Hello Lou,
What I would like to link is:
1) "la" (no space after) in <orig> (I think this tag is more
appropriate here than <sic>) to "la" in <reg>
2) "cho sa" in <orig> to "cosa" in <reg>
I cannot figure out an easier way than something
like
<choice>
<orig>
<seg type="ot" rend="no_space_after">la</seg>
<seg type="ot">
<seg rend="space_after">cho</seg>
<seg>sa</seg>
</seg>
</orig>
<reg>
<seg type="ot">la</seg>
<seg type="ot">cosa</seg>
</reg>
</choice>
and then use linking and alignment methods.
The question is:
If I want the tokenization be done automatically, how should
I pre-tag this kind of regulirazation?
It seems to me that Emmanuele's initial idea of introducing a new
tag makes sense.
I would suggest something like <sb/> for "space break" with
@type="orig|reg":
<choice>
<orig>la<sb type="reg"/>cho<sb type="orig"/>sa</orig>
<reg>la cosa</reg>
</choice>
or even
la <sb type="reg"/> <choice><orig>cho<sb type="orig"/>sa</orig>
<reg>cosa</reg></choice>
Best,
Ale(x|ks)e[iïyj]
----- Original Message -----
From: "Lou Burnard" <[log in to unmask]>
To: "Alexey LAVRENTEV" <[log in to unmask]>
Cc: <[log in to unmask]>
Sent: Monday, March 03, 2008 11:07 AM
Subject: Re: Critical Insert
> Hello Alexe[iy]!
>
> Thanks for the suggestion as to a possible use-case. But surely this could
> be coped with quite easily:
>
> <choice>
> <sic><seg>lacho</seg><seg>sa</seg></sic>
> <reg><seg>la</seg> <seg>cosa</seg></reg>
> </choice>
>
> (I've used <seg> here rather than <w> to avoid arguments about what a word
> is, but the same principles apply; you could also saying something like
> <seg type="orthographicToken"> or introducing a new element <ot> for that
> purpose!)
>
> Once you have the tokens identified, you can use the usual linking and
> alignment methods to associate them, of course.
>
>
>
>
> Alexey LAVRENTEV wrote:
>> Lou Burnard wrote:
>>>
>>> What's wrong with
>>>
>>> <choice>
>>> <sic>lacho sa</sic>
>>> <reg>la cosa</reg>
>>> </choice>
>>>
>>> ?
>>>
>>> What kinds of query or analysis might you realistically do that this
>>> encoding would not support?
>>
>> I am afraid that this markup makes a confusion between 2 different kinds
>> of regularization:
>> a) spelling,
>> b) spacing.
>>
>> In diplomatic editions, it is not uncommon to have normalized spacing
>> and original spelling.
>>
>> Such an encoding can also cause problems in case of linguistic
>> tokenization.
>> If you tokenize on the basis of the regularized form but want to keep a
>> track
>> of the original one, how will you link the two?
>>
>> Best,
>>
>> Alexei
>
|