What I would like to link is:
1) "la" (no space after) in <orig> (I think this tag is more
appropriate here than <sic>) to "la" in <reg>
2) "cho sa" in <orig> to "cosa" in <reg>
I cannot figure out an easier way than something
<seg type="ot" rend="no_space_after">la</seg>
and then use linking and alignment methods.
The question is:
If I want the tokenization be done automatically, how should
I pre-tag this kind of regulirazation?
It seems to me that Emmanuele's initial idea of introducing a new
tag makes sense.
I would suggest something like <sb/> for "space break" with
<orig>la<sb type="reg"/>cho<sb type="orig"/>sa</orig>
la <sb type="reg"/> <choice><orig>cho<sb type="orig"/>sa</orig>
----- Original Message -----
From: "Lou Burnard" <[log in to unmask]>
To: "Alexey LAVRENTEV" <[log in to unmask]>
Cc: <[log in to unmask]>
Sent: Monday, March 03, 2008 11:07 AM
Subject: Re: Critical Insert
> Hello Alexe[iy]!
> Thanks for the suggestion as to a possible use-case. But surely this could
> be coped with quite easily:
> <reg><seg>la</seg> <seg>cosa</seg></reg>
> (I've used <seg> here rather than <w> to avoid arguments about what a word
> is, but the same principles apply; you could also saying something like
> <seg type="orthographicToken"> or introducing a new element <ot> for that
> Once you have the tokens identified, you can use the usual linking and
> alignment methods to associate them, of course.
> Alexey LAVRENTEV wrote:
>> Lou Burnard wrote:
>>> What's wrong with
>>> <sic>lacho sa</sic>
>>> <reg>la cosa</reg>
>>> What kinds of query or analysis might you realistically do that this
>>> encoding would not support?
>> I am afraid that this markup makes a confusion between 2 different kinds
>> of regularization:
>> a) spelling,
>> b) spacing.
>> In diplomatic editions, it is not uncommon to have normalized spacing
>> and original spelling.
>> Such an encoding can also cause problems in case of linguistic
>> If you tokenize on the basis of the regularized form but want to keep a
>> of the original one, how will you link the two?