Print

Print


Thanks Fabio, James, Lou and Piotr for interesting replies!

Lou is absolutely right that the underlying problem is the overlap 
between the linguistic structure and the orthographic surface.

Another example I could give is

<choice>
   <abbr>q<am>&#x305;</am>lle</abbr>
   <expan>q<ex>u'e</ex>lle</expan>
</choice>

where an abbreviation mark (combining horizontal bar) expands to a 
letter in one word and a letter in another. How should this be tokenized?

Piotr would say that stand-off is my friend but I believe that it is in 
all cases necessary to know how to "inject" the stand-off markup.

My initial question was in fact much more pragmatic. I was looking for a 
way to indicate that one and the same glyph expands to two characters in 
two different words.

The @next & @prev solution seemed an overkill to me, as it implies 
assigning a unique identifier to each <g> (which I did not intend to do 
at that stage of encoding), and att.fragmentable appeared to be a 
convenient solution.

However, it did not occur to me that the glyph might be de-composed, as 
Lou suggested. So the right solution would probably be to use @sameAs 
(which implies @xml:id) to make it clear that one and the same glyph 
belongs to two words... if I am getting right.

Best,

Alexey

Le 13/02/2015 15:14, Piotr Bański a écrit :
> Things pronounced bonkers one day need not stay in that category as 
> the research progresses, and the TEI does wear a few pioneer's badges.
>
> You may have just helped Alexey to lay foundations for a new section 
> of ch.5 -- how about "Glyph deconstruction: free variation vs. 
> contextual variants". And since it's Alexey who's involved, we might 
> see it begin to happen already at the next TEI-MM... just sayin'.
>
> If this is taken seriously, then the semantics of @ref would simply 
> need to be properly defined, to fit the context (or some kind of grid 
> mapping would have to devised). Nice.
>
> Best,
>
>   P.
>
>
>
> On 13/02/15 12:21, Lou Burnard wrote:
>> I suggest that the underlying problem here is nothing to do with <g>s,
>> fragmentable or otherwise, but rather with the desire to have a
>> tokenisation which is not well-structured with respect to the
>> orthographic structure. As such it's no different from such problems as
>> how to encode things like "it's" or "isn't" in English.
>>
>> I must say I don't like the idea of fragmentary <g>s, if only because I
>> don't know whether the @ref attribute is then supposed to point to the
>> whole (reconsituted) glyph, as you have in your example, or rather
>> whether it's supposed to point to a partial glyph. If that sounds
>> bonkers, consider someone trying to encode as separate glyphs the
>> strokes that constitute a single Chinese character.
>>
>>
>>
>> On 13/02/15 10:35, James Cummings wrote:
>>> Hi Alexey, Fabio,
>>>
>>> I think I'd do as Fabio suggests. To answer your underlying question,
>>> I guess that no one had considered that <g> might be fragmented in
>>> this way. I certainly hadn't. If you find lots of examples of this,
>>> they could be used as evidence for a clear feature request.
>>>
>>> -James
>>>
>>>
>>> On 13/02/15 08:29, Fabio Ciotti wrote:
>>>> Dear Alexey.
>>>>
>>>> you have @next and @prev for coreferencing the two parts of the
>>>> ligature (or what I consider a misuse of @corresp), and they are
>>>> available in <g>.
>>>>
>>>> Fabio
>>>>
>>>> 2015-02-11 17:42 GMT+01:00 Lavrentev Alexey
>>>> <[log in to unmask]>:
>>>>> Dear all,
>>>>>
>>>>> I am working at a project that involves annotation and alignment
>>>>> with image
>>>>> zones of individual characters in manuscript transcriptions
>>>>> (http://oriflamms.hypotheses.org).
>>>>>
>>>>> We are going to use <g> to encode ligatures, exactly as shown in an
>>>>> example
>>>>> in the Guidelines
>>>>> (http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-g.html).
>>>>>
>>>>> At a certain stage of the project, the transcriptions will need to be
>>>>> tokenised at word and character level using <w> and <c> tags.
>>>>>
>>>>> Although we have not yet seen such a case in the corpus, it is
>>>>> possible that
>>>>> a ligature join the last letter of a word with the first letter of
>>>>> another,
>>>>> e.g. don&ctlig;u = "donc tu" in modern French.
>>>>>
>>>>> In the tokenized transcription, I would like to do something like 
>>>>> this
>>>>>
>>>>> <w>don<g ref="#ctlig" part="I">c</g></w> <w><g ref="#ctlig"
>>>>> part="F">t</g>u</w>
>>>>>
>>>>> but, unlike segLike elements (including <c>), <g> is not member of
>>>>> att.fragmentable class.
>>>>>
>>>>> Of course, I could use the <c> element instead of <g>, but then I
>>>>> would lose
>>>>> the the semantics of <g> and the ref attribute.
>>>>>
>>>>> So, my question is whether there is a particular reason for <g> not
>>>>> being
>>>>> member of att.fragmentable, and, if not, whether it is worth
>>>>> submitting a
>>>>> feature request.
>>>>>
>>>>> Otherwise, I would be very grateful for any alternative encoding
>>>>> proposal
>>>>> for "cross-word" glyphs.
>>>>>
>>>>> Best,
>>>>>
>>>>> Alexei
>>>
>>>
>>