As I understand it, and unless things have changed recently,
we have the opposite problem: our retrieval engine *cannot*
'skip over' a tag of any kind, empty or otherwise, but it
*can* skip over a character (i.e. index it as NULL). This
has undoubtedly been influential in our policy of encoding
all hyphens as characters rather than as elements.
'cha|racter' and 'cha-racter' are indexed as 'character'
'cha<g>racter' would be indexed as two words 'cha' and 'racter'.
pfs
On Fri, 4 Dec 2009, Jens Østergaard Petersen wrote:
> Most kinds of character-based formatting occurs before and after the string formatted and will not interfere with searches. One exception is the extra interspacing of letters common in German, but this can be marked with @rend and recreated in the output with css. Soft hyphens, however, pose a problem.
>
> If one searches on an index, say with Lucene, one could encode a soft hyphen using a character entity, in the manner recommended in the Guidelines, and then remember to remove it prior to indexing, but if one searches directly on the document, this will not help, unless one establishes a soft hyphen-free version of the document, which is surely overkill.
>
> If one e.g. passes an XQuery on some element contents, a soft hyphen encoded as a character entity would interfere with the query, but if it was marked with an empty tag, say <softHyphen />, it would just be passed over, and searching for "Wittgenstein" would find "Wittgen<softHyphen />stein" (and even "Wittgen<softHyphen /></l>stein"), whereas it would not find "Wittgen­stein" or "Wittgen stein" (you probably can't see the soft hyphen, unless you read your mail in pure text, but it's there).
>
> Since soft hyphens look exactly like hard hyphens (if they can be seen at all), wouldn't it be easier to manage soft hyphens if we could use an empty tag to signal them, and to remove or transform them more transparently, or neglect them, as in XQuery?
>
> My problem is also that I wish to register hyphenation, but I do not wish to my "keyboarders" to be loaded with too many technicalities, and I believe that the character code difference between the two (not to mention the other hyphens and dashes) will be seen as technical. Having an empty tag would makes this part easier as well.
>
> Jens
>
>
>
--------------------------------------------------------------------
Paul Schaffner | [log in to unmask] | http://www.umich.edu/~pfs/
316-C Hatcher Library N, Univ. of Michigan, Ann Arbor MI 48109-1205
--------------------------------------------------------------------
|