On 18/06/12 10:37, Roberto Rosselli Del Turco wrote:
> &C, &c for italics (&C is the opening tag, &c the closing one)
> 1 follows a vowel to mean grave accent
> $####$ to markup a page number (0001 and following)
> &V following text is in verse
> &P following text is in prose
> § marks a paragraph
> What would be the best method to export these texts in TEI XML? I
> ruled out XSLT since the input text is not a well formed document,
> perhaps some PERL script, or in a similar language?
Since that looks very much like a standardised encoding language I would
try to work structurally, that is I would first of all try to write a
grammar for the input encoding language. This way you develop a good
understanding for the input and secondly you get into the position where
it becomes feasible to write a simple parser for your input language.
For parsing documents I would recommend a recursive approach where you
parse the whole thing as the document and then split it recursively in
parts (If I remember correctly this could be an LL* parser)
For each of the components of the input language you then have a
function handling it, where the recursion depth represents the structure
of the input tree. This tree can be easily used to create a
corresponding DOM that is either directly TEI or can be used in it's XML
serialized form for a simple transformation. The simplest case would be
that every function just produces one XML element. This may look like
very much overhead, but it has some strong advantages: The parser is a
validation step for the input material, as the parser simply fails for
unexpected input. Therefore, you can't forget some kind of annotation
which may easily happen in pure pattern matching approaches with e.g.
sed. I have done that myself with annotation languages in Perl which was
a very pleasant experience. But any programming language you feel
...). Nevertheless, a good pattern matching library certainly helps.