le 18/06/2012 10:46 Selon stuart yeates:
> There may, of course, already be someone who has walked this
> particular migration road and can help you.
A Google search shows that this kind of DBT conversion has
already been done in Pisa in the 90's.
In the "ITALIAN PAROLE CORPUS: AN OVERVIEW"
one can see that DBT format was converted to SGML/TEI-CES:
"A detailed list of all the DBT and SGML tags used for each
medium is given in the LE-PAROLE Italian Corpus Documentation
(Goggi et al.,1997)."
Unfortunately access to the paper cited:
is reserved to ACM members...
Nevertheless, being a speech transcription corpus, the DBT tags
involved in this example may not be very usefull for you, just
the framework and the persons who have done it.
In the TXM platform, we use Groovy language (Python like)
scripts to convert such TXT based formats to XML-TEI.
For example, for the Alceste format which has the
following tags in full-text:
**** a_1 b_2 ... for text separators and metadata values
-* ... for speech turns (not used)
You can see the sample Groovy script here:
[Sorry for the Java aspect of this script, we should
make it more Groovy when we have time...]
It is simple to patch and use that kind of script:
- you install TXM
- you copy the script on your machine
- you change the infile and outfile parameters (lines 175-176)
- you open the script from TXM and call it from the menu
You can find another example script with the
Hyperbase format converter with the following
&&& fullname, shortname, veryshortname for text separators
for paragraph separators
The script is here:
PS. TXM is a framework similar to PISYSTEM
Dr. Serge Heiden, [log in to unmask], http://textometrie.ens-lyon.fr
ENS de Lyon/CNRS - ICAR UMR5191, Institut de Linguistique Française
15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883