Print

Print


Hi, it's nice to see someone ask this and follow the hints that come in 
return.

I have recently used Apache Tika to "extracts metadata and structured 
text content from various documents using existing parser libraries".

http://tika.apache.org/

As far as I understand, it "uses" (or regularly includes routines from)? 
PDFBox - but I have not checked the details.

Tika produces plain text output/files by default, but could also be made 
to create other output, I suppose. The usual java. Surprisingly easy to 
get working, though. Works quite smoothly, although some issues remain 
(e.g. ligatures, hyphenation at EOLs) that might confuse the next 
processing steps. Also, if you use it out of the box, e.g. from shell 
scripts (and not exactly "the java way", which I have not tried in that 
scenario), it turns out to be quite slow if you have many (i.e. hundreds 
of) small files - rough estimate: it does take a few seconds per file on 
a seasoned core-2-duo when used as described.

It also extracts some metadata. Would be nice, if it could to both 
(structured text and metadata) in one run and create - for instance - 
TEI files. :)

However, I would suppose that most Tika users apply it "embedded" in 
their own "processing pipeline" ...

Regards,
Frank



Am 22.09.2013 11:50, schrieb Laurent Romary:
> As some of you know there has been https://github.com/kermitt2/grobid
> around, mainly trained on scholarly papers and particularly good for
> meta-data extraction. Full text re-structuring is progressing on a
> regular basis. Maybe you want to have a try.
> Cheers,
> Laurent
>
> Le 22 sept. 2013 à 11:12, Eberhard von Kitzing a écrit :
>
>> Dear all,
>>
>> There are standard ways to create PDF from TEI. But are there any
>> programs designed to convert PDF to TEI? Off course, there are many
>> programs to convert PDF to text files. But it is often a rather long
>> way to convert these text files to TEI.
>>
>> I am just in the process to extend some class in apache pdfBox to
>> produce a TEI file which already contains more information than a
>> simple text file. Perhaps there are persons in the readiance who
>> already thought about this problem and may therefore provide some good
>> advice.
>>
>> All the best, Eberhard.
>>
>> _____________________________
>> Eberhard von Kitzing
>> Carl-Zuckmayer-Str. 17
>> D 69126 heidelberg
>>
>> [log in to unmask] <mailto:[log in to unmask]>
>>
>> Tel.: +49 6221 385129 (zu Hause)
>> Tel.: +49 2405 413524 (Arbeit)
>> Mob.: +49 172 2419568
>> FAX: +49 3221 2348315
>>
>
> Laurent Romary
> INRIA & HUB-IDSL
> [log in to unmask] <mailto:[log in to unmask]>
>
>
>