This is fairly well-trodden ground, I think. It goes in layers:
- can you read the medium at all? can you get a stream of bits?
- can you understand the file system (CP/M, anyone?)
- can you reconstruct those bits into "letters" (bytes)?
- can you understand the encoding? (ie is it UTF-8, ascii, some IBM thing, binary)
- can you understand what the sequence of letters or bits is supposed to do? ie is it XML or TeX?
- can you grok the semantics of that XML? is it HTML or TEI?
- if its binary (PDF, doc, image formats), can you see how a) identify which format, and b) reconstruct what it is trying to do
if you get as far as a text file, past the encoding, you are likely to be able to understand
what is happening. If its a binary image format, working strictly from published specs
should get you back the image; ie one should be able to write a new reader of PNG
from first principles. if its a binary document format without a formal specification...
even worse, if its a proprietary database format, you're really in trouble :-}
XML is not a universal panacea, obviously, but its a help
Information Manager, Oxford University Computing Services
13 Banbury Road, Oxford OX2 6NN. Phone +44 1865 283431
Sólo le pido a Dios
que el futuro no me sea indiferente