-------- Original Message --------
Subject: Fwd: Developing Linguistic Corpora: a guide to good practice
The Arts and Humanities Data Service (AHDS) has published a new Guide to
Good Practice, 'Developing Linguistic Corpora'. It is edited by Martin
Wynne from the Literature, Languages and Linguistics branch of the AHDS,
which is hosted by the Oxford Text Archive.
The printed book can be ordered online from Oxbow Books
(http://www.oxbowbooks.com/) for £15 plus post and packing, and the full
text is available for free online at http://ahds.ac.uk/linguistic-corpora/
In this volume, a selection of leading experts offer advice to help the
reader to ensure that their corpus is well-designed and fit for the
intended purpose.
As John Sinclair writes in the first chapter: "A corpus is a remarkable
thing, not so much because it is a collection of language text, but because
of the properties that it acquires if it is well-designed and
carefully-constructed."
The collection includes the following chapters:
* 'Corpus and text: basic principles' by John Sinclair
* 'Adding linguistic annotation' by Geoffrey Leech
* 'Metadata for corpus work' by Lou Burnard
* 'Character encoding in corpus construction' by Tony McEnery and
Richard Xiao
* 'Spoken language corpora' by Paul Thompson
* 'Archiving, distribution and preservation' by Martin Wynne
John Sinclair sets out ten principles for corpus design, plus a new
definition of a corpus. Geoffrey Leech offers a taxonomy of types of
annotations as well as clear guidelines and some provisional standards for
annotation at various linguistic levels. Lou Burnard explains the different
types of metadata which can be provided for a corpus, and gives examples of
how these can be implemented using the Text Encoding Initiative guidelines.
Tony McEnery and Richard Xiao take on the tricky issue of encoding
characters in languages other than English, giving an historical overview
of the various solutions, leading to a discussion of how to use Unicode
today in encoding corpus texts. Paul Thompson draws on his experience in
developing the British Academic Spoken English (BASE) corpus to set out the
stages involved in the development and exploitation of a corpus of speech,
covering data collection, transcription, markup and annotation, and access.
In chapter six, Martin Wynne explains how good planning and design can help
to ensure the ongoing availability and usefulness of a corpus.
This and other guides in the series are available from
http://ahds.ac.uk/creating/guides/
Alastair
Alastair Dunning
Arts and Humanities Data Service
http://ahds.ac.uk/
King's College London
0207 848 1972
--
Dr James Cummings, Oxford Text Archive, University of Oxford
James dot Cummings at oucs dot ox dot ac dot uk
|