Announcing a NEW CORPUS from the LDC
Voicemail Corpus - Part I
The Voicemail Corpus - Part I was created by the following
researchers at IBM:
M. Padmanabhan, G. Ramaswamy, B. Ramabhadran, P.S.
Gopalakrishnan, and C. Dunn.
This CD-ROM corpus consists of 1801 voicemail messages,
collected from volunteers at various IBM sites in the United
States, comprising the training data set and 42 messages in the
development test set. The average voicemail message is 31
seconds in duration, and has about 100 words. Approximately 38%
of the messages correspond to male speakers; the remainder
correspond to females. All messages were transcribed by IBM.
During the collection period, volunteers were asked to forward
some of their voicemail messages to a local extension number set
up for the purpose of collecting this data. The messages were
then collected periodically from the voicemailbox of this local
extension and added to the database.
DirectTalk6000 (DT6K) software was used to transfer the
voicemail messages to the computer. DT6K is an application that
runs under the AIX operating system on a host computer, and can
interface to a phone line through special hardware on the host
computer. Note that the data was collected from IBM sites all
over the US whereas the host computer that the DT6K application
was running on was located at a single IBM site. Consequently,
when the application dialed into the phonemail system of an IBM
site in a different state, the voicemail messages were played
out over a long distance line before they were recorded on the
The data was sampled at 8 KHz, and recorded in 8-bit u-law
compressed format onto a local disk of the host computer. The
messages were compressed by the proprietary compression
techniques used by the ROLM phonemail system, which is the
phonemail system in use at various IBM locations.
IBM would like to acknowledge the support of DARPA for funding
this data collection effort under Grant MDA972- 97-C-0012 and is
also extremely grateful to George di Simone and Ira Ellis
(Watson telephone system support) for their help in setting up
the data collection process. IBM would also like to thank Dr.
Ellen Eide for helping with the verification of transcripts and
Dr. Salim Roukos, Dr. David Nahamoo, and Dr. Lalit Bahl for
their help and support. Finally, thanks are due to the various
volunteers who contributed their voicemail messages to the
Institutions that have membership in the LDC during the 1998
Membership Year will be able to receive this corpus in the same
manner as all other text and speech corpora published by the
If you would like to order a copy of this corpus, please email
your request to <[log in to unmask]>. If you need
additional information before placing your order, or would like
to inquire about membership in the LDC, please send email or
call (215) 898-0464.
Further information about the LDC and its available corpora can
be accessed on the Linguistic Data Consortium WWW Home Page at