Machine Language

Sep 1, 2010

How does one go about teaching a machine a human language? This was the question DARPA (Defense Advanced Research Projects Agency) had in mind when it issued a call for the creation of an organization to support human language technology research and development—a call answered with the establishment of the Linguistic Data Consortium (LDC) at Penn. “Think of it as a semester abroad for computers,” jokes Mark Liberman, the Trustee Professor of Phonetics and Professor of Computer and Information Science who founded the Consortium in 1992. “Learning a language takes a lot of experience.”

In order to gain this “experience,” a computer requires vast amounts of human language data, as well as directions for interpretation. These collections are often too time-consuming and expensive for individual research groups to create, and in providing shared resources to speech and language researchers around the world, the Consortium helps facilitate intellectual exchange. The LDC also acts as an intermediary for intellectual property rights. Its contracts with over 70 data providers allow researchers at more than a thousand institutions to use billions of words of text, and tens of thousands of hours of speech, without violating the copyrights of publishers and broadcasters.

LDC team members are skilled in some combination of linguistics, computer science and project management. Though many of the methods the Consortium uses to gather speech and language data are automated, the majority of data sets depend on human analysis. One of the key research areas the LDC supports is speech recognition (speech-to-text). Speech is collected from a variety of different sources, including satellite dishes and cable feeds. It is also captured from human subjects recruited to participate in telephone conversations and face-to-face interviews. Afterwards, the recordings are transcribed and stored, along with information about the source and the recording process.

In order to learn to “understand” speech or text, machines—just like humans need information about meaning. To provide these data, LDC annotators are often asked to tag texts for “entities,” such as people or places. Researchers then use these tagged texts to develop and test programs that can extract the same sort of information automatically from new material.

To develop and test methods for speaker identification, researchers need examples of speakers recorded in multiple places, talking about multiple topics, using multiple recording devices. Otherwise, instead of learning to recognize differences among speakers, machine algorithms would learn to recognize differences among microphones, differences between rooms, or even differences between casual conversations and formal interviews.

Technology derived from this research could eventually lend authorities the ability to match threatening phone calls to suspects in custody, or enable a telephone banking service to identify a customer by voice alone.

Some Consortium projects work toward a very different goal, explains Chris topher Cieri, LDC Executive Director since 1998. For example, annotator Alyaa Abbood has spent the last two years updating a 1960s-era Iraqi-Arabic dictionary, a U.S. Department of Education–sponsored collaboration between the Consortium and Georgetown University Press. Her work will lead to a new, standardized edition for use in academia and other venues. In all, the Consortium has published data containing material in 75 languages.

In addition to his professorial duties and work at the LDC, Consortium founder Mark Liberman is also Faculty Director of College Houses and Academic Services and founder of Language Log, a blog that presents linguistic research in a popular form and dissects linguistic idiosyncrasies in popular media and literature. “The LDC has played an important role in the last 20 years of progress,” Liberman says. “We continue to be in the middle of exciting new developments. I look forward to an increased impact on speech and language science, and to applications in new areas.”

Machine Language

Arts & Sciences News