Machine Language

How does one go about teaching a machine a human language? This was the question DARPA (Defense Advanced Research Projects Agency) had in mind when it issued a call for the creation of an organization to support human language technology research and development—a call answered with the establishment of the Linguistic Data Consortium (LDC) at Penn. “Think of it as a semester abroad for computers,” jokes Mark Liberman, the Trustee Professor of Phonetics and Professor of Computer and Information Science who founded the Consortium in 1992. “Learning a language takes a lot of experience.”

In order to gain this “experience,” a computer requires vast amounts of human language data, as well as directions for interpretation. These collections are often too time-consuming and expensive for individual research groups to create, and in providing shared resources to speech and language researchers around the world, the Consortium helps facilitate intellectual exchange. The LDC also acts as an intermediary for intellectual property rights. Its contracts with over 70 data providers allow researchers at more than a thousand institutions to use billions of words of text, and tens of thousands of hours of speech, without violating the copyrights of publishers and broadcasters.

LDC team members are skilled in some combination of linguistics, computer science and project management. Though many of the methods the Consortium uses to gather speech and language data are automated, the majority of data sets depend on human analysis. One of the key research areas the LDC supports is speech recognition (speech-to-text). Speech is collected from a variety of different sources, including satellite dishes and cable feeds. It is also captured from human subjects recruited to participate in telephone conversations and face-to-face interviews. Afterwards, the recordings are transcribed and stored, along with information about the source and the recording process.

In order to learn to “understand” speech or text, machines—just like humans need information about meaning. To provide these data, LDC annotators are often asked to tag texts for “entities,” such as people or places. Researchers then use these tagged texts to develop and test programs that can extract the same sort of information automatically from new material.

To develop and test methods for speaker identification, researchers need examples of speakers recorded in multiple places, talking about multiple topics, using multiple recording devices. Otherwise, instead of learning to recognize differences among speakers, machine algorithms would learn to recognize differences among microphones, differences between rooms, or even differences between casual conversations and formal interviews.

Technology derived from this research could eventually lend authorities the ability to match threatening phone calls to suspects in custody, or enable a telephone banking service to identify a customer by voice alone.

Some Consortium projects work toward a very different goal, explains Chris topher Cieri,  LDC Executive Director since 1998. For example, annotator Alyaa Abbood has spent the last two years updating a 1960s-era Iraqi-Arabic dictionary, a U.S. Department of Education–sponsored collaboration between the Consortium and Georgetown University Press. Her work will lead to a new, standardized edition for use in academia and other venues. In all, the Consortium has published data containing material in 75 languages.

In addition to his professorial duties and work at the LDC, Consortium founder Mark Liberman is also Faculty Director of College Houses and Academic Services and founder of Language Log, a blog that presents linguistic research in a popular form and dissects linguistic idiosyncrasies in popular media and literature. “The LDC has played an important role in the last 20 years of progress,” Liberman says. “We continue to be in the middle of exciting new developments. I look forward to an increased impact on speech and language science, and to applications in new areas.”

Arts & Sciences News

Marisa C. Kozlowski Named Next Associate Dean for the Natural Sciences

Kozlowski, who joined the Penn faculty in 1997, succeeds Mark Trodden, who transitions to the Dean of Penn Arts & Sciences on June 1.

View Article >
One Fourth Year, One Alum Receive 2025 Hertz Fellowship

Eric Tao, C’25, Gr’25 (left), and Suraj Chandran, C’23, were awarded the honor, part of a group of 19 fellows selected this year. Each one receives five years of funding toward a doctoral program.

View Article >
Benjamin Nathans Wins 2025 Pulitzer Prize in General Nonfiction

Nathans, Alan Charles Kors Endowed Term Professor of History, won for his book “To the Success of Our Hopeless Cause: The Many Lives of the Soviet Dissident Movement.”

View Article >
Mark Devlin Elected to National Academy of Sciences

He joins three others from Penn to receive the honor this year, all recognized for “distinguished and continuing achievements in original research.”

View Article >
Michael Jones-Correa and Sophia Rosenfeld Elected to American Academy of Arts & Sciences

They join three others from the University of Pennsylvania, selected as part of the Academy’s mission to convene leaders from “every field of human endeavor to examine new ideas, address issues of importance to the nation and the world, and work together.”

View Article >
Eva Del Soldato Awarded 2025-26 Rome Prize

She joins Sean Burkholder, of the Weitzman School of Design, and just 33 others in receiving the prestigious honor from the American Academy in Rome.

View Article >