Nav: Home

New approach may open up speech recognition to more languages

December 07, 2016

Speech recognition systems, such as those that convert speech to text on cellphones, are generally the result of machine learning. A computer pores through thousands or even millions of audio files and their transcriptions, and learns which acoustic features correspond to which typed words.

But transcribing recordings is costly, time-consuming work, which has limited speech recognition to a small subset of languages spoken in wealthy nations.

At the Neural Information Processing Systems conference this week, researchers from MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) are presenting a new approach to training speech-recognition systems that doesn't depend on transcription. Instead, their system analyzes correspondences between images and spoken descriptions of those images, as captured in a large collection of audio recordings. The system then learns which acoustic features of the recordings correlate with which image characteristics.

"The goal of this work is to try to get the machine to learn language more like the way humans do," says Jim Glass, a senior research scientist at CSAIL and a co-author on the paper describing the new system. "The current methods that people use to train up speech recognizers are very supervised. You get an utterance, and you're told what's said. And you do this for a large body of data.

"Big advances have been made -- Siri, Google -- but it's expensive to get those annotations, and people have thus focused on, really, the major languages of the world. There are 7,000 languages, and I think less than 2 percent have ASR [automatic speech recognition] capability, and probably nothing is going to be done to address the others. So if you're trying to think about how technology can be beneficial for society at large, it's interesting to think about what we need to do to change the current situation. And the approach we've been taking through the years is looking at what we can learn with less supervision."

Joining Glass on the paper are first author David Harwath, a graduate student in electrical engineering and computer science (EECS) at MIT; and Antonio Torralba, an EECS professor.

Visual semantics

The version of the system reported in the new paper doesn't correlate recorded speech with written text; instead, it correlates speech with groups of thematically related images. But that correlation could serve as the basis for others.

If, for instance, an utterance is associated with a particular class of images, and the images have text terms associated with them, it should be possible to find a likely transcription of the utterance, all without human intervention. Similarly, a class of images with associated text terms in different languages could provide a way to do automatic translation.

Conversely, text terms associated with similar clusters of images, such as, say, "storm" and "clouds," could be inferred to have related meanings. Because the system in some sense learns words' meanings -- the images associated with them -- and not just their sounds, it has a wider range of potential applications than a standard speech recognition system.

To test their system, the researchers used a database of 1,000 images, each of which had a recording of a free-form verbal description associated with it. They would feed their system one of the recordings and ask it to retrieve the 10 images that best matched it. That set of 10 images would contain the correct one 31 percent of the time.

"I always emphasize that we're just taking baby steps here and have a long way to go," Glass says. "But it's an encouraging start."

The researchers trained their system on images from a huge database built by Torralba; Aude Oliva, a principal research scientist at CSAIL; and their students. Through Amazon's Mechanical Turk crowdsourcing site, they hired people to describe the images verbally, using whatever phrasing came to mind, for about 10 to 20 seconds.

For an initial demonstration of the researchers' approach, that kind of tailored data was necessary to ensure good results. But the ultimate aim is to train the system using digital video, with minimal human involvement. "I think this will extrapolate naturally to video," Glass says.

Merging modalities

To build their system, the researchers used neural networks, machine-learning systems that approximately mimic the structure of the brain. Neural networks are composed of processing nodes that, like individual neurons, are capable of only very simple computations but are connected to each other in dense networks. Data is fed to a network's input nodes, which modify it and feed it to other nodes, which modify it and feed it to still other nodes, and so on. When a neural network is being trained, it constantly modifies the operations executed by its nodes in order to improve its performance on a specified task.

The researchers' network is, in effect, two separate networks: one that takes images as input and one that takes spectrograms, which represent audio signals as changes of amplitude, over time, in their component frequencies. The output of the top layer of each network is a 1,024-dimensional vector -- a sequence of 1,024 numbers.

The final node in the network takes the dot product of the two vectors. That is, it multiplies the corresponding terms in the vectors together and adds them all up to produce a single number. During training, the networks had to try to maximize the dot product when the audio signal corresponded to an image and minimize it when it didn't.

For every spectrogram that the researchers' system analyzes, it can identify the points at which the dot-product peaks. In experiments, those peaks reliably picked out words that provided accurate image labels -- "baseball," for instance, in a photo of a baseball pitcher in action, or "grassy" and "field" for an image of a grassy field.

In ongoing work, the researchers have refined the system so that it can pick out spectrograms of individual words and identify just those regions of an image that correspond to them.
-end-
Additional background

PAPER: Unsupervised learning of spoken language with visual contexthttp://papers.nips.cc/paper/6186-unsupervised-learning-of-spoken-language-with-visual-context.pdf

ARCHIVE: Learning spoken languagehttp://news.mit.edu/2015/learning-spoken-language-phoneme-data-0914

ARCHIVE: Object recognition for freehttp://news.mit.edu/2015/visual-scenes-object-recognition-0508

ARCHIVE: Automatic speaker tracking in audio recordingshttp://news.mit.edu/2013/automatic-speaker-tracking-in-audio-recordings-1018

Massachusetts Institute of Technology

Related Neural Networks Articles:

Neural networks facilitate optimization in the search for new materials
Sorting through millions of possibilities, a search for battery materials delivered results in five weeks instead of 50 years.
Crosstalk captured between muscles, neural networks in biohybrid machines
A platform designed for coculturing a neurosphere and muscle cells allows scientists to capture the growth of neurons toward muscles to form neuromuscular junctions.
Kazan University chemists teach neural networks to predict properties of compounds
The international team works on a computational model able to predict the properties of new molecules based on the analysis of fundamental chemical laws.
All-optical diffractive neural networks process broadband light
Developed by researchers at UCLA, diffractive optical networks provide a low power, low latency and highly-scalable machine learning platform that can find numerous applications in robotics, autonomous vehicles, defense industry, among many others.
Neural compass
Harvard Medical School neuroscientists have decoded how visual cues can rapidly reorganize the activity of compass neurons in fruit flies to maintain an accurate sense of direction.
Deep neural networks speed up weather and climate models
A team of environmental and computation scientists at the US Department of Energy's (DOE) Argonne National Laboratory are collaborating to use deep neural networks, a type of machine learning, to replace the parameterizations of certain physical schemes in the Weather Research and Forecasting Model, an extremely comprehensive model that simulates the evolution of many aspects of the physical world around us.
Deep neural networks uncover what the brain likes to see
Researchers built deep artificial neural networks that can accurately predict the neural responses produced by a biological brain to arbitrary visual stimuli.
Agriculture of the future: Neural networks have learned to predict plant growth
Scientists from Skoltech have trained neural networks to evaluate and predict the plant growth pattern taking into account the main influencing factors and propose the optimal ratio between the nutrient requirements and other growth-driving parameters.
Researchers make neural networks successfully detect DNA damage caused by UV radiation
Researchers of Tomsk Polytechnic University jointly with the University of Chemistry and Technology (Prague) conducted a series of experiments, which proved that artificial neural networks can accurately identify DNA damages caused by UV radiation.
All-optical diffractive neural network closes performance gap with electronic neural networks
A new paper in Advanced Photonics, an open-access journal co-published by SPIE, the international society for optics and photonics, and Chinese Laser Press (CLP), demonstrates distinct improvements to the inference and generalization performance of diffractive optical neural networks.
More Neural Networks News and Neural Networks Current Events

Trending Science News

Current Coronavirus (COVID-19) News

Top Science Podcasts

We have hand picked the top science podcasts of 2020.
Now Playing: TED Radio Hour

Listen Again: Reinvention
Change is hard, but it's also an opportunity to discover and reimagine what you thought you knew. From our economy, to music, to even ourselves–this hour TED speakers explore the power of reinvention. Guests include OK Go lead singer Damian Kulash Jr., former college gymnastics coach Valorie Kondos Field, Stockton Mayor Michael Tubbs, and entrepreneur Nick Hanauer.
Now Playing: Science for the People

#562 Superbug to Bedside
By now we're all good and scared about antibiotic resistance, one of the many things coming to get us all. But there's good news, sort of. News antibiotics are coming out! How do they get tested? What does that kind of a trial look like and how does it happen? Host Bethany Brookeshire talks with Matt McCarthy, author of "Superbugs: The Race to Stop an Epidemic", about the ins and outs of testing a new antibiotic in the hospital.
Now Playing: Radiolab

Dispatch 6: Strange Times
Covid has disrupted the most basic routines of our days and nights. But in the middle of a conversation about how to fight the virus, we find a place impervious to the stalled plans and frenetic demands of the outside world. It's a very different kind of front line, where urgent work means moving slow, and time is marked out in tiny pre-planned steps. Then, on a walk through the woods, we consider how the tempo of our lives affects our minds and discover how the beats of biology shape our bodies. This episode was produced with help from Molly Webster and Tracie Hunte. Support Radiolab today at Radiolab.org/donate.