New method enables high quality speech separation

June 05, 2018

CHICAGO--People have a natural knack for focusing on what a single person is saying, even when there are competing conversations in the background or other distracting sounds. For instance, people can often make out what is being said by someone at a crowded restaurant, during a noisy party, or while viewing televised debates where multiple pundits are talking over one another. To date, being able to computationally--and accurately--mimic this natural human ability to isolate speech has been a difficult task.

"Computers are becoming better and better at understanding speech, but still have significant difficulty understanding speech when several people are speaking together or when there is a lot of noise," says Ariel Ephrat, a PhD candidate at Hebrew University of Jerusalem-Israel and lead author of the research. (Ephrat developed the new model while interning at Google the summer of 2017.) "We humans know how to understand speech in such conditions naturally, but we want computers to be able to do it as well as us, maybe even better."

To this end, Ephrat and his colleagues at Google have developed a novel audio-visual model for isolating and enhancing the speech of desired speakers in a video. The team's deep network-based model incorporates both visual and auditory signals in order to isolate and enhance any speaker in any video, even in challenging real-world scenarios, such as video conferencing, where multiple participants oftentimes talk at once, and noisy bars, which could contain a variety of background noise, music, and competing conversations.

The team, which includes Google's Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T. Freeman, and Michael Rubinstein, will present their work at SIGGRAPH 2018, held 12-16 August in Vancouver, British Columbia. The annual conference and exhibition showcases the world's leading professionals, academics, and creative minds at the forefront of computer graphics and interactive techniques.

In this work, the researchers did not just focus on auditory cues to separate speech but also visual cues in the video--i.e., the subject's lip movements and potentially other facial movements that may lend to what he or she is saying. The visual features garnered are used to "focus" the audio on a single subject who is speaking and to improve the quality of speech separation.

To train their joint audio-visual model, Ephrat and collaborators curated a new dataset, "AVSpeech," comprised of thousands of YouTube videos and other online video segments, such as TED Talks, how-to videos, and high-quality lectures. From AVSpeech, the researchers generated a training set of so-called "synthetic cocktail parties"--mixtures of face videos with clean speech and other speech audio tracks with background noise. To isolate speech from these videos, the user is only required to specify the face of the person in the video whose audio is to be singled out.

In multiple examples detailed in the paper, titled "Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation," the new method turned out superior results as compared to existing audio-only methods on pure speech mixtures, and significant improvements in delivering clear audio from mixtures containing overlapping speech and background noise in real-world scenarios. While the focus of the work is speech separation and enhancement, the team's novel method could also be applied to automatic speech recognition (ASR) and video transcription--i.e., closed captioning capabilities on streaming videos and TV. In a demonstration, the new joint audio-visual model produced more accurate captions in scenarios where two or more speakers were involved.

Surprised at first by how well their method worked, the researchers are excited about its future potential.

"We haven't seen speech separation done 'in-the-wild' at such quality before. This is why we see an exciting future for this technology," notes Ephrat. "There is more work needed before this technology lands in consumer hands, but with the promising preliminary results that we've shown, we can certainly see it supporting a range of applications in the future, like video captioning, video conferencing, and even improved hearing aids if such devices could be combined with cameras."

The researchers are currently exploring opportunities for incorporating it into various Google products.
For the full paper and videos, visit the group's project page.


ACM, the Association for Computing Machinery, is the world's largest educational and scientific computing society, uniting educators, researchers, and professionals to inspire dialogue, share resources, and address the field's challenges. ACM SIGGRAPH is a special interest group within ACM that serves as an interdisciplinary community for members in research, technology, and applications in computer graphics and interactive techniques. SIGGRAPH is the world's leading annual interdisciplinary educational experience showcasing the latest in computer graphics and interactive techniques. SIGGRAPH 2018, marking the 45th annual conference hosted by ACM SIGGRAPH, will take place from 12-16 August at the Vancouver Convention Centre in Vancouver, B.C.

To register for SIGGRAPH 2018 and hear from the authors themselves, visit

Association for Computing Machinery

Related Speech Articles from Brightsurf:

How speech propels pathogens
Speech and singing spread saliva droplets, a phenomenon that has attracted much attention in the current context of the Covid-19 pandemic.

How everyday speech could transmit viral droplets
High-speed imaging of an individual producing common speech sounds shows that the sudden burst of airflow produced from the articulation of consonants like /p/ or /b/ carry salivary and mucus droplets for at least a meter in front of a speaker.

Speech processing hierarchy in the dog brain
Dog brains, just as human brains, process speech hierarchically: intonations at lower, word meanings at higher stages, according to a new study by Hungarian researchers.

Computational model decodes speech by predicting it
UNIGE scientists developed a neuro-computer model which helps explain how the brain identifies syllables in natural speech.

Variability in natural speech is challenging for the dyslexic brain
A new study brings neural-level evidence that the continuous variation in natural speech makes the discrimination of phonemes challenging for adults suffering from developmental reading-deficit dyslexia.

How the brain controls our speech
Speaking requires both sides of the brain. Each hemisphere takes over a part of the complex task of forming sounds, modulating the voice and monitoring what has been said.

How important is speech in transmitting coronavirus?
Normal speech by individuals who are asymptomatic but infected with coronavirus may produce enough aerosolized particles to transmit the infection, according to aerosol scientists at UC Davis.

Using a cappella to explain speech and music specialization
Speech and music are two fundamentally human activities that are decoded in different brain hemispheres.

Speech could be older than we thought
The theory of the 'descended larynx' has stated that before speech can emerge, the larynx must be in a low position to produce differentiated vowels.

How the brain detects the rhythms of speech
Neuroscientists at UC San Francisco have discovered how the listening brain scans speech to break it down into syllables.

Read More: Speech News and Speech Current Events is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to