Nav: Home

New method enables high quality speech separation

June 05, 2018

CHICAGO--People have a natural knack for focusing on what a single person is saying, even when there are competing conversations in the background or other distracting sounds. For instance, people can often make out what is being said by someone at a crowded restaurant, during a noisy party, or while viewing televised debates where multiple pundits are talking over one another. To date, being able to computationally--and accurately--mimic this natural human ability to isolate speech has been a difficult task.

"Computers are becoming better and better at understanding speech, but still have significant difficulty understanding speech when several people are speaking together or when there is a lot of noise," says Ariel Ephrat, a PhD candidate at Hebrew University of Jerusalem-Israel and lead author of the research. (Ephrat developed the new model while interning at Google the summer of 2017.) "We humans know how to understand speech in such conditions naturally, but we want computers to be able to do it as well as us, maybe even better."

To this end, Ephrat and his colleagues at Google have developed a novel audio-visual model for isolating and enhancing the speech of desired speakers in a video. The team's deep network-based model incorporates both visual and auditory signals in order to isolate and enhance any speaker in any video, even in challenging real-world scenarios, such as video conferencing, where multiple participants oftentimes talk at once, and noisy bars, which could contain a variety of background noise, music, and competing conversations.

The team, which includes Google's Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T. Freeman, and Michael Rubinstein, will present their work at SIGGRAPH 2018, held 12-16 August in Vancouver, British Columbia. The annual conference and exhibition showcases the world's leading professionals, academics, and creative minds at the forefront of computer graphics and interactive techniques.

In this work, the researchers did not just focus on auditory cues to separate speech but also visual cues in the video--i.e., the subject's lip movements and potentially other facial movements that may lend to what he or she is saying. The visual features garnered are used to "focus" the audio on a single subject who is speaking and to improve the quality of speech separation.

To train their joint audio-visual model, Ephrat and collaborators curated a new dataset, "AVSpeech," comprised of thousands of YouTube videos and other online video segments, such as TED Talks, how-to videos, and high-quality lectures. From AVSpeech, the researchers generated a training set of so-called "synthetic cocktail parties"--mixtures of face videos with clean speech and other speech audio tracks with background noise. To isolate speech from these videos, the user is only required to specify the face of the person in the video whose audio is to be singled out.

In multiple examples detailed in the paper, titled "Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation," the new method turned out superior results as compared to existing audio-only methods on pure speech mixtures, and significant improvements in delivering clear audio from mixtures containing overlapping speech and background noise in real-world scenarios. While the focus of the work is speech separation and enhancement, the team's novel method could also be applied to automatic speech recognition (ASR) and video transcription--i.e., closed captioning capabilities on streaming videos and TV. In a demonstration, the new joint audio-visual model produced more accurate captions in scenarios where two or more speakers were involved.

Surprised at first by how well their method worked, the researchers are excited about its future potential.

"We haven't seen speech separation done 'in-the-wild' at such quality before. This is why we see an exciting future for this technology," notes Ephrat. "There is more work needed before this technology lands in consumer hands, but with the promising preliminary results that we've shown, we can certainly see it supporting a range of applications in the future, like video captioning, video conferencing, and even improved hearing aids if such devices could be combined with cameras."

The researchers are currently exploring opportunities for incorporating it into various Google products.
-end-
For the full paper and videos, visit the group's project page.

About ACM, ACM SIGGRAPH, and SIGGRAPH 2018

ACM, the Association for Computing Machinery, is the world's largest educational and scientific computing society, uniting educators, researchers, and professionals to inspire dialogue, share resources, and address the field's challenges. ACM SIGGRAPH is a special interest group within ACM that serves as an interdisciplinary community for members in research, technology, and applications in computer graphics and interactive techniques. SIGGRAPH is the world's leading annual interdisciplinary educational experience showcasing the latest in computer graphics and interactive techniques. SIGGRAPH 2018, marking the 45th annual conference hosted by ACM SIGGRAPH, will take place from 12-16 August at the Vancouver Convention Centre in Vancouver, B.C.

To register for SIGGRAPH 2018 and hear from the authors themselves, visit s2018.siggraph.org/attend/register.

Association for Computing Machinery

Related Speech Articles:

New findings on human speech recognition at TU Dresden
Neuroscientists at TU Dresden were able to prove that speech recognition in humans begins in the sensory pathways from the ear to the cerebral cortex and not, as previously assumed, exclusively in the cerebral cortex itself.
Babbling babies' behavior changes parents' speech
New research shows baby babbling changes the way parents speak to their infants, suggesting that infants are shaping their own learning environments.
Hearing through your fingers: Device that converts speech
A novel study published in Restorative Neurology and Neuroscience provides the first evidence that a simple and inexpensive non-invasive speech-to-touch sensory substitution device has the potential to improve hearing in hearing-impaired cochlear implant patients, as well as individuals with normal hearing, to better discern speech in various situations like learning a second language or trying to deal with the 'cocktail party effect.' The device can provide immediate multisensory enhancement without any training.
AI can detect depression in a child's speech
A machine learning algorithm can detect signs of anxiety and depression in the speech patterns of young children, potentially providing a fast and easy way of diagnosing conditions that are difficult to spot and often overlooked in young people.
Synthetic speech generated from brain recordings
A state-of-the-art brain-machine interface created by UC San Francisco neuroscientists can generate natural-sounding synthetic speech by using brain activity to control a virtual vocal tract -- an anatomically detailed computer simulation including the lips, jaw, tongue, and larynx.
More Speech News and Speech Current Events

Best Science Podcasts 2019

We have hand picked the best science podcasts for 2019. Sit back and enjoy new science podcasts updated daily from your favorite science news services and scientists.
Now Playing: TED Radio Hour

Erasing The Stigma
Many of us either cope with mental illness or know someone who does. But we still have a hard time talking about it. This hour, TED speakers explore ways to push past — and even erase — the stigma. Guests include musician and comedian Jordan Raskopoulos, neuroscientist and psychiatrist Thomas Insel, psychiatrist Dixon Chibanda, anxiety and depression researcher Olivia Remes, and entrepreneur Sangu Delle.
Now Playing: Science for the People

#537 Science Journalism, Hold the Hype
Everyone's seen a piece of science getting over-exaggerated in the media. Most people would be quick to blame journalists and big media for getting in wrong. In many cases, you'd be right. But there's other sources of hype in science journalism. and one of them can be found in the humble, and little-known press release. We're talking with Chris Chambers about doing science about science journalism, and where the hype creeps in. Related links: The association between exaggeration in health related science news and academic press releases: retrospective observational study Claims of causality in health news: a randomised trial This...