Nav: Home

New method for high-speed synthesis of natural voices

February 05, 2019


To date, many speech synthesis systems have adopted the vocoder approach, a method for synthesizing speech waveforms that is widely used in cellular-phone networks and other applications. However, the quality of the speech waveforms synthesized by these methods has remained inferior to that of the human voice. In 2016, an influential overseas technology company proposed WaveNet--a speech-synthesis method based on deep-learning algorithms--and demonstrated the ability to synthesize high-quality speech waveforms resembling the human voice. However, one drawback of WaveNet is the extremely complex structure of its neural networks, which demand large quantities of voice data for machine learning and require parameter tuning and various other laborious trial-and-error procedures to be repeated many times before accurate predictions can be obtained.

Overview and achievements of the research

One of the most well-known vocoders is the source-filter vocoder, which was developed in the 1960s and remains in widespread use today. The NII research team infused the conventional source-filter vocoder method with modern neural-network algorithms to develop a new technique for synthesizing high-quality speech waveforms resembling the human voice. Among the advantages of this neural source-filter (NSF) method is the simple structure of its neural networks, which require only about 1 hour of voice data for machine learning and can obtain correct predictive results without extensive parameter tuning. Moreover, large-scale listening tests have demonstrated that speech waveforms produced by NSF techniques are comparable in quality to those generated by WaveNet.

Future outlook

Because the theoretical basis of NSF differs from the patented technologies used by influential overseas ICT companies, the adoption of NSF techniques is likely to spur new technological advances in speech synthesis. For this reason, the source code implementing the NSF method has been made available to the public at no cost, allowing it to be widely used.
Source code, trained NSF models, and the actual NSF-synthesized speech samples (both Japanese and English) are available at the following sites:

Source code:

Trained models (may be executed to generate English-language voices):

Voice samples (Japanese or English):

Associate Professor Junichi Yamagishi makes the following comment:

"We hope that our NSF method will create new business opportunities for Japanese AI firms that use voice-based interfaces. For future work, we will work to make the method available for use as a real-time voice-synthesis engine in a wide variety of systems. We are also planning to add speaker adaption and other related features to the NSF methods."

Please visit the following page for comparisons of actual human voices to voice waveforms produced by source-filter vocoder methods, by WaveNet, and by NSF.

*It is explained in Japanese only in this movie.

About this research project

The research described here was supported by the Japan Science and Technology Agency under CREST JPMJCR18A6 and by the Japan Society for the Promotion of Science under Grants-in-Aid for Scientific Research "KAKENHI" 16H06302, 16K16096, 17H04687, 18H04120, 18H04112, and 18KT0051.

Paper title and authors

Title: Neural source-filter-based waveform model for statistical parametric speech synthesis
Authors: Xin Wang, Shinji Takaki, Junichi Yamagishi
Publication for: International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2019 (Accepted: February 01, 2019)

Date announced: October 30, 2018 (ArXiV: )

Research Organization of Information and Systems

Related Speech Articles:

Speech and language deficits in children with autism may not cause tantrums
Speech or language impairments may not be the cause of more frequent tantrums in children with autism, according to Penn State College of Medicine researchers.
What's coming next? Scientists identify how the brain predicts speech
A new study, publishing on April 25 in the open access journal PLOS Biology, has shed light on how the brain helps us to predict what is coming next in speech.
Whether our speech is fast or slow, we say about the same
Fast talkers tend to convey less information with each word and syntactic structure than slower-paced speakers, meaning that no matter our pace, we all say just about as much in a given time, a new study finds.
Do dogs of all ages respond equally to dog-directed speech?
People tend to talk to dogs as though they are human babies.
New approach may open up speech recognition to more languages
At the Neural Information Processing Systems conference this week, researchers from MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) are presenting a new approach to training speech-recognition systems that doesn't depend on transcription.
Preschoolers' expectations shape how they interpret speech
When we listen to people speak, we aren't just hearing the sounds they're making, we're also actively trying to infer what they're going to say.
Genes for speech may not be limited to humans
Mice use language but not speech, which is thought to need biological functions particular to people.
Hearing with your eyes -- a Western style of speech perception
Which parts of a person's face do you look at when you listen them speak?
When do speech difficulties in children matter for literacy?
A new study found that speech difficulties are linked with difficulties in learning to read when children first start school, but these effects are no longer apparent at 8 years of age.
Male mice model human speech defect
Male mice carrying a mutation in the Foxp2 gene have difficulty putting the syllables of their ultrasonic wooing song into proper order.

Related Speech Reading:

Best Science Podcasts 2019

We have hand picked the best science podcasts for 2019. Sit back and enjoy new science podcasts updated daily from your favorite science news services and scientists.
Now Playing: TED Radio Hour

Do animals grieve? Do they have language or consciousness? For a long time, scientists resisted the urge to look for human qualities in animals. This hour, TED speakers explore how that is changing. Guests include biological anthropologist Barbara King, dolphin researcher Denise Herzing, primatologist Frans de Waal, and ecologist Carl Safina.
Now Playing: Science for the People

#SB2 2019 Science Birthday Minisode: Mary Golda Ross
Our second annual Science Birthday is here, and this year we celebrate the wonderful Mary Golda Ross, born 9 August 1908. She died in 2008 at age 99, but left a lasting mark on the science of rocketry and space exploration as an early woman in engineering, and one of the first Native Americans in engineering. Join Rachelle and Bethany for this very special birthday minisode celebrating Mary and her achievements. Thanks to our Patreons who make this show possible! Read more about Mary G. Ross: Interview with Mary Ross on Lash Publications International, by Laurel Sheppard Meet Mary Golda...