Many videos today are recorded using only a single microphone, such as those built into smartphones or cameras. While this is convenient, it means that the recorded sound does not contain information about direction or distance. As a result, even when a video clearly shows where sounds are coming from, the audio itself often feels flat and unrealistic.
Binaural audio, sometimes called 3D audio, reproduces how humans naturally hear sound using both ears. It allows listeners to perceive where a sound is coming from and how far away it is, creating a strong sense of immersion. However, recording binaural audio typically requires special equipment that imitates the shape of the human head and ears, making it expensive and impractical for everyday use.
To address this problem, researchers at the University of Electro-Communications developed an AI-based method that can generate binaural audio from ordinary monaural recordings. Instead of relying on special microphones, the system uses visual information from the video itself to guide how the sound should be spatialized.
The key idea behind the method is that people naturally use vision to interpret sound. When watching a video, viewers expect sounds to come from the objects they see on the screen. For example, if a musician is standing on the right side of the image, listeners intuitively expect the sound to be heard from the right. The proposed system learns this relationship by analyzing both the audio and the video together.
The AI first identifies sound-producing objects in the video and estimates their positions within the scene. It then uses this visual and positional information, along with the original monaural audio, to generate a new version of the sound that matches the visual layout. This allows the system to create a sense of direction and space that was not present in the original recording.
Previous AI approaches often worked well only under artificial conditions. Many of them were trained and evaluated using audio that was artificially created from binaural recordings and still contained subtle spatial clues. When applied to real recordings made with a single microphone, such methods often failed to reproduce any meaningful spatial effect. In contrast, the new approach was designed specifically to work with real-world monaural recordings from the start.
To test the system under practical conditions, the researchers created a new dataset that includes synchronized video, single-microphone audio, and true binaural recordings captured in real environments. Using this dataset, they conducted experiments and listening tests to compare the proposed method with existing techniques.
The results showed that the new method produced audio that listeners perceived as coming from directions consistent with the visual scene. In situations where earlier methods collapsed into flat, monaural sound, the proposed approach was able to preserve a clear sense of spatial placement. Although some noise and limitations remain, especially in complex environments, the overall results demonstrate a significant improvement in realistic 3D sound generation.
By enabling immersive spatial audio without special recording equipment, this technology could make 3D sound more accessible for everyday video content. Potential applications include online videos, digital entertainment, virtual and augmented reality, and enhancing older recordings that were originally captured without spatial audio.
Authors:
Haruka Okano, Ryohei Orihara, Yasuyuki Tahara, Yuichi Sei
University of Electro-Communications, Tokyo, Japan
Computational simulation/modeling
Not applicable
Binaural Audio Generation Using Diffusion Model Conditioned on Visual and Positional Information of the Sound Sources
12-Nov-2025
The authors declare no competing interests