Spatial Sound in Virtual Reality

7 Spatial Audio

The perception of immersion is influenced by physiological (human anatomy), psychological and environmental factors. The shape of our head and ear play an important role in the localization of sounds. Virtual simulations of spatial audio are best realized when an understanding of the relevant psychoacoustic and acoustical cues are implemented into signal processing design.

Today, there are many recording formats, hardware and software tools that can be used to realize the desired spatial imagery. 3D graphics engines (Unity, Unreal) are mainly designed for the production of video games and include set of tools for placing sounds in a virtual world. There are dedicated audio engines (FMOD, Wwise) for games that integrate with the 3D graphics engines, and there are numerous SDKs that can be used with a variety of languages and platforms. For example, Google Resonance SDK supports many platforms. According to Google "Resonance Audio is a multi-platform spatial audio SDK, delivering high fidelity at scale". It "simulates how sound waves interact with human ears and their environment". 1313https://resonance-audio.github.io/resonance-audio/discover/overview.html

There are two main components to spatial audio, one is the sound world where the listener immerses into, the second one is the listener himself –or his head– in the world. To create a convincing experience, we need to track the movements of the head and the world should rotate accordingly. Spatial audio is a sonic experience where the audio changes with the movement of the listener’s head. This experience can be produced by 4 speakers (quad), surround-sound (5.1,7.1) and higher configurations with hundreds of speakers. It can also be produced by headphones. Before VR, sound localization was explored by composers of Musik im Raum.

Spatialization refers to the virtual projection and localization of a sound source in space including the algorithms that creates its trajectories, and, in the case of reproduction of audio recorded with soundfield techniques, the encoding and decoding algorithms and techniques.

Stereo and multichannel surround sound reproduction systems are speaker-dependent, that is, they are dependent on a given speaker setup. For example, 5.1 surround always consists of 5 full bandwith speakers (20-20kHz) and a subwoofer mainly used for low frequency effects. This is the common setup in a home theater. The layout does not really immerse the listener into a sound field but it creates clarity. In practice, it functions as a stereo (L-R) system for music and sound effects, a center speaker for dialog, and two back speakers (BL-BR) for special effects with the added subwoofer to enhance effects such as explosions, machines, etc. Each channel is focused to a center point called the (sweet spot, where the listener needs to be located in order to experience the system the way the sound mixer intended. As opposed to this listener-centric approach, a sound field creates a speaker-independent representation of sound. The term sound-field refers to the capture, reproduction and description of sound waves. A sound field includes the acoustic phenomena encountered by the sound wave from its point of origin to its point of observation, the observer being the soundfield microphone. Virtual sound worlds can be created by combining soundfield recordings for ambiance, monoarual recordings and sound synthesis for sounds that need to be spatialized. A virtual sound experience is finally accomplished by tracking the movement of the listener’s head and rotating the world accordingly.

The sound field approach started with the work of Blumlein, who patented the X/Y and the M/S recording techniques in 1931, the latter leading the way to the development of Ambisonics. Mid-Side is a coincident technique, meaning that both mics are placed as closely as possible to each other and the stereo image is created by differences in loudness rather than time delay. In theory, this is a coincident setup but in practice, as one is put above the other, it creates the problem of space (to maintain coincidence) when more capsules are added. The M/S signals are produced by using an omnidirectional capsule (MID) and a figure-of-8 (SIDE) microphone.

The MID microphone picks up sound from all directions, whereas the SIDE microphone gets the lateral components. By convention the left components have a positive phase, and the right ones a negative phase. When reproducing the signals, and contrary to X/Y that can be assigned to two channels respectively, the MS spatial encoding needs to be "decoded". Since the left components picked up by the SIDE microphone are in phase with the MID components, summing the outputs of the MID and SIDE microphones extracts the left part of the sound field. On the contrary, since the right components are in opposite phase with the frontal ones, they are eliminated. In the same way, if the difference of the MID and SIDE outputs is computed instead of their sum, the right components are extracted and the left ones eliminated. The left (L) and right (R) signals for a stereophonic reproduction are thus derived from the MID and SIDE signals by the following matrix process:

L=M+S
R=MS

Encoding and decoding matrices are at the core of Ambisonics and algorithms and techniques are being developed or revisited to create more positional accuracy.