One of our early goals with Smash Hit was to combine audiovisual realism with highly abstract landscapes and environments. A lot of effort was put into making realistic shadows and visuals, and our sound designer spent long hours finding the perfect glass breaking sound. However, without proper acoustics to back up the different environments, the sense of presence simply would be there.
To achieve full control over the audio processing and add environmental effects I needed to do all the mixing myself. Platform dependent solutions like OpenAL and OpenSL cannot be trusted here, because support for environmental effects is device/firmware specific and missing in most mobile implementations. Even it was available it would be virtually impossible to reliably map parameters between OpenSL and OpenAL. As in most cases with multi-platform game development, DIY is the way to go.
Showcasing a few different acoustic environments
Writing a software mixer is quite rewarding – a small, well defined task with a handful operations performed on a large chunk of data, thus very suitable for SIMD optimization. A software mixer is also one of those few subsystems that has, or can have, a real work analogue - the physical audio mixer. I chose a conventional, physical abstraction, so my interface classes are named Mixer, Channel, Effect, etc, but there might better ways to structure it.
The biggest hurdle when writing a software mixer turned out to be the actual mixing. Two samples playing at the same time are added together, but what happens if they both play at maximum volume? The intuitive implementation, and what also happens in the real world is clipping. This is what most real audio programs do. Clipping is a form of distortion, where minimum and maximum audio levels are simply clamped above or below the physical threshold, effectively destroying or reshaping the waveform. In an audio software, you would typically adjust the levels manually in order to avoid clipping, but in games, where audio is interactive this can be really tricky. Say for instance you have a click sound for buttons. If there are no other sounds playing you want the click played back at maximum volume, but if there is music in the background the volume level needs to be lowered. If there is an explosion nearby it needs to be adjusted even further to avoid clipping.
One way to reduce clipping is to transform the output signal in a non-linear fashion, so that it never really reaches the maximum level. This still has the problem that it will affect the result when there is only one sample playing. Hence, when the click is played back in isolation it won't be at maximum volume.
Some people suggest that the output should be averaged in the case of multiple channels. So if there are three sounds, A B C playing at once. You mix them as (A+B+C)/3. This is not a good way to do it, because the formula doesn't know anything about the content of each channel (B and C can for instance be silent, still resulting in A played back at a third of the volume).
What we need is some form of audio compression - an algorithm that compress audio dynamically, based on the current levels. Real audio compressors are pretty advanced, with a sliding window to analyze the current audio content and adjust the levels accordingly. Fortunately there is a "magic formula" that sounds good enough in most cases. I found this solution by Viktor T. Toth:
mix=A+B-A*B, but when adapting it to floating point math I realized that a slight modification into:
mix=A+B-abs(A)*B is more suitable to better deal with negative numbers. Each channel is added to the mix separately, one at a time, using the following pseudo code:
mix = 0
for each channel C
mix = mix+C - abs(mix)*C
This means that if there is only one channel playing, it will pass unmodified through the mixer. The same applies if there are two channels but one is completely silent. If both channels have the maximum value (1.0), the result will be 1.0, and anything in between will be compressed dynamically. It is definitely not the best or most accurate way to do it, but considering how cheap it is, it sounds amazingly good. I use this for all mixing in Smash Hit, and there are typically 10-20 channels playing simultaneously, so it does handle complex scenarios quite well.
I use three separate mixers in Smash Hit – the HUD mixer, which is used for all button clicks and menu sounds, the gameplay mixer, which represent all 3D sounds, and the music mixer which is used for streaming music. The gameplay mixer has a series of audio effects attached to it to emulate the acoustics of different room types.
Given how useful a reverb effect is in game development, it's quite surprising to me how difficult it was to find any implementations or even an explanation online. At a first glance, the reverb effect seems much like a long series of small echoes, but when trying it, the result sounds exactly like that – a long series of small echoes, not the warm, rich acoustics of a big church. If one tries to make the echoes shorter, it turns more and more metallic, like being inside a sewage pipe.
There is a great series of blog posts about digital reverberation by Christian Floisand that contains a lot of the theory and also a practical implementation: Digital reverberation and Algorithmic Reverbs: The Moorer Design.
It uses a series of parallel comb filters passed through all-pass filters in series. Each comb filter is basically a short delay line with feedback, representing reflected sounds, while the all-pass filters are used to thicken and diffuse the reflected sounds by altering the phase. I don't know enough signal theory to fully understand the all-pass filter, but it works great and implementation is fairly easy.
In addition to the comb filters and all-pass filters I also added a couple of tap-delays (delay line without feedback), representing early reflections on hard surfaces, as well as low-pass filters in each comb filter allowing a great way to control the room characteristics. Christian's article suggest the use of six comb filters, but for performance reasons I cut it down to four. I'm using four tap-delays and two all-pass filter, plus a pre-delay on the entire late reflection network (the chain on the left).
All audio is processed in stereo in Smash Hit, so the reverb needs to be processed separately on the left and right channel. I slightly randomize the loop time in the comb filters differently for the left and right channel, which gives the final mix a very nice stereo spread and a much better sense of presence.
In addition to reverb I also implemented a regular echo as well as a low-pass filter. The parameters of these three filters are used to give each room its unique acoustics.