The Audio Engineering Society conference in Manhattan last week had a surprisingly robust set of technical sessions on game audio. A session led by Chris Jahnkow, senior sound designer at Sony Computer Entertainment America, and Scott Selfon, senior audio specialist at Microsoft, provided some key takeaways that even game makers outside of audio design would find interesting.
Changing the Controller Can Change the (Audio) Game
With the growth of full player motion as game control, the masters of game audio have been thrown a new challenge. Game audio used to be designed in much the same way as film and video audio, it was designed for a player or players sitting still in the sweet spot of whatever audio system they had at home.
"When a player's only input mechanism was their thumbs you didn't need to worry about the position of the thumbs," said Selfon.
With the player becoming a moving target, both in a literal and figurative sense, game audio is facing a whole new set of new opportunities and challenges.
Player as recipient of audio
How dynamic should the output of the audio be for a moving player? According to the panelists, most games are still using the traditional perspective. True, some surround sound is used for spacializing audio for when something passes the player or happens behind them, but that's still most often dealt with from the perspective of a seated, stationary player.
A moving player introduces questions of how much the sound should pan, or shift focus, depending on a player's position in the gameplay space. "You want to avoid giving the player audio whiplash," said Selfon.
"We'll go through a phase of trying every bell and whistle of motion-controlled audio just because we can," said Jahnkow, referring to the industry as a whole. "You really want to look at what [that kind of manipulation of the sound] lends to the game. ... Audio is a very subtle and subjective thing and we do our jobs best when it is least noticed."
Sound for Gameplay Enhancement
"What is the sound of one hand waving?" asked Selfon.
For example, take a Star Wars light saber cutting through the air. The sound that it makes should change on the direction of the slash or thrust, which hand its in, etc. Likewise, samples of real world sounds have to be recorded and designed to take into account that they are informing a player as to whether or not they are making the moves correctly.
Jahnkow pointed out that motion needs to be taken into account as early as the Foley
recording stage. "You need to cover the potential dynamic range of sound effects to match the motion control." He then provided some case examples from his work on an upcoming Sony title, Carnival Island
He had to create game physics sound hooks controlled by xml to create different cues for different object impacts (different ways, for example, the bowling ball would sound when it hit the pins depending on the "success" and power of the throw). While this has been done in traditional in-game audio for a long time, the input to the system was generally much more straightforward when the controls were limited to a stick and buttons on a game pad.
"When you use motion controllers it's a lot more 'free form' than input from a standard controller, and your audio representations have to be more clear, more progressive and less tied to visuals generated by machine." said Jahnkow.
"[Twisted Pixel's] The Gunstringer
did a great job with the audio being part of the game, and the way in which my moves then move my avatar and triggers Foley that's variable based on the quality of my motion," he said.
Another example is the Gears of War
"active reload." If you time it right you can go faster, so there's a visual display but also an audio cue. Experienced players don't look at the screen when shooting for an active reload, they listen for the audio cue and when they've done it just right, the player character provides a reward of your character by saying "Nice."
Voice recognition provides game and interaction designers with a potentially wider palette of inputs for in-game navigation and command and control. Not only does voice do a better job of keeping the player in the virtual experience, it can modify, be combined with and/or replace gestures or controller interactions.
Some of the most interesting possibilities are combos. "Imagine a tennis game where a grunt with a swing equals extra power." says Selfon "Most of the pros seem to think it does anyway."
Kinect's "Far Talk" microphone array, in combination with the Xbox 360, handles speech two different ways. Selfon played two several examples of how it works under different conditions.
The Kinect/360 combination already starts with a couple of advantages in being able to pull speech out of background sound. First, the Kinect's multichannel echo cancellation system is calibrated by the user during initial set up. The system sends chirps to the speakers its connected to and listens back to assist the system in cancelling out game sound.
Another bonus the system has, whether it's playing a game or playing back a movie, is that it already knows what sound it is putting out into the player's environment to begin with. This helps it when it's isolating speech from background noise. Additionally, the Kinect has two different pipelines, including one for chat that cleans up input but allows for some game audio bleed through.
That processing is done entirely on the Kinect device. The second pipeline, for speech recognition, works harder to clean up bleed through. That requires a combination of Kinect and Xbox 360 processing and, as a result, can impact the cycles used by the game itself.
To demonstrate the capabilities of the technology, Selfon played back a sample of raw gameplay sound with a player talking over the general audio explosions and chaos you expect in a military-style video game. What the voice was saying was discernable, but effort was required to make it out. The same recording, played back after being processed for speech recognition sounded like someone calling from a speaker phone -- there was a little ringing, but it was clear enough for phoneme detection.
Things to Keep in Mind for Speech Input to Games
Latency tolerant design:
Speech input can be slow as it takes time for sound to reach the mic(s), get processed, etc. The delay can be from tens to hundreds of milliseconds compared to the almost instant combination of muscle twitch and hard-wired controllers. Even a human listening to a human takes a few hundred milliseconds to process.
Sometimes the better part of valor is to just detect incoming sound, assume the player is uttering the right command, and then
worry about processing and parsing. Selfon used the example of a rocket coming towards the player, when the player is supposed to yell "duck" to trigger the avatar's move. Designers might just use any vocal input to initiate the move, instead of waiting to process "duck" to make sure it was the right word.
There are all kinds of opportunities for new gameplay mechanics and combinations. Which choices will make audio fun and challenging instead of just being gratuitous? Magic spell memorization? Speed of repetition of phrases? Mimicking the words of an NPC?
There needs to be a strategy to avoid competing dialogue between player and NPC. Will you use conversational cues, handle the dialogue insertion via mixing in real time or leave space in the mix? Would you combine the system with gestures, such as having the player wave or raise their hand give them methods to interact more naturally with the game?
Voice Without Motion:
Voice input also raises challenges with or without motion in the mix. Children's voices and use of language are different. Accents and localization are also elements that need to be taken into consideration. All of these vocal variations are additionally challenging when the player isn't staying in one place.