This article is based on academic research recently published in the proceedings of CHI Play, the 2015 Annual Symposium on Computer-Human Interaction in. It has been re-written for a Gamasutra audience.
Advances in voice recognition technology have seen a proliferation of voice-based interfaces, embedded in smartphones, smart watches, smart homes – and game platforms. While the use of player-to-player voice in games is widespread and well-researched, the use of voice as an input in games is relatively unexplored. By this, we refer to the use of the player’s voice as a controller where it is typically used alongside other input modalities.
Voice interaction in games is curiously polarizing. Players routinely express frustration with its accuracy, its social awkwardness, its potential for griefing, and its inefficiency. However, some instances of voice interaction are well received, attributed with increasing flow and immersion, and voice-based interfaces are increasingly emphasized in videogame marketing.
In our research, we have set out to understand the dimensions of these failures and successes, aiming to look deeper at player experiences that go beyond an examination of speech recognition accuracy. We found that understanding the player's identity in a game can go a long way in improving the reception of voice interaction.
AN EARLY HISTORY OF VOICE INTERACTION IN GAMES
The history of voice interaction in video games has been significantly influenced by the console market. While the earliest examples of voice interaction in games were enabled on PCs (such as Command: Aces of the Deep , a submarine simulator that allowed verbal commands), the presence of enabling hardware on console platforms has often shaped the development of voice interaction in games.
Hey You, Pikachu!  was the only game released in the United States to use the Nintendo 64 system’s voice recognition unit (VRU). This game allowed players to interact with Pikachu through voice, giving simple commands and responses such as identifying something Pikachu had picked up (e.g., an apple) or telling Pikachu to use an ability to open a box. The implementation of voice in Hey You, Pikachu! is strikingly similar to modern examples, featuring an icon in the user interface that reflects when the user’s voice is being registered.
The only other game developed for the Nintendo 64’s voice recognition peripheral was the Japanese-only Densha de Go! 64 , a train driving simulation game where the microphone was used to announce train stations to passengers.
Sega followed suit in 1999 with Seaman on Dreamcast, one of the few to use the system’s microphone attachment. Seaman is a pet simulation game narrated by Leonard Nimoy where the player uses the microphone to converse with, and guide, a humanoid fish. Seaman was the third bestselling Dreamcast game in Japan and retains a cult following. A similar game, [email protected] , featuring a female android was released on Xbox, only in Japan.
Sixth-generation game consoles in the early 2000s introduced online play, for which voice communication between players became important. Games in the tactical combat genre required coordination between teammates, and so in some cases even came packaged with headset microphones. As a consequence, several games in this genre including SOCOM: U.S. Navy SEALs [2002, and sequels] and Tom Clancy’s Rainbow Six 3  were among the first to enable players to give orders to AI teammates using voice commands.
A small number of games have experimented with voice-only interaction, such as Tom Clancy’s EndWar , “mic mode” in Mario Party 6 , and Lifeline , a role-playing adventure game which the player controls almost entirely through speaking commands to characters. Among the most successful voice interaction games are karaoke series such as SingStar [2004, and sequels] and Rock Band [2007, and sequels], which typically came packaged with dedicated stage-style microphones.
Perhaps reflecting an increased faith in the quality of voice interaction software, in 2014 the voice-only tactical combat game There Came an Echo successfully raised $115,569 of funding from 3,906 backers on Kickstarter, and around the same time some level of voice interaction started to become typical in AAA titles.
In order to understand the nature of player experiences with voice interaction in games, we examined online discussions, reviews and “Let’s Play” videos around games with voice interaction. We collected 166 professional reviews, 2,951 amateur reviews, 84 discussion threads and 69 Let’s Play videos and analysed each reference to or use of voice commands. This provided us a way to obtain insight into player experiences with minimal interference, to explore the breadth of issues and successes associated with these new interfaces in commercially available games.
The following four games represent the dominant ways in which voice interaction is being used as a multi-modal interface in games, and exemplify our argument around identity dissonance.
TOMB RAIDER: DEFINITIVE EDITION
Tomb Raider: Definitive Edition uses voice recognition to permit simple voice commands. These allow the player to bring up menu items (e.g. by saying “show map”), switch between weapons (e.g. “pistol” or “bow”), and pause/resume the game. In our review, we found that users raised two kinds of issues around the voice interaction: those related to performance, and those related to discomfort.
Performance issues included reports that the speech recognition was not reliable, and more generally, complaints that it was “faster just pressing a button”. We frequently saw that speed, and subsequently improved performance, were regarded as metrics by which to evaluate the voice interface, due to its effect on the player’s sense of physical mastery. In some cases, players acknowledged that the voice configuration did improve the flow of play (e.g. enabling them to change a weapon instantly while engaged in combat).
Issues of discomfort with the voice interface were raised in both online discussions and reviews. Players frequently noted that repeatedly yelling “shotgun” at their television was “uncomfortable” and “embarrassing”, and that it restricted the use of the interface to when other people were not present to be bothered by the noise. Similarly, we noted multiple accounts of the “pause” command being used by non-players to grief or control players, in a way that could presumably be used by a frustrated parent to end play:
[my] wife hates when I game with her [at] home or awake and thinks it’s fun to use voice commands to turn it off and so do my kids
SPLINTER CELL: BLACKLIST
In Splinter Cell: Blacklist , the protagonist (Sam Fisher) must navigate through areas patrolled by hostile enemy guards, using stealth rather than brute force. The player can use actions such as throwing a rock to make the enemy investigate the noise, so that Sam Fisher may ambush them from behind or sneak past undetected.
In the Xbox One version of the game, the user can yell “Hey you!” to the Kinect sensor, and Sam Fisher accordingly calls out “Hey you!” in the game, making a virtual sound which the enemy guards will investigate (see Figure 2).
Figure 2: Splinter Cell: Blacklist (2014) allows players to distract in-game enemies by shouting “Hey you!”
This feature was very well received, with one reviewer noting “the ability to relate directly with fictional characters is an [sic] powerful idea” and online discussions lamenting that there were so few commands that worked with the interface. In comparison to other examples of voice interaction, players liked that they were doing what their character would actually do, rather than something “unnatural” that they would not normally say out loud (such as yelling “shotgun” in Tomb Raider).
Issues with discomfort were not entirely absent, however, as there remained a disjuncture between actions appropriate in the game world and actions appropriate in the real-world context. One player mentioned that they had been “caught” by their partner yelling various words at their television with no feedback, in an attempt to test the game for other voice commands. We also noted that in Let’s Play videos, users attempted to engage the voice recognition function with “hey buddy!” and “come here!” until the correct “hey you!” registered.
FIFA 2014 implemented voice commands during offline matches. Players can select substitutions (by saying “substitution” followed by the substitute’s name), change team formations (e.g. “formation two”), use custom tactics (e.g. “offside trap”), and change the mentality of the players (e.g. “ultra attacking”).
It is possible to do all of these things using the controller, but voice allows them without pausing play, and so avoids interrupting the experience. The implementation of voice in FIFA is one of the most commended implementations in a contemporary game. We noted a large number of positive comments about the “well-conceived”, “effective” and “useful” voice interactions.
Players praised the voice interaction options both for improving their ability to perform in the game (as doing the same tasks with the controller required “pausing the game or pressing difficult button combinations on your D-Pad and los[ing] focus on the ball”), and for avoiding a sense of discomfort; a common sentiment was that the commands “don’t feel artificial or put on” and were “natural”.
RYSE: SON OF ROME
Ryse: Son of Rome  is a third-person combat game for Xbox One in which the user plays a Roman centurion, occasionally commanding other troops in battle. Ryse features voice commands such as “fire volley” and “charge” that are relevant to events in the game’s linear story, and the opportunity to use them is triggered by in-game events.
Overwhelmingly, players spoke positively about the voice commands, with the feature commonly being referred to as “immersive”, and negative comments limited to the infrequent opportunities to use them. In the context of the game’s ancient Roman setting, we identified numerous instances where players noted the appeal of play-acting the character, as they “Put on the roman soldier epic voice for it and everything”, reflecting the sense of a virtually embodied “real” voice we noted in the example of Splinter Cell.
PLAYER IDENTITY DISSONANCE AND VOICE INTERACTION
These four games overview how multi-modal voice interaction is being integrated in contemporary console games. While issues remain around the accuracy of voice recognition technology and whether the implementation improves the player’s in-game performance, we argue that the key issue with regard to voice interaction in games is best understood through the lens of identity dissonance.
Based on earlier research into EVE Online, we distinguish between four types of identities present in a game play situation: the user (the “real” human who plays); the player (a social identity); the character (an identity within a game’s imaginary); and the avatar (the character’s virtual depiction). This framework does not suggest that players necessarily identify with their characters, but instead establishes them as separate identity constructs which may overlap and inform each other in a game-play situation.
Through this lens, it becomes clear that in the example of Splinter Cell, voice was well received because of a voice-based resonance between the user’s player identity and the character identity of Sam Fisher; the user saying “hey you” in the real world meant that their character said “hey you” in the virtual world, with the expected effect.
Virtually embodying the player’s real voice increases the perception of overlap between the player and character identities. Players’ comments indicated that this convergence of identities could be contributing to an increase in their sense of flow and immersion.
Contrastingly, voice interaction in Tomb Raider afforded no such convergence, as the in-game character did not (and would not) yell “shotgun” or “reload” in the middle of combat. We argue that this configuration of voice interaction, cited by numerous players and reviewers as “unnatural” and “uncomfortable”, causes a dissonance between the player and character identities that can diminish the player's sense of flow and immersion in the game. While in some cases changing weapons by voice command was faster (assumedly improving the flow of the experience), the identity dissonance appears to negate this positive effect, as the balance of the commentary was negative towards this feature.
Approaching voice interaction in games through this lens reveals something interesting about the character identity in sports simulation games like FIFA. One interpretation of the player’s role in FIFA is as the manager or coach, particularly in career mode. The voice commands as implemented in FIFA accord with this personification; mentalities like “defensive” and tactics like “offside trap” are commands that a coach or manager might yell out from the sidelines to their players, and several commenters felt these were things they would already yell at their TV during intense moments of play. Reflecting and playing with this idea, the player’s character can receive a letter from the board of directors in FIFA chastising them for swearing too much where the microphone could hear them.
As noted earlier, we also identified how many players would mimic the voice and (British) accent of the protagonist in Ryse: Son of Rome when giving voice commands to other troops. Rather than simply enunciating “fire volley” in a calm and reliably recognizable tone, many players would shout the command as if the urgency in their own voice would be conveyed to the virtual archers. These emergent practices reflect players’ desire for resonance between their own vocal identity and that of the character, and their incorporation into game design could further improve player experience around voice interaction.
This raises a risk of identity dissonance when it is difficult for the player to make their voice mimic their character’s voice, impacting player experience. As the majority of games employ only male protagonists, this may potentially mean that female players will have a different, less immersive or more uncomfortable experience using voice interaction in games.
This critique is particularly interesting in the context of (infamous) comments by Tomb Raider executive producer Ron Rosenberg, who suggested that “when people play Lara, they don't really project themselves into the character... they're more like, ‘I want to protect her’”. This prejudicial, male-oriented and dissociated conceptualization of the player-character relationship (the player is “kind of like her helper”, according to Rosenberg) was not mentioned in the context of the game’s voice interaction, and yet it is reflected in the configuration of voice interaction in the game’s design, where player-voice is configured as a command to the character rather than a convergence between the two identities present in the play situation.
These examples demonstrate how voice interaction in games with persistent identities must take into account the game’s imaginary, and the identity of the player in that imaginary. Where voice interaction is not related to the virtually embodied experience, it causes dissonance between the user and their character, thereby negatively affecting the way the game is experienced and reviewed.
User commentary indicates that embodying a player’s voice through their in-world character provides an opportunity to increase immersion and flow, and appears to circumvent the widespread criticisms of voice interaction as “unnatural”, “forced” or “embarrassing”.
Based on this exploratory study, we have proposed a theoretical understanding that can guide the design and future research into voice interaction in digital games. It is our intent to further test this theoretical understanding through experimental game design and online data analysis. We are interested in identifying other experiences players have with voice interaction to guide this research.
Considering the wide range of contemporary commercial games that utilize voice interaction in some form, and the lack of research into the usability and design of voice interaction in games, such work seems immediately necessary.
As voice and other natural user interfaces (such as gesture and eye tracking) are increasingly being integrated with modern games, understanding how player identity and embodiment influences the experience of novel interfaces may be important for their success in game design and beyond. Questions around voice and embodiment in games with more complex player identities (e.g. in Banished, are you the mayor? The collective consciousness of the town? God?) also need to be further explored.
Marcus Carter is a Research Fellow in the Microsoft Research Centre for Social Natural User Interfaces, at The University of Melbourne. His PhD focused on treacherous play in EVE Online, such as scamming and espionage. He has also researched DayZ, Warhammer 40,000 and Candy Crush Saga. He is an editor of the forthcoming academic collection on EVE Online, Internet Spaceships are Serious Business: An EVE Online Reader. See his personal website, www.marcuscarter.com.
Fraser Allison is a PhD candidate at the Microsoft Research Centre for Social Natural User Interfaces at The University of Melbourne, researching voice interaction with virtual characters. His other research has looked at immersion and communicating subjective experiences in digital games. Prior to academia he was the technology manager at a consulting firm.