Sponsored By

Speech Recognition and Storytelling in Plan Be

In this article I talk a bit about Speech Recognition, about voice and games and then we dive in the storytelling + voice = experience part of my game.

Valentina Chrysostomou, Blogger

December 7, 2015

25 Min Read








Plan Be, a game I created that will be out in January for free, is a voice controlled game. http://gamejolt.com/games/plan-be/86295

It’s one of the few out there, and if I’m not mistaken, the only one that focused on storytelling via speech input.

In this article I’ll outline some of the design decisions taken that gave the game its voice (see what I did there?).

Firstly I’ll talk a bit about Speech Recognition, then I’ll talk about voice and games and then we’ll dive in the storytelling + voice = experience part of my game.


If you are looking for technical speech recognition information you will not find any in this article. Feel free to contact me with any questions.


Speech Recognition

Voice or speech recognition is the ability of a machine or program to receive and interpret dictation, or to understand and carry out spoken commands. The way it does that is very complex, so in order to avoid the trouble of creating that from scratch (which I will be honest, is not in my set of skills) I have used the Windows Speech Recognition software.

That is a way of me telling you that you won’t be able to play my game if you don’t have that installed. But, it is already pre-installed on almost all PCs that have Windows so you'll be fine!

In my game I use this software to recognize the player’s voice and phrases they say. Of course, the player cannot utter any phrase they want and expect the game to do something. There are various phrases and words that have been written in a file that the game will understand.

If you are thinking that all the above create restrictions, you are right.

  • You are expected to talk in English - Windows Speech Recognition will use the language installed on your PC so if it’s not in English it won’t work.

  • You are expected to talk clearly – you can whisper or shout but you must do so clearly.

  • You are expected to say the exact phrase the game asks you – if you leave words behind it won’t work.

  • You are expected to calibrate it to your own voice for better recognition.

But those are ok. Really. English is not my native language but it recognizes me just fine. It is well responsive, something that at first I was not expecting. It can understand a variety of words and phrases that exist in a huge database. Trust me, I have tried to narrate a paragraph from Romeo and Juliet and it understood everything. It really is a great experience but it does have flaws that might break or make the game for some.

Bad English accents, not speaking clearly, low quality microphones, noise around the room are not problems to be dismissed. I have tried different microphones and my old one wasn’t catching the phrases as good as my newer one. It’s not a nice experience. It can get frustrating repeating the same words or trying to speak with an accent only for the game to not understand you.

I accept those limitations because this is speech recognition after all. It’s not perfect and we cannot expect it to be. That’s why I can understand if for some it might break the game. 

The fact that you have to repeat phrases exactly like the game demands, was one of my concerns, from a design standpoint. Not because you have to repeat them exactly like they are, but because you have no choice in what you say. This might be counter-intuitive for some players, might be uncomfortable or pressuring. It would be amazing if we uttered whatever we wanted and the game understood it and then did something, but I do not believe that this is possible right now. Speech recognition is very specific and sensitive and not only that, but when you have a defined experience it isn’t possible to leave the chance to the player.

Imagine a character asking you something and you have to respond back. What do you say? How do you say it? Does what you say matter to the gameplay and does it change the story?

Those things sound vague and difficult to do. In my game I have added something that tries this method called the hidden dialogue. I’ll mention it later in a bit more detail.

So with all the above in mind, I had a clear picture of how I could use this software to create a game in which you can talk. I knew the restrictions, I knew what I could use, the responsiveness, the flexibility. Now it was time to build the game. Easy right?


Voice in Games

When we say “using your voice in games” what comes to your mind? Do you think about various commands, like “run, shoot, go right, jump”? Do you think of existing games and try to think what it would be like to use your voice to make the character shoot or walk? Do you think of other new innovative ideas where you could use your voice everywhere in your game?

That’s all cool! I thought of that too but I only used some to none of that and I’ll explain why.

As we have seen above, voice recognition is pretty responsive but it still has flaws. Imagine playing a game where you talk to your PC but it responses slowly, might not understand the speech and might even process the speech as something completely different from the intentions of the player.

Let’s play out an example and see how it goes. I’ll take an existing game and give it some voice commands. Game: Call of Duty. Voice commands: “run, shoot, jump, pause game”.

So we start the game and we play and suddenly a wild enemy appears!

“Shoot!” I exclaim.

Oh but where do I shoot? Left? Right? Up, down? Screen width divided by 2, minus 30 pixels to the right? Where do I shoot? Ok fine, let’s say the targets appear only on the left, right, up and down side of the screen.

“Shoot right!” I exclaim.

The game takes a second to register that but at the same time the enemy saw me and starts shooting back.

I yell “Run!” But where do I run? Ok we’ve been through that before, so let’s just add left, right, up and down to this one too.

“Run left!” I yell, and my character runs left behind cover and the enemy advances. Phew, I probably lost some health getting shot already but I still got some time before he gets me.

“Shoot right!” I repeat and nothing happens. It didn’t register. I say it again but by the time I do, the enemy comes and continues to shoot me.

“Pause game” I say and it takes me 1 second to say it, 1 second until it registers and now I’m dead.

Had so much fun playing! Not only an FPS is a fast-paced game, it needs absolute precision. FPS wouldn’t be the genre for games with voice controls because simply the input by pressing a button is better than the input of speech when trying to shoot a gun. At least right now, with our technology.

Ok, so “shoot, run and 360 no scope” may not be the best things to voice to get an immediate response.

What about using the interface with voice, like I tried to do so with “Pause game”? Sounds very cool! You can do it in your game, but I didn’t. Why? Because the button input in this case is still better than the speech input. If I need to visit the bathroom, I need to pause the game RIGHT NOW. Innovation sounds very nice, but I’d rather have user satisfaction as a priority first. You can of course, use both voice and buttons, which seems like a better decision but I decided to implement voice only where I deemed it necessary and where it felt like the actually context supports it (something I'll talk about later). 

So here’s my rule:


If any action in your game can be done better with any input but speech input, you’re using speech recognition wrong.


That doesn’t mean you can’t do it or you shouldn’t. This is just my opinion and what it meant for me to create a game that utilizes voice recognition.

So then, what kind of game can we create that takes all this into account and how can we create it? Maybe we should stop thinking too hard and start thinking simply. What do we do with voice in real life? I’ll give you 3 chances to find at least one good answer.

My answer to that was “we speak”, like in conversations.

Now, of course there are complications with that too but all of the complications with this are the same with any voice controlled game.

  • What if the player is shy?

  • What if the player doesn’t want to speak?

  • What if they don’t know the language or they do but they have different accents?

Those are all valid but they are still the same problems you’d encounter with any voice controlled game. Can I avoid them? I guess I can avoid them as much as I can avoid creating a Kinect game which requires motion and gestures but players are bored to move.

So keeping all of those things in my mind, I decided to follow the idea of having conversations and if by speaking to a character we can create a new type of storytelling in games. If by playing a game in which you have to talk to a character will create a bond with them, or immerse you more into the world, if it’s fun to do, if it’s engaging and interesting… or none of that.

As we all know though, storytelling in games isn’t only the cutscene dialogue…


Storytelling + Voice + Everything Else

Ok so let’s define video game storytelling before we move on- at least the way I see it. Video game storytelling isn’t just the writing and dialogue. Because games are not a passive entertainment but an active one, in which you participate, they are interactive, the gameplay and the mechanics of the game are just as important to storytelling.

Storytelling in games is the dialogue, the background chat of the crowd, the way the character acts or speaks, it’s the world and level design, the art style, the mechanics, the objectives, the lighting, the UI, the music, the animations, the clothing design, the pacing- ok it’s a LOT of things!

So creating a “story-driven” game is not as simple as writing the script and be done with it. The whole development team has to be in on it and think as storytellers. In my case, I had to think of all these factors and aspects to create the experience. I’ll mostly talk about the voice though.

That’s why I’d rather use the word storytelling and not writing because writing is too specific and points to the script of the story. I actually started creating the main mechanics of the game BEFORE having any clue what I would write for the story. I only had my pitch idea which was this:


Location: CDC Underground Facility. Sitting behind a computer desk, locked into the security room, you try to guide a scientist to safety as an accident caused most of the personnel to have unstable and violent conditions.


Now, even if I started creating the game before the story, I made sure I had this pitch to guide me simply because this is the core concept of the game and as we said storytelling isn’t only the script.

A very important aspect I had to take into account before creating anything was “what kind of game can I create?” I had to think of a concept that would actually justify the voice use. We mentioned above Call of Duty and how voice might be a little problematic with that game. If you think about it, it’s not only problematic because of its genre, but because there is absolute no reason to use it in the game. 
What is the context that drives it? What is the reasoning behind it and how does the player understand that reasoning without explaining? Why do you absolutely need to use voice? If I didn’t buy into it and it didn’t feel normal I wouldn’t want to play it.

So the above description of a game that is a not-so-unique idea popped in my head. In movies you usually see scenes where a guy in front of a computer guides someone else with a walkie talkie or with an ear piece. This absolutely clicked with me and interested me because it made sense! I would use my voice to guide someone and talk to them. From a concept, gameplay and story standpoint it fits perfectly!


If you use speech recognition for the sake of using it you are doing it wrong.


Give it a reason to exist. Nobody wants to talk to a computer, they feel silly. Tackle that by finding a good reason for them to talk. In my game you literally talk to a character, it’s no extra addition or an afterthought; it’s the very main mechanic. You will never doubt why you have to use voice to do that, simply because you cannot do it in any other way!

The pitch helped me see clearly a lot of things about the game mechanics and how the voice takes part in the game world. Having the pitch in my mind was enough to make me create all the basic programming, mechanics and even the art style of the game. The art style became a big UI in which you could see a log screen of the dialogue, a documents “folder” with picked up documents, the controls of the game, the microphone and phrases you’d utter and the character stats which are Stamina and Oxygen.

This is what it looks like:











It didn’t look like that at first, it has been through a lot of re-iteration until I decided to make it more “familiar” with a Windows 8 style UI to represent a window you could actually have on your PC.

There are some small details here and there too that make it more “alive” like the real date and time you play the game, FPS counter, location of the character, the name of the program used to keep track of the facility, some tips appearing every once in a while for the player, etc. Those too add to the storytelling. If you’re supposed to be looking at a radar on your computer it must appear like you are looking at a radar on your computer.

So not only the UI must look like it but also the game itself, which is that window on the left that takes the most space on the screen. It shows a map/blueprint of the facility, the enemies and the main character, the doors, loudspeakers and pick up items.

The facility’s design went through iteration too (as everything really) but I’m talking about it because at first it was just simple blue. But it didn’t look like an actual place. Adding environmental items to it and some blueprint style lines, made it look like a real place, gave a sense of space and made it feel residential; a place people could work.

Yeah that matters to the storytelling too. The static that appears on the radar once in a while matters too, the fact that the character blinks only when he is stationary matters too, the fact that enemies are red and triangular to show danger (even though a bit cliché) matters too. Everything matters and thinking about all these things to make it more polished is what will make it a better experience.

All these details came a lot afterwards as this is mainly polishing and re-defining things.

The main concept and mechanics of the game went like this:

  • You can open doors (you are in the control room and have access to everything)

  • You can enable loudspeakers to distract enemies

  • Guide the scientist to specific spots in the level to pick up items (oxygen packs and documents)

  • Help the scientist avoid danger and get to safety

  • Read documents that give backstory or some helpful door codes to use on doors

  • Guide the scientist to hiding spots (shadows in which the enemies don’t sense you)

Pretty “cliché” stuff and basic but I was ok with them because the focus would be the voice. I mention the word “cliché” a lot. It’s true. I wanted innovation but at the same time I wanted to keep familiarity. It would help the player’s mind ease into the concept while adding something new to it. It’s not wrong to do something completely new and by doing what I did does not mean I took the safe or easy route. It means that I decided it would be better for the idea I had in mind, the story and the mechanics. Remember, this project is only the beginning of what we can do with voice recognition. Let’s build on it.

So, all those mechanics add to the story and the story helps the mechanics and here's how that happens: The fact that I am in the control room and the facility is on lockdown, means only I have access to some of the facility’s features like the doors and loudspeakers. Using those, I can help the scientist move forward. That also means something else; that the scientist is relying on me getting him out. That in return means that I am someone willing to help. It also means that specific characters attributes for both the main characters are already in place and I didn’t even have to think about them. The mechanics demand it in the story.

Clearly we can see from this trail of thought how story derives from the mechanics. It went from the simple act of opening doors to the character’s personality traits. Then those personality traits can be used to define the core character which you use to flesh out, give backstory to support that and then let them loose in the game world you created to make decisions.

My journey of creating the game story, started not only from those gameplay mechanics but also from the very main mechanic of it - the voice. And that’s all it took. I used it as a restriction and not a feature because I like working under pressure or limitations.

So I’m going to be talking to a character in the game but what else? Talking to a character is a story voice mechanic and I needed a gameplay voice mechanic too. That became the dictation or commands you have to give to the character. So you don’t only talk to them, you utter commands like “run, keep going, stop, wait, stop running, move back” etc.

In case you noticed, the above commands aren’t like the ones we have talked about above in our Call of Duty example. They aren’t “shoot right, move left”. They are more natural.

In the very beginning (not of the universe) the gameplay consisted of looking at your level, and finding markers on it where the player could move to. Then you uttered for example “Marker C4” and the character went there. Imagine playing like that, moving from marker to marker like a little soldier. That’s not actually bad if you are creating a strategy game where you have to move into positions. And my gameplay consisted of that until I have written the first draft of my story. Then I had to reconsider the gameplay.

Let’s revisit the pitch idea:


Location: CDC Underground Facility. Sitting behind a computer desk, locked into the security room, you try to guide a scientist to safety as an accident caused most of the personnel to have unstable and violent conditions.


The keyword is “guide”. You are supposed to be helping the scientist not commanding him. Saying “Move to Marker C4” is simply not natural. A scientist in that facility that works with you, knows where to go if the story demands they go, let’s say, to a specific laboratory. He isn’t a fool that needs to be told to move every 5 meters. The characters are real people and the scientist is not your pawn. You are working like a team trying to survive this. So the gameplay had to change and they way you talked had to change.


The facility is dark, it’s under lockdown and he knows where to go but he is afraid; he needs your help as you can see the facility on the computer monitor and any other movement. He starts moving towards the open door but an enemy is close and you tell him to stop. The enemy leaves and you tell him to continue because now it is safe to move. After a while it seems safe and he speaks to you, wondering what is going on. You reply back.


Now this is the experience I wanted to create. One step towards this experience was having voice commands sounding not like commands.

The marker system didn’t completely perish out of existence though. I used it for another aspect of the game. Say you need him to deviate from the path he is taking towards that laboratory because there’s a pick up item in a room. You tell him to wait and then mark the spot on the level and tell him to go there. He picks it up and you tell him to continue and he does. So there is a commanding feeling but now it’s minimal and happens for a reason.

In one of the paragraphs above, I have mentioned the hidden dialogue. This basically means that you can ask anything you want and as long as you say it in a right way and that something is written as a valid phrase to use, then the character will respond back. There is nothing that forces you to say it, it’s completely optional. Those are things that you might be thinking about or facts from the story and the game world. You can go ahead and talk about them. This is very experimental and somewhat difficult. Like I mentioned above speech recognition is sensitive, so you need to say something specific for it to register. But the main problem isn’t this. The main problem is the player’s thought. What in the world will the player think to say, how will they say it, and how do I know that? That can be answered “with playtests”. Watching people play the game and asking them to write down what they would ask. That would be great but unfortunately due to time limitations I had no such luxury. I did take player feedback for that but it wasn’t as extensive as I would have liked.

Moving on I had to make the gameplay more natural and the character feel humane. Talking to him and not really commanding him are good steps towards that but I decided to take it to a next step. The character is a human being. He knows where to go but he is counting on you to tell him to stop or move. Fair enough. What happens in the game if an enemy grabs him and he starts losing “health”? - health is the oxygen in the game for various spoiler-y reasons, go play the game when it comes out okay?

Me as a player can tell him to run or keep moving and he will do so. What if I don’t tell him anything, will he just stand there taking “damage”? That’s what usually happens with game characters but in this game the scientist IS NOT the character you "embody". You play you, yourself, and the scientist is someone else. There are two main characters in the game: the scientist and you sitting behind the PC monitor. So in order to make him feel real, I made it so if you do not tell him to run or move after a few seconds he will do so himself. Would anyone just stand there dying? No.

In the game as a player you also let the character know when he will use an oxygen pack. If the oxygen levels in his suit fall to a very small percentage he will use an oxygen pack all by himself. Like I said, he is a human being and won’t let himself die if you don’t tell him. Does all this sound natural? It does to me. Does it raise some questions like “does that take control from the player”? Yes it does raise questions. I had to answer that too when I was developing the game.

Of course, I mustn’t take control from the player but you have to understand that this actually gives more control to the player. Why? Because when an enemy grabs you, you start losing oxygen rapidly. If you let the character’s oxygen reach a very small percentage and then an enemy grabs you, you are most likely a lost cause. In order to avoid that, you have to keep the oxygen at normal levels, so as a player you keep an eye on that and make sure it doesn’t reach low levels. This makes the player vigilant and careful and they have to decide when to use oxygen. On the other hand if they don’t, the game won’t completely punish them, instead the scientist will take the initiative and use an oxygen pack. This is how I balanced it to make sure the control is in the player's hands but at the same time you feel that the character in real person and not just another AI.

Same with the character getting away from an enemy’s grasp. He won’t escape immediately; he will do so after some seconds and only if he finds another place to go to. If he doesn’t find a good spot to move he won’t. The seconds the player has to wait are also more than the normal amount it would take to utter a command meaning if the character escapes, you have deliberately waited so he would do that (or you panicked and you didn’t know what to say – trust me it happened).

You can obviously see how I approached the rest of the game too. You might wonder “does it really take this much thought to create such a simple game?” I agree the game is very simple actually. Some gameplay aspects are seen in other games, some decisions made are very basic when designing a game and some of the things above are really common sense (maybe?).

So why am I writing all this? Because this game is very experimental. I have not played a game that attempted to combine story and voice recognition before. I didn’t know where to start from in order to create it. Where are the blueprints and the design documents that state what I have to do? This isn’t another third person game and think how many of those we have but how difficult is to create a new third person IP.

This is why I’m writing this. I’m not here to say that the game is perfect because I have thought of all the above beforehand. I’m not here to say this is a great story-driven game made with voice. No. Maybe you’ll hate the story, maybe you’ll hate the gameplay or maybe you’ll love everything!

I’m here to say this is the first game that attempts that and whether people say it’s awful or incredible, this won’t matter but what will matter is going to be the fact that we can take this feedback and make the next story-driven, voice controlled game even better.


The game will be out in January for free. Check it out here: http://gamejolt.com/games/plan-be/86295


Tips for creating a voice controlled game:

  • Speech recognition is not perfect yet. But it’s really good and you should use it.

  • If any action in your game can be done better with any input but speech input, you’re using speech recognition wrong.

  • If you’re going to be talking to a character they must be less AI and more humane.

  • The mechanics of your game aren’t separate from the voice use. Make sure all makes sense and be consistent in the how you use voice.

  • People feel silly talking to the PC. Give them a good reason; justify the use of voice in your game.

  • If you’ll use voice make sure it’s not an afterthought for your game. Decide if you’ll use it for speech or dictation or both.

  • Menu controlling needs thinking. Think about how voice is going to make it a better experience; don’t just add it for the sake of adding it.

  • Make sure the grammar you use for valid speech is something that people feel natural saying. Playtest it!

  • Manage active words and phrases with rules. You do not need the dialogue phrases when you are not engaging in a dialogue.

  • Think of how you will teach your players the grammar set they can use if you have too big of a grammar set.

  • Make speech failure into a feature; make the character say “sorry didn’t hear you” or add static to show that it wasn’t registered.

  • Pick your grammar carefully. Do not add “sea” and “see” for obvious reasons.

  • Don’t expect people will get accustomed to it easy. Ease them into it.

  • Give players some breathing room between the times they have to use their voice.

Read more about:

Featured Blogs
Daily news, dev blogs, and stories from Game Developer straight to your inbox

You May Also Like