Howdy. My name is Michael, and I implemented speech recognition for my upcoming game Radio General. In it, I wanted players to command their troops with their voice, like in Tom Clancy's Endwar (see below).
This was a long journey, but I'm hoping my experiences can help those who wish to follow (or decide not to follow!).
State of speech recognition in gaming:
Even though we increasingly talk to Siri and Cortana on our phones, voice commands in games have not become popular - I can list on one hand the number of games (Tom Clancy's Endwar, There Came an Echo, Kinect games?). Speech recognition in games just isn't done. There's a multitude of reasons for this: voice controls are slower than mouse and keyboard, the recognition isn't perfect, and talking out-loud isn't always ideal (you live with others/children/pets). Despite the above reasons, I still wanted speech recognition for my game.
So how does one put speech recognition in their PC game? Well, there are two obvious options. Let's go over each.
1. Microsoft built-in speech api. Windows has had voice-to-text for a while, but has expored it most recently in Windows 10. This is easy to use, but unfortunately inflexible. You can't control the words the recognition is looking, and it's only available for Windows 10. If you want mac or linux, you'd have to implement a different solution.
2. An open-source speech library (CMUSphinx or kaldi). The benefit of using an open-source library is that we can make changes to it, and it can be exported to any OS - you can make one solution fit all. The downside, however, is that the speech accuracy out of the box is... bad. It'll take some work to tune it into something usable.
We went with #2, and chose CMUSphinx since one of our team had used it before. Our game uses Unity, and and there are several examples of how to set up Sphinx in it.
Next we had to answer: how did we want voice commands to be used in-game? To answer this, we'll need a brief description of my game.
Radio General is a real-time strategy game where you can't see your units. Instead you talk with them over the radio. You receive verbal reports, and then issue orders back.
So, in-game, you're commanding 3-8 units, and you need to be able to ask them a few questions, and tell them where to go. We decided to give each unit a unique, alphabetical name. NATO wasn't formed yet, so we an older radio phonetic alphabet: Able, Baker, Charlie, Dog, Easy, Fox, etc. (I really like Oboe for O).
We decided to emulate Tom Clancy's Endwar, with most commands following this syntax:
<RECIPIENTS> <ORDER> <EXTRA INFORMATION>
Ex: Able move to J 5.
We then came up with a full list of words and orders we thought we wanted (spoiler alert: some of these were dropped), along with their importance. Here's that full list.
Finally, we knew what we wanted. Now, when running the speech recognition in-game, it provides us with a string of detected words in order. To use this string, we wrote a simple script to check if any unit names are present, and then see if any other order words are present (move to, head, report status, etc).
Out of the box, the results were horrible. It was detecting words when no one was talking, and when talking it detected all sorts of garbage words that weren't even in our game. We tried it out with different people and different microphones, and everyone agreed: it sucked. It was frustrating, and nobody wanted to use it.
We needed limit the words it was looking for, so it only looked for words needed to play the game. We needed a dictionary file. A dictionary file is a simple text file that lists the possible words that can be detected, along with their pronunciation.
ABLE EY B AH L.
The pronunciation is broken up into phonemes. You can even list several ways to pronounce the same word:
ARE AA R
This dictionary file helped - we no longer got garbage irrelevant words, however the accuracy was still very low (~70%). 70% might sound high, but it's not - imagine if a third of mouse-clicks just didn't work when playing an RTS. Players would try it for 5 minutes, and then give up.
The next step to improve accuracy is training the model. I'll skim over most of the technical bits, but training involves collecting a large amount of LABELLED voice clips, and feeding it into a model. The labelled part is important - we need to know EXACTLY what words are said, and in what order for each voice file. If mislabelled, the model's accuracy may become even worse. Garbage in, garbage out.
Our recognition accuracy improved drastically after training on these datasets, but unfortunately these datasets don't mention anti-tank guns. Or artillery barrages. Dang. So for words that were missing, the accuracy still remained poor. To fill this gap, we needed to collect data on very specific words for our game, ideally in combinations that will be used in-game (ex: a full clip saying: "deploy reserve anti-tank guns").
We needed to create our own dataset to fill the missing gaps. So whenever a tester played the game, we recorded each their voice commands. Every time they held down SPACEBAR to talk, a new .wav file is recorded and created, with the name of the file being what the speech recognition detected. This approach is absolutely fantastic - you want to train on data as close to what your players will be generating/using. Testers are actually playing the game with various microphone setups and different accents, mirroring what your playerbase will be doing. Awesome!
The problem with recording testers is that our speech recognition isn't perfect, and the labelled recording are often wrong. Remember: garbage in, garbage out. This data needs to be cleaned and relabelled. We had a person listen to each file, and correct the filename with what was actually said. Here's what the cleaned tester recordings look like.
Finally, with this excellent real-world data, we trained our model again. Part of the training process is to generate a statistical language model. This file lists a number of words, and their likelihood of appearing together (or by themselves). Think of it as a bunch of weighted terms. Here's an example of what that looks like:
-0.3860 DOG MOVE TO
-1.9542 DOG MOVE WEST
-0.3010 DOG REPORT STATUS
-0.3010 DOG RETREAT </s>
-0.3010 DOG WHERE ARE
After all this training, our testers actually found the voice controls usable (and sometimes, fun!). However, we still found giving grid coordinates to be inaccurate. Coordinates are specified by <LETTER> <NUMBER> (ex: A 6, B 9). The problem is that a lot of letters sound similar: B, C, E, G, V, T (all ending with 'eeeee'). After much debate, we grid coordinate letters over to NATO lettering (ex: alpha, bravo, charlie, delta). No, this isn't historically accurate, but you NEED to be able to specify grid coordinates with accuracy (lest you artillery your own men).
The last step is to make actually USING the voice commands in-game more friendly. This took the form of mid-speech feedback, where it displays what voice commands are available to you right now. Commands and items are highlighted as you speak, showing what remaining words are needed to complete the command. Here's what that looks like.
So there you have it! That's how we implemented speech recognition in our game. If you have a good microphone, and speak english with a Canadian accent, the accuracy is quite high. If you talk with other accents... well... your mileage will very (British accents are ROUGH).