While the Xbox 360's Kinect has proven popular with the mass consumer, developing games that accurately reflect player movement, and really take advantage of the 3D motion-sensing capabilities, has been a major challenge.
Here, David Quinn, who works at Microsoft's Rare studio in the UK as a Kinect engineer, details how he has approached different challenges when developing for the system and how he's handled them, over the course of developing Kinect Sports and its sequel.
How do you do a game like darts, where most of the player's arm is occluded by the body? How do you handle golf, when the Kinect camera loses track of the player's arms during the swing? How do you handle different accents across the UK and the U.S.?
Since Rare is a Microsoft first party, does the stuff you write end up going back into the Kinect SDK?
DQ: There are a couple of things that Rare has done have gone into the SDK. The avateering system, we did that at Rare; that was when you take the 20-joint skeleton and turn it into the 70-joint avatar. That was done at Rare. And this machine learning system that we've recently built with the platform team for Kinect Sports 2; we helped out with that, as well. They did the more mathematical side, and we worked on the tools.
Have you seen implementations of Kinect in third party games that have impressed you or that do things that you weren't expecting?
DQ: Sure. What Mass Effect had recently done with Kinect's speech system is an excellent use of speech. We pushed speech in Sports 2; that was always going to be a huge thing for us. It was going to be a key thing, a differentiator from Sports 1. But what the Mass Effect guys have done is bring it into a core title, showing it could be used with a controller. It doesn't have to be the "get up and dance" kind of experience. You can use speech in Kinect in a more core title, and it really demonstrated that. I think from here on in you'll see a lot of speech in core games.
Are you primarily concentrating on the skeleton and the visual tracking, or do you work a lot with speech as well?
DQ: I work with both of them, yeah. It's odd; Kinect is like a brand, but it's actually a group of technologies, really. I'm kind of the Kinect rep at the studio, so I kind of touch both. I did all the speech work for Sports 2, basically by myself, and then quite a bit of gesture work as well. The machine learning system in golf was kind of my responsibility as well.
Can you describe what that accomplishes?
DQ: For golf, the major problem is the player's side faces the camera, so we don't actually get a great feed off the skeleton tracking system, because the back half of the body is completely occluded. All those joints are kind of inferred, basically. It gives a good guess of where it thinks it is, but it has no real meaning.
So when the player does a backswing, it cuts their hands a little, detecting when they do a forward swing. We worked out a codey, hacky job -- "hacky" is a bad word -- an unscientific job of running the animation. But when the player actually hits the ball and it flies off into the air, that has to be very reliable, because it's so detrimental to gameplay. Obviously, that's the entire game: hitting the ball.
So, early days of golf, we kind of had it so you could to do a full backswing and we'd just kind of drop your hands, because we didn't want the ball to go, but our hand-coded system would actually release the ball.
That's when we went to the ATG guys, the advanced tech group in Microsoft: "This is kind of where we're seeing. We've got a problem with the golf swing; do you have any recommendations?" They came back with this idea of creating a machine learning system for gestures.
What we basically ended up doing was recording about 1600 clips of people doing golf swings in front of Kinect, tagging in the clip where the ball should release, and then getting the computer itself to work out what's consistent among all those clips.
Then what happens is it creates a trainer and a classifier and move around that classifier at runtime, so we can pipe in a live feed into the classifier, and it can go, "Yes, the ball should release now," because it's been trained on a load of clips. It knows when it should happen. When the golf ball flies off in golf, it's done in that system; there's no hand-written code. It's all mathematical.
Does that have more overhead than other solutions?
DQ: You'd be surprised. 1600 clips sounds like a lot, but the thing is we record them quite quickly. I just wrote a tool, basically, that we ran on five dev kits at once. We had family days at Rare, so everyone would bring in their kids and partners. We wanted a wide cross-section of people doing the swings. Everyone would stand in front of their dev kits, and we would say, "Turn to the side. Do a golf swing." And we would just record them all onto the server.
The other interesting thing is, once we had all those clips, the engine doesn't really need to tag them up. We actually gave it to our testers and said, "Here's a hundred clips. Spend the next hour tagging them." They can just go through in a video-editing tool and say, "Here it is. Here it is. Here it is."
So it's not really an engineering-driven problem. That really helps as well. That's basically how we did all those tags. Now we've done that with golf, we're actually doing that with all of our events.
Is it a more effective way to determine natural motion -- the kind of motions players will do?
DQ: It's another tool in the tool belt, basically. The machine learning system we use in golf is very discrete; it's good at detecting specific events: the ball should release now. For example, table tennis is a very analog, skill-driven system, so it's a different kind of gesture.
You have to look at what you're trying to detect and then pick the right tool. Machine learning is just another one of those tools -- a very powerful one. I don't think we could have done golf to the level that we did without having that system.
You've worked with Kinect since the Project Natal days. Has Kinect come further in terms of recognizing people's movements and recognizing multiple people in front of the camera than you actually anticipated?
DQ: Yeah, I think so. Since the Kinect launched, we've had two upgrades to the tracking system: more data sets, more training. Every time that's happened we've seen it getting better and better. Whether it's beyond my expectations, I was pretty blown away the first time I saw it (laughs), so it's a very high bar.
I know you're working on Sports, and that sort of does limit things. You're going to pick specific sports. When you're working with the designers, do they have to come to you and say, "This is the idea we want to do. Can you figure out a way to do this in engineering?" Or is it more of a back-and-forth where you're like, "This is the kind of tracking that is possible"?
DQ: It's definitely a back-and-forth. I'd say for Sports 2, they picked a ton of tough ones for us: darts, baseball, and golf. When they first suggested darts, I was almost in disbelief.
Because your hand's going to be right in front of your face.
DQ: Absolutely. For the precise motion that they wanted, I was almost one of the guys going, "No, no, no. We can't do that." But then you look at it and you kind of look at the how could we do this kind of stuff. Darts is actually brilliant; it's one of my favorite games in Sports 2.
Darts uses a system nobody used at all in Sports 1; it's almost entirely around the depth feeds, that image feed of how far everyone is from the thing. We actually don't use the skeleton as much.
That's not something we really did much in Sports 1. It's just looking at all the information Kinect gives you and working out which bits you should look at to run the system that you want. The skeleton tracking, the depth feed -- all kinds of stuff.
Is it as much about excluding information as it is about including information?
DQ: It's definitely working out the context -- exactly what you're looking at. An example of tailoring the information was the boxing punch in Sports 1. Initially, we were looking at the skeleton feed, thinking that would be the best way to detect it; obviously, as the hand launches forward, that's a punch.
But since the hand's in front of your body, it's one of those occlusion issues. The skeleton feed can struggle with occlusion. So in the end, we turned to the depth feed and painted these panes of glass in front of the player. When you punch through the panes, if they all broke; that's how we did the punch.
It's one of those instances of taking the consistent information the game is receiving, and trying to look at specific bits [of that information]. That can vary from sport to sport depending upon if it's an analog-y moving game, or a precise dart throw, or a specific moment like a golf swing for when we want to release the ball. So they're all quite different problems.
Are there any challenges in Kinect engineering that you haven't had a chance to tackle yet, or is there something that you're looking forward to tackling?
DQ: I think the big one coming up is speech. We pushed speech pretty hard in Sports 2. There was speech in the first round of launch titles; Kinectimals obviously had speech. But from day one the entire UI was gonna be speech-driven. Every game event had to have speech incorporated into it.
But it was also a very say-what-you-see approach; in golf, you change club [by saying] "four iron," kind of thing. What I'd like to see and what we're investigating now is a more natural conversation way of talking to the Kinect, so you can say, "Hey, caddy, give me a five iron," or "Hey, caddy, what should I use now?"
We're looking at that now, improving the speech system, so I think that would probably be the one that I'm personally the most interested in, mainly because I did so much work with speech in Sports 2.
Do you think that at this point with most of the visual input, whether it's 3D data or skeleton data, you've now encountered enough situations where you have a good toolbox to solve any of those problems?
DQ: Yeah, I think so. It's interesting now, as we look at new ideas, how quickly the engineers who've worked with Kinect a lot can pick out what the challenges will be. "If we do this event or this style of game, these are the things that we're going to have to deal with." That's just because we have so much experience with it now.
Our 13 sports now have been so varied -- as we said before, the gestures vary from sport to sport, so we have a good cross-section of what we've been doing and how we've solved problems in the past. As new ideas come in, we can all think, "This will be a challenge," or "Yes, we could do that pretty easily; we can copy what we did in track and field."
I think the only place that you might have new frontiers is if you go to a totally new genre like an adventure game. We recently did an article with Blitz Games for Puss in Boots, and one thing discussed was that if the developers did one-to-one tracking with the character, the character didn't look heroic on-screen anymore, because people have an exaggerated assumption of how cool they look when they're doing things -- which isn't exactly a problem you have with Sports.
DQ: Yeah. If you look at Star Wars, what they've done there is some really interesting stuff, blending in that one-to-one with extra animation into that so you use both at once. That means you get your power moment.
I've played it a couple of times, and it's interesting. When you stand there and realize that the character on screen is really puffing up their chest and getting ready for a swing, you find yourself mimicking that, and start doing it yourself, because you're getting into the thing. We call it "augmenteering" at Rare, joining in with avateering, which is that one-to-one mapping of animation. We did a little bit of augmenteering in Sports, but most of the time we were trying to get the one-to-one -- the player in the game -- as much as we could.
When it comes to speech, how much of a problem do you have with accents?
DQ: The speech system at the moment has what we call acoustic models. I'm Australian but actually I run the UK English because I think I've been in England long enough that I've lost my twang. Say we have execs come across from the States; if we leave the kits in U.S. mode, it does go down for the UK people speaking. So the acoustic models are quite tailored to the models. The UK model contains Scottish, Irish, the thick, pommy accents, whereas the U.S. mode has the Southern, and all of the American ones.
The reason those models exist and are different is that they have to include those accents for the regions. Our biggest challenge -- we have a Scottish guy at work, and he has the thickest thick accent. He actually interviewed me, and I could hardly understand what he was saying. If we know it works for him, we know it works. He's our test case, basically. "Good. It works for him." (laughs)
Whenever you do speech, they always recommend getting native speakers in front of the game, so we were sending people out to Japan and to Germany and everywhere get the native speakers talking and testing in front of the game.
Basically, what we're doing is lowering a number so it's as low as it needs to be to detect the speech, but still high enough to reject false accepts. It's just a tuning; we just dial it back and forth. We always have that infamous week at Rare where I turn it down too low and the game's just jumping around on any noise because it's just accepting everything. My name is mud for a week, and then we just turn it up again. It's really iterative, just trying to find that special spot -- and that special spot's different for each acoustic model, so the U.S. number is different from the UK number. It's just a tuning process.