In this article, I'm going to describe Talking Heads, our facial animation system which uses parsed speech and a skeletal animation system to reduce the workload involved in creating facial animation on large scale game projects. SCEE's Team Soho is based in the heart of London, surrounded by a plethora of postproduction houses. We have always found it difficult to find and keep talented animators, especially with so many appealing film projects being created on our doorstep here in Soho.
The Getaway is one of SCEE's groundbreaking in-house projects. It is being designed by Team Soho, the studio that brought you Porsche Challenge, Total NBA, and This Is Football. It integrates the dark, gritty atmosphere of films like Lock, Stock, and Two Smoking Barrels and The Long Good Friday with a living, breathing, digital rendition of London. The player will journey through an action adventure in the shoes of a professional criminal and an embittered police detective, seeing the story unfold from two completely different characters with their own agendas.
The Getaway takes place in possibly the largest environment ever seen in a video game; we have painstakingly re-created over 50 square kilometers of the heart of London in blistering photorealistic detail. The player will be able to drive across the capital from Kensington Palace to the Tower of London. But the game involves much more than just racing, the player must leave their vehicle to enter buildings on foot to commit crimes ranging from bank robberies to gang hits.
So, with a huge project such as The Getaway in development and unable to find enough talented people, the decision was made to create Talking Heads, a system that would severely cut down on the number of man-hours spent on tedious lip-synching.
Breaking It Down
The first decision to be made was whether to use a typical blend-shape animation process or to use a skeleton-based system. When you add up the number of phonemes and emotions required to create a believable talking head, you soon realize that blend shapes become impractical. One character might have a minimum of six emotions, 16 phonemes, and a bunch of facial movements such as blinking, breathing, and raising an eyebrow. Blend shapes require huge amounts of modeling, and also huge amounts of data storage on your chosen gaming platform.
The skeleton-based system would also present certain problems. Each joint created in the skeleton hierarchy has to mimic a specific muscle group in the face.
"If you want to know exactly which muscle performs a certain action, then you won't find an answer in Gray's Anatomy. The experts still haven't defined the subject of facial expression. Though psychologists have been busy updating our knowledge of the face, anatomists have not." -- Gary Faigin, The Artist's Complete Guide to Facial Expression
Most information on the Internet is either too vague or far too specialized. I found no one who could tell me what actually makes us smile. The only way forward was to work with a mirror close at hand, studying my own emotions and expressions. I also studied the emotions of friends, family, work colleagues, and people in everyday life. I have studied many books on facial animation and over the years attended many seminars. I strongly recommend a book by Gary Faigin, The Artist's Complete Guide to Facial Expression. If you can, try and catch Richard Williams in one of his three day master classes; his insight into animation comes from working with the guys who created some of the best Disney classics.
Building Your Head
Only part of a face is used during most expressions. The whole face is not generally used in facial expressions. The areas around the eyes, brows and the mouth contain the greatest numbers of muscle groups. They are the areas that change the most when we create an expression. We look at these two positions first and gather most of our information from them. Although other areas of the face do move (the cheeks in a smile for example), 80 percent of an emotion is portrayed through these two areas.
Neutral positions. We can detect changes in a human face because we understand when a face is in repose. We understand the positions of the brow and the mouth, and how wide the eyes are. These elements are constant from face to face. This is true if we are familiar with a person's face at rest or not (see Figure 1).
This changed the way we built our models, adding greater detail around the eyes and the mouth. Simulating the muscle rings seen in anatomy books allowed for greater movement in the face at these points.
The proportions of the face are the key to building a good head. Get this right and you are well on the way to creating realistic facial animation. Asymmetry is another goal to strive for when modeling your heads. Do not create half a head and flip it across to create the other half. The human head is not perfectly symmetrical.
Study of facial proportions by Leonardo da Vinci.
There are many rules concerning facial proportions. The overall shape of the head is governed by a simple rule: The height of the skull and the depth of the skull are nearly the same. The average skull is only two-thirds as wide as it is tall. The human head can be divided into thirds: forehead to brow; brow to base of nose; and base of nose to chin. The most consistent rule is that the halfway point of the head falls in the middle of the eyes. Exceptions to this are rare. A few other general rules:
The width of the nose at the base is the same as the width of an eye.
The distance between the brow and the bottom of the nose governs the height of the ear.
- The width of the mouth is the same as the distance between the centers of the pupils.
- The angle between the top lip and the bottom lip is 7.5 degrees.
- The bottom of the cheekbones is the same height as the end of the nose.
The heads for The Getaway all stem from one model. This head contains the correct polygon count, animation system and weighting. We scan actors using a system created by a company called Eyetronics, a very powerful and cost-effective scanning process. A grid is projected onto the person's face whom you wish to scan and photographs are taken. These photographs are passed through the software and converted into 3D meshes. Each mesh is sewn together by the software, and you end up with a perfect 3D model of the person you scanned. At the same time it creates a texture map and applies this to the model.
Then the original head model, the one that contains the correct polygon count and animation, is morphed into the shape of the scanned head. Alan Dann, an artist here at SCEE, wrote proprietary in-house technology to morph the heads inside Maya. The joints in the skeleton hierarchy are proportionally moved to compensate for the changes in the head. We are left with a model that has the stipulated in-game requirements but looks like the actor we wish to see in the g.
1,500-polygon model used for high-res in-game and medium resolution cutscenes.
The Getaway heads are designed with incredible level of detail. We use a 4,000-polygon model for extreme close-ups in the real-time cut scenes. The highest-resolution in-game model is 1,500 polygons, which includes tongue, teeth, eyelashes, and hair.
The skeleton hierarchy also contains level of detail; we remove joints as the characters move further away from the camera. Eventually only three joints remain, enough to rotate the head and open the mouth using the jaw.
Creating the Skeleton
The skeleton hierarchy was created based on the above study. Two main joints are used as the controls, the neck and the head. The "neck" is the base, the joint that is constrained to the skeleton of the character model. This joint can either be driven by constraints or motion capture data from the character model can be copied across. This gives us the point at which we have seamless interaction between the head and body. The "head" joint would control slight head movements: shaking and nodding, random head motions, and positions taken up in different expressions. The head leans forward during anger or downward when sad. This is the joint that all other joints spring from; it's used as the controlling joint. Wherever it goes, the rest of the joints go. Other joints which relate to specific muscle groups of the face are:
Six joints control the forehead and eyebrows.
- Three control each eye, one in each eyelid and one for the eye itself
- Two joints, one on either side of the nose.
- Two joints control each cheek.
- Two joints on either side of the jaw.
- Three joints in the tongue.
- Four joints control the lips.
Front and side views of the facial animation system, showing the skeleton hierarchy.
The idea behind this mass of joints is that they simulate certain muscle groups. The muscles of the face are attached to the skull at one end. The other end is attached straight to the flesh or to another muscle group. This is different from muscles in the body, which are always attached to a bone at both ends. As the muscles contract, it should be a simple case of just animating the scales of our joints to simulate these contractions. Unfortunately this is not the case, as there are actually hundreds of muscles which all interact together. To achieve realistic expression we had to rotate, scale, and translate the joints.
How do you go about assigning an arbitrary head model to this skeleton? The original skinning of the character took two whole days of meticulous weighting, using Maya and its paint weights tool to achieve this.
I didn't wish to do this for every head. Joe Kilner, a programmer here at SCEE who was writing the animation system with me, came up with a MEL script (Maya Embedded Language) that would copy weights from one model to another. The script basically saved out the weights of the vertices using two guidelines: the vertex's normal direction and UV coordinates. This enabled us to export weights from one head and import them onto another.
For this to work, we had to make sure that all of our head textures conform to a particular fixed template. The added bonus of this is that then we can apply any texture to any head. The template also made it easier to create our face textures.
Emotions and the Face
Research has shown that people recognize six universal emotions: sadness, anger, joy, fear, disgust, and surprise. There are other expressions that we have that are more ambiguous. If you mix the above expressions together, people offer differing opinions on what they suggest. Also, physical states such as pain, sleepiness, passion, and physical exertion tend to be harder to recognize. So if you wish to make sure that the emotion you are trying to portray is recognized, you must rely on the overall attitude or animation of the character. Shyness, for example, is created with a slight smile and downcast eyes. But this could be misinterpreted as embarrassed or self-satisfied.
Emotions are closely linked to each other. Worry is a less intense form of fear, disdain is a mild version of disgust, and sternness is a mild version of anger. Basically blending the six universal emotions or using lesser versions of the full emotions gives us all the nuances of the human face.
A typical face texture in The Getaway.
Emotions and the System
Creating the emotions on your base skeleton is the next step. Which emotions should the system incorporate? We use the six universal emotions, some physical emotions, a phoneme set and a whole load of facial and head movements. The system inside Maya runs off the back of three locators. Each locator controls a different set of Set Driven Keys. A locator in Maya is a Null object that can have attributes added.
The first locator controls expressions. Each of the following is an attribute on the locator: sadness, anger, joy, fear, disgust, surprise, shock, perplexed, asleep, pain, exertion, and shout. Each attribute has a value which ranges from 0 to 10.
The skeleton is set to a neutral pose which is keyed at zero on all the emotion attributes. Then the joints are scaled, rotated, and translated into an expression, for example, "sad." Using Maya's Set Driven Key, this position is keyed onto a value of 5 on the sadness attribute. Then at a value of 10, "crying open mouthed" is keyed, giving us a full emotional range for sadness. Now the face is set up so that Maya can blend from a "neutral" pose to one of "sad" and then continue on to "crying."
Sadness attribute keyed at a value of 0, 10, and 15.
For each emotion attribute, several different keys are assigned as above. This gives the character a full range of human emotions. These emotion attributes can then be mixed together to achieve subtle effects.
A mixture of joy and sadness produces a sad smile, while anger and joy produce a wicked grin. The process is additive, which means that mixing emotions over certain values starts to pull the face apart. A good rule of thumb is never to let the total of the attributes exceed the maximum attribute value. As we have keyed ours between 0 and 10, we try never to exceed 10. If you mix three emotion attributes together and they have equal values then each cannot exceed 3.3. There are attributes that can be mixed at greater levels, but trial and error is a great way of finding out which you can mix and which you can't.
Phonemes and Visemes
"A phoneme is the smallest part of a grammatical system that distinguishes one utterance from another in a language or dialect." -- Bill Fleming and Darris Dobbs, Animating Facial Features and Expressions
Basically, a phoneme is the sound we hear in speech. Combining phonemes, rather than letters, creates words. The word "foot" would be represented by "f-uh-t."
Visual phonemes (visemes) are the mouth shapes and tongue positions that you create to make a phoneme sound during speech. The common myth is that there are only nine visual phonemes. You can create wonderful animation from just these nine; however, there are in fact 16 visual phonemes. Although some may look very similar externally, the tongue changes position.
Our second locator controls the phonemes. They are assigned in exactly the same way as the emotion attributes. An exaggerated form of each phoneme is keyed at 10. When creating the lip-synching we generally only use values up to 3.
The phoneme set shown is Anglo-American. This can be replaced with any phoneme set from around the world. You can conceivably make your character talk in any language you wish.
Two rules for the use of visual phonemes:
animate behind synch. Do not try to animate behind the dialogue. In
fact it's better to animate your phonemes one or two frames in front
of the dialogue. Before you can utter a sound you must first make the
- Don't exaggerate. The actual range of movement while talking is fairly limited. Study your own mouth movements.
Talking Heads tries to simulate realistic facial movements, and "less is more" is true for all parts of the system. The mouth doesn't open much at all while talking, so don't make your visual phonemes exaggerated.
The third locator controls aspects of the face that are so natural that we don't even think about them. These attributes are essential if you want to achieve realistic facial animation.
Blinking. A human blinks once every four seconds. This timing can change according to what emotional state the character is in. If anger is your dominant attribute then the blink rate should decrease to once every six seconds. The reason behind this is physical; the eyes open wide in anger, achieving a glare. If you are acting nervous then the blink rate increases to once every two seconds. This reaction is involuntary. Blinking brings realism to your characters but also emphasizes a particular emotion or mood.
Facial shrug and raising eyebrows. These attributes are generally used when the character is silent, listening to a conversation, etc. The human face is never static, it's constantly moving. This movement can take many forms. Slight head movement, constant eye movement and blinking are excellent at keeping the character alive. Raising an eyebrow or performing a facial shrug can be used in conjunction with emotion attributes to add a little extra emphasis to the emotion.
Nodding and shaking the head. Whenever we encounter a positive or negative statement, we either nod in agreement or shake our head in disapproval. These are involuntary acts and the quickest ways to state your point of view without opening your mouth. Note that the neutral position of these two attributes is set at 5. This allows the head to move in four separate directions, up, down, left, and right.
Random head motion. We realized very quickly when animating our heads that when you talk you are constantly moving your head. The random head attribute simulates this slight movement.
Breath. The breathing attribute is set at several different positions. It can simulate slight breathing to full gasps.
The Fourth Locator
There is one final locator that I haven't yet mentioned. This locator is called the "look at" and controls what the character is seeing. The joints that control the eyes are constrained using aim constraints in Maya. This forces the joints to always track/point at the "look at" locator. You can then use the locator to control the character's point of view. You can animate this locator and enable your character to glance away during a conversation. The angles of the eye joints are linked via an expression with the head joint. If the eyes are forced to rotate more than 20 degrees to follow the "look at" locator, the head rotates to compensate. We found this to be very realistic, mimicking the movement of the head (Figure 13).
Tips and Tricks
Here are a few additional pointers for animators when animating facial expressions.
You must have two frames to be able to read it! When you are laying down keyframes for your lip-synching, always make sure that the consonants last for a minimum of two frames at 24 fps. Obviously, if you are running at 60fps on PS2, then triple this. Any phoneme that is a consonant, such as p, b, m, f, or t, must be keyed in this way. This rule cannot be broken; the mouth must be in a closed state for the two frames. If you don't make sure of this then you will not be able to read what the character is saying. If you have no time to fit this in, steal from the previous word.
Make sure your animation is ahead of your timeline. The easy way to do this is to animate to your sound file. When you are happy with your animation and lip-synching, move the sound forward in the timeline and make sure that the animation starts one to two frames before the sound. You cannot utter a peep unless you have made the correct mouth shape. This will improve your lip-synching.
Subtlety is king. I cannot stress too much how important this is. The urge once you have created your system is to go mad. The human face is a subtle machine, keep your movements to a minimum and your animations will look much more realistic.
Move the eyes. If you want to keep your character alive keep the eyes moving. When we are talking to someone we spend 80 percent of our time tracking their eyes and mouth and 20 percent glancing at their hands and body.
Head synch is almost as important as lip-synch. Every word and pause should have a separate head pose. We use random head motion to achieve this. Some words need accenting or emphasizing. Listen to your sound file and pick out the words that are stressed, these are the ones to which you should add extra head movement.
We have talked about the basics of facial animation, why we chose a skeleton-based system, and how we put this into practice. The next step is to explain exactly how Talking Heads works.
As I've mentioned before, the point of a system like this is to reduce the workload and demands on a small group of animators working on a large project. The only way that this can happen is to hand over some of the more tedious tasks of facial animation to the computer.
Our facial animation system works on three levels: the first is concentrated around achieving believable lip-synching, the second around laying down blocks of emotions, and the third on underlying secondary animation such as blinking or breathing.
Lip-synching. The first step is to record an uncompressed 44kHz .WAV file of the chosen actor and script. A good point to mention here is that your script should contain a series of natural pauses. A good actor or voice-over artist should give you this automatically. Remember, you want the best performance you can get. The sound file contains all the hints you will need to animate emotions and will carry your animation. The pauses aid the system, allowing it to work out where it is in the .WAV file when it calculates the phonemes.
We then create a text file, which is an exact script of the .WAV file. During the creation of the phonemes, the text file is matched against a phoneme dictionary. There are many such dictionaries on the web, it's just a matter of finding a free one (see For More Information). The dictionary contains a huge list of words and their phoneme equivalents. By checking the script against this dictionary, the system determines the phonemes required to make the words. Some obscure words are not covered, and we enter these into our dictionary by hand.
Most of the development time of Talking Heads was taken up working out how to parse the .WAV file. This is all custom software which enables us to scan through our sound file and work out the timings between the words. We also work out the timing between phonemes, which is very important.
Talking Heads then lays down keyframes for the phonemes in Maya. It does this by taking the information from the dictionary and the .WAV file and matching them, phoneme against length of time. As mentioned before these keys are assigned to the locator that controls the phonemes. This allows for easy editing of the phonemes at a later stage by an animator, or the creation of a complete new phoneme animation if the producer decides that he wants to change the script. So a one-minute animation that could take a week to animate by hand can be created in half an hour. Then the animator is free to refine and polish as he sees fit.
One advantage to the system is the creation of language SKUs. We produce products for a global market, and there is nothing more frustrating than re-doing tedious lip-synching for each country. Talking Heads gets around this problem quite efficiently. You have to create a phoneme set for each language and find a corresponding phoneme dictionary, but once you have done this the system works in exactly the same way as before. You can lay down animations in English, French, German, Japanese, or whatever language you wish.
Emotions. The next step is to add blocks of emotion. To do this we edit the text file that we created from the .WAV file. A simple markup language is used to define various emotions throughout the script.
As you can see, emotions are added and given values. These values correspond with those on the emotion locator. An Anger value of 2.2 gives the character a slight sneer, and by the end of this sentence the character would smirk. In this way, huge amounts of characterization can be added. We video our actors at the time we record the sound, either in the sound studio or the motion capture studio. We can then play back the video recording of the scene we are editing and lay down broad emotions using the actor's face as a guideline.
The advantage of editing a text file is that anyone can do it. You do not have to be an animator or understand how a complicated software package works. As long as the person who is editing knows what the different emotion values look like, they can edit any script. Using the video of the actor's face allows anyone to see which emotions should be placed where and when.
Later on, an animator can take the scene that has been setup using the script and go in and make changes where necessary. This allows our animators to concentrate their talents on more detailed facial animation, adding subtlety and characterization by editing the sliders in the animation system and laying keys down by hand.
Specials. The third area to be covered by the Talking Heads system concentrates on a wide range of subtle human movements. These are the keys to bringing your character to life. Talking Heads takes the text file and creates emotions from the markup language as it matches phonemes and timings. It also sets about laying down a series of secondary animations and keying these to the third locator. As mentioned before, this locator deals with blinking, random head motion, nodding and shaking of the head, breathing, and so on.
Blinking is controlled by the emotion that is set in the text file. If the character has anger set using the markup language, then it will only set blinking keyframes once every six seconds. When angry, the face takes on a scowl, the eyes open wide, and blinking is reduced to show as much whites of the eyes as possible. It has lengths of time for each emotion and will use the one with the highest value as the prime emotion for blinking. Also added is a slight randomness which will occasionally key in a double blink. The normal blinking rate is once every four seconds, and if the character is lying or acting suspiciously this rate increases to once every two seconds.
Random head motion is keyed only when keyframes are present for phonemes. This means that the character always moves his head when he is speaking. This is a subtle effect; be careful with the movement, a little goes a long way. The next pass looks for positive and negative statements. It tracks certain words such as "yes, no, agree, disagree, sure, certainly, never." When it finds such words, it sets keyframes for nodding and shaking of the head. Using the timing from the script, it uses a set of decreasing values on the nod and shake head Set Driven Keys. This gives us very realistic motion.
Breathing is automatic; the system keys values when it reaches the end of a sentence. This value can differ depending on the physical state of the character. Normal values are hardly detectable, while extreme values mimic gasping for breath.
At this stage the system also creates keys for random eye motion. This keeps the character alive at all times. If your character stops moving at any point, the illusion of life is broken.
Set up and ready to go. Once everything has run through Talking Heads, we have a fully animating human head. At this stage an animator has not even overseen the process. Our character blinks, breathes, moves, talks, and expresses a full range of human emotion.
At this point we schedule our animators onto certain scenes and they make subtle changes to improve the overall animation, making sure that the character is reacting to what other characters are saying and doing.
More Refined in Less Time
The process of creating Talking Heads has been a long nine months, and still changes are being made. We continue to tinker and evolve the system to achieve the most believable facial animation seen in a computer game. Whether we have done this successfully will only be seen when The Getaway is eventually released.
The next step is to incorporate Talking Heads into real-time. This would allow our in-game NPCs to react to whatever the player does. This is already in motion and we hope to see this happening in The Getaway.
Facial animation can be achieved without huge animation teams. The process of creating Talking Heads has been an extremely worthwhile experience. We are now able to turn out excellent animations in very short times. Our team of animators is free to embellish facial animation, adding real character and concentrating their efforts on creating the huge amount of animation required for in-game and cutscenes.
Gavin Moore has worked in the games industry for 10 years. He is currently the senior animator on The Getaway at Sony Computer Entertainment Europe's Team Soho. He is in charge of a team of artists and animators responsible for all aspects of character creation and animation in the game. Gavin can be reached at [email protected]
For More Information
Faigin, Gary. The Artist's Complete Guide to Facial Expression. New York: Watson-Guptill, 1990.
Fleming, Bill, and Darris Dobbs. Animating Facial Features and Expressions. Rockland, Mass.: Charles River Media, 1999.
Park, Frederic I., and Keith Waters. Computer Facial Animation. Wellesley, Mass.: A. K. Peters, 1996.