who has ever been in a professional production situation realizes that
real-world coding these days requires a broad area of expertise. When
this expertise is lacking, developers need to be humble enough to look
things up and turn to people around them who are more experienced in
that particular area.
As I continue to explore areas of graphics technology, I have attempted to document the research and resources I have used in creating projects for my company. My research demands change from month to month depending on what is needed at the time. This month, I have the need to develop some facial animation techniques, particularly lip sync. This means I need to shelve my physics research for a bit and get some other work done. I hope to get back to moments of inertia, and such, real soon.
And Now for Something Completely Different
My problem right now is facial animation. In particular, I need to know enough in order to create a production pathway and technology to display real-time lip sync. My first step when trying to develop new technology is to take a historic look at the problem and examine previous solutions. The first people I could think of who had explored facial animation in depth were the animators who created cartoons and feature animation in the early days of Disney and Max Fleischer.
Facial animation in games has built up on this tradition. Chiefly, this has been achieved through cut-scene movies animated using many of the same methods. Games like Full Throttle and The Curse of Monkey Island used facial animation for their 2D cartoon characters in the same way that the Disney animators would have. More recently, games have begun to include some facial animation in real-time 3D projects. Tomb Raider has had scenes in which the 3D characters pantomime the dialog, but the face is not actually animated. Grim Fandango uses texture animation and mesh animation for a basic level of facial animation. Even console titles like Banjo Kazooie are experimenting with real-time “lip-flap” without even having a dialog track. How do I leverage this tradition into my own project?
Phonemes and Visemes
No discussion of facial animation is possible without discussing phonemes. Jake Rodgers’s article “Animating Facial Expressions” (Game Developer, November 1998) defined a phoneme as an abstract unit of the phonetic system of a language that corresponds to a set of similar speech sounds. More simply, phonemes are the individual sounds that make up speech. A naive facial animation system may attempt to create a separate facial position for each phoneme. However, in English (at least where I speak it) there are about 35 phonemes. Other regional dialects may add more.
Now, that’s a lot of facial positions to create and keep organized. Luckily, the Disney animators realized a long time ago that using all phonemes was overkill. When creating animation, an artist is not concerned with individual sounds, just how the mouth looks while making them. Fewer facial positions are necessary to visually represent speech since several sounds can be made with the same mouth position. These visual references to groups of phonemes are called visemes. How do I know which phonemes to combine into one viseme? Disney animators relied on a chart of 12 archetypal mouth positions to represent speech as you can see in Figure 1.
Figure 1. The 12 classic Disney mouth positions.
Each mouth position or viseme represented one or more phonemes. This reference chart became a standard method of creating animation. As a game developer, however, I am concerned with the number of positions I need to support. What if my game only has room for eight visemes? What if I could support 15 visemes? Would it look better?
Throughout my career, I have seen many facial animation guidelines with different numbers of visemes and different organizations of phonemes. They all seem to be similar to the Disney 12, but also seem like they involved animators talking to a mirror and doing some guessing.
I wanted to establish a method that would be optimal for whatever number of visemes I wanted to support. Along with the animator’s eye for mouth positions, there are the more scientific models that reduce sounds into visual components. For the deaf community, which does not hear phonemes, spoken language recognition relies entirely on lip reading. Lip-reading samples base speech recognition on 18 speech postures. Some of these mouth postures show very subtle differences that a hearing individual may not see.
So, the Disney 12 and the lip reading 18 are a good place to start. However, making sense of the organization of these lists requires a look at what is physically going on when we speak. I am fortunate to have a linguist right in the office. It’s times like this when it helps to know people in all sorts of fields, no matter how obscure.
The field of linguistics, specifically phonetics, compares phonemes according to their actual physical attributes. The grouping does not really concentrate on the visual aspects, as sounds rely on things going on in the throat and in the mouth, as well as on the lips. But, perhaps this can help me organize the phonemes a bit.
Sounds can be categorized according to voicing, manner of articulation (airflow), and the places of articulation. There are more, but these will get the job done. As speakers of English, we automatically create sounds correctly without thinking about what is going on inside the mouth. Yet, when we see a bad animation, we know it doesn’t look quite right although we may not know why. With the information below, you will be equipped to know why things look wrong. Now for some group participation. This is an interactive article. Go on, no one is looking. The categories we want to examine are:
Voiced vs. Voiceless. Put your hand on your throat and say something. You can feel an intermittent vibration. Now say, “p-at, b-at, p-at, b-at,” (emphasizing the initial consonant). Looking at the face, there is no visual difference between voiced and voiceless sounds. In some sounds the vocal cords are vibrating together (b-voiced) and in some the vocal cords are apart (p- voiceless). This is an automatic no-brainer as far as reducing sounds into one viseme. Any pair of sounds that is only different because of voicing can be reduced to the same viseme. In English, that eliminates eight phonemes.
Nasal vs. oral. Put your fingers on your nose. Slowly say “momentary.” You can feel your nose vibrating when you are saying the “m.” Some sounds are said through the nasal cavity, but most are said through the oral cavity. These are also not visibly different. So again, we have an automatic reduction in phonemes. All three nasal sounds in English can be included in the oral viseme counterpart.
Manners of Speech. Sounds can also be differentiated by the amount of opening through the oral tract. These also do not offer a visible clue, but are very important for categorizing phonemes. Sounds that have complete closure of the airstream are called stops. Sounds that have a partially obstructed closure and turbulent airflow are called fricatives. A sound that combines a stop/fricative is called an affricate. Sounds that have a narrowing of the vocal tract, but no turbulent airflow, are called approximates. And then there are sounds that have relatively no obstruction of the airflow; these are the vowels.
Figure 2. Side cut-out view of places of articulation.
Places of Articulation. This involves where the sound is being made in the mouth. This is where the visible differences occur. There are several places of articulation (see Figure 2) involving the lips, teeth, tongue, and stuff in the back of the mouth (the palate, velum, and glottis) for the consonants. Vowel placement is based on the relative height of the tongue and whether the tongue is more front or back in the mouth. A differentiating factor not listed in Chart 1 is lip rounding. This is not associated with any particular place of articulation and will be addressed below. Whew.
As I said, there are 35 phonemes in my dialect of American English. You may have more. Chart 1 is a summary of these phonemes. Read the chart from the front of the mouth to the back of the mouth. Try saying each of the words that illustrate the phoneme that is in bold. Have a look in the mirror and see what is going on as well as feel what is going on inside the head. By using the distinction of voicing and oral/nasal, we have already eliminated 11 phonemes. Let’s continue the reduction of phonemes into the usable visemes.
Take It to the Limit
According to the chart, there are three bilabials, which are sounds made with both lips. They are [b], [p], and [m]. According to the Figures 3a, 3b, and 3c they have different attributes inside the mouth. B and P only differ in that the B makes use of the vocal cords and P does not. The M sound is nasal and voiced so it is similar to the B sound, but it is a nasal sound. The cool thing about these sounds is that while there are differences inside the mouth, visually there is no difference. If you look in a mirror and say “buy,” “pie,” and “my” they all look identical. We have reduced three phonemes into one viseme.
Chart 1. American English phoneme summary chart.
While you’re working, remember that you are thinking with respect to sounds (phonemes), not letters. In many cases a phoneme is made up of multiple letters. So, if we go through Chart 1, we can continue to reduce the 35 phonemes into 13 visemes. For the most part, the visemes are categorized along the lines of the Places of Articulation (with the exception of [r]).
Take a look at the following listing of visemes. It describes the look of each phoneme in American English. The only phoneme not listed is [h]. “In English, ‘h’ acts like a consonant, but from an articulatory point of view it is simply the voiceless counterpart of the following vowel.” (Ladefoged, 1982:33-4). In other words, treat [h] like the vowel that comes after it.
1. [p, b, m] - Closed lips.
2. [w] & [boot] - Pursed lips.
3. [r*] & [book] - Rounded open lips with corner of lips slightly puckered. If you look at Chart 1, [r] is made in the same place in the mouth as the sounds of #7 below. One of the attributes not denoted in the chart is lip rounding. If [r] is at the beginning of a word, then it fits here. Try saying “right” vs. “car.”
4. [v] & [f ] - Lower lip drawn up to upper teeth.
5. [thy] & [thigh] - Tongue between teeth, no gaps on sides.
6. [l] - Tip of tongue behind open teeth, gaps on sides.
7. [d,t,z,s,r*,n] - Relaxed mouth with mostly closed teeth with pinkness of tongue behind teeth (tip of tongue on ridge behind upper teeth).
8. [vision, shy, jive, chime] Slightly open mouth with mostly closed teeth and corners of lips slightly tightened.
9. [y, g, k, hang, uh-oh] - Slightly open mouth with mostly closed teeth.
10. [beat, bit] - Wide, slightly open mouth.
11. [bait, bet, but] - Neutral mouth with slightly parted teeth and slightly dropped jaw.
12. [boat] - very round lips, slight dropped jaw.
13. [bat, bought] - open mouth with very dropped jaw.
To see how helpful this information can be when animating a face take a word like “hack.” It has four letters, three phonemes, and only two visemes (13 and 9 in the listing).
Say that you don’t have enough space to include 13 visemes and whatever emotions you want expressed. Well, by using Chart 1 and the list of visemes in the listing, you can make logical decisions of where to cut. For example, if you only have room for 12 visemes, you can combine viseme 5 and 6 or 6 and 7 below. For 11 visemes, continue combining visemes by incorporating viseme 7 and 9 below. For 10, combine visemes 2 and 3. For 9, combine 8 with the new viseme 7/9. For 8, combine 11 and 13.
If I were really pressed for space, I could keep combining and drop this list down further. Most drastic would be three frames (Open, Closed, and Pursed as in boot) or even a simple two frames of lip flap open and closed. In this case you would just alternate between opened and closed once in a while. But that isn’t very fun or realistic, is it?
These viseme descriptions are enough to realistically represent speech. However, the use of individual visemes is more an artistic judgement then a hard rule. When speaking, people tend to slur phonemes together. They do not clearly articulate each phoneme all the time. Also, the look of a viseme can change depending on the visemes that surround it. For example, the Disney guidelines describe the use of a slightly different viseme for B, P, and M if they precede the ea sound as in beat.
This dependency on surrounding sounds is called co-articulation and makes viseme choice more complicated. This is one reason that the automatic phoneme recognition software in some packages doesn’t always provide realistic results. Smooth interpolation between viseme keyframes can help, but this alone may not be good enough. In many cases, it requires an artistic judgement for which viseme really looks best. In computer animation, realistic looks are all that matter. So, when you work, put in the viseme that looks best.
Emphasis and exaggeration are also very important in animation. You may wish to punch up a sound by the use of a viseme to punctuate the animation. This emphasis along with the addition of secondary animation to express emotion is key to a believable sequence.
In addition to these viseme frames, you will want to have a neutral frame that you can use for pauses. In fast speech, you may not want to add the neutral frame between all words, but in general it gives good visual cues to sentence boundaries.
Side view of the sound
[m], as in “my.”
Side view of the sound
[b], as in “buy.”
Side view of the sound
[p], as in “pie.”
So What Do I Do with This Stuff?
So far, I have been discussing issues that only seem important to the artists working on the facial animation. If the only use of facial animation in your project is for pre-rendered cut scenes, this may be true. However, I believe facial animation will become an important aspect in real-time 3D rendering as we take character simulation to the next level. This requires a close relationship between the art assets and engine features.
As a technical lead on a cutting-edge 3D project, you will be required to create the production pathway that the artists will use to create assets. You will be responsible for deciding how many visemes the engine can support and the manner in which the meshes must be created. Having a clear understanding of what goes into the creation of the assets will allow you to interface more effectively with those creating the assets.
However, even with the viseme count I am still not ready to set the artists loose creating my viseme frames. There are several basic engine decisions that I must make before modeling begins. Unfortunately, I will have to wait until the next column to dig into that. Until then, think back on my 3D morphing column (“Mighty Morphing Mesh Machine,” December 1998) as well as last year’s skeletal deformation column (“Skin Them Bones,” Graphic Content, May 1998) and see if you can get a jump on the rest of the class.
Special thanks go to my partner in crime, Margaret Pomeroy. She was able to explain to me what was really going on when I made all those funny faces in the mirror. When she was studying ancient languages in school I am sure she never imagined working on lip-synching character dialog.
For Further Info
• Culhane, Shamus. Animation from Script to Screen. New York: St. Martin’s Press, 1988.
• Ladefoged, Peter. A Course in Phonetics. San Diego: Harcourt Brace Jovanovich, 1982.
• Maestri, George. [digital] Character Animation. Indianapolis: New Riders Publishing, 1996.
• Parke, Frederic I. and Keith Waters. Computer Facial Animation. Wellesley: A. K. Peters, 1996.
Jeff Lander often sounds like he knows what he’s talking about. Actually, he’s just lip-synched to someone who really know what’s going on. Let him know you are on to the scam at [email protected].