In my preceeding article, "Read My Lips: Facial Animation Techniques" I left off with a nice short list of the visemes I would need to represent speech realistically. However, now I am left with the not insignificant problem of determining exactly how to display these visemes in a real-time application.
It may seem as if this is purely an art problem, better left to your art staff Or, if you are a one-person development team, at least left to the creative side of your brain. However, your analytical side needs to inject itself in here a bit. This is one of those early production decisions you read about so much in the Postmortem column that can make or break your schedule and budget. Choose wisely and everything will work out great. Choose poorly and your art staff or even your own brain will throttle you.
For the final result, I want a 3D real-time character that can deliver various pieces of dialog in the most convincing manner possible. Thanks to the information learned last month, I know I can severely limit the amount of work I need to do. I know that with 13 visemes, or visual phoneme positions, I can reasonably represent most sounds I expect to encounter. I even have a nice mapping from American English to my set of visemes. Most other languages could probably be represented by these visemes as well, but could require a different mapping table.
From this information I can expect that if I can reasonably represent these 13 visemes with my character mesh, then continuous lip-synch should be possible. So the problem really comes down to how I construct and manipulate those meshes.
Certainly, the obvious method for creating these 13 visemes is to generate 13 versions of my character head mesh, one to represent each viseme. I can then use the morphing techniques I discussed in my column “Mighty Morphing Mesh Machine,” in the December 1998 issue of Game Developer to interpolate smoothly between different sounds.
Figure 1. The “l” viseme
as seen at the start
of the word “life.”
Modeling the face to match the visemes is pretty easy. Once the artist has the base mesh created, each viseme can be generated by deforming the mesh any way necessary to get the right target frame. As long as no vertices are added or deleted and the triangle topology remains the same, everything should work out great. Figure 1 shows an image of a character displaying the “L” viseme, as in the word “life.” The tongue is behind the top teeth, slightly cupped, leaving gaps at the side of the mouth, and the teeth are slightly parted.
Sounds pretty good so far. Just create 13 morph targets for the visemes in addition to the base frame and you’re done. Life’s great, back to physics, right? Well, not quite yet.
Suppose in addition to simply lip-synching dialog, your characters must express some emotion. You want them to be able to say things sadly, or speak cheerfully. We need to add an emotional component to the system.
Adding Some Heart to the Story
At first glance, it may seem that you can simply add some additional morph targets for the base emotions. Most people describe six basic emotions. Here they are with some of their traits. (See Goldfinger under “For Further Info” for photo examples of the six emotions.)
1. Happiness: Mouth smiles open or closed, cheeks puff, eyes narrow.
2. Sadness: Mouth cornsers pull down, brows incline, upper eyelids droop.
3. Surprise: Brows raise up and arch, upper eyelids raise, jaw drops.
4. Fear: Brows raise and draw together, upper eyelids raise, lower eyelids tense upwards, jaw drops, mouth corners go out and down.
5. Anger: Inner brows pull together and down, upper eyelids raise, nostrils may flare, lips are closed tightly or open exposing teeth.
6. Disgust: Middle portion of upper lip pulls up exposing teeth, inner brows pull together and down, nose wrinkles.
There are variations of these emotions, such as contempt, pain, distress, excitement, but you get the idea. Very distinct versions of these six will get the message across.
The key thing to notice about this list is that many of these emotions directly affect the same regions of the model as the visemes. If you simply layer these emotions on top of the existing viseme morph targets, you can get an additive effect. This can lead to ugly results.
Figure 2. A very surprised
For example, let me start with the “L” sound from before and blend in a surprised emotion at 100 percent. The “L” sound moves the tongue up to the top set of teeth and parts the mouth slightly. However, the surprise target drops the jaw even farther but leaves the tongue alone. This combination blends into the odd-looking character you see in Figure 2.
This problem really becomes apparent when the two meshes are actually fighting each other. For example, the “oo” viseme drives the lips into a tight, pursed shape while the surprise emotion drives the lips apart. Nothing pretty or realistic will come out of that combination.
When I ran into this issue a couple of years ago, the solution was tied to the weighting. By assigning a weight or priority to each morph target, I can compensate for these problems. I give the “oo” viseme priority over the surprise frame. This will suppress the effect that the surprise emotion has over shared vertices.
to Muscle Beach
Most of the academic research on facial animation has not approached the problem from a viseme basis. This is due to a fundamental drawback to the viseme frame based approach. In the viseme- based system, every source frame of animation is completely specified. While I can specify the amount each frame contributes to the final model, I cannot create new source models dynamically. Say, for example, I want to allow the character to raise one eyebrow. With the frames I have described so far, this would not be possible. In order to accomplish this goal, I would need to create individual morph targets with each eyebrow raised individually. Since a viseme can incorporate a combination of many facial actions, isolating these actions can lead to an explosive need for source meshes. You may find yourself breaking these targets into isolated regions of the face.
For this reason, researchers such as Frederic Parke and Keith Waters began examining how the face actually works biologically. By examining the muscle structure underneath the skin, a parametric representation of the face became possible. In fact, psychologists Paul Ekman and Wallice Friesden developed a system to determine emotional state based on the measurement of individual muscle groups as “action units.” Their system, called Facial Action Coding System (FACS), describes 50 of these action units that can create thousands of facial expressions. By creating a facial model that is controlled via these action units, Waters was able to simulate the effect that changes in the action units reveal on the skin.
Figure 3. The zygomaticus major muscle will put a
smile on your face.
While I’m not sure if artists are ready to start creating parametric models controlled by virtual muscles, there are definitely some lessons to be learned here. With this system, it’s possible to describe any facial expression using these 50 parameters. It also completely avoids the additive morph problem I ran into with the viseme system. Once a muscle is completely contracted, it cannot contract any further. This limits the expression to ones that are at least physically possible.
Artist-Driven Muscle-Based Facial Animation
Animation tools are not really developed to a point where artists can place virtual muscles and attach them to a model. This would require a serious custom application that the artists may be reluctant even to use. However, that doesn’t mean that these methods are not available for game production. It just requires a different way of thinking about modeling.
For instance, let me take a look at creating a simple smile. Biologically, I smile by contracting the zygomaticus major muscle on each side of my face. This muscle connects the outside of the zygomatic bone to the corner of the mouth as shown in Figure 3. Contract one muscle and half a smile is born.
Figure 4. Pucker up:
Incisivus labii at work.
O.K. Mr. Science, what does that have to do with modeling? Well, this muscle contracts in a linear fashion. Take a neutral mouth and deform it as you would when the left zygomaticus major is contracted. This mesh can be used to create a delta table for all vertices that change. Repeat this process for all the muscles you wish to simulate and you have all the data you need to start making faces. You will find that you probably don’t need all 50 muscle groups described in the FACS system. Particularly if your model has a low polygon count, this will be overkill. The point is to create the muscle frames necessary to create all the visemes and emotions you will need, plus any additional flexibility you want. You will probably want to add some eye blinks, perhaps some eye shifts, and tongue movement to make the simulation more realistic.
The FACS system is a scientifically-based general modeling system. It does not consider the individual features of a particular model. By allowing the modeler to deform the mesh for the muscles instead of using this algorithmic system, I am giving up general flexibility over a variety of meshes. However, I gain creative control by allowing for exaggeration as well as artistic judgement.
The downside is that it is now much harder to describe to the artists what it is you need. You need to purchase some sort of anatomy book (see my suggestions at the end of the column) and figure out exactly what you want to achieve. Your artists are going to resist. You had this nice list of 13 visemes and now you are creating more work. They don’t know what an incisivius labii is and don’t want to. You can explain that it is what makes Lara pucker up and they won’t care. You will have to win the staff over by showing the creative possibilities for character expression that are now available. They probably still won’t care, so get the producer to force them to do it. I have created a sample muscle set in Chart 1. This will give you some groups from which to pick.
Chart 1. The basic muscle groups involved in facial animation.
Now I need to relate these individual muscle meshes to the viseme and emotional states. This is accomplished with “muscle macros” that blend the percentages of the basic muscles to form complex expressions. This flexibility permits speech and emotion in any language without the need for special meshes.
I still need to handle the case where several muscles interact with the same vertices. However, now there is a biological foundation to what you are doing.
Certain muscles counteract the actions of other muscles. For example, the muscles needed to create the “oo” viseme (incisivius labii) will counter the effect of the jaw dropping (digastric for those of you playing along at home). One real-time animation package I have been working with called Geppetto, from Quantumworks, calls this Muscle Relations Channels. You can create a simple mathematical expression between the two to enforce this relationship. You can see this effect in Figure 5.
Figure 5. W.C. Fields’s jaw is open and then blended
with the “oo” viseme. Image courtesy of Virtual
Celebrities Productions and Quantumworks.
for the Animation
I finally have my system set up and my models created. It is time to create some real-time animation. The time-tested animation production method is to take a track of audio dialog and go through it, matching the visemes in your model set to the dialog. Then, in second pass, go through it and add any emotional elements you want. This, as you can imagine, is pretty time consuming. Complicating the matter is that there are not many off-the-shelf solutions to help you out. The job requires handling data in a very special way and most commercial animation packages are not up to the task without help.
Detecting the individual phonemes within an audio track is part of the puzzle that you can get help with. There is an excellent animation utility called Magpie Pro from Third Wish Software that simplifies this task. It can take an audio track and analyze it for phoneme patterns you provide automatically. While not entirely accurate, it will at least get you started. From there you can manually match up the visemes to the waveform until it looks right. The software also allows you to create additional channels for things such as emotions and eye movements. All this information can be exported as a text file containing the transition information. This in turn can be converted directly to a game-ready stream of data. You can see Magpie Pro in action in Figure 6.
Figure 6. Magpie Pro simplifies the task of isolating
phoneme patterns in your audio track.
Wire Me Up, Baby
With all the high-tech toys available these days, it may seem like a waste to spend all this time hand-synching dialog. What about this performance capture everyone has been talking about? There are many facial capture devices on the market. Some determine facial movements by looking at dots placed on the subject’s face. Others use a video analysis method for determining facial position. For more detailed information on this aspect, have a look at Jake Rodgers’s article “Animating Facial Expressions” in the November 1998 issue of Game Developer . The end result is a series of vectors that describe how certain points on the face move during a capture session. The number of points that can be captured varies based on the system used. However, typically you get from about eight to hundreds of sensor positions in either 2D or 3D. The data is commonly brought into an animation system like Softimage or Maya and the data points drive the deformation of a model. Filmbox by Kaydara is designed specifically to aid in the process of capturing, cleaning up, and applying this form of data. Filmbox can also apply suppressive expressions, inverse kinematic constraints, and perform audio analysis similar to Magpie Pro.
This form of motion capture clearly can speed up the process of generating animation information. However, it’s geared much more toward traditional animation and high-end performance animation. In this respect it doesn’t really suit the real-time game developer’s needs. It’s possible to drive a real-time character by using the raw motion capture data to drive a facial deformation model. However, for a real-time game application, I do not believe this is currently feasible.
In order to convert this stream of positional data into my limited real-time animation system, I would need to analyze the data and determine what visemes and emotions the performer is trying to convey. You need a filtering method that will take the multiple sample points and select the viseme or muscle action that is occurring. This is really the key to making motion capture data usable for real-time character animation. This area of research, termed gesture recognition, is pretty active right now. There is a lot of information out there for study. However, Quantumworks’ Geppetto provides gesture recognition from motion capture data to drive “muscle macros” as both a standalone and a plug-in for Filmbox.
Where Do We Go from Here?
Between viseme-based and muscle-based facial animation, you can see that there are a lot of possible approaches and creative areas to explore. In fact, the whole field has really opened up to game development in terms of opportunities for game productions as well as tool developers. Games are going to need content to start filling up those new DVD drives and I think facial animation is a great way to take our productions to the next level.
For Further Information
• Ekman, P. and W. Friesen. Manual for the Facial Action Coding System. Palo Alto, Calif.: Consulting Psychologist Press, 1977.
• Faigin, Gary. The Artist’s Complete Guide to Facial Expression. New York: Watson-Guptill Publications, 1990.
• Goldfinger, Eliot. Human Anatomy for Artists. New York: Oxford University Press, 1991.
• Landreth, C. “Faces with Personality: Modeling Faces That Exude Personality When Animated.” Computer Graphics World (February 1996): p. 58(3).
• Waters, Keith. “A Muscle Model for Animating Three-Dimensional Facial Expression,” SIGGRAPH Vol. 21, N. 4 (July 1987): pp. 17-24.
Thanks to Steve Tice of Quantumworks Corporation for the skull model and the use of Geppetto as well as insight into muscle-based animation systems. The W. C. Fields image is courtesy of Virtual Celebrity Productions LLC (http://www.virtualceleb.com) created using Geppetto. The female kiss image is courtesy of Tom Knight of Imagination Works.
When not massaging the faces of digital beauties or doing stunt falls in a mo-cap rig, Jeff can be found flapping his own lips at Darwin 3D. Send him some snappier dialogue at [email protected].