Rich facial animation is a vital aspect of modern games, and blend shapes, or morph targets, are one of the most common and effective techniques available. There are restrictions in their traditional application, however, that are now being broken on DirectX 10 hardware. Given the popularity of DX9 hardware (and the lack of any DX10 consoles), these techniques, while novel and more powerful than those presented here, are limited in use. This paper presents a powerful approach to blendshapes and is supported in DX9 on Shader Model 3 hardware.
Background and History
A brief overview and history of blendshapes is available in Chapter 3 of GPU Gems3, by Tristan Lorach of NVIDIA. Some key points are:
- Hardware restrictions allowed blendshapes to blend between only few blendshapes. This is due to morph data being sent as vertex attributes, with a maximum of 16 attributes sent to the GPU.
- Of these attributes, each blendshapes needs between 1 and 3 (for position, normal, and tangent). Less can be sent or they can be packed differently, but of course at the cost of visual quality.
- These restrictions lead to blendshapes animation usually blending between only two or three targets, which restricts the expressiveness and nuance of the animation- typically games will have blendshapes for ‘happy’, ‘frown’, etc. This is different than film-style blendshapes, which use localized targets added together to create dynamic expressions (‘smile’, ‘jaw open’, and ‘squint’ targets may all add together for a laughing expression, for example).
There are currently four options for breaking this barrier. The first two are dealt with in Lorach 2007: using DX10 Stream-Out or DX10 Buffer-Templates. The third is also using DX10 and is very similar to this implementation (in fact it can be considered an improvement on this technique, using DX10 capabilities). It is using what Microsoft calls “Sparse Morph Targets” and is available in the August 2007 DX SDK. This technique writes the morph data to textures, where each vertex accesses a pixel on the texture via its vertex index. The main goal of the technique presented here is to adapt the Sparse Morph Target technique into one very easily implemented on DX9 hardware.
Facial Blendshape Animation Paradigms
There are two main types types of blendshape animation, already mentioned above. The first, the one implemented in current games, using ‘high level’ blendshapes that control entire expressions and some localized animations. Characters may blend between an expression and/or phoneme, and eye blinking. It is important to remember that the blendshape blending in is unique from the one blending out, which means ‘happy’, ‘ah’, ‘ff’, and ‘blink’ are all that can be going simeltaneously. This creates unexpressive dialog or expression without speech.
The second paradigm is that detailed in the book “Stop Staring” by Jason Osipa and common in film productions. It uses a large number of localized blendshapes to provide dynamic, varied, and nuanced expression. His ‘Simple Setup’ for example has 16 blendshapes : Jaw_Open, Smile, Narrow, Frown, Sneer, UprLip_Up, LwrLip_Dn, Squint, BrowsMid_Up, BrowsMid_Dn, BrowsOut_Up, Brows_Squeeze, Teeth_Open, Tongue_Out, Tongue_Up, and Brows_Dn. This setup does not provide assymmetry, however, though changes can be made. Osipa’s Complex Setup, with 40 targets, provides film-quality control, and while it would be more expensive to utilize using Texture-Based Blendshapes, is certainly possible.
Blendshapes or Bones?
There are a few important benefits of using bones over traditional blendshapes in games. The first is they provide flexibility in animation; there are none of the blending restrictions blendshapes have. This can provide for a higher level of expression and diversity than morph animation. The second is a high level of proceduralism. Dynamically syncing dialogue and providing strong and varied expression was possible only with bones, because of the same blending restrictions mentioned above.
High-quality bone animation also has its problems, however. In order to get more fidelity and control over a bone rig, more bones must be added, which increases CPU and GPU overhead. Also, the common bone influence of four bones per vert may be quickly reached in some areas. Bone rigs cannot, without excessive bone counts, match the fidelity of a good morph rig: modelers and animators have control of every vert in a blendshapes, but they are restricted in a bone rig.
With Texture Blendshapes, however, the diversity and proceduralism of bone animation can be combined with the nuance and control of film-style blendshapes. Instead of driving bones procedurally (or driving expressions and phonemes that drive bones), we can drive blendshapes instead, that dynamically combine to provide maximum fidelity for cost.
Texture Blendshape Pipeline
The Texture Blendshape pipeline differs substantially from the traditional blendshape pipeline. It can be broken into three parts: Mesh preprocessing and texture generation, GPU texture blending passes, and GPU rendering passes.
Mesh Preprocessing and Texture Generation
Generating textures and mesh preprocessing can be done in the DCC tool or in an external app. The scripting required in a DCC app is relatively simple, but if an external program is available it should be simple to add the functionality to it. One important thing to note is that because D3D splits vertices with multiple UV, Normals, Colors, etc., the exported mesh and mesh in the DCC will have differing numbers of vertices (the DCC will have a smaller or equal number).
This vertex count is important because it is how we will be accessing the textures. One of the key components of Sparse Morph Targets is the way the texture is stored and the ability to access the texture by vertex index. Since we do not have that capability in DX9, we must substitute for it. We will use the W component of our UVW texcoord as our vertex index.
Given an original mesh, we compute the positions, normals, and tangents of all morph targets as deltas from the original mesh, so we can additively combine them in the shader. We can also do this for any other vertex attributes, such as uv coordinates (adding a uv texture can correctively fix UV stretching caused by the morphing). We then write out our texture, indexing the pixels with a modulus and integer division. The maxScript code for pixel indexing is as follows:
texSize = 64
theTex = bitmap texSize texSize color:black hdr:true
for i = 1 to (getNumVerts obj) do
(
texU = mod (i - 1) posTex.width
texU = texU as integer
texV = (i - 1) / posTex.height
texV = texV as integer
....
setPixels theTex [texU, texV] #(someColor)
}
It is important that “i” remains the same between this step and the GPU. Ideally we generate the textures and W coord at the same time, or we can use the W coord (instead of actual vert index) to look up where to write our pixels during texture generation.
It is also worthwhile to note that verts that aren’t effected by any blendshape can all be given the same index, to conserve texture space, though ideally the part of the mesh that needs blendshapes (such as the head) would use a seperate object and shader.
We write out our textures and save them in a DX-compatible format.
Morph Animation Tracks
At this point, we have our textures and a single mesh with vertex indices saved as the W component of its texture coordinate. We must digress for a moment, though, to discuss the options of how the mesh will be animated, ie, how animation tracks will be handled.
Each blendshape requires an additional texture sampler for each of the position, normal, and tangent blending passes, for a total of three. Since these textures are quite small, however, this is a small cost. Even though this cost is small, it can add up when we want a full-featured setup involving 40 or so blendshapes (40 texture samplers per-pass and 120 textures!). Clearly, it would be worthwhile to limit the number of morph targets to apply at the same time; experimentation in the DCC tool from artists should be able to yield a good number.
Each shape also needs a control for its influence, which is additional overhead and setup (though much less than required for something like bone animation). How the animation ‘track’ will be passed is not the subject of this paper, though I will state a couple ideas. It should not be much different than the techniques used for traditional blendshape tracks and animation setup, however. An animation file can be written out stating the active blendshapes and their influences at each frame from the DCC tool quite simply.
Blending the Morph Textures
The vertex shader of our first GPU pass generate a simple quad; in the pixel shader, we sample the position textures for each morph target, multiply them by their respective influence floats, and add them together. We write this out to a render target in VTF-compatible format (16 or 32 bpc ARGB). We do the same thing in subsequent passes for normal and tangent textures. If any of our vectors are compressed (normal or tangent especially), we may want to uncompress and normalize them before writing out to the render target. If not, we can do this in the pixel shader. Experimentation will yield which is faster.
Applying the Morph
At this point, we have three unique render targets. First, we need to first get the vertex index and its proper U and V coordinates to look up on our blendshape textures. That done, we sample each using Vertex Texture Fetch, and add them to the original position, normal, and tangent, to get our morphed mesh. At that point we can deform via skeletal animation and/or light the mesh as normal.
One thing to keep in mind is that Vertex Shader 3.0 supports up to 4 vertex texture samplers.
Additions
There are two important additions that can be made to texture blendshapes, not done in the sample. The first is the addition of a UV texture (or adding UV coords to the alpha of two already used maps). This UV texture can assign the vertices new or offset UV coordinates that can accomodate for stretching in morph targets with big geometry changes.
Another use is for wrinkle maps. Wrinkle maps can be pre-processed depending on whether the edges from one vert to its neighbors changes, and the output can be saved as a mask per blendshape. This mask can then be stored in the alpha channel of another texture, and passed into the pixel shader to use as a blend factor between two or three normal maps (.5 grey being unchanged in the latter, a 1 or 0 being unchanged in the former.
Assuming we use three VTF (one each for position, normal, and tangent), the two extra channels from a UV blend map and the one from a wrinkle map would occupy the alpha channels of the textures, not resulting in any additional sampling.
Optimizations
A number of optimizations could be made to this technique, some of which have already been mentioned. First, number of blendshapes can be limited and dynamically swapped. Second, a number of improvements based on the DX10 “Sparse Morph Targets” can possibly be made (especially the ‘subrect’ method). Finally, tangent and/or normal textures can be left out at the cost of diminishing visual fidelity, which is possibly acceptable for relatively small morph variations. Testing should be done to see the impact of not using tangent or normal textures on your particular characters.
Conclusions
TBMT provides a high degree of flexibility and fidelity for facial animation or any use of morph targets. I see two main uses: first is where a high degree of fidelity in facial animation is needed. However, the hand-keying required for such facial animation is often prohibitive for most games, especially with the increased use of universal/facial capture. Non-realistic or non-human faces, however, can still benefit from TBMT and the fidelity it provides.
The second benefit of TBMT is that because of the ability for localized morph targets, an approach similar to bone-driven facial animation can be taken. This allows the proceduralism of facial bone animation with the fidelity and detail of morph targets. In order to approach the quality and nuance of morph targets, a large number of bones are needed (and one cannot truly achieve morph-quality animation without those many bones). This, in turn, becomes prohibitively expensive, as well as difficult to set up and skin. TBMT provides the quality of a film-like morph target setup, the ability to procedurally produce facial animation demanded by many games, and the speed that a real-time implementation requires.