FaceFormer: Speech-Driven 3D Facial Animation with Transformers

Yingruo Fan¹

Zhaojiang Lin²

Jun Saito³

Wenping Wang^1,4

Taku Komura¹

¹The University of Hong Kong

²The Hong Kong University of Science and Technology

³Adobe Research

⁴Texas A&M University

Given the raw audio input and a neutral 3D face mesh, our proposed end-to-end Transformer-based architecture, dubbed FaceFormer, can autoregressively synthesize a sequence of realistic 3D facial motions with accurate lip movements.

Abstract

Speech-driven 3D facial animation is challenging due to the complex geometry of human faces and the limited availability of 3D audio-visual data. Prior works typically focus on learning phoneme-level features of short audio windows with limited context, occasionally resulting in inaccurate lip movements. To tackle this limitation, we propose a Transformer-based autoregressive model, FaceFormer, which encodes the long-term audio context and autoregressively predicts a sequence of animated 3D face meshes. To cope with the data scarcity issue, we integrate the self-supervised pre-trained speech representations. Also, we devise two biased attention mechanisms well suited to this specific task, including the biased crossmodal multi-head (MH) attention and the biased causal MH self-attention with a periodic positional encoding strategy. The former effectively aligns the audio-motion modalities, whereas the latter offers abilities to generalize to longer audio sequences. Extensive experiments and a perceptual user study show that our approach outperforms the existing state-of-the-arts.

Paper

FaceFormer: Speech-Driven 3D Facial Animation with Transformers. CVPR 2022.

Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, Taku Komura

Paper

Code

Bibtex

Video

Acknowledgement

We gratefully acknowledge ETHZ-CVL for providing the B3D(AC)2 database and MPI-IS for releasing the VOCASET dataset.