Inside Our AI Lipsync Pipeline: From Wav2Lip to Broadcast
When a brand like Britannia or Ching's needs a TV commercial to air across 12 Indian languages, our AI lipsync pipeline transforms a single Hindi source ad into broadcast-ready multilingual versions, with photorealistic lip movements synced to each language's audio track. Here's exactly how we do it, from the models under the hood to the final broadcast QC pass.
Table of contents:
The Pipeline Overview
Our multilingual lipsync pipeline processes a single source video through five core stages. Each stage is orchestrated via Docker containers running on AWS GPU instances (typically p4d.24xlarge or g5.12xlarge), and the entire flow is managed by a custom Python-based task queue built on Celery + Redis.
At a high level, the pipeline works as follows:
- Face Detection & Tracking: locate and track all faces across every frame
- Audio Alignment: process the translated audio and extract phoneme timing
- Wav2Lip Inference: generate new lip movements synced to the target audio
- Post-Processing: blend the generated lips back into the original frame
- Broadcast QC: validate output against TV broadcast specifications
Each stage is modular and version-controlled. If the face detection model gets upgraded, we swap it without touching the rest of the pipeline. This modularity is critical when processing hundreds of videos per month.
Step 1: Face Detection & Landmark Extraction
Before we can modify lip movements, we need to know exactly where the face is in every frame. We use a two-stage approach:
RetinaFace for Detection
RetinaFace is a single-stage face detector that provides bounding boxes and five facial landmarks (left eye, right eye, nose tip, left mouth corner, right mouth corner) at near real-time speeds. We run it at 1080p resolution to ensure we capture faces even in wide shots. The model achieves 96.3% mAP on the WIDER FACE hard set, which is critical for detecting faces in complex broadcast footage with motion blur, multiple subjects, and varied lighting.
MediaPipe Face Mesh for Dense Landmarks
Once we have the face bounding box from RetinaFace, we pass the cropped face region through MediaPipe's Face Mesh model, which extracts 468 3D facial landmarks. These dense landmarks give us precise geometry of the jaw, lips, and chin, essential for smooth compositing later. We specifically track landmarks in the lip region (indices 0-16 for the outer lip contour, 48-67 for the inner lip) to create an accurate mask for blending.
The combined output is a per-frame data structure containing:
- Bounding box coordinates (x, y, w, h)
- 468 3D landmark positions
- Head pose estimation (yaw, pitch, roll)
- Confidence scores for detection quality
Step 2: Audio Processing & Phoneme Alignment
The dubbed audio tracks arrive from professional voice-over studios in each target language. Before feeding them to Wav2Lip, we process them through several stages:
Audio Normalization
We use FFmpeg's loudnorm filter to normalize all audio tracks to -24 LUFS (the broadcast standard for Indian television). This ensures consistent volume levels across all language versions and prevents clipping artifacts during the lipsync process.
Forced Alignment with Montreal Forced Aligner (MFA)
For languages where we need precise phoneme timing (particularly for longer dialogues), we run the Montreal Forced Aligner. MFA uses a Hidden Markov Model (HMM) trained on individual language acoustics to map each phoneme to its exact timestamp in the audio. This gives us a TextGrid file with millisecond-level alignment that we use to validate the Wav2Lip output.
Mel-Spectrogram Extraction
Wav2Lip operates on 80-band mel-spectrograms rather than raw audio waveforms. We extract these using a 16kHz sample rate, 800-sample FFT window, and 200-sample hop length, matching the original Wav2Lip training parameters exactly. This is computed using librosa and cached for each audio track to avoid redundant computation during iterative processing.
Step 3: Wav2Lip, The Core Model
Wav2Lip (published at ACM Multimedia 2020) is the foundational model in our pipeline. It takes two inputs, a face image and a mel-spectrogram chunk, and outputs a face image with lip movements synchronized to the audio.
Architecture Overview
Wav2Lip uses a dual-encoder architecture:
- Face Encoder: A CNN that processes the lower-half-masked face image through a series of residual blocks, producing a 512-dimensional face embedding
- Audio Encoder: A CNN that processes the mel-spectrogram through convolutional layers, producing a 512-dimensional audio embedding
- Decoder: These embeddings are concatenated and decoded through transposed convolutions to generate the lip region
The model is trained with a combination of L1 reconstruction loss and a SyncNet-based perceptual loss. The SyncNet discriminator ensures that generated lips are not just visually plausible but also correctly synchronized with the audio, it's essentially a lip-reading model that acts as a judge.
Our Modifications
We've made several key modifications to the vanilla Wav2Lip model:
- Resolution Enhancement: We operate at 384×384 pixel face crops (vs. the original 96×96), using a super-resolution post-processing step with Real-ESRGAN to upscale the lip region after generation
- Temporal Smoothing: We apply a 5-frame Gaussian temporal filter to prevent flickering artifacts between consecutive frames
- Batch Processing: We process frames in batches of 128 on A100 GPUs, achieving ~45 FPS throughput, fast enough to process a 30-second TVC in under a minute
Step 4: Post-Processing & Compositing
The Wav2Lip output gives us a face crop with new lip movements, but it needs to be seamlessly composited back into the original video frame. This is where our post-processing pipeline comes in:
Feathered Alpha Masking
Using the 468 facial landmarks from Step 1, we create a precise alpha mask around the lower face region. The mask is feathered with a Gaussian blur (σ=3px) at the edges to ensure smooth blending between the generated lip region and the original face. The mask is dynamically adjusted per-frame based on the head pose, wider when the face is frontal, narrower when turned.
Color Matching
The generated lip region may have slightly different color characteristics than the surrounding skin. We apply histogram matching in LAB color space between the generated region and a reference patch from the original frame. This corrects for any color drift introduced by the neural network.
Sharpening & Denoising
We apply an unsharp mask (radius=1, amount=0.5) to the composited region to restore the crispness lost during the neural network's downsampling-upsampling process. A light pass of temporal denoising (using FFmpeg's nlmeans filter) removes any remaining high-frequency noise patterns.
Final Frame Assembly
The composited face is placed back into the original frame using the tracked bounding box coordinates. We use OpenCV's seamlessClone (with NORMAL_CLONE flag) for the final blend, which automatically handles lighting and color continuity at the boundary. The entire compositing step runs on CPU and processes at ~200 FPS, so it's not a bottleneck.
Step 5: Broadcast QC & Delivery
Every output goes through automated and manual quality checks before delivery:
- SyncNet Confidence Score: We run SyncNet on the final output and require a minimum confidence score of 7.5 (out of 10). Anything below triggers a re-processing with adjusted parameters
- VMAF Quality Score: We compute VMAF (Video Multimethod Assessment Fusion) between the output and original to ensure visual quality remains above 85/100
- Broadcast Specs: Final encoding uses ProRes 422 HQ for broadcast delivery, with ITU-R BT.709 color space and -24 LUFS audio levels
- Native Speaker Review: Each language version is reviewed by a native speaker for naturalness of lip movements
GPU Infrastructure & Parallel Processing
For a typical 12-language campaign, we process all language versions in parallel using a cluster of GPU instances. Our infrastructure is built on:
- AWS EC2 p4d/g5 instances with NVIDIA A100/A10G GPUs
- Docker containers with CUDA 12.x and PyTorch 2.x
- Celery + Redis for distributed task orchestration
- S3 for input/output video storage with presigned URLs
- CloudWatch for monitoring GPU utilization, queue depth, and processing times
A 30-second TVC processed into 12 languages completes in approximately 15-20 minutes wall-clock time (including upload, all processing stages, and QC checks). Individual language processing takes 3-4 minutes on a single A100.
Results & What's Next
This pipeline has processed 100+ broadcast ads for major Indian FMCG brands, with a first-pass success rate of 92% (meaning 92% of outputs pass QC without re-processing). The remaining 8% typically involve edge cases like extreme head angles, multiple overlapping faces, or very fast dialogue.
We're currently working on several improvements:
- MuseTalk Integration: Evaluating MuseTalk as an alternative to Wav2Lip for higher-resolution output without the super-resolution step
- Real-time Preview: Building a web-based preview system that shows approximate lipsync results in real-time before full processing
- Emotion Transfer: Preserving and transferring emotional expressions (smiles, frowns) from the source to the lipsync output, so that emotional nuances aren't lost during the relipping process
If you're interested in multilingual campaigns or want to learn more about our technology, reach out at [email protected].