Real-Time Avatars

A comparative guide to building interactive digital humans

Overview

Interactive digital humans that respond in near real-time to user input are becoming central to virtual communication, gaming, and AI assistants. Achieving a convincing digital human requires balancing visual realism, low latency, precise controllability, and feasible deployment.

Recent advances (2023-2024) have produced several distinct approaches to real-time responsive avatars, each with unique trade-offs in latency, fidelity, control, and system cost. Additionally, streaming infrastructure like LiveKit now enables production-ready avatar deployment with multiple provider integrations.

Graphics

MetaHuman Pipeline

Game-engine characters driven by performance capture or animation rigs for real-time rendering in Unreal Engine.

AI/ML

Generative Video Models

Diffusion or transformer-based models that directly synthesize avatar video frames from audio or other signals.

Neural 3D

Gaussian Splatting

Neural 3D scene representation using Gaussian primitives that can be efficiently animated and rendered in real-time.

Infrastructure

Streaming Avatars

Production-ready WebRTC infrastructure integrating multiple avatar providers with voice AI agents via LiveKit.

Approach 1

MetaHuman Pipeline

Epic Games' MetaHuman framework exemplifies the graphics-based approach to digital humans. MetaHumans are highly detailed 3D character models with rigged faces and bodies, designed for real-time rendering in Unreal Engine.

Key Features

  • +60+ FPS rendering with ~30-50ms latency
  • +Precise control via rigs and blendshapes
  • +Live Link support for real-time streaming
  • +No per-person ML training required

Limitations

  • -CGI look may not achieve true photorealism
  • -Significant content creation effort upfront
  • -Requires capable GPU and game engine
  • -Manual design needed for specific likenesses

How It Works

Input
Camera/Audio
Tracking
ARKit/LiveLink
Animation
Blendshapes
Render
Unreal Engine
Approach 2

Generative Video Models

AI generative models, often based on diffusion or transformer architectures, directly synthesize video frames of a talking or moving person. A single input image can be turned into a lifelike talking video with one-shot generalization to unseen identities.

Key Features

  • +Photorealistic output from minimal input
  • +One-shot: no per-subject training needed
  • +Natural behaviors (blinks, head movements)
  • +20-30 FPS on high-end GPUs achievable

Limitations

  • -Heavy compute requirements (A100+ GPU)
  • -Limited explicit control over output
  • -Risk of artifacts or identity drift
  • -Higher first-frame latency (~0.3-1s)

Key Techniques

Autoregressive Streaming

Models like CausVid use block-wise causal attention for 40x speedup over vanilla diffusion.

Long-term Consistency

Reference Sink and RAPR techniques prevent identity drift over extended generation.

Adversarial Refinement

Second-stage discriminator training recovers detail lost in distillation.

Approach 3

Neural Gaussian Splatting

3D Gaussian Splatting (3DGS) enables real-time rendering of photorealistic 3D scenes using a cloud of Gaussian primitives. By capturing a person as textured 3D Gaussians that can be animated, we get a streaming neural avatar that runs extremely fast and looks realistic.

Key Features

  • +60+ FPS rendering on consumer GPUs
  • +Photorealistic for the captured subject
  • +Multi-view consistent output for AR/VR
  • +Can be driven by parametric models

Limitations

  • -Requires multi-view capture per person
  • -Hours of training time per identity
  • -Fixed identity (one model = one person)
  • -Quality degrades outside training range

Notable Projects

D3GA (Drivable 3D Gaussian Avatars)

Factors full human avatar into layered Gaussian clusters (body, garments, face) attached to a deformable cage rig.

GaussianSpeech

First to generate photorealistic multi-view talking head sequences from audio input with expression-dependent details.

Production Ready

Streaming Avatars with LiveKit

LiveKit Agents provides production-ready infrastructure for deploying real-time avatars at scale. Rather than building avatar rendering from scratch, it integrates multiple avatar providers through a unified API, handling WebRTC streaming, synchronization, and voice AI pipelines automatically.

Key Features

  • +Multiple avatar providers (Tavus, Hedra, Simli, etc.)
  • +Built-in voice AI pipeline (STT + LLM + TTS)
  • +WebRTC-based low-latency streaming
  • +Production deployment with load balancing
  • +Cross-platform SDKs (Web, iOS, Android, Flutter)

Limitations

  • -Requires third-party avatar provider subscription
  • -Less control over avatar rendering pipeline
  • -Dependent on provider capabilities and quality
  • -Per-minute or per-session pricing from providers

Architecture

Agent Session
Python/Node.js
Avatar Worker
Provider API
LiveKit Room
WebRTC
Client
Web/Mobile

The avatar worker joins as a separate participant, receiving audio from the agent and publishing synchronized video back to users. This minimizes latency by having the provider connect directly to LiveKit rooms.

Supported Avatar Providers

Tavus

Photorealistic digital twins with custom voice cloning and persona training.

Hedra

Character-based avatars with expressive animations and customizable styles.

Simli

Real-time lip-sync avatars optimized for conversational AI applications.

Anam

AI-powered digital humans with natural gestures and emotional expressions.

Beyond Presence

Enterprise-grade avatars for customer service and virtual assistance.

bitHuman

Hyper-realistic avatars with advanced facial animation technology.

Side-by-Side Comparison

AspectMetaHumanGenerativeGaussianStreaming
Latency~30-50ms (60+ FPS)~0.3-1s first frame, 20-30 FPS<100ms (30-60 FPS)~100-300ms (provider dependent)
Visual RealismHigh-quality CGIPhotorealisticPhotorealistic (subject-specific)Varies by provider
ControllabilityExplicit, fine-grainedLimited, audio-drivenModerate to highAudio-driven, provider APIs
New IdentityModerate effort (modeling)One-shot (just an image)Low (capture + training)Provider-specific setup
Training RequiredNone per characterBase model onlyPer-subject (hours)None (managed by provider)
HardwareGaming GPUA100+ or cloudConsumer GPUAny (cloud-hosted)
Best ForProduction, precise controlQuick deployment, any faceVR/AR telepresenceVoice AI apps, rapid deploy

Getting Started Tutorial

Choose your approach based on your requirements. Below are quick-start guides for each method with links to open-source implementations.

Tutorial 1

MetaHuman + Live Link

1
Install Unreal Engine 5

Download from Epic Games Launcher. MetaHuman requires UE 5.0+.

2
Create a MetaHuman

Use MetaHuman Creator (metahuman.unrealengine.com) to design or import a character.

3
Set up Live Link Face

Install Live Link Face app on iPhone. Connect to Unreal via your local network.

4
Enable Live Link in your project

Add Live Link plugin, create a Live Link preset, and connect the ARKit face data to your MetaHuman blueprint.

// Alternative: Audio2Face for audio-driven
NVIDIA Omniverse Audio2Face can drive MetaHuman lips from audio
Tutorial 2

SadTalker (Diffusion-based)

1
Clone the repository
git clone https://github.com/OpenTalker/SadTalker.git
2
Install dependencies
pip install -r requirements.txt
3
Download pretrained models

Run the download script or manually download checkpoints from the releases page.

4
Generate talking head
python inference.py --source_image face.jpg --driven_audio speech.wav
Other options: GeneFace++, OmniAvatar, Avatarify (for real-time webcam)
Tutorial 3

D3GA (Gaussian Avatars)

1
Clone D3GA repository
git clone https://github.com/facebookresearch/D3GA.git
2
Capture multi-view video

Record the subject from multiple angles. The more viewpoints, the better the reconstruction.

3
Train the Gaussian avatar

Run the training script with your captured data. This may take several hours depending on data size.

4
Drive with motion data

Use FLAME parameters, body poses, or audio input to animate your trained avatar in real-time.

Also see: GaussianSpeech (audio-driven), GaussianTalker
Tutorial 4

LiveKit Agents + Avatar

1
Install LiveKit Agents SDK
pip install livekit-agents livekit-plugins-hedra
2
Configure API credentials

Set up LiveKit Cloud account and obtain API keys from your chosen avatar provider (Hedra, Tavus, Simli, etc.).

3
Create Agent + Avatar session
from livekit.agents import AgentSession
from livekit.plugins import hedra

# Create voice AI agent
agent_session = AgentSession(
    stt="assemblyai/universal-streaming",
    llm="openai/gpt-4.1-mini",
    tts="cartesia/sonic-3"
)

# Create avatar session
avatar_session = hedra.AvatarSession()

# Start avatar with agent
await avatar_session.start(
    agent_session=agent_session,
    room=ctx.room
)
4
Deploy and connect frontend

Use LiveKit's React hooks or native SDKs to display the avatar video track. The avatar worker publishes synchronized audio/video to the room.

Providers: Tavus, Hedra, Simli, Anam, Beyond Presence, bitHuman, LiveAvatar

Open Source Resources