Unison | Shihao Cheng

Unison is a unified framework for human-centric audio-video generation that jointly models motion, speech, and ambient sound.

Existing audio-video generators struggle with two long-standing problems:

Speech–SFX interference — speech and sound effects collapse into a single noisy stream.
Motion–audio desynchronization — visual motion drifts away from the audio it is supposed to drive.

Unison resolves both via dedicated modality-aware tokenization and a harmonization objective, producing temporally aligned, semantically consistent multi-modal outputs.

First Author. ECCV 2026.