Unison
Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation (ECCV 2026)
Unison is a unified framework for human-centric audio-video generation that jointly models motion, speech, and ambient sound.
Existing audio-video generators struggle with two long-standing problems:
- Speech–SFX interference — speech and sound effects collapse into a single noisy stream.
- Motion–audio desynchronization — visual motion drifts away from the audio it is supposed to drive.
Unison resolves both via dedicated modality-aware tokenization and a harmonization objective, producing temporally aligned, semantically consistent multi-modal outputs.
First Author. ECCV 2026.