Unison

Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation (ECCV 2026)

Unison is a unified framework for human-centric audio-video generation that jointly models motion, speech, and ambient sound.

Existing audio-video generators struggle with two long-standing problems:

  • Speech–SFX interference — speech and sound effects collapse into a single noisy stream.
  • Motion–audio desynchronization — visual motion drifts away from the audio it is supposed to drive.

Unison resolves both via dedicated modality-aware tokenization and a harmonization objective, producing temporally aligned, semantically consistent multi-modal outputs.

First Author. ECCV 2026.