DreamDance: Animating Human Images
by Enriching 3D Geometry Cues from 2D Poses


Yatian Pang1,3
Bin Zhu1
Bin Lin1
Mingzhe Zheng4
Francis E. H. Tay3
Ser-Nam Lim5,6
Harry Yang4,6
Li Yuan1,2,*


Peking University    PengCheng Laboratory    NUS    HKUST   UCF   Everlyn AI  




Abstract

In this work, we present DreamDance, a novel method for animating human images using only skeleton pose sequences as conditional inputs. Existing approaches struggle with generating coherent, high-quality content in an efficient and user-friendly manner. Concretely, baseline methods relying on only 2D pose guidance lack the cues of 3D information, leading to suboptimal results, while methods using 3D representation as guidance achieve higher quality but involve a cumbersome and time-intensive process. To address these limitations, DreamDance enriches 3D geometry cues from 2D poses by introducing an efficient diffusion model, enabling high-quality human image animation with various guidance. Our key insight is that human images naturally exhibit multiple levels of correlation, progressing from coarse skeleton poses to fine-grained geometry cues, and further from these geometry cues to explicit appearance details. Capturing such correlations could enrich the guidance signals, facilitating intra-frame coherency and inter-frame consistency. Specifically, we construct the TikTok-Dance5K dataset, comprising 5K high-quality dance videos with detailed frame annotations, including human pose, depth, and normal maps. Next, we introduce a Mutually Aligned Geometry Diffusion Model to generate fine-grained depth and normal maps for enriched guidance. Finally, a Cross-domain Controller incorporates multi-level guidance to animate human images effectively with a video diffusion model. Extensive experiments demonstrate that our method achieves state-of-the-art performance in animating human images.

Method

Overview of DreamDance framework. The Mutually Aligned Geometry Diffusion Model generates detailed depth and normal maps to enrich guidance signals that are mutually aligned across modalities and time. The Cross-domain Controlled Video Diffusion Model utilizes a cross-domain controller to integrate multiple levels of guidance, producing high-quality human animations.

Experiments

We show the pseudo Ground Truth on the last column with the generated results displayed on the penultimate column.

Cross ID Animation

Click Me

This will take you to the page containing the generated normal and depth.




Unseen Domain

Comparisons with baseline methods