We present AlignNet, a model designed to synchronize a video with a reference audio under non-uniform and irregular misalignment. AlignNet learns the end-to-end dense correspondence between each frame of a video and an audio. Our method is designed according to simple and well-established principles: attention, pyramidal processing, warping, and affinity function. Together with the model, we release a dancing dataset Dance50 for training and evaluation. Qualitative, quantitative and subjective evaluation results on dance-music alignment and speech-lip alignment demonstrate that our method far outperforms the state-of-the-art methods.

@inproceedings{jianren20alignnet, Author = {Wang, Jianren and Fang, Zhaoyuan and Zhao, Hang}, Title = {AlignNet: A Unifying Approach to Audio-Visual Alignment}, Booktitle = {WACV}, Year = {2020} }

Jianren Wang*	Zhaoyuan Fang*	Hang Zhao
CMU	University of Notre Dame	MIT

Paper and Bibtex

Acknowledgements

		Citation Jianren Wang, Zhaoyuan Fang, Hang Zhao. AlignNet: A Unifying Approach to Audio-Visual Alignment In WACV 2020. [Bibtex] [Paper] [ArXiv]
		@inproceedings{jianren20alignnet, Author = {Wang, Jianren and Fang, Zhaoyuan and Zhao, Hang}, Title = {AlignNet: A Unifying Approach to Audio-Visual Alignment}, Booktitle = {WACV}, Year = {2020} }