We present AlignNet, a model designed to synchronize a video with a reference audio under non-uniform and irregular misalignment. AlignNet learns the end-to-end dense correspondence between each frame of a video and an audio. Our method is designed according to simple and well-established principles: attention, pyramidal processing, warping, and affinity function. Together with the model, we release a dancing dataset Dance50 for training and evaluation. Qualitative, quantitative and subjective evaluation results on dance-music alignment and speech-lip alignment demonstrate that our method far outperforms the state-of-the-art methods.
Paper and Bibtex
Citation Jianren Wang*, Zhaoyuan Fang*, Hang Zhao. AlignNet: A Unifying Approach to Audio-Visual Alignment In WACV 2020.
@inproceedings{jianren20alignnet,
Author = {Wang, Jianren and Fang, Zhaoyuan
and Zhao, Hang},
Title = {AlignNet: A Unifying Approach to Audio-Visual Alignment},
Booktitle = {WACV},
Year = {2020}
}
Acknowledgements
We would like to thank David Held, Antonio Torralba and members of the CMU Rpad and MIT CSAIL for fruitful discussions. The work was carried out when JW/FZ was at CMU and HZ was at MIT. This work was supported by PanGU Young Investigator Award to JW.