AI Watchdog: HD-VG-130M

Explore original journalism about this data set through AI Watchdog, The Atlantic’s ongoing investigation into the generative-AI industry.


HD-VG-130M is a collection of 1,549,408 YouTube videos, according to our analysis of the data set. It was compiled by Microsoft, and published in 2023 for the purpose of training video-generating AI models. The data set’s compilers sought out high-definition videos with a high “aesthetic score,” as judged by an AI model. They rejected videos that contained a creator’s name or watermark to prevent text from accidentally appearing in videos generated by AI models. The data set has been used by Nvidia, and its developers claim that “🎉 Up to January 2024, our data set has been downloaded by more than 50 universities and research institutes!”