AI Watchdog: HD-VILA-100M

Explore original journalism about this data set through AI Watchdog, The Atlantic’s ongoing investigation into the generative-AI industry.


HD-VILA-100M is a collection of 3,098,462 YouTube videos, according to our analysis of the data set. It was compiled by Microsoft Research Asia and published in 2021, for the purpose of training video-based AI models. It contains approximately 100 million video clips, labeled with the subtitles from YouTube. The data set is hosted on Microsoft’s website. It has been used by Meta, ByteDance, Snap, and Tencent.