AI Watchdog: MiraData-330K

September 10, 2025

Explore original journalism about this data set through AI Watchdog, The Atlantic’s ongoing investigation into the generative-AI industry.

MiraData-330K is a collection of 330,000 video clips taken from 225,544 videos from YouTube, Videvo, Pixabay, and Pexels, and paired with text captions. This data set features long video clips (averaging 72.1 seconds each) and captions (averaging 318 words). It was compiled by researchers at Tencent’s Applied Research Center and released in 2024. The researchers used GPT-4V to generate captions. The data set is hosted on Hugging Face, an AI-development hub, and has been downloaded more than 2,500 times.

Sections

The Print Edition

AI Watchdog: MiraData-330K