AI Watchdog: Koala-36M

Explore original journalism about this data set through AI Watchdog, The Atlantic’s ongoing investigation into the generative-AI industry.


Koala-36M is a collection of 2,430,928 videos from YouTube split into 36 million clips and paired with text captions, according to our analysis of the data set. It was compiled by Kuaishou Technology, the Chinese creator of a popular short-form-video platform, and released in 2024. The data set is a refinement of Snap’s Panda-70M—developers applied different video-filtering, segmenting, and caption-generating strategies. It is hosted on Hugging Face, an AI-development hub.