AI Watchdog: OpenSubtitles
Explore original journalism about this data set through AI Watchdog, The Atlantic’s ongoing investigation into the generative-AI industry.
OpenSubtitles is a collection of subtitles taken from more than 53,000 movies and 85,000 television-show episodes. The subtitles were mostly screen-ripped by users of the website OpenSubtitles.org, a site for sharing unauthorized copies of subtitles. It includes at least 616 episodes of The Simpsons and every film nominated for Best Picture from 1950 to 2016. OpenSubtitles is distributed as part of the Pile, a huge data set–of–data sets consisting of copyrighted books, patent applications, European Parliament transcriptions, and much more. It has been used by Apple, Anthropic, Meta, Nvidia, Salesforce, Bloomberg, EleutherAI, Databricks, and Cerebras as well as for more than 500 open-source-AI models.