There’s a question that vexes generative AI companies: “What content was used to train the model?” While some companies avoid the question and others aggressively tackle the issue head-on, the question of whether an AI company has obtained content without permission for its own business purposes is a thorny one.
At best, it provides a vague explanation for what a “curated dataset” is, and at worst, it could spark a debate about whether everything on the internet is inherently subject to fairness.
Documents obtained by 404media reveal that some of the data used to train Runway’s latest AI video generation tool, “Gen-3,” may have been taken from YouTube channels from thousands of popular media companies, including Pixar, Netflix, Disney, and Sony.
404media did not elaborate on how it obtained the document, nor could it confirm whether all of the videos mentioned in it were used to train Gen-3, but it could offer insight into techniques AI companies might use when scraping copyrighted material to train their models.
A former Runway employee spoke to 404media about the method, saying the leaked documents included 14 spreadsheets with words like “beach” and “rain” written next to the names of Runway employees.
According to sources, these names were employees tasked with finding videos and channels related to these keywords, and then using YouTube video downloader tools via proxies to scrape the videos and channels from the site without them being blocked by Google.
It seems that it’s not just YouTube content that was scraped: the spreadsheet, which contains 14 links to non-YouTube sources, also includes links to a website dedicated to streaming popular manga and anime movies, which has thousands of copyright infringement claims against it.
In essence, it appears that pirated media was at least considered as training data, if not directly scraped and used.
404media actually went a step further and attempted to generate videos in Gen-3 using prompts containing keywords based on terms found in a spreadsheet, allowing them to create clips that were very similar in style to the related content.
Runway itself is partly funded by companies like Google, and if it is true that it is removing content on its platform without creators’ permission, it would likely find itself in big trouble, not to mention potentially broader legal ramifications.
The issue of AI content theft is a thorny one, but it seems the model still has issues. Ars Technica recently tried to create some videos with Gen-3 Alpha, but a cat was given a human hand. I don’t know what content was used to train this particular version of the model, but I think there’s some room for improvement, regardless of the methodology used here.