Some of the world’s largest technology companies trained AI models on datasets that included unauthorized transcripts of more than 173,000 YouTube videos, a new study has found. Proof News The dataset, created by nonprofit EleutherAI, contained transcripts of YouTube videos from over 48,000 channels and was used by companies like Apple, NVIDIA, and Anthropic. The findings of the investigation highlighted an uncomfortable truth about AI: that AI technology is built on data siphoned from creators without their consent or compensation.
The dataset does not include any YouTube videos or images, but it does include video transcripts from some of the platform’s biggest creators, such as Marques Brownlee and MrBeast, as well as major news publishers, such as: The New York Times, BBCand ABC NewsSubtitles from Engadget videos are also part of the dataset.
“Apple sources data for its AI from multiple companies,” Brownlee wrote on X. “One of these companies harvests a ton of data and transcripts from YouTube videos, including mine,” he added. “This will be a long-term, evolving issue.”
Apple sources data for its AI from multiple companies
One of them scraped a ton of data and transcripts from YouTube videos, including mine.
Apple technically avoids the “flaw” because it doesn’t scrape.
But this will be an evolving issue for a long time https://t.co/U93riaeSlY
— Marques Brownlee (@MKBHD) July 16, 2024
A Google spokesperson told Engadget that YouTube CEO Neal Mohan’s previous comments that companies that use YouTube data to train AI models violate the platform’s terms of service still stand. Apple, NVIDIA, Anthropic, and EleutherAI did not respond to Engadget’s requests for comment.
Until now, AI companies have not been transparent about the data they use to train their models. Earlier this month, artists and photographers criticized Apple for not disclosing the origins of the training data for Apple Intelligence, the company’s proprietary generative AI that will be included in millions of Apple devices this year.
In particular, YouTube, the world’s largest video repository, is a treasure trove of audio, video and images, as well as transcripts, making it an attractive dataset for training AI models. Earlier this year, OpenAI’s Chief Technology Officer Mira Murati said: The Wall Street Journal The company was asked about whether it used YouTube videos to train OpenAI’s upcoming AI video generation tool, Sora. “I won’t go into the details of the data used, but it was publicly available or licensed data,” Murati said at the time. Alphabet CEO Sundar Pichai also said that companies that use YouTube data to train AI models would violate the platform’s terms of service.
If you want to see if the dataset includes subtitles for YouTube videos or your favorite channels, visit Proof News’ search tool.
Update, July 16, 2024 3:17 PM PST: This story has been updated to add a statement from Google.