OpenAI Reportedly Transcribed 1 Million Hours of YouTube Videos to Train GPT-4

Apr 07,2024
OpenAI Reportedly Transcribed 1 Million Hours of YouTube Videos to Train GPT-4

OpenAI reportedly transcribed more than one million hours of YouTube videos to train GPT-4, according to The New York Times on Saturday. The report comes just days after YouTube CEO Neal Mohan said transcribing YouTube videos for AI training would be a “clear violation” of its policies in a Bloomberg interview.

“When a creator uploads their hard work to our platform, they have certain expectations. One of those expectations is that the terms of services is going to be abided by,” said Mohan in an interview with Bloomberg last week. “But it does not allow for things like transcripts or video bits to be downloaded.”

The New York Times report alleges that OpenAI team members, including President Greg Brockman, personally helped collect the YouTube videos, according to sources. The article details how OpenAI, and many tech companies, are facing difficulty collecting enough data to train massive AI models. OpenAI allegedly used Whisper, its AI transcription software, to collect more data to train GPT-4, the latest and greatest model underlying ChatGPT.

OpenAI and Google did not immediately respond to Gizmodo’s requests for comment.

The New York Times report could have massive implications for OpenAI and Google’s ongoing battle at the forefront of generative AI development. Google is unlikely to go quietly if OpenAI is using its content to make ChatGPT even greater. However, the company has made no such allegations yet. In a statement to The Verge this weekend, a Google spokesperson merely said he’s “seen unconfirmed reports” about OpenAI’s training.

YouTube’s terms of service prohibit any user from downloading its content, including the use of botnets or scrapers, unless they have clear permissions from the company. YouTube also prohibits utilizing its content for any “independent” uses of its service.

OpenAI’s Chief Technology Officer, Mira Murati, said she was “not sure” whether YouTube videos were used to train her company’s text-to-video AI model Sora when asked by The Wall Street Journal in March. The New York Times report mentions nothing about Sora, or actual YouTube bits themselves. However, her hesitancy to answer this question directly leads to greater speculation.

The New York Times, itself, is in a copyright battle with OpenAI at the moment. OpenAI and Meta are also being sued by a number of authors and content houses for training their AI on copyrighted works.

If these reports are true, it could raise entirely new questions about copyright law in the AI world. Most copyright complaints around AI have been brought by small publishers, but Google could add some real weight behind this fight if it chooses to partake. It would also present a way for Google to slow down OpenAI, which is undoubtedly winning the AI race at the moment.