Apple, NVIDIA and Anthropic reportedly used YouTube transcripts without permission to train AI models

Some of the world’s largest tech companies trained their AI models on a dataset that included transcripts of more than 173,000 YouTube videos without permission, a new investigation from Proof News has found. The dataset, which was created by a nonprofit company called EleutherAI, contains transcripts of YouTube videos from more than 48,000 channels and was used by Apple, NVIDIA and Anthropic among other companies. The findings of the investigation spotlight AI’s uncomfortable truth: the technology is largely built on the backs of data siphoned from creators without their consent or compensation.

The dataset doesn’t include any videos or images from YouTube, but contains video transcripts from the platforms biggest creators including Marques Brownlee and MrBeast, as well as large news publishers like The New York Times, the BBC, and ABC News. Subtitles from videos belonging to Engadget are also part of the dataset.

“Apple has sourced data for their AI from several companies,” Brownlee posted on X. “One of them scraped tons of data/transcripts from YouTube videos, including mine,” he added. “This is going to be an evolving problem for a long time.”

Apple has sourced data for their AI from several companies

One of them scraped tons of data/transcripts from YouTube videos, including mine

Apple technically avoids “fault” here because they’re not the ones scraping

But this is going to be an evolving problem for a long time https://t.co/U93riaeSlY

— Marques Brownlee (@MKBHD) July 16, 2024

YouTube, Apple, NVIDIA, Anthropic and EleutherAI did not respond to a request for comment from Engadget.

So far, AI companies haven’t been transparent about the data used to train their models. Earlier this month, artists and photographers criticized Apple for failing to reveal the source of training data for Apple Intelligence, the company own spin on generative AI coming to millions of Apple devices this year.

YouTube, the world’s largest repository of videos, in particular, is a goldmine of not only transcripts but also audio, video, and images, making it an attractive dataset for training AI models. Earlier this year, OpenAI’s chief technology officer, Mira Murati, evaded questions from The Wall Street Journal about whether the company used YouTube videos to train Sora, OpenAI’s upcoming AI video generation tool. “I’m not going to go into the details of the data that was used, but it was publicly available or licensed data,” Murati said at the time. Both YouTube CEO Neal Mohan and Alphabet CEO Sundar Pichai have said that companies using data from YouTube to train their AI models was a violation of the platform’s terms of service.

If you want to see if subtitles from your YouTube videos or from your favorite channels are part of the dataset, head over the Proof News’ lookup tool.

Source link

Apple, NVIDIA and Anthropic reportedly used YouTube transcripts without permission to train AI models

This Long-Lasting, Smudge-Proof Brown Mascara Was Made For Redheads

The Annual Hollywood VFD Carnival Is Almost Here!

A Stock Portfolio Spring Cleaning

The Watch Dogs movie has finally started filming after 10 years

Mainstays 10-Piece Bath Towels Set Only $13.97 on Walmart.com | Upgraded Softness & Durability!

LEGO Minecraft The Crafting Table & Steve’s Desert Expedition Bundle Only $84.99 Shipped on Costco.com

EXTRA 50% Off Banana Republic Factory Women’s Clearance – Styles from $5.48!

Old Wisconsin Turkey Sausage Snack Stick 42-Count Only $17 Shipped on Amazon (Reg. $40)

Culture

When Soviet Youth Bootlegged Western Rock Music on Discarded X-Rays: Hear Original Audio Samples

An Introduction to Brutalism: The Iconic Postwar Architectural Style That Combined Utopianism and Concrete

Meet the “Telharmonium,” the First Synthesizer (and Predecessor to Muzak), Invented in 1897

See the Climactic Ending of Steven Spielberg’s Breakout Duel Recreated Entirely with 3D-Printed Models

Gadgets

The unsealed New Mexico Snapchat lawsuit alleges the company ignored child safety

eBay will stop charging seller fees in the UK

Pebblebee’s new item trackers works with both Apple and Google ‘Find My’ networks

Prime Day deals include an Echo Pop bundle with a smart light bulb for only $18

Apple, NVIDIA and Anthropic reportedly used YouTube transcripts without permission to train AI models

You Might Also Like

Culture

Gadgets