{"id":20348,"date":"2025-09-10T15:58:08","date_gmt":"2025-09-10T15:58:08","guid":{"rendered":"https:\/\/naijaglobalnews.org\/?p=20348"},"modified":"2025-09-10T15:58:08","modified_gmt":"2025-09-10T15:58:08","slug":"at-least-15-million-youtube-videos-have-been-snatched-by-ai-companies","status":"publish","type":"post","link":"https:\/\/naijaglobalnews.org\/?p=20348","title":{"rendered":"At Least 15 Million YouTube Videos Have Been Snatched by AI Companies"},"content":{"rendered":"<p>\n<\/p>\n<p class=\"ArticleParagraph_root__4mszW\" data-flatplan-paragraph=\"true\"><em>Editor\u2019s note: This analysis is part of <\/em>The Atlantic<em>\u2019s investigation into how YouTube videos are taken to train AI tools. You can use the search tool directly here, to see whether videos you\u2019ve created or watched are included in the data sets. This work is part of AI Watchdog, <\/em>The Atlantic<em>\u2019s ongoing investigation into the generative-AI industry.<\/em><\/p>\n<p class=\"ArticleParagraph_root__4mszW ArticleParagraph_dropcap__uIVzg\" data-flatplan-paragraph=\"true\" data-flatplan-dropcap=\"true\">W<span class=\"smallcaps\">hen Jon Peters uploaded his first video<\/span> to YouTube in 2010, he had no idea where it would lead. He was a professional woodworker running a small business who decided to film himself making a dining table with some old legs he had found in a barn. It turned out that people liked his candid style, and as he posted more videos, a fan base began to grow. \u201cAll of a sudden there\u2019s people who appreciate the work I\u2019m doing,\u201d he told me. \u201cThe comments were a motivator.\u201d Fifteen years later, his channel has more than 1 million subscribers. Sometimes he gets photos of people in their shops, following his guidance from a big TV on the wall\u2014most of his viewers, Peters told me, are woodworkers looking to him for instruction.<\/p>\n<p class=\"ArticleParagraph_root__4mszW\" data-flatplan-paragraph=\"true\">But Peters\u2019s channel could soon be obsolete, along with millions of other videos created by people who share their expertise and advice on YouTube. Over the past few months, I\u2019ve discovered more than 15.8 million videos from more than 2 million channels that tech companies have, without permission, downloaded to train AI products. Nearly 1 million of them, by my count, are how-to videos. You can find these videos in at least 13 different data sets distributed by AI developers at tech companies, universities, and research organizations, through websites such as Hugging Face, an online AI-development hub.<\/p>\n<p class=\"ArticleParagraph_root__4mszW\" data-flatplan-paragraph=\"true\">In most cases the videos are anonymized, meaning that titles and creator names are not included. I was able to identify the videos by extracting unique identifiers from the data sets and looking them up on YouTube\u2014similar to the process I followed when I revealed the contents of the Books3, OpenSubtitles, and LibGen data sets. You can search the data sets using the tool below, typing in channel names like \u201cMrBeast\u201d or \u201cJames Charles,\u201d for example.<\/p>\n<\/p>\n<p class=\"ArticleParagraph_root__4mszW\" data-flatplan-paragraph=\"true\">(<em>A note for users: Just because a video appears in these data sets does not mean it was used for training by AI companies, which could choose to omit certain videos when developing their products.<\/em>)<\/p>\n<p class=\"ArticleParagraph_root__4mszW\" data-flatplan-paragraph=\"true\">To create AI products capable of generating video, developers need huge quantities of videos, and YouTube has become a common source. Although YouTube does offer paying subscribers the ability to download videos and watch them through the company\u2019s app whenever they\u2019d like, this is something different: Video files are being ripped from YouTube en masse and saved in files that are then fed to AI algorithms. This kind of downloading violates the platform\u2019s terms of service, but many tools allow AI developers to download videos in this way. YouTube appears to have done little, if anything, to stop the mass downloading, and the company did not respond to my request for comment.<\/p>\n<p class=\"ArticleParagraph_root__4mszW\" data-flatplan-paragraph=\"true\">Not all YouTube videos are copyrighted (and some are uploaded by people who don\u2019t own the copyrights), but many are. Unauthorized copying or distribution of those videos is illegal, but whether AI training constitutes a form of copying or distribution is still a question being debated in many ongoing lawsuits. Tech companies have argued that training is a \u201cfair use\u201d of copyrighted work, and some judges have disagreed in their responses. How the courts ultimately apply the law to this novel technology could have massive consequences for creators\u2019 motivations to post their work on YouTube and similar platforms\u2014if tech companies are able to continue taking creators\u2019 work to build AI products that compete with them, then creators may have little choice but to stop sharing.<\/p>\n<p class=\"ArticleParagraph_root__4mszW ArticleParagraph_dropcap__uIVzg\" data-flatplan-paragraph=\"true\" data-flatplan-dropcap=\"true\">G<span class=\"smallcaps\">enerative-AI tools are already producing<\/span> videos that compete with human-made work on YouTube. AI-generated history videos with hundreds of thousands of views and many inaccuracies are drowning out fact-checked, expert-produced content. Popular music-remix videos are frequently created using this technology, and many of them perform better than human-made videos.<\/p>\n<p class=\"ArticleParagraph_root__4mszW\" data-flatplan-paragraph=\"true\">The problem extends far beyond YouTube, however. Most modern chatbots are \u201cmultimodal,\u201d meaning they can respond to a question by creating relevant media. Google\u2019s Gemini chatbot, for instance, will produce short clips for paying users. Soon, you may be able to ask ChatGPT or another generative-AI tool about how to build a table from found legs and get a custom how-to video in response. Even if that response isn\u2019t as good as any video Peters would make, it will be immediate, and it will be tailor-made to your specifications. The online-publishing business has already been decimated by text-generation tools; video creators should expect similar challenges from generative-AI tools in the near future.<\/p>\n<p class=\"ArticleParagraph_root__4mszW\" data-flatplan-paragraph=\"true\">Many major tech companies have used these data sets to train AI, according to research papers I\u2019ve read and AI developers I\u2019ve spoken with. The group includes Microsoft, Meta, Amazon, Nvidia, Runway, ByteDance, Snap, and Tencent. I reached out to each of these companies to ask about their use of these data sets. Only Meta, Amazon, and Nvidia responded. All three said they \u201crespect\u201d content creators and believe that their use of the work is legal under existing copyright law. Amazon also shared that, where video is concerned, it is currently focused on developing ways to generate \u201ccompelling, high-quality advertisements from simple prompts.\u201d<\/p>\n<p class=\"ArticleParagraph_root__4mszW\" data-flatplan-paragraph=\"true\">We can\u2019t be certain whether all these these companies will use the videos to create for-profit video-generating tools. Some of the work they\u2019ve done may be simply experimental. But a few of these companies have an obvious interest in pursuing commercial products: Meta, for instance, is developing a suite of tools called Movie Gen that creates videos from text prompts, and Snap offers \u201cAI Video Lenses\u201d that allow users to augment their videos with generative AI. Videos such as the ones in these data sets are the raw material for products like these; much as ChatGPT couldn\u2019t write like Shakespeare without first \u201creading\u201d Shakespeare, a video generator couldn\u2019t construct a fake newscast without \u201cwatching\u201d tons of recorded broadcasts. In fact, a large number of the videos in these data sets are from news and educational channels, such as the BBC (which has at least 33,000 videos in the data sets, across its various brands) and TED (nearly 50,000). Hundreds of thousands of others\u2014if not more\u2014are from individual creators, such as Peters.<\/p>\n<p class=\"ArticleParagraph_root__4mszW\" data-flatplan-paragraph=\"true\">AI companies are more interested in some videos than others. A spreadsheet leaked to <em>404 Media<\/em> by a former employee at Runway, which builds AI video-generation tools, shows what the company valued about certain channels: \u201chigh camera movement,\u201d \u201cbeautiful cinematic landscapes,\u201d \u201chigh quality scenes from movies,\u201d \u201csuper high quality sci-fi short films.\u201d One channel was labeled \u201cTHE HOLY GRAIL OF CAR CINEMATICS SO FAR\u201d; another was labeled \u201conly 4 videos but they are really well done.\u201d<\/p>\n<p class=\"ArticleParagraph_root__4mszW\" data-flatplan-paragraph=\"true\">Developers seek out high-quality videos in a variety of ways. Curators of two of the data sets collected here\u2014HowTo100M and HD-VILA-100M\u2014prioritized videos with high view counts on YouTube, equating popularity with quality. The creators of another data set, HD-VG-130M, noted that \u201chigh view count does not guarantee video quality,\u201d and used an AI model to select videos of high \u201caesthetic quality.\u201d Data-set creators often try to avoid videos that contain overlaid text, such as subtitles and logos, so these identifying features don\u2019t appear in videos generated by their model. So, some advice for YouTubers: Putting a watermark or logo on your videos, even a small one, makes them less desirable for training.<\/p>\n<p class=\"ArticleParagraph_root__4mszW\" data-flatplan-paragraph=\"true\">To prepare the videos for training, developers split the footage into short clips, in many cases cutting wherever there is a scene or camera change. Each clip is then given an English-language description of the visual scene so the model can be trained to correlate words with moving images, and to generate videos from text prompts. AI developers have a few methods of writing these captions. One way is to pay workers to do it. Another is to use separate AI models to generate a description automatically. The latter is more common, because of its lower cost.<\/p>\n<p class=\"ArticleParagraph_root__4mszW ArticleParagraph_dropcap__uIVzg\" data-flatplan-paragraph=\"true\" data-flatplan-dropcap=\"true\">A<span class=\"smallcaps\">I video tools aren\u2019t yet<\/span> as mainstream as chatbots or image generators, but they are already in wide use. You may already have seen AI-manipulated video without realizing it. For example, TED has been using AI to dub speakers\u2019 talks in different languages. This includes the video as well as the audio: Speakers\u2019 mouths are lip-synched with the new words so it looks like they\u2019re speaking Japanese, French, or Russian. Nishat Ruiter, TED\u2019s general counsel, told me this is done with the speakers\u2019 knowledge and consent.<\/p>\n<p class=\"ArticleParagraph_root__4mszW\" data-flatplan-paragraph=\"true\">There are also consumer-facing products for tweaking videos with AI. If your face doesn\u2019t look right, for example, you can try a face-enhancer such as Facetune, or ditch your mug entirely with a face-swapper such as Facewow. With Runway\u2019s Aleph, you can change the colors of objects, or turn sunshine into a snowstorm.<\/p>\n<p class=\"ArticleParagraph_root__4mszW\" data-flatplan-paragraph=\"true\">Then there are tools that generate new videos based on an image you provide. Google encourages Gemini users to animate their \u201cfavorite photos.\u201d The result is a clip that extrapolates eight seconds of movement from an initial image, making a person dance, cook, or swing a golf club. These are often both amazing and creepy. \u201cTalking head generation\u201d\u2014for employee-orientation videos, for example\u2014is also advancing. Vidnoz AI promises to generate \u201cRealistic AI Spokespersons of Any Style.\u201d A company called Arcads will generate a complete advertisement, with actors and voiceover. ByteDance, the company that operates TikTok, offers a similar product called Symphony Creative Studio. Other applications of AI video generation include virtual try-on of clothes, generating custom video games, and animating cartoon characters and people.<\/p>\n<p class=\"ArticleParagraph_root__4mszW\" data-flatplan-paragraph=\"true\">Some companies are both working with AI and simultaneously fighting to defend their content from being pilfered by AI companies. This reflects the Wild West mentality in AI right now\u2014companies exploiting legal gray areas to see how they can profit. As I investigated these data sets, I learned about an incident involving TED\u2014again, one of the most-pilfered organizations in the data sets captured here, and one that is attempting to employ AI to advance its own business. In June, the Cannes Lions international advertising festival gave one of its Grand Prix awards to an ad that included deepfaked footage from a TED talk by DeAndrea Salvador, currently a state senator in North Carolina. The ad agency, DM9, \u201cused AI cloning to change her talk and repurposed it for a commercial ad campaign,\u201d Ruiter told me on a video call recently. When the manipulation was discovered, the Cannes Lions festival withdrew the award. Last month, Salvador sued DM9 along with its clients\u2014Whirpool and Consul\u2014for misappropriation of her likeness, among other things. DM9 apologized for the incident and cited \u201ca series of failures in the production and sending\u201d of the ad. A spokesperson from Whirlpool told me the company was unaware the senator\u2019s remarks had been altered.<\/p>\n<p class=\"ArticleParagraph_root__4mszW\" data-flatplan-paragraph=\"true\">Others in the film industry have filed lawsuits against AI companies for training with their content. In June, Disney and Universal sued Midjourney, the maker of an image-generating tool that can produce images containing recognizable characters (Warner Brothers joined the lawsuit last week). The lawsuit called Midjourney a \u201cbottomless pit of plagiarism.\u201d The following month, two adult-film companies sued Meta for downloading (and distributing through BitTorrent) more than 2,000 of their videos. Neither Midjourney nor Meta has responded to the allegations, and neither responded to my request for comment. One YouTuber filed their own lawsuit: In August of last year, David Millette sued Nvidia for unjust enrichment and unfair competition with regard to the training of its Cosmos AI, but the case was voluntarily dismissed months later.<\/p>\n<p class=\"ArticleParagraph_root__4mszW\" data-flatplan-paragraph=\"true\">The Disney characters and the deepfaked Salvador ad are just two instances of how these tools can be damaging. The floodgates may soon be opening further. Thanks to the enormous amount of investment in the technology, generated videos are beginning to appear everywhere. One company, DeepBrain AI, pays \u201ccreators\u201d to post AI-generated videos made with its tools on YouTube. It currently offers $500 for a video that gets 10,000 views, a relatively low threshold. Companies that run social-media platforms, such as Google and Meta, also pay users for content, through ad-revenue sharing, and many directly encourage the posting of AI-generated content. Not surprisingly, a coterie of gurus has arrived to teach the secrets of making money with AI-generated content.<\/p>\n<p class=\"ArticleParagraph_root__4mszW\" data-flatplan-paragraph=\"true\">Google and Meta have also trained AI tools on large quantities of videos from their own platforms: Google has taken at least 70 million clips from YouTube, and Meta has taken more than 65 million clips from Instagram. If these companies succeed in flooding their platforms with synthetic videos, human creators could be left with the unenviable task of competing with machines that churn out endless content based on their original work. And social media will become even less social than it is.<\/p>\n<p class=\"ArticleParagraph_root__4mszW\" data-flatplan-paragraph=\"true\">I asked Peters if he knew his videos had been taken from YouTube to train AI. He said he didn\u2019t, but he wasn\u2019t surprised. \u201cI think everything\u2019s gonna get stolen,\u201d he told me. But he didn\u2019t know what to do about it. \u201cDo I quit, or do I just keep making videos and hope people want to connect with a person?\u201d<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Editor\u2019s note: This analysis is part of The Atlantic\u2019s investigation into how YouTube videos are taken to train AI tools. You can use the search tool directly here, to see whether videos you\u2019ve created or watched are included in the data sets. This work is part of AI Watchdog, The Atlantic\u2019s ongoing investigation into the<\/p>\n","protected":false},"author":1,"featured_media":20349,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[55],"tags":[477,1305,12515,2359,1265],"class_list":{"0":"post-20348","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-social-issues","8":"tag-companies","9":"tag-million","10":"tag-snatched","11":"tag-videos","12":"tag-youtube"},"_links":{"self":[{"href":"https:\/\/naijaglobalnews.org\/index.php?rest_route=\/wp\/v2\/posts\/20348","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/naijaglobalnews.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/naijaglobalnews.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/naijaglobalnews.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/naijaglobalnews.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=20348"}],"version-history":[{"count":0,"href":"https:\/\/naijaglobalnews.org\/index.php?rest_route=\/wp\/v2\/posts\/20348\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/naijaglobalnews.org\/index.php?rest_route=\/wp\/v2\/media\/20349"}],"wp:attachment":[{"href":"https:\/\/naijaglobalnews.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=20348"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/naijaglobalnews.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=20348"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/naijaglobalnews.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=20348"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}