Over 245 news organisations across nine countries are blocking AI firms from mining the Internet Archive's vast web history, highlighting a growing clash over digital preservation and copyright rights as AI training raises new legal questions.

News publishers are drawing a line around the Internet Archive as they try to stop AI firms from mining old web pages for training data, turning a long-standing preservation tool into an unexpected front in the copyright fight. Euronews reported that about 245 news organisations in nine countries are now seeking to block at least one of the Archive’s crawlers, with many of the affected sites belonging to major publishers including USA Today’s parent company. The concern is no longer just about search or storage, but about whether archived journalism is being repurposed without permission or payment.

The scale of the Archive explains why the issue has become so sensitive. With more than a trillion web pages saved since 1996, the Wayback Machine has become a crucial record of disappearing or altered online material, including reporting from outlets such as CNN, The New York Times, The Guardian and USA Today. For historians, lawyers and editors, it can provide proof of what was published and when. For AI companies, the same trove offers structured, dated text and images that are attractive for training large language models.

That tension is now feeding into a wider legal and commercial struggle over journalism and artificial intelligence. Reuters has reported in recent months that major publishers, including The New York Times, are pursuing AI companies over copyright and licensing, while The Atlantic has noted that courts are still defining how copyright applies to AI-generated and AI-assisted work. In that environment, publishers see archived copies not as neutral history, but as another possible route for systems to ingest their work at scale.

The Internet Archive insists it is being caught in the middle. Its director of the Wayback Machine, Mark Graham, has argued that the real problem is AI companies using archive interfaces as a shortcut to content they did not create, while the Archive itself has tried to curb large downloads and automated extraction in some cases. At the same time, it says preservation remains essential, because pages can be edited, removed or quietly rewritten after publication. Some publishers, including The Guardian, have opted for tighter limits rather than complete blocks, while digital rights campaigners and journalists are pushing back against broad restrictions that could erase pieces of the web’s public memory.

Source Reference Map

Inspired by headline at: [1]

Sources by paragraph: - Paragraph 1: [2], [6] - Paragraph 2: [1], [6] - Paragraph 3: [3], [4] - Paragraph 4: [1], [2], [7]

Source: Noah Wire Services