AI crawlers cause Wikimedia Commons bandwidth demands to surge 50%

Topics

a la mode

Amazon

Image Credits:Riccardo Milani / Hans Lucas / Hans Lucas via AFP / Getty Images

Apps

Biotech & Health

Climate

Illustration of the Wikipedia website application

Image Credits:Riccardo Milani / Hans Lucas / Hans Lucas via AFP / Getty Images

Cloud Computing

Commerce

Crypto

endeavor

EVs

Fintech

fund raise

Gadgets

Gaming

Google

Government & Policy

Hardware

Instagram

Layoffs

Media & Entertainment

More from TechCrunch

effect

Startup Battlefield

StrictlyVC

newssheet

Podcasts

Videos

Partner Content

TechCrunch Brand Studio

Crunchboard

The Wikimedia Foundation , the umbrella organisation of Wikipedia and a XII or soothercrowdsourced knowledge undertaking , sound out on Wednesday that bandwidth consumption for multimedia downloads fromWikimedia Commonshas surged by 50 % since January 2024 .

The reason , the outfit pen in ablog postTuesday , is n’t due to growing requirement from knowledge - hungry human race , but from automate , datum - athirst scrapers search to train AI models .

“ Our infrastructure is built to sustain sudden traffic spikes from humans during high - interest events , but the amount of traffic mother by scraper bot is unprecedented and presents growing risk of exposure and costs , ” the position reads .

Wikimedia Commons is a freely accessible monument of epitome , television , and audio files that are useable under open licenses or are otherwise in the public domain .

get the picture down , Wikimedia tell that almost two - thirds ( 65 % ) of the most “ expensive ” dealings — that is , the most resource - intensive in term of the kind of content consumed — was from bots . However , just 35 % of the overall pageviews comes from these bots . The reason for this disparity , according to Wikimedia , is that frequently access mental object stays nearer to the user in its cache , while other less - ofttimes accessed capacity is stored further by in the “ core information center , ” which is more expensive to serve content from . This is the form of content that bots typically go front for .

“ While human readers incline to focus on specific – often similar – topics , crawler bot be given to ‘ bulk read ’ large numbers of pages and inspect also the less pop pages , ” Wikimedia publish . “ This means these type of requests are more likely to get send on to the core datacenter , which makes it much more expensive in price of consumption of our resources . ”

The long and short of all this is that the Wikimedia Foundation ’s site reliability team is hold to drop a lot of time and resources occlude fishworm to avert disruption for veritable users . And all this before we study the swarm be that the Foundation is faced with .

Join us at TechCrunch Sessions: AI

Exhibit at TechCrunch Sessions: AI

In truth , this represent part of a fast - growing vogue that is threatening the very existence of the open net . Last month , software engineer and open source counsellor Drew DeVault bewail the factthat AI crawlers ignore “ robots.txt ” files that are designed to ward off automatize traffic . And “ pragmatic engineer ” Gergely Oroszalso complainedlast week that AI scraper from companies such as Meta have driven up bandwidth demands for his own projects .

While subject source infrastructure , in finicky , is in the release line , developer are fighting back with “ cleverness and vengeance , ” asTechCrunch wrote last week . Some tech companiesare doing their bitto reference the issue , too — Cloudflare , for exemplar , recentlylaunched AI Labyrinth , which expend AI - generated capacity to slow crawlers down .

However , it ’s very much a bozo - and - mouse biz that could ultimately force many publishers to duck for cover behind logins and paywalls — to thedetriment of everyone who uses the World Wide Web today .

Topics#

More from TechCrunch#

Join us at TechCrunch Sessions: AI#

Exhibit at TechCrunch Sessions: AI#

Topics

More from TechCrunch

Join us at TechCrunch Sessions: AI

Exhibit at TechCrunch Sessions: AI