Artificial intelligence/training: Difference between revisions

Revision as of 03:00, 27 April 2026

⚠️This article has been marked as incomplete. Sourcing or verifiability needs additional work.

A moderator needs to check the page before this notice can be removed. Visit the noticeboard or the #appeals channel in either Zulip or Discord to request removal.

More info ▼

Articles must provide verifiable, credible evidence for their claims and avoid relying on forum posts, personal blogs, or other unverifiable sources. You can help by replacing weak citations with reputable reporting, corporate communications, receipts, repair logs, or independent investigative coverage that demonstrates the systemic relevance required by the Mission statement and Moderator Guidelines.

AI training is a process by which data is fed into an AI model, in order to adjust its weights. This makes the output of the model closely match that of its input.

How it works

There are several ways to implement AI, and even more ways to train them, the most well-known being backpropagation. With respect to the data-set, LLMs must be trained on massive amounts of data, which is a task that's only feasible via automation. This is in contrast to curated data-sets, in which both the data and the training is done in a more carefully controlled environment. Automated training on massive data-sets is typically done using internet web-sites as sources. The process of scraping is similar to how web-search-engines index and cache pages.

Why it is a problem

Intellectual property laundering

Most, if not all, of the data used for training is copied indiscriminately, without even checking licenses or any copyright terms.^{[citation needed]} This is very controversial. Some people argue that it is "fair use" because AI systems learn in ways similar to animal and human brains, others claim it's more like a parrot learning phrases, others claim that it's "transformative" so it's still fair-use, others say it's akin to tracing images (this applies mostly to image models, though the analogy can work for text models).^{[citation needed - too many opinions]}

Ultimately, it depends a lot on the technical details of how each model works, so none of those arguments are universal.

Some people request that, at the very least, the sources of the training data must be publicly disclosed, for the sake of transparency and attribution.^[1]

Energy use

While self-hosted models can be trained with a single consumer-grade GPU, data-centers with hundreds or thousands of GPUs (known for being more power-hungry than CPUs) are used to train corporate-grade (or "enterprise") models. This can worsen climate change.

Bandwidth abuse

Massive data needs massive bandwidth. Scraping web-pages across the entire internet requires sending millions of requests to all known servers. Some AI companies go as far as to repeatedly send requests for the same content (or several revisions of the same content) as frequent bursts in short intervals, which is indistinguishable from DDoS attacks.

Examples

While "mainstream" companies such as OpenAI, Anthropic, and Meta appear to correctly follow industry-standard practice for web crawlers, others ignore them (such as Alibaba^[2]), causing distributed denial of service attacks which damage access to freely-accessible websites. This is particularly an issue for websites that are large or contain many dynamic links.

Ethical website scrapers, known as "spiders" that crawl the web, follow a certain set of minimum guidelines. Specifically, they follow robots.txt, a text file found at the root of a domain that indicates:

Paths bots are allowed to index
Paths bots should not index
How long the bot should wait in between requests to the server, to reduce load
The sitemap of the website's content

These rules are typically configured for all bots, with minor adjustments made to individual bots as needed. Additionally, specific web pages may use the robots meta tag to control use of their output.

While it is good practice for a bot to respect robots.txt, there is no requirement for it, and there is no punishment for not following a website's wishes. It is additionally standard practice, but in no way enforced, that bots use a User-Agent header to uniquely identify itself. This allows a website operator to observe a bot's traffic patterns, potentially blocking the bot outright if its scraping is not desirable. The header also typically contains a URL or email address that can be used to contact the operator in case of anomalies observed in its traffic.

Unethical AI scraper bots do not follow robots.txt - in fact, they may not even request this file at all. They typically completely ignore it, instead opting to start from an entry point such as the root home page (/), working its way through an exponentially growing list of links as it finds them, with little to no delay between requests. The bots use false User-Agent header strings that would correspond to real web browsers on desktop or mobile operating systems - blocking them would also block legitimate users, or at least legitimate users on VPNs.

Some AI services opt to use separate User-Agent strings, potentially also ignoring robots.txt, when a request is made through user command rather than as part of model training. For example, ChatGPT identifies itself as ChatGPT-User rather than its standard OpenAI when it uses the "search the web" command - even if searching the web was an automatic decision. In a less favorable example, Perplexity AI in this same situation falsely identifies as a standard Chrome web browser running on Windows. AI companies defend this under the belief that they are not a "spider", but rather a "user agent" (like a web browser), when called upon by a user's request.^[3]

Less legitimate bots use a wide distribution of IP addresses, further reducing options for the website to protect itself. This is in a clear attempt to bypass IP-based request throttling and rate limiting the website may implement. They are also known to ignore HTTP response status codes that indicate a server error (5xx), or warnings that the client needs to slow down (429 Too Many Requests) or has been entirely blocked (403 Forbidden).

Effect on users

To protect against unethical crawlers, due to concerns of both intellectual property and service disruption, websites adopt practices that affect the experience of real users:

Bot check walls: The user may be required to pass a security check "wall". While usually automatic for the user, this can affect legitimate bots. When a website protection service such as Cloudflare is not confident as to whether the visitor is legitimate, it may present a CAPTCHA to be manually filled out. An example is "Google Sorry", a CAPTCHA wall frequently seen when using Google Search via a VPN. An example that's popular in the FOSS community is Anubis.
Login walls: Should bots be found to pass CAPTCHA walls, the website may advance to requiring logging in to view content. A major recent example of this is YouTube's "Sign in to confirm you're not a bot" messages.
JavaScript requirement: Most websites do not need JavaScript to deliver their content. However, as many scrapers expect content to be found directly in the HTML, it is often an easy workaround to use JavaScript to "insert" the content after the page has loaded. This may reduce the responsiveness of the website, increasing points of failure, and preventing security-conscious users who disable JavaScript from viewing the website.
IP address blocking: Blocking IP addresses, especially by blocking entire providers via their autonomous system number, always comes with some risk of blocking legitimate users. Particularly, this may restrict access to users making use of a VPN.
Heuristic blocking: Patterns in request headers may give away that the request is being made by an unethical bot, despite attempts to act as a legitimate visitor. Heuristics are imperfect and may block legitimate users, especially those that may use less common browsers.

In rare situations, a website operator may redirect detected bot traffic, such as to download speed test files hosted by ISPs containing multiple gigabytes of random garbage data. This may have the effect of disrupting the bot, but its effectiveness is unknown.

The need to respond to unethical scraping also further consolidates the web into the control of a few large web application firewall (WAF) services, most notably Cloudflare, as website owners find themselves otherwise unable to protect their service from being disrupted by such traffic.

Case studies

Diaspora

On 27 December 2024, the open-source social network project Diaspora noted that 70% of traffic across its infrastructure was in service of AI scrapers.^[4] Particularly, the project noted that bots had followed links to crawl every individual edit in their MediaWiki instance, causing an exponential increase in the number of unique requests being made.

LVFS

The Linux Vendor Firmware Service (LVFS) provides a free central store of firmware updates, such as for UEFI motherboards and SSD controllers. This feature is integrated with many Linux distributions through the fwupd daemon. For situations where internet access is not permitted, the service allows users to make a local mirror of the entire 100+ GB store.

On 9 January 2025, the project announced that it would introduce a login wall around its mirror feature, citing unnecessary use of its bandwidth.^[5] Up to 1,000 files may be downloaded per day without logging in. The author later mentioned on Mastodon that the problem appears to be caused by AI scraping.^[6]

LWN.net

On 21 January 2025, Jonathan Corbet, maintainer of the Linux news website LWN.net, made the following post to social.kernel.org:

Should you be wondering why @LWN #LWN is occasionally sluggish... since the new year, the DDOS onslaughts from AI-scraper bots has picked up considerably. Only a small fraction of our traffic is serving actual human readers at this point. At times, some bot decides to hit us from hundreds of IP addresses at once, clogging the works. They don't identify themselves as bots, and robots.txt is the only thing they *don't* read off the site.
This is beyond unsustainable. We are going to have to put time into deploying some sort of active defenses just to keep the site online. I think I'd even rather be writing about accounting systems than dealing with this cr*p. And it's not just us, of course; this behavior is going to wreck the net even more than it's already wrecked.

He later commented:^[7]

We do indeed see a kind of pattern. Every IP stays below the threshold for our fuses, but the overload is overwhelming. Any form of active defense will probably have to figure out to block entire subnets instead of individual addresses, and even that might not be enough.

MediaWiki, Wikipedia, and the Wikimedia Foundation

MediaWiki is of particular interest to LLM training due to the vast amount of factual, plain-text content wikis tend to hold. While Wikipedia and the Wikimedia Foundation host the most well-known wikis, numerous smaller wikis exist thanks to the work of many independent editors. The strength of wiki architecture is its ability for every edit to be audited by anyone, at any time - you can still view the first edit to Wikipedia from 2002. This makes wikis a hybrid of a static website and a dynamic web app, which becomes problematic when poorly-designed bots attempt to scrape them.^[4]

The Apple Wiki, which documents internal details of Apple's hardware and software, holds more than 50,000 articles. On 2 August 2024, with a repeat occurrence on 5 January 2025, the service was disrupted by scraping efforts.^[8] The wiki contains a considerable amount of information that is scraped by legitimate security research tools, making it difficult for the website to block non-legitimate requests. Efforts to block unethical scraping and protect the wiki have disrupted these legitimate tools. The large article count, combined with more than 280,000 total edits over the wiki's lifetime, create an untenable situation where it is simply not possible to scrape the website without causing significant service disruption.

On 1 April 2025, the Wikimedia Foundation indicated that its infrastructure has been under increasing pressure from content scraping bots since January 2024, with the particularly critical metric that "65% of our most expensive traffic comes from bots", despite estimating 35% of all traffic as coming from bots. The bots create traffic patterns that are significantly unlike human traffic patterns, effectively bypassing Wikimedia's caching infrastructure and placing significant load on the core servers. A blog post provides an example where bot traffic caused the Wikimedia Commons service to become unstable during a human traffic spike. The Foundation is considering introduction of a Responsible Use of Infrastructure policy to ensure the continued stability of their services.^[9]

Perplexity AI and news outlets

Perplexity AI, founded in August 2022, is a large language model that aims to be viewed as a general search engine. It encourages users to consume news through its summaries of stories.

On 15 June 2024, an investigation by Apple blog MacStories found that Perplexity does not follow its own documented policies when accessing content the user requests from the web. In their testing, the scraper pretended to be Chrome 111 running on Windows 10, connecting from an IP address not found in Perplexity's publicly-listed IP address ranges.^[10] MacStories' findings were confirmed by a WIRED investigation.^[11] Perplexity responded by removing its list of IP addresses.

On 27 June 2024, Amazon announced an investigation into Perplexity AI, suggesting the behavior may be considered abusive under Amazon Web Services terms of service:^[3]

"AWS's terms of service prohibit abusive and illegal activities and our customers are responsible for complying with those terms," [AWS spokesperson Patrick] Neighorn said in a statement. "We routinely receive reports of alleged abuse from a variety of sources and engage our customers to understand those reports."

Read the Docs

In an early example, on 25 July 2024, open source documentation website Read the Docs detailed cases of abusive bots downloading large amounts of content from the service. Particularly, the significant range of IP addresses used in an aggressive manner rendered existing rate limiting ineffective. Taking action to block traffic identified by Cloudflare as "AI crawlers" reduced bandwidth requirements by 75%, at a cost saving of $1,500 USD/month.^[12]

SourceHut and Fedora Linux

On 15 March 2025, an infrastructure manager for the Fedora Linux open source project discussed an assumed large language model crawling attack against the Prague.io Git source code hosting service. The project made the decision to block the entire country of Brazil for some time, while also blocking access to certain repositories whose traffic was creating significant CPU usage.^[13]^[14]

On 17 March 2025, the Git source code host SourceHut announced that the service was being disrupted by large language model crawlers. Mitigations deployed to reduce disruption involved requiring login for some areas of the service, and blocking IP ranges of cloud providers, affecting legitimate use of the website by its users.^[15] In response to the event, SourceHut founder Drew DeVault wrote a blog post entitled "Please stop externalizing your costs directly into my face", discussing his frustrations with having ongoing and ever-adapting attacks that must be addressed in a timely fashion to reduce disruption to legitimate SourceHut users. DeVault estimates that between "20-100%" of his time is now spent addressing such attacks.

References

↑ Tunney, Justine (2024-08-23). "AI Training Shouldn't Erase Authorship". Retrieved 2026-04-26.
↑ Venerandi, Niccolò. "FOSS infrastructure is under attack by AI companies". LibreNews. Archived from the original on 17 Feb 2026. Retrieved 2026-02-23.
↑ ^3.0 ^3.1 Mehrotra, Dhruv (27 Jun 2024). "Amazon Is Investigating Perplexity Over Claims of Scraping Abuse". WIRED. Archived from the original on 2 Feb 2026.
↑ ^4.0 ^4.1 Schubert, Dennis (27 Dec 2024). "In the last 60 days, the diaspora* web assets received 11.3 million requests ..." Archived from the original on 3 Dec 2025 – via Diaspora*.
↑ Hughes, Richard (9 Jan 2025). "Authentication soon required to mirror the entire LVFS". Linux Vendor Firmware Service (LVFS) mailing list. Archived from the original on 1 Sep 2025.
↑ Hughes, Richard (22 Jan 2025). "Commentary citing 'Authentication soon required to mirror the entire LVFS'". Archived from the original on 19 Oct 2025 – via Mastodon.
↑ Knop, Dirk (22 Jan 2025). "AI bots paralyze Linux news site and others". Heise (in English and Deutsch). Archived from the original on 18 Oct 2025.
↑ "Bot traffic abuse". The Apple Wiki. Archived from the original on 30 Jan 2026.
↑ Mueller, Birgit; Danis, Chris; Lavagetto, Giuseppe (1 Apr 2025). "How crawlers impact the operations of the Wikimedia projects". Wikimedia Foundation. Archived from the original on 7 Feb 2026.
↑ Knight, Robb (15 Jun 2024). "Perplexity AI Is Lying about Their User Agent". Archived from the original on 23 Jan 2026.
↑ Mehrotra, Dhruv; Marchman, Tim (19 Jun 2024). "Perplexity Is a Bullshit Machine". WIRED. Archived from the original on 1 Feb 2026.
↑ Holscher, Eric (25 Jul 2024). "AI crawlers need to be more respectful". Read the Docs. Archived from the original on 28 Oct 2025.
↑ Fenzi, Kevin (15 Mar 2025). "Mid March infra bits 2025". Archived from the original on 15 Feb 2026.
↑ Fenzi, Kevin (29 Mar 2025). "Late March infra bits 2025". Archived from the original on 18 Oct 2025.
↑ "LLM crawlers continue to DDoS SourceHut". sr.ht status. 17 Mar 2025. Archived from the original on 20 Dec 2025.

[1] Tunney, Justine (2024-08-23). "AI Training Shouldn't Erase Authorship". Retrieved 2026-04-26.

[2] Venerandi, Niccolò. "FOSS infrastructure is under attack by AI companies". LibreNews. Archived from the original on 17 Feb 2026. Retrieved 2026-02-23.

[perplexity-aws-3] 3.0 ^3.1 Mehrotra, Dhruv (27 Jun 2024). "Amazon Is Investigating Perplexity Over Claims of Scraping Abuse". WIRED. Archived from the original on 2 Feb 2026.

[geraspora-4] 4.0 ^4.1 Schubert, Dennis (27 Dec 2024). "In the last 60 days, the diaspora* web assets received 11.3 million requests ..." Archived from the original on 3 Dec 2025 – via Diaspora*.

[5] Hughes, Richard (9 Jan 2025). "Authentication soon required to mirror the entire LVFS". Linux Vendor Firmware Service (LVFS) mailing list. Archived from the original on 1 Sep 2025.

[6] Hughes, Richard (22 Jan 2025). "Commentary citing 'Authentication soon required to mirror the entire LVFS'". Archived from the original on 19 Oct 2025 – via Mastodon.

[7] Knop, Dirk (22 Jan 2025). "AI bots paralyze Linux news site and others". Heise (in English and Deutsch). Archived from the original on 18 Oct 2025.

[8] "Bot traffic abuse". The Apple Wiki. Archived from the original on 30 Jan 2026.

[9] Mueller, Birgit; Danis, Chris; Lavagetto, Giuseppe (1 Apr 2025). "How crawlers impact the operations of the Wikimedia projects". Wikimedia Foundation. Archived from the original on 7 Feb 2026.

[10] Knight, Robb (15 Jun 2024). "Perplexity AI Is Lying about Their User Agent". Archived from the original on 23 Jan 2026.

[11] Mehrotra, Dhruv; Marchman, Tim (19 Jun 2024). "Perplexity Is a Bullshit Machine". WIRED. Archived from the original on 1 Feb 2026.

[12] Holscher, Eric (25 Jul 2024). "AI crawlers need to be more respectful". Read the Docs. Archived from the original on 28 Oct 2025.

[13] Fenzi, Kevin (15 Mar 2025). "Mid March infra bits 2025". Archived from the original on 15 Feb 2026.

[14] Fenzi, Kevin (29 Mar 2025). "Late March infra bits 2025". Archived from the original on 18 Oct 2025.

[15] "LLM crawlers continue to DDoS SourceHut". sr.ht status. 17 Mar 2025. Archived from the original on 20 Dec 2025.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

@@ Line 7: / Line 7: @@
 ==Why it is a problem==
-=== Intellectual property laundering ===
+===Intellectual property laundering===
 Most, if not all, of the data used for training is copied indiscriminately, without even checking licenses or any copyright terms.{{Citation needed}} This is very controversial. Some people argue that it is "fair use" because AI systems learn in ways similar to animal and human brains, others claim it's more like [[wikt:parroting|a parrot learning phrases]], others claim that it's "transformative" so it's still fair-use, others say it's akin to [[wikipedia:Tracing_(art)|tracing images]] (this applies mostly to image models, though the analogy can work for text models).{{Citation needed|reason=too many opinions}}
@@ Line 14: / Line 14: @@
 Some people request that, at the very least, the sources of the training data must be publicly disclosed, for the sake of [[wikipedia:Transparency_(behavior)|transparency]] and [[wikipedia:Attribution_(copyright)|attribution]].<ref>{{Cite web |last=Tunney |first=Justine |date=2024-08-23 |title=AI Training Shouldn't Erase Authorship |url=https://justine.lol/history/ |access-date=2026-04-26}}</ref>
-=== Ecosystem damage ===
+===Energy use===
-TO-DO
+While [[Self-hosting|self-hosted]] models can be trained with a single consumer-grade GPU, data-centers with hundreds or thousands of GPUs (known for being more power-hungry than CPUs) are used to train corporate-grade (or "enterprise") models. This can worsen [[wikipedia:Climate_change|climate change]].
+=== Bandwidth abuse ===
+Massive data needs massive bandwidth. Scraping web-pages across the entire internet requires sending millions of requests to all known servers. Some AI companies go as far as to ''repeatedly'' send requests for the same content (or several revisions of the same content) as frequent bursts in short intervals, which is indistinguishable from [[wikipedia:Denial-of-service_attack|DDoS attacks]].
 ==Examples==