Artificial intelligence/training: Difference between revisions

Line 9:

==Examples==

While "mainstream" companies such as [[OpenAI]], [[Anthropic]], and [[Meta]] appear to correctly follow industry-standard practice for web crawlers, others ignore them, causing [[wikipedia:Denial-of-service attack|distributed denial of service attacks]] which damage access to freely-accessible websites. This is particularly an issue for websites that are large or contain many dynamic links.

While "mainstream" companies such as [[OpenAI]], [[Anthropic]], and [[Meta]] appear to correctly follow industry-standard practice for web crawlers, others ignore them (such as [[wikipedia:Alibaba_Group|Alibaba]]<ref>{{Cite news |last=Venerandi |first=Niccolò |title=FOSS infrastructure is under attack by AI companies |url=https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies |access-date=2026-02-23 |work=LibreNews}}</ref>), causing [[wikipedia:Denial-of-service attack|distributed denial of service attacks]] which damage access to freely-accessible websites. This is particularly an issue for websites that are large or contain many dynamic links.

Ethical website scrapers, known as "spiders" that crawl the web, follow a certain set of minimum guidelines. Specifically, they follow [[wikipedia:robots.txt|robots.txt]], a text file found at the root of a domain that indicates:

Line 91:

On 17 March 2025, the Git source code host SourceHut announced that the service was being disrupted by large language model crawlers. Mitigations deployed to reduce disruption involved requiring login for some areas of the service, and blocking IP ranges of cloud providers, affecting legitimate use of the website by its users.<ref>{{Cite web |date=17 Mar 2025 |title=LLM crawlers continue to DDoS SourceHut |url=https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/ |website=sr.ht status |url-status=live |archive-url=http://web.archive.org/web/20251220125852/https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/ |archive-date=20 Dec 2025}}</ref> In response to the event, SourceHut founder Drew DeVault wrote a blog post entitled "[https://drewdevault.com/2025/03/17/2025-03-17-Stop-externalizing-your-costs-on-me.html Please stop externalizing your costs directly into my face]", discussing his frustrations with having ongoing and ever-adapting attacks that must be addressed in a timely fashion to reduce disruption to legitimate SourceHut users. DeVault estimates that between "20-100%" of his time is now spent addressing such attacks.

==References==

@@ Line 9: / Line 9: @@
 ==Examples==
-While "mainstream" companies such as [[OpenAI]], [[Anthropic]], and [[Meta]] appear to correctly follow industry-standard practice for web crawlers, others ignore them, causing [[wikipedia:Denial-of-service attack|distributed denial of service attacks]] which damage access to freely-accessible websites. This is particularly an issue for websites that are large or contain many dynamic links.
+While "mainstream" companies such as [[OpenAI]], [[Anthropic]], and [[Meta]] appear to correctly follow industry-standard practice for web crawlers, others ignore them (such as [[wikipedia:Alibaba_Group|Alibaba]]<ref>{{Cite news |last=Venerandi |first=Niccolò |title=FOSS infrastructure is under attack by AI companies |url=https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies |access-date=2026-02-23 |work=LibreNews}}</ref>), causing [[wikipedia:Denial-of-service attack|distributed denial of service attacks]] which damage access to freely-accessible websites. This is particularly an issue for websites that are large or contain many dynamic links.
 Ethical website scrapers, known as "spiders" that crawl the web, follow a certain set of minimum guidelines. Specifically, they follow [[wikipedia:robots.txt|robots.txt]], a text file found at the root of a domain that indicates:
@@ Line 91: / Line 91: @@
 On 17 March 2025, the Git source code host SourceHut announced that the service was being disrupted by large language model crawlers. Mitigations deployed to reduce disruption involved requiring login for some areas of the service, and blocking IP ranges of cloud providers, affecting legitimate use of the website by its users.<ref>{{Cite web |date=17 Mar 2025 |title=LLM crawlers continue to DDoS SourceHut |url=https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/ |website=sr.ht status |url-status=live |archive-url=http://web.archive.org/web/20251220125852/https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/ |archive-date=20 Dec 2025}}</ref> In response to the event, SourceHut founder Drew DeVault wrote a blog post entitled "[https://drewdevault.com/2025/03/17/2025-03-17-Stop-externalizing-your-costs-on-me.html Please stop externalizing your costs directly into my face]", discussing his frustrations with having ongoing and ever-adapting attacks that must be addressed in a timely fashion to reduce disruption to legitimate SourceHut users. DeVault estimates that between "20-100%" of his time is now spent addressing such attacks.
 ==References==