Artificial intelligence/training: Difference between revisions

Line 17:

While [[Self-hosting|self-hosted]] models can be trained with a single consumer-grade GPU, data-centers with hundreds or thousands of GPUs (known for being more power-hungry than CPUs) are used to train corporate-grade (or "enterprise") models. This can worsen [[wikipedia:Climate_change|climate change]].

=== Bandwidth abuse ===

===Bandwidth abuse===

Massive data needs massive bandwidth. Scraping web-pages across the entire internet requires sending millions of requests to all known servers. Some AI companies go as far as to ''repeatedly'' send requests for the same content (or several revisions of the same content) as frequent bursts in short intervals, which is indistinguishable from [[wikipedia:Denial-of-service_attack|DDoS attacks]].

Massive data needs massive bandwidth. Scraping web-pages across the entire internet requires sending millions of requests to all known servers. Some AI companies go as far as to ''repeatedly'' send requests for the same content (or several revisions of the same content) as frequent bursts in short intervals, which is indistinguishable from [[wikipedia:Denial-of-service_attack|distributed denial-of-service (DDoS) attacks]].

==Examples==

While "mainstream" companies such as [[OpenAI]], [[Anthropic]], and [[Meta]] appear to correctly follow industry-standard practice for web crawlers, others ignore them (such as [[wikipedia:Alibaba_Group|Alibaba]]<ref>{{Cite news |last=Venerandi |first=Niccolò |title=FOSS infrastructure is under attack by AI companies |url=https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies |access-date=2026-02-23 |work=LibreNews |archive-url=http://web.archive.org/web/20260217195639/https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/ |archive-date=17 Feb 2026}}</ref>), causing ~~[[wikipedia:Denial-of-service attack|distributed denial of service~~ attacks]] which damage access to freely-accessible websites. This is particularly an issue for websites that are large or contain many dynamic links.

While "mainstream" companies such as [[OpenAI]], [[Anthropic]], and [[Meta]] appear to correctly follow industry-standard practice for web crawlers, others ignore them (such as [[wikipedia:Alibaba_Group|Alibaba]]<ref>{{Cite news |last=Venerandi |first=Niccolò |title=FOSS infrastructure is under attack by AI companies |url=https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies |access-date=2026-02-23 |work=LibreNews |archive-url=http://web.archive.org/web/20260217195639/https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/ |archive-date=17 Feb 2026}}</ref>), causing DDoS attacks which damage access to freely-accessible websites. This is particularly an issue for websites that are large or contain many dynamic links.

Ethical website scrapers, known as "spiders" that crawl the web, follow a certain set of minimum guidelines. Specifically, they follow <code>[[wikipedia:robots.txt|robots.txt]]</code>, a text file found at the root of a domain that indicates:

@@ Line 17: / Line 17: @@
 While [[Self-hosting|self-hosted]] models can be trained with a single consumer-grade GPU, data-centers with hundreds or thousands of GPUs (known for being more power-hungry than CPUs) are used to train corporate-grade (or "enterprise") models. This can worsen [[wikipedia:Climate_change|climate change]].
-=== Bandwidth abuse ===
+===Bandwidth abuse===
-Massive data needs massive bandwidth. Scraping web-pages across the entire internet requires sending millions of requests to all known servers. Some AI companies go as far as to ''repeatedly'' send requests for the same content (or several revisions of the same content) as frequent bursts in short intervals, which is indistinguishable from [[wikipedia:Denial-of-service_attack|DDoS attacks]].
+Massive data needs massive bandwidth. Scraping web-pages across the entire internet requires sending millions of requests to all known servers. Some AI companies go as far as to ''repeatedly'' send requests for the same content (or several revisions of the same content) as frequent bursts in short intervals, which is indistinguishable from [[wikipedia:Denial-of-service_attack|distributed denial-of-service (DDoS) attacks]].
 ==Examples==
-While "mainstream" companies such as [[OpenAI]], [[Anthropic]], and [[Meta]] appear to correctly follow industry-standard practice for web crawlers, others ignore them (such as [[wikipedia:Alibaba_Group|Alibaba]]<ref>{{Cite news |last=Venerandi |first=Niccolò |title=FOSS infrastructure is under attack by AI companies |url=https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies |access-date=2026-02-23 |work=LibreNews |archive-url=http://web.archive.org/web/20260217195639/https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/ |archive-date=17 Feb 2026}}</ref>), causing [[wikipedia:Denial-of-service attack|distributed denial of service attacks]] which damage access to freely-accessible websites. This is particularly an issue for websites that are large or contain many dynamic links.
+While "mainstream" companies such as [[OpenAI]], [[Anthropic]], and [[Meta]] appear to correctly follow industry-standard practice for web crawlers, others ignore them (such as [[wikipedia:Alibaba_Group|Alibaba]]<ref>{{Cite news |last=Venerandi |first=Niccolò |title=FOSS infrastructure is under attack by AI companies |url=https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies |access-date=2026-02-23 |work=LibreNews |archive-url=http://web.archive.org/web/20260217195639/https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/ |archive-date=17 Feb 2026}}</ref>), causing DDoS attacks which damage access to freely-accessible websites. This is particularly an issue for websites that are large or contain many dynamic links.
 Ethical website scrapers, known as "spiders" that crawl the web, follow a certain set of minimum guidelines. Specifically, they follow <code>[[wikipedia:robots.txt|robots.txt]]</code>, a text file found at the root of a domain that indicates: