Artificial intelligence/training: Difference between revisions

(5 intermediate revisions by the same user not shown)

Line 1:

~~{{Ph-T-Int}}~~

'''AI training''' is a process by which data is fed into an AI model, in order to adjust its weights. This makes the output of the model closely match that of its input.

==How it works==

~~{{Ph~~-T-~~HIW}}~~

There are several ways to implement AI, and even more ways to train them, the most well-known being [[wikipedia:Backpropagation|backpropagation]]. With respect to the data-set, LLMs must be trained on massive amounts of data, which is a task that's only feasible via automation. This is in contrast to curated data-sets, in which both the data and the training is done in a more carefully controlled environment. Automated training on massive data-sets is typically done using internet web-sites as sources. The process of scraping is similar to how web-[[wikipedia:Search_engine|search-engines]] index and [[wikipedia:Cache_(computing)|cache]] pages.

==Why it is a problem==

===Intellectual property laundering===

Most, if not all, of the data used for training is copied indiscriminately, without even checking licenses or any copyright terms.{{Citation needed}} This is very controversial. Some people argue that it is "fair use" because AI systems learn in ways similar to animal and human brains, others claim it's more like [[wikt:parroting|a parrot learning phrases]], others claim that it's "transformative" so it's still fair-use, others say it's akin to [[wikipedia:Tracing_(art)|tracing images]] (this applies mostly to image models, though the analogy can work for text models).{{Citation needed|reason=too many opinions}}

Ultimately, it depends a lot on the technical details of how each model works, so none of those arguments are universal.

Some people request that, at the very least, the sources of the training data must be publicly disclosed, for the sake of [[wikipedia:Transparency_(behavior)|transparency]] and [[wikipedia:Attribution_(copyright)|attribution]].<ref>{{Cite web |last=Tunney |first=Justine |date=2024-08-23 |title=AI Training Shouldn't Erase Authorship |url=https://justine.lol/history/ |access-date=2026-04-26}}</ref>

===Energy use===

While [[Self-hosting|self-hosted]] models can be trained with a single consumer-grade [[wikipedia:Graphics_processing_unit|GPU]], data-centers with hundreds or thousands of GPUs (known for being more power-hungry than CPUs) are used to train corporate-grade (or "enterprise") models. This can worsen [[wikipedia:Climate_change|climate change]].

===Bandwidth abuse===

Massive data needs massive bandwidth. Scraping web-pages across the entire internet requires sending millions of requests to all known servers. Some AI companies go as far as to ''repeatedly'' send requests for the same content (or several revisions of the same content) as frequent bursts in short intervals, which is indistinguishable from [[wikipedia:Denial-of-service_attack|distributed denial-of-service (DDoS) attacks]].

=== Chip shortage ===

Many AI companies have pre-ordered massive amounts of computer components. So much that it doesn't even fit in their current data-centers. This is done in anticipation for ''more'' data-centers being built.{{Citation needed}} This has caused such components to become scarce, and prices to spike. The most notable being the [[wikipedia:2024–present_global_memory_supply_shortage|increase in RAM prices]].

==Examples==

While "mainstream" companies such as [[OpenAI]], [[Anthropic]], and [[Meta]] appear to correctly follow industry-standard practice for web crawlers, others ignore them (such as [[wikipedia:Alibaba_Group|Alibaba]]<ref>{{Cite news |last=Venerandi |first=Niccolò |title=FOSS infrastructure is under attack by AI companies |url=https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies |access-date=2026-02-23 |work=LibreNews |archive-url=http://web.archive.org/web/20260217195639/https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/ |archive-date=17 Feb 2026}}</ref>), causing ~~[[wikipedia:Denial-of-service attack|distributed denial of service~~ attacks]] which damage access to freely-accessible websites. This is particularly an issue for websites that are large or contain many dynamic links.

While "mainstream" companies such as [[OpenAI]], [[Anthropic]], and [[Meta]] appear to correctly follow industry-standard practice for web crawlers, others ignore them (such as [[wikipedia:Alibaba_Group|Alibaba]]<ref>{{Cite news |last=Venerandi |first=Niccolò |title=FOSS infrastructure is under attack by AI companies |url=https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies |access-date=2026-02-23 |work=LibreNews |archive-url=http://web.archive.org/web/20260217195639/https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/ |archive-date=17 Feb 2026}}</ref>), causing DDoS attacks which damage access to freely-accessible websites. This is particularly an issue for websites that are large or contain many dynamic links.

Ethical website scrapers, known as "spiders" that crawl the web, follow a certain set of minimum guidelines. Specifically, they follow <code>[[wikipedia:robots.txt|robots.txt]]</code>, a text file found at the root of a domain that indicates:

@@ Line 1: / Line 1: @@
 {{Incomplete}}
-{{Ph-T-Int}}
+'''AI training''' is a process by which data is fed into an AI model, in order to adjust its weights. This makes the output of the model closely match that of its input.
 ==How it works==
-{{Ph-T-HIW}}
+There are several ways to implement AI, and even more ways to train them, the most well-known being [[wikipedia:Backpropagation|backpropagation]]. With respect to the data-set, LLMs must be trained on massive amounts of data, which is a task that's only feasible via automation. This is in contrast to curated data-sets, in which both the data and the training is done in a more carefully controlled environment. Automated training on massive data-sets is typically done using internet web-sites as sources. The process of scraping is similar to how web-[[wikipedia:Search_engine|search-engines]] index and [[wikipedia:Cache_(computing)|cache]] pages.
 ==Why it is a problem==
-{{Ph-T-WIIAP}}
+===Intellectual property laundering===
+Most, if not all, of the data used for training is copied indiscriminately, without even checking licenses or any copyright terms.{{Citation needed}} This is very controversial. Some people argue that it is "fair use" because AI systems learn in ways similar to animal and human brains, others claim it's more like [[wikt:parroting|a parrot learning phrases]], others claim that it's "transformative" so it's still fair-use, others say it's akin to [[wikipedia:Tracing_(art)|tracing images]] (this applies mostly to image models, though the analogy can work for text models).{{Citation needed|reason=too many opinions}}
+Ultimately, it depends a lot on the technical details of how each model works, so none of those arguments are universal.
+Some people request that, at the very least, the sources of the training data must be publicly disclosed, for the sake of [[wikipedia:Transparency_(behavior)|transparency]] and [[wikipedia:Attribution_(copyright)|attribution]].<ref>{{Cite web |last=Tunney |first=Justine |date=2024-08-23 |title=AI Training Shouldn't Erase Authorship |url=https://justine.lol/history/ |access-date=2026-04-26}}</ref>
+===Energy use===
+While [[Self-hosting|self-hosted]] models can be trained with a single consumer-grade [[wikipedia:Graphics_processing_unit|GPU]], data-centers with hundreds or thousands of GPUs (known for being more power-hungry than CPUs) are used to train corporate-grade (or "enterprise") models. This can worsen [[wikipedia:Climate_change|climate change]].
+===Bandwidth abuse===
+Massive data needs massive bandwidth. Scraping web-pages across the entire internet requires sending millions of requests to all known servers. Some AI companies go as far as to ''repeatedly'' send requests for the same content (or several revisions of the same content) as frequent bursts in short intervals, which is indistinguishable from [[wikipedia:Denial-of-service_attack|distributed denial-of-service (DDoS) attacks]].
+=== Chip shortage ===
+Many AI companies have pre-ordered massive amounts of computer components. So much that it doesn't even fit in their current data-centers. This is done in anticipation for ''more'' data-centers being built.{{Citation needed}} This has caused such components to become scarce, and prices to spike. The most notable being the [[wikipedia:2024–present_global_memory_supply_shortage|increase in RAM prices]].
 ==Examples==
-While "mainstream" companies such as [[OpenAI]], [[Anthropic]], and [[Meta]] appear to correctly follow industry-standard practice for web crawlers, others ignore them (such as [[wikipedia:Alibaba_Group|Alibaba]]<ref>{{Cite news |last=Venerandi |first=Niccolò |title=FOSS infrastructure is under attack by AI companies |url=https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies |access-date=2026-02-23 |work=LibreNews |archive-url=http://web.archive.org/web/20260217195639/https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/ |archive-date=17 Feb 2026}}</ref>), causing [[wikipedia:Denial-of-service attack|distributed denial of service attacks]] which damage access to freely-accessible websites. This is particularly an issue for websites that are large or contain many dynamic links.
+While "mainstream" companies such as [[OpenAI]], [[Anthropic]], and [[Meta]] appear to correctly follow industry-standard practice for web crawlers, others ignore them (such as [[wikipedia:Alibaba_Group|Alibaba]]<ref>{{Cite news |last=Venerandi |first=Niccolò |title=FOSS infrastructure is under attack by AI companies |url=https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies |access-date=2026-02-23 |work=LibreNews |archive-url=http://web.archive.org/web/20260217195639/https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/ |archive-date=17 Feb 2026}}</ref>), causing DDoS attacks which damage access to freely-accessible websites. This is particularly an issue for websites that are large or contain many dynamic links.
 Ethical website scrapers, known as "spiders" that crawl the web, follow a certain set of minimum guidelines. Specifically, they follow <code>[[wikipedia:robots.txt|robots.txt]]</code>, a text file found at the root of a domain that indicates: