Rudxain (talk | contribs)
m link GPU
Rudxain (talk | contribs)
m link scraping
 
(7 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{Incomplete}}
{{Incomplete}}
'''AI training''' is a process by which data is fed into an AI model, in order to adjust its weights. This makes the output of the model closely match that of its input.
'''AI training''' is a process by which data is fed into an [[Artificial intelligence|AI]] model, in order to adjust its weights. This makes the output of the model closely match that of its input.


==How it works==
==How it works==
There are several ways to implement AI, and even more ways to train them, the most well-known being [[wikipedia:Backpropagation|backpropagation]]. With respect to the data-set, LLMs must be trained on massive amounts of data, which is a task that's only feasible via automation. This is in contrast to curated data-sets, in which both the data and the training is done in a more carefully controlled environment. Automated training on massive data-sets is typically done using internet web-sites as sources. The process of scraping is similar to how web-[[wikipedia:Search_engine|search-engines]] index and [[wikipedia:Cache_(computing)|cache]] pages.
There are several ways to implement AI, and even more ways to train them, the most well-known being [[wikipedia:Backpropagation|backpropagation]]. With respect to the data-set, LLMs must be trained on massive amounts of data, which is a task that's only feasible via automation. This is in contrast to curated data-sets, in which both the data and the training is done in a more carefully controlled environment. Automated training on massive data-sets is typically done using internet web-sites as sources. The process of [[wikipedia:Web_scraping|scraping]] is similar to how web-[[wikipedia:Search_engine|search-engines]] index and [[wikipedia:Cache_(computing)|cache]] pages.


==Why it is a problem==
==Why it is a problem==
Line 12: Line 12:
Ultimately, it depends a lot on the technical details of how each model works, so none of those arguments are universal.
Ultimately, it depends a lot on the technical details of how each model works, so none of those arguments are universal.


Some people request that, at the very least, the sources of the training data must be publicly disclosed, for the sake of [[wikipedia:Transparency_(behavior)|transparency]] and [[wikipedia:Attribution_(copyright)|attribution]].<ref>{{Cite web |last=Tunney |first=Justine |date=2024-08-23 |title=AI Training Shouldn't Erase Authorship |url=https://justine.lol/history/ |access-date=2026-04-26}}</ref>
Some people request that, at the very least, the sources of the training data must be publicly disclosed, for the sake of [[wikipedia:Transparency_(behavior)|transparency]] and [[wikipedia:Attribution_(copyright)|attribution]].<ref>{{Cite web |last=Tunney |first=Justine |date=2024-08-23 |title=AI Training Shouldn't Erase Authorship |url=https://justine.lol/history/ |access-date=2026-04-26}}</ref> Others propose that authors should be [[wikipedia:Financial_compensation|compensated]] in some way, and that AI should be regulated.<ref>{{Cite web |last=Hartzell |first=Jimmy |date=2024-06-30 |title=Large Language Models Should Have to Obey Copyright |url=https://www.thecodedmessage.com/posts/llm-copyright/ |access-date=2026-05-18 |website=The Coded Message}}</ref>


===Energy use===
===Energy use===
Line 19: Line 19:
===Bandwidth abuse===
===Bandwidth abuse===
Massive data needs massive bandwidth. Scraping web-pages across the entire internet requires sending millions of requests to all known servers. Some AI companies go as far as to ''repeatedly'' send requests for the same content (or several revisions of the same content) as frequent bursts in short intervals, which is indistinguishable from [[wikipedia:Denial-of-service_attack|distributed denial-of-service (DDoS) attacks]].
Massive data needs massive bandwidth. Scraping web-pages across the entire internet requires sending millions of requests to all known servers. Some AI companies go as far as to ''repeatedly'' send requests for the same content (or several revisions of the same content) as frequent bursts in short intervals, which is indistinguishable from [[wikipedia:Denial-of-service_attack|distributed denial-of-service (DDoS) attacks]].
===Chip shortage===
Many AI companies have pre-ordered massive amounts of computer components. So much that it doesn't even fit in their current data-centers. This is done in anticipation for ''more'' data-centers being built.{{Citation needed}} This has caused such components to become scarce, and prices to spike. The most notable being the [[RAM shortage|increase in RAM prices]].


==Examples==
==Examples==
Line 34: Line 37:
While it is good practice for a bot to respect <code>robots.txt</code>, there is no requirement for it, and there is no punishment for not following a website's wishes. It is additionally standard practice, but in no way enforced, that bots use a [[wikipedia:User-Agent header|User-Agent header]] to uniquely identify itself. This allows a website operator to observe a bot's traffic patterns, potentially blocking the bot outright if its scraping is not desirable. The header also typically contains a URL or email address that can be used to contact the operator in case of anomalies observed in its traffic.
While it is good practice for a bot to respect <code>robots.txt</code>, there is no requirement for it, and there is no punishment for not following a website's wishes. It is additionally standard practice, but in no way enforced, that bots use a [[wikipedia:User-Agent header|User-Agent header]] to uniquely identify itself. This allows a website operator to observe a bot's traffic patterns, potentially blocking the bot outright if its scraping is not desirable. The header also typically contains a URL or email address that can be used to contact the operator in case of anomalies observed in its traffic.


Unethical [[Artificial_intelligence|AI]] scraper bots do not follow <code>robots.txt</code> - in fact, they may not even request this file at all. They typically completely ignore it, instead opting to start from an entry point such as the root home page (<code>/</code>), working its way through an exponentially growing list of links as it finds them, with little to no delay between requests. The bots use false User-Agent header strings that would correspond to real web browsers on desktop or mobile operating systems - blocking them would also block legitimate users, or at least legitimate users on VPNs.
Unethical AI scraper bots do not follow <code>robots.txt</code> - in fact, they may not even request this file at all. They typically completely ignore it, instead opting to start from an entry point such as the root home page (<code>/</code>), working its way through an exponentially growing list of links as it finds them, with little to no delay between requests. The bots use false User-Agent header strings that would correspond to real web browsers on desktop or mobile operating systems - blocking them would also block legitimate users, or at least legitimate users on VPNs.


Some AI services opt to use separate User-Agent strings, potentially also ignoring <code>robots.txt</code>, when a request is made through user command rather than as part of model training. For example, ChatGPT identifies itself as <code>ChatGPT-User</code> rather than its standard <code>OpenAI</code> when it uses the "search the web" command - even if searching the web was an automatic decision. In a less favorable example, Perplexity AI in this same situation falsely identifies as a standard [[Google_Chrome|Chrome]] web browser running on [[Microsoft_Windows|Windows]]. AI companies defend this under the belief that they are not a "spider", but rather a "user agent" (like a web browser), when called upon by a user's request.<ref name="perplexity-aws" />
Some AI services opt to use separate User-Agent strings, potentially also ignoring <code>robots.txt</code>, when a request is made through user command rather than as part of model training. For example, ChatGPT identifies itself as <code>ChatGPT-User</code> rather than its standard <code>OpenAI</code> when it uses the "search the web" command - even if searching the web was an automatic decision. In a less favorable example, Perplexity AI in this same situation falsely identifies as a standard [[Google_Chrome|Chrome]] web browser running on [[Microsoft_Windows|Windows]]. AI companies defend this under the belief that they are not a "spider", but rather a "user agent" (like a web browser), when called upon by a user's request.<ref name="perplexity-aws" />
Line 43: Line 46:
To protect against unethical crawlers, due to concerns of both intellectual property and service disruption, websites adopt practices that affect the experience of real users:
To protect against unethical crawlers, due to concerns of both intellectual property and service disruption, websites adopt practices that affect the experience of real users:


*'''Bot check walls''': The user may be required to pass a security check "wall". While usually automatic for the user, this can affect legitimate bots. When a website protection service such as [[Cloudflare]] is not confident as to whether the visitor is legitimate, it may present a [[CAPTCHA]] to be manually filled out. An example is "Google Sorry", a CAPTCHA wall frequently seen when using Google Search via a VPN. An example that's popular in the FOSS community is [[wikipedia:Anubis_(software)|Anubis]].
*'''Bot check walls''': The user may be required to pass a security check "wall". While usually automatic for the user, this can affect legitimate bots. When a website protection service such as [[Cloudflare]] is not confident as to whether the visitor is legitimate, it may present a [[CAPTCHA]] to be manually filled out. An example is "Google Sorry", a CAPTCHA wall frequently seen when using Google Search via a VPN. An example that's popular in the FOSS community is {{Wplink|Anubis (software)|Anubis}}.
*'''Login walls''': Should bots be found to pass CAPTCHA walls, the website may advance to requiring logging in to view content. A major recent example of this is [[YouTube]]'s "Sign in to confirm you're not a bot" messages.
*'''Login walls''': Should bots be found to pass CAPTCHA walls, the website may advance to requiring logging in to view content. A major recent example of this is [[YouTube]]'s "Sign in to confirm you're not a bot" messages.
*'''[[JavaScript]] requirement''': Most websites do not need JavaScript to deliver their content. However, as many scrapers expect content to be found directly in the HTML, it is often an easy workaround to use JavaScript to "insert" the content after the page has loaded. This may reduce the responsiveness of the website, increasing points of failure, and preventing security-conscious users who disable JavaScript from viewing the website.
*'''JavaScript requirement''': Most websites do not need [[JavaScript]] to deliver their content. However, as many scrapers expect content to be found directly in the HTML, it is often an easy workaround to use JavaScript to "insert" the content after the page has loaded. This may reduce the responsiveness of the website, increasing points of failure, and preventing security-conscious users who disable JavaScript from viewing the website.
*'''IP address blocking''': Blocking IP addresses, especially by blocking entire providers via their [[wikipedia:Autonomous system (Internet)|autonomous system number]], always comes with some risk of blocking legitimate users. Particularly, this may restrict access to users making use of a VPN.
*'''IP address blocking''': Blocking IP addresses, especially by blocking entire providers via their {{Wplink|Autonomous system (Internet)|autonomous system number}}, always comes with some risk of blocking legitimate users. Particularly, this may restrict access to users making use of a VPN.
*'''Heuristic blocking''': Patterns in request headers may give away that the request is being made by an unethical bot, despite attempts to act as a legitimate visitor. Heuristics are imperfect and may block legitimate users, especially those that may use less common browsers.
*'''Heuristic blocking''': Patterns in request headers may give away that the request is being made by an unethical bot, despite attempts to act as a legitimate visitor. Heuristics are imperfect and may block legitimate users, especially those that may use less common browsers.


In rare situations, a website operator may redirect detected bot traffic, such as to download speed test files hosted by ISPs containing multiple gigabytes of random garbage data. This may have the effect of disrupting the bot, but its effectiveness is unknown.
In rare situations, a website operator may redirect detected bot traffic, such as to download speed test files hosted by ISPs containing multiple gigabytes of random garbage data. This may have the effect of disrupting the bot, but its effectiveness is unknown.


The need to respond to unethical scraping also further consolidates the web into the control of a few large [[wikipedia:Web application firewall|web application firewall]] (WAF) services, most notably [[Cloudflare]], as website owners find themselves otherwise unable to protect their service from being disrupted by such traffic.
The need to respond to unethical scraping also further consolidates the web into the control of a few large {{Wplink|Web application firewall|web application firewall}} (WAF) services, most notably [[Cloudflare]], as website owners find themselves otherwise unable to protect their service from being disrupted by such traffic.


===Case studies===
===Case studies===
Line 107: Line 110:
{{reflist}}
{{reflist}}


[[Category:Artificial intelligence]]
[[Category:Artificial intelligence|*Training]]