Artificial intelligence/training: Difference between revisions

Line 11:

While "mainstream" companies such as [[OpenAI]], [[Anthropic]], and [[Meta]] appear to correctly follow industry-standard practice for web crawlers, others ignore them (such as [[wikipedia:Alibaba_Group|Alibaba]]<ref>{{Cite news |last=Venerandi |first=Niccolò |title=FOSS infrastructure is under attack by AI companies |url=https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies |access-date=2026-02-23 |work=LibreNews |archive-url=http://web.archive.org/web/20260217195639/https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/ |archive-date=17 Feb 2026}}</ref>), causing [[wikipedia:Denial-of-service attack|distributed denial of service attacks]] which damage access to freely-accessible websites. This is particularly an issue for websites that are large or contain many dynamic links.

Ethical website scrapers, known as "spiders" that crawl the web, follow a certain set of minimum guidelines. Specifically, they follow [[wikipedia:robots.txt|robots.txt]], a text file found at the root of a domain that indicates:

Ethical website scrapers, known as "spiders" that crawl the web, follow a certain set of minimum guidelines. Specifically, they follow <code>[[wikipedia:robots.txt|robots.txt]]</code>, a text file found at the root of a domain that indicates:

*Paths bots are allowed to index

Line 20:

These rules are typically configured for all bots, with minor adjustments made to individual bots as needed. Additionally, specific web pages may use the [[wikipedia:noindex|robots meta tag]] to control use of their output.

While it is good practice for a bot to respect robots.txt, there is no requirement for it, and there is no punishment for not following a website's wishes. It is additionally standard practice, but in no way enforced, that bots use a [[wikipedia:User-Agent header|User-Agent header]] to uniquely identify itself. This allows a website operator to observe a bot's traffic patterns, potentially blocking the bot outright if its scraping is not desirable. The header also typically contains a URL or email address that can be used to contact the operator in case of anomalies observed in its traffic.

While it is good practice for a bot to respect <code>robots.txt</code>, there is no requirement for it, and there is no punishment for not following a website's wishes. It is additionally standard practice, but in no way enforced, that bots use a [[wikipedia:User-Agent header|User-Agent header]] to uniquely identify itself. This allows a website operator to observe a bot's traffic patterns, potentially blocking the bot outright if its scraping is not desirable. The header also typically contains a URL or email address that can be used to contact the operator in case of anomalies observed in its traffic.

Unethical [[Artificial_intelligence|AI]] scraper bots do not follow robots.txt - in fact, they may not even request this file at all. They typically completely ignore it, instead opting to start from an entry point such as the root home page (<code>/</code>), working its way through an exponentially growing list of links as it finds them, with little to no delay between requests. The bots use false User-Agent header strings that would correspond to real web browsers on desktop or mobile operating systems - blocking them would also block legitimate users, or at least legitimate users on VPNs.

Unethical [[Artificial_intelligence|AI]] scraper bots do not follow <code>robots.txt</code> - in fact, they may not even request this file at all. They typically completely ignore it, instead opting to start from an entry point such as the root home page (<code>/</code>), working its way through an exponentially growing list of links as it finds them, with little to no delay between requests. The bots use false User-Agent header strings that would correspond to real web browsers on desktop or mobile operating systems - blocking them would also block legitimate users, or at least legitimate users on VPNs.

Some AI services opt to use separate User-Agent strings, potentially also ignoring robots.txt, when a request is made through user command rather than as part of model training. For example, ChatGPT identifies itself as <code>ChatGPT-User</code> rather than its standard <code>OpenAI</code> when it uses the "search the web" command - even if searching the web was an automatic decision. In a less favorable example, Perplexity AI in this same situation falsely identifies as a standard Chrome web browser running on Windows. AI companies defend this under the belief that they are not a "spider", but rather a "user agent" (like a web browser), when called upon by a user's request.<ref name="perplexity-aws" />

Some AI services opt to use separate User-Agent strings, potentially also ignoring <code>robots.txt</code>, when a request is made through user command rather than as part of model training. For example, ChatGPT identifies itself as <code>ChatGPT-User</code> rather than its standard <code>OpenAI</code> when it uses the "search the web" command - even if searching the web was an automatic decision. In a less favorable example, Perplexity AI in this same situation falsely identifies as a standard Chrome web browser running on Windows. AI companies defend this under the belief that they are not a "spider", but rather a "user agent" (like a web browser), when called upon by a user's request.<ref name="perplexity-aws" />

Less legitimate bots use a wide distribution of IP addresses, further reducing options for the website to protect itself. This is in a clear attempt to bypass IP-based request throttling and rate limiting the website may implement. They are also known to ignore HTTP response status codes that indicate a server error ([[wikipedia:HTTP status code#5xx server errors|5xx]]), or warnings that the client needs to slow down ([[wikipedia:HTTP status code#429|429 Too Many Requests]]) or has been entirely blocked ([[wikipedia:HTTP status code#403|403 Forbidden]]).

@@ Line 11: / Line 11: @@
 While "mainstream" companies such as [[OpenAI]], [[Anthropic]], and [[Meta]] appear to correctly follow industry-standard practice for web crawlers, others ignore them (such as [[wikipedia:Alibaba_Group|Alibaba]]<ref>{{Cite news |last=Venerandi |first=Niccolò |title=FOSS infrastructure is under attack by AI companies |url=https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies |access-date=2026-02-23 |work=LibreNews |archive-url=http://web.archive.org/web/20260217195639/https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/ |archive-date=17 Feb 2026}}</ref>), causing [[wikipedia:Denial-of-service attack|distributed denial of service attacks]] which damage access to freely-accessible websites. This is particularly an issue for websites that are large or contain many dynamic links.
-Ethical website scrapers, known as "spiders" that crawl the web, follow a certain set of minimum guidelines. Specifically, they follow [[wikipedia:robots.txt|robots.txt]], a text file found at the root of a domain that indicates:
+Ethical website scrapers, known as "spiders" that crawl the web, follow a certain set of minimum guidelines. Specifically, they follow <code>[[wikipedia:robots.txt|robots.txt]]</code>, a text file found at the root of a domain that indicates:
 *Paths bots are allowed to index
@@ Line 20: / Line 20: @@
 These rules are typically configured for all bots, with minor adjustments made to individual bots as needed. Additionally, specific web pages may use the [[wikipedia:noindex|robots meta tag]] to control use of their output.
-While it is good practice for a bot to respect robots.txt, there is no requirement for it, and there is no punishment for not following a website's wishes. It is additionally standard practice, but in no way enforced, that bots use a [[wikipedia:User-Agent header|User-Agent header]] to uniquely identify itself. This allows a website operator to observe a bot's traffic patterns, potentially blocking the bot outright if its scraping is not desirable. The header also typically contains a URL or email address that can be used to contact the operator in case of anomalies observed in its traffic.
+While it is good practice for a bot to respect <code>robots.txt</code>, there is no requirement for it, and there is no punishment for not following a website's wishes. It is additionally standard practice, but in no way enforced, that bots use a [[wikipedia:User-Agent header|User-Agent header]] to uniquely identify itself. This allows a website operator to observe a bot's traffic patterns, potentially blocking the bot outright if its scraping is not desirable. The header also typically contains a URL or email address that can be used to contact the operator in case of anomalies observed in its traffic.
-Unethical [[Artificial_intelligence|AI]] scraper bots do not follow robots.txt - in fact, they may not even request this file at all. They typically completely ignore it, instead opting to start from an entry point such as the root home page (<code>/</code>), working its way through an exponentially growing list of links as it finds them, with little to no delay between requests. The bots use false User-Agent header strings that would correspond to real web browsers on desktop or mobile operating systems - blocking them would also block legitimate users, or at least legitimate users on VPNs.
+Unethical [[Artificial_intelligence|AI]] scraper bots do not follow <code>robots.txt</code> - in fact, they may not even request this file at all. They typically completely ignore it, instead opting to start from an entry point such as the root home page (<code>/</code>), working its way through an exponentially growing list of links as it finds them, with little to no delay between requests. The bots use false User-Agent header strings that would correspond to real web browsers on desktop or mobile operating systems - blocking them would also block legitimate users, or at least legitimate users on VPNs.
-Some AI services opt to use separate User-Agent strings, potentially also ignoring robots.txt, when a request is made through user command rather than as part of model training. For example, ChatGPT identifies itself as <code>ChatGPT-User</code> rather than its standard <code>OpenAI</code> when it uses the "search the web" command - even if searching the web was an automatic decision. In a less favorable example, Perplexity AI in this same situation falsely identifies as a standard Chrome web browser running on Windows. AI companies defend this under the belief that they are not a "spider", but rather a "user agent" (like a web browser), when called upon by a user's request.<ref name="perplexity-aws" />
+Some AI services opt to use separate User-Agent strings, potentially also ignoring <code>robots.txt</code>, when a request is made through user command rather than as part of model training. For example, ChatGPT identifies itself as <code>ChatGPT-User</code> rather than its standard <code>OpenAI</code> when it uses the "search the web" command - even if searching the web was an automatic decision. In a less favorable example, Perplexity AI in this same situation falsely identifies as a standard Chrome web browser running on Windows. AI companies defend this under the belief that they are not a "spider", but rather a "user agent" (like a web browser), when called upon by a user's request.<ref name="perplexity-aws" />
 Less legitimate bots use a wide distribution of IP addresses, further reducing options for the website to protect itself. This is in a clear attempt to bypass IP-based request throttling and rate limiting the website may implement. They are also known to ignore HTTP response status codes that indicate a server error ([[wikipedia:HTTP status code#5xx server errors|5xx]]), or warnings that the client needs to slow down ([[wikipedia:HTTP status code#429|429 Too Many Requests]]) or has been entirely blocked ([[wikipedia:HTTP status code#403|403 Forbidden]]).