Artificial intelligence/training: Difference between revisions

(One intermediate revision by the same user not shown)

Line 3:

==How it works==

There are several ways to implement AI, and even more ways to train them, the most well-known being [[wikipedia:Backpropagation|backpropagation]]. With respect to the data-set, LLMs must be trained on massive amounts of data, which is a task that's only feasible via automation. This is in contrast to curated data-sets, in which both the data and the training is done in a more carefully controlled environment. Automated training on massive data-sets is typically done using internet web-sites as sources. The process of scraping is similar to how web-[[wikipedia:Search_engine|search-engines]] index and [[wikipedia:Cache_(computing)|cache]] pages.

There are several ways to implement AI, and even more ways to train them, the most well-known being [[wikipedia:Backpropagation|backpropagation]]. With respect to the data-set, LLMs must be trained on massive amounts of data, which is a task that's only feasible via automation. This is in contrast to curated data-sets, in which both the data and the training is done in a more carefully controlled environment. Automated training on massive data-sets is typically done using internet web-sites as sources. The process of [[wikipedia:Web_scraping|scraping]] is similar to how web-[[wikipedia:Search_engine|search-engines]] index and [[wikipedia:Cache_(computing)|cache]] pages.

==Why it is a problem==

Line 37:

While it is good practice for a bot to respect <code>robots.txt</code>, there is no requirement for it, and there is no punishment for not following a website's wishes. It is additionally standard practice, but in no way enforced, that bots use a [[wikipedia:User-Agent header|User-Agent header]] to uniquely identify itself. This allows a website operator to observe a bot's traffic patterns, potentially blocking the bot outright if its scraping is not desirable. The header also typically contains a URL or email address that can be used to contact the operator in case of anomalies observed in its traffic.

Unethical ~~[[Artificial_intelligence|~~AI]] scraper bots do not follow <code>robots.txt</code> - in fact, they may not even request this file at all. They typically completely ignore it, instead opting to start from an entry point such as the root home page (<code>/</code>), working its way through an exponentially growing list of links as it finds them, with little to no delay between requests. The bots use false User-Agent header strings that would correspond to real web browsers on desktop or mobile operating systems - blocking them would also block legitimate users, or at least legitimate users on VPNs.

Unethical AI scraper bots do not follow <code>robots.txt</code> - in fact, they may not even request this file at all. They typically completely ignore it, instead opting to start from an entry point such as the root home page (<code>/</code>), working its way through an exponentially growing list of links as it finds them, with little to no delay between requests. The bots use false User-Agent header strings that would correspond to real web browsers on desktop or mobile operating systems - blocking them would also block legitimate users, or at least legitimate users on VPNs.

Some AI services opt to use separate User-Agent strings, potentially also ignoring <code>robots.txt</code>, when a request is made through user command rather than as part of model training. For example, ChatGPT identifies itself as <code>ChatGPT-User</code> rather than its standard <code>OpenAI</code> when it uses the "search the web" command - even if searching the web was an automatic decision. In a less favorable example, Perplexity AI in this same situation falsely identifies as a standard [[Google_Chrome|Chrome]] web browser running on [[Microsoft_Windows|Windows]]. AI companies defend this under the belief that they are not a "spider", but rather a "user agent" (like a web browser), when called upon by a user's request.<ref name="perplexity-aws" />

@@ Line 3: / Line 3: @@
 ==How it works==
-There are several ways to implement AI, and even more ways to train them, the most well-known being [[wikipedia:Backpropagation|backpropagation]]. With respect to the data-set, LLMs must be trained on massive amounts of data, which is a task that's only feasible via automation. This is in contrast to curated data-sets, in which both the data and the training is done in a more carefully controlled environment. Automated training on massive data-sets is typically done using internet web-sites as sources. The process of scraping is similar to how web-[[wikipedia:Search_engine|search-engines]] index and [[wikipedia:Cache_(computing)|cache]] pages.
+There are several ways to implement AI, and even more ways to train them, the most well-known being [[wikipedia:Backpropagation|backpropagation]]. With respect to the data-set, LLMs must be trained on massive amounts of data, which is a task that's only feasible via automation. This is in contrast to curated data-sets, in which both the data and the training is done in a more carefully controlled environment. Automated training on massive data-sets is typically done using internet web-sites as sources. The process of [[wikipedia:Web_scraping|scraping]] is similar to how web-[[wikipedia:Search_engine|search-engines]] index and [[wikipedia:Cache_(computing)|cache]] pages.
 ==Why it is a problem==
@@ Line 37: / Line 37: @@
 While it is good practice for a bot to respect <code>robots.txt</code>, there is no requirement for it, and there is no punishment for not following a website's wishes. It is additionally standard practice, but in no way enforced, that bots use a [[wikipedia:User-Agent header|User-Agent header]] to uniquely identify itself. This allows a website operator to observe a bot's traffic patterns, potentially blocking the bot outright if its scraping is not desirable. The header also typically contains a URL or email address that can be used to contact the operator in case of anomalies observed in its traffic.
-Unethical [[Artificial_intelligence|AI]] scraper bots do not follow <code>robots.txt</code> - in fact, they may not even request this file at all. They typically completely ignore it, instead opting to start from an entry point such as the root home page (<code>/</code>), working its way through an exponentially growing list of links as it finds them, with little to no delay between requests. The bots use false User-Agent header strings that would correspond to real web browsers on desktop or mobile operating systems - blocking them would also block legitimate users, or at least legitimate users on VPNs.
+Unethical AI scraper bots do not follow <code>robots.txt</code> - in fact, they may not even request this file at all. They typically completely ignore it, instead opting to start from an entry point such as the root home page (<code>/</code>), working its way through an exponentially growing list of links as it finds them, with little to no delay between requests. The bots use false User-Agent header strings that would correspond to real web browsers on desktop or mobile operating systems - blocking them would also block legitimate users, or at least legitimate users on VPNs.
 Some AI services opt to use separate User-Agent strings, potentially also ignoring <code>robots.txt</code>, when a request is made through user command rather than as part of model training. For example, ChatGPT identifies itself as <code>ChatGPT-User</code> rather than its standard <code>OpenAI</code> when it uses the "search the web" command - even if searching the web was an automatic decision. In a less favorable example, Perplexity AI in this same situation falsely identifies as a standard [[Google_Chrome|Chrome]] web browser running on [[Microsoft_Windows|Windows]]. AI companies defend this under the belief that they are not a "spider", but rather a "user agent" (like a web browser), when called upon by a user's request.<ref name="perplexity-aws" />