Artificial intelligence/training: Difference between revisions
m cat |
m link scraping |
||
| (One intermediate revision by the same user not shown) | |||
| Line 3: | Line 3: | ||
==How it works== | ==How it works== | ||
There are several ways to implement AI, and even more ways to train them, the most well-known being [[wikipedia:Backpropagation|backpropagation]]. With respect to the data-set, LLMs must be trained on massive amounts of data, which is a task that's only feasible via automation. This is in contrast to curated data-sets, in which both the data and the training is done in a more carefully controlled environment. Automated training on massive data-sets is typically done using internet web-sites as sources. The process of scraping is similar to how web-[[wikipedia:Search_engine|search-engines]] index and [[wikipedia:Cache_(computing)|cache]] pages. | There are several ways to implement AI, and even more ways to train them, the most well-known being [[wikipedia:Backpropagation|backpropagation]]. With respect to the data-set, LLMs must be trained on massive amounts of data, which is a task that's only feasible via automation. This is in contrast to curated data-sets, in which both the data and the training is done in a more carefully controlled environment. Automated training on massive data-sets is typically done using internet web-sites as sources. The process of [[wikipedia:Web_scraping|scraping]] is similar to how web-[[wikipedia:Search_engine|search-engines]] index and [[wikipedia:Cache_(computing)|cache]] pages. | ||
==Why it is a problem== | ==Why it is a problem== | ||
| Line 37: | Line 37: | ||
While it is good practice for a bot to respect <code>robots.txt</code>, there is no requirement for it, and there is no punishment for not following a website's wishes. It is additionally standard practice, but in no way enforced, that bots use a [[wikipedia:User-Agent header|User-Agent header]] to uniquely identify itself. This allows a website operator to observe a bot's traffic patterns, potentially blocking the bot outright if its scraping is not desirable. The header also typically contains a URL or email address that can be used to contact the operator in case of anomalies observed in its traffic. | While it is good practice for a bot to respect <code>robots.txt</code>, there is no requirement for it, and there is no punishment for not following a website's wishes. It is additionally standard practice, but in no way enforced, that bots use a [[wikipedia:User-Agent header|User-Agent header]] to uniquely identify itself. This allows a website operator to observe a bot's traffic patterns, potentially blocking the bot outright if its scraping is not desirable. The header also typically contains a URL or email address that can be used to contact the operator in case of anomalies observed in its traffic. | ||
Unethical | Unethical AI scraper bots do not follow <code>robots.txt</code> - in fact, they may not even request this file at all. They typically completely ignore it, instead opting to start from an entry point such as the root home page (<code>/</code>), working its way through an exponentially growing list of links as it finds them, with little to no delay between requests. The bots use false User-Agent header strings that would correspond to real web browsers on desktop or mobile operating systems - blocking them would also block legitimate users, or at least legitimate users on VPNs. | ||
Some AI services opt to use separate User-Agent strings, potentially also ignoring <code>robots.txt</code>, when a request is made through user command rather than as part of model training. For example, ChatGPT identifies itself as <code>ChatGPT-User</code> rather than its standard <code>OpenAI</code> when it uses the "search the web" command - even if searching the web was an automatic decision. In a less favorable example, Perplexity AI in this same situation falsely identifies as a standard [[Google_Chrome|Chrome]] web browser running on [[Microsoft_Windows|Windows]]. AI companies defend this under the belief that they are not a "spider", but rather a "user agent" (like a web browser), when called upon by a user's request.<ref name="perplexity-aws" /> | Some AI services opt to use separate User-Agent strings, potentially also ignoring <code>robots.txt</code>, when a request is made through user command rather than as part of model training. For example, ChatGPT identifies itself as <code>ChatGPT-User</code> rather than its standard <code>OpenAI</code> when it uses the "search the web" command - even if searching the web was an automatic decision. In a less favorable example, Perplexity AI in this same situation falsely identifies as a standard [[Google_Chrome|Chrome]] web browser running on [[Microsoft_Windows|Windows]]. AI companies defend this under the belief that they are not a "spider", but rather a "user agent" (like a web browser), when called upon by a user's request.<ref name="perplexity-aws" /> | ||