Artificial intelligence/training: Difference between revisions
m link CAPTCHA |
Added archive URLs for 1 citation(s) using CRWCitationBot |
||
| Line 9: | Line 9: | ||
==Examples== | ==Examples== | ||
While "mainstream" companies such as [[OpenAI]], [[Anthropic]], and [[Meta]] appear to correctly follow industry-standard practice for web crawlers, others ignore them (such as [[wikipedia:Alibaba_Group|Alibaba]]<ref>{{Cite news |last=Venerandi |first=Niccolò |title=FOSS infrastructure is under attack by AI companies |url=https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies |access-date=2026-02-23 |work=LibreNews}}</ref>), causing [[wikipedia:Denial-of-service attack|distributed denial of service attacks]] which damage access to freely-accessible websites. This is particularly an issue for websites that are large or contain many dynamic links. | While "mainstream" companies such as [[OpenAI]], [[Anthropic]], and [[Meta]] appear to correctly follow industry-standard practice for web crawlers, others ignore them (such as [[wikipedia:Alibaba_Group|Alibaba]]<ref>{{Cite news |last=Venerandi |first=Niccolò |title=FOSS infrastructure is under attack by AI companies |url=https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies |access-date=2026-02-23 |work=LibreNews |archive-url=http://web.archive.org/web/20260217195639/https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/ |archive-date=17 Feb 2026}}</ref>), causing [[wikipedia:Denial-of-service attack|distributed denial of service attacks]] which damage access to freely-accessible websites. This is particularly an issue for websites that are large or contain many dynamic links. | ||
Ethical website scrapers, known as "spiders" that crawl the web, follow a certain set of minimum guidelines. Specifically, they follow [[wikipedia:robots.txt|robots.txt]], a text file found at the root of a domain that indicates: | Ethical website scrapers, known as "spiders" that crawl the web, follow a certain set of minimum guidelines. Specifically, they follow [[wikipedia:robots.txt|robots.txt]], a text file found at the root of a domain that indicates: | ||