Artificial intelligence: Difference between revisions

(40 intermediate revisions by 21 users not shown)

Line 1:

'''Artificial intelligence''' (AI) is a field of computer science producing software that aims to ultimately replace all manual labor. AI is not a new concept - it has been of interest as early as the 1950s. Since the November 2022 launch of [[ChatGPT]], [[wikipedia:Large language model|large language model]] (LLM) chatbots have been a main focus of the industry, with billions of dollars in funding allocated to producing more "intelligent" LLMs. Also a significant focus are [[wikipedia:Text-to-image model|text-to-image models]], which "draw" an image using written instructions, and [[wikipedia:Text-to-video model|text-to-video models]], which extend the text-to-image concept across several smooth video frames.

[[wikipedia:~~Generative artificial intelligence~~|~~Generative artificial intelligence~~]] ~~models are trained through vast amounts~~ of ~~existing human-generated content. Using~~ the ~~example~~ of ~~an LLM~~, ~~by learning about common trends~~ in ~~sentence structure, the~~ model ~~is able~~ to ~~form complete sentences and show artificial~~ "~~knowledge~~" of a ~~topic. The artificial nature may cause~~ [[wikipedia:~~Hallucination (artificial intelligence)~~|~~hallucination~~]] ~~through confidently~~-~~written, but mostly or entirely incorrect, output~~.

'''Artificial intelligence''' ('''AI''') is a field of computer science that produces systems designed to solve problems that humans typically solve using intelligence. In the consumer and industry space, it is commonly referred to as chatbots or [[wikipedia:Large language model|large language models]] (LLMs), which have been a main focus of industry since the November 2022 launch of [[OpenAI]]'s [[ChatGPT]], with tens of billions of dollars in funding allocated to producing more popular LLMs. This is also a significant focus on [[wikipedia:Text-to-image model|text-to-image models]], which "draw" an image using a written prompt, and less commonly, [[wikipedia:Text-to-video model|text-to-video models]], which extend the text-to-image concept across several smooth video frames.

~~The current well-funded, lucrative industry of artificial intelligence tools~~ has ~~resulted in rampant unethical use~~ of ~~content~~. ~~Startups intending to produce~~ AI ~~services have been scraping the internet for content to train future models at~~ a ~~concerning pace~~, ~~with no regard for copyright law, as members of the field are concerned that they are approaching the limit of publicly-available content to train from~~.~~<ref>https://observer.com/2024/12/openai-cofounder-ilya-sutskever-ai-data-peak/</ref>~~

AI is not a new concept; it has been of interest since the 1950s. AI is a catch-all term, encompassing many areas and techniques.

~~==Unethical website scraping==~~

[[wikipedia:Generative artificial intelligence|Generative artificial intelligence]] models are trained through vast amounts of existing human-generated content. LLMs gather statistics on word patterns, which allows the model to generate sequences of words that seem similar to what a person might have written. However, an LLM does not understand anything; they cannot reason. They generate randomly modulated pattern of tokens. In this way, they function similarly to autocomplete.

~~While "mainstream" companies such as [[OpenAI]], [[Anthropic]], and [[Meta]] appear to correctly follow industry-standard practice for web crawlers, others ignore them, causing~~ [[wikipedia:~~Denial-of-service attack~~|~~distributed denial of service attacks~~]] which ~~damage access~~ to ~~freely-accessible websites~~. ~~This is particularly~~ an ~~issue for websites that are large or contain many dynamic links~~.

~~Ethical website scrapers, known as "spiders"~~ that ~~crawl~~ the ~~web~~, ~~follow a certain set of minimum guidelines. Specifically~~, ~~they follow~~ [[wikipedia:~~robots~~.~~txt~~|~~robots.txt~~]], ~~a text file found at~~ the ~~root of a domain that indicates:~~

People reading sequences of tokens sometimes perceive things they think are true. Sequences that do not make sense to the reader, or that are false, are called [[wikipedia:Hallucination (artificial intelligence)|hallucinations]]. LLMs are typically trained to produce output that is pleasing to people, exhibiting [[Dark pattern|dark patterns]]. For example, they produce output which seems confidently written, use patterns which praise the user (sycophancy), and employ emotionally manipulative language.

*Paths bots are ~~allowed~~ to ~~index~~

People are accustomed to interacting with others, and many overestimate the abilities of things that exhibit complex, person-like patterns. Promoters of “AI” systems take advantage of this tendency, using suggestive names (like “reasoning” and “learning”) and grand claims (“PhD level”), which make it harder for people to understand these systems.

*Paths bots should not index

*How long the ~~bot should wait in between requests to the server~~, to ~~reduce load~~

*The [[wikipedia:Sitemaps|sitemap]] of the website's content

~~These rules are typically configured for all bots~~, ~~with minor adjustments made to individual bots as needed~~. ~~Additionally~~, ~~specific web pages~~ may ~~use the~~ [[~~wikipedia~~:~~noindex~~|~~robots meta tag~~]] ~~to control use of their output.~~

From November 2022 to 2025, venture capitalists and companies invested hundreds of billions of dollars into AI but received minimal returns. When companies seek returns, consumers can expect that products may be orphaned, services may be reduced, customer data may be sold or repurposed, costs may rise, and companies may reduce staff or fail. Historically, AI has had brief periods of intense hype, followed by disillusionment, and “AI winters.”[<nowiki/>[[Consumer Rights Wiki:Verifiability|citation needed]]]

~~While it is good practice for a bot~~ to ~~respect robots~~.~~txt, there is no requirement~~ for it, and ~~there is no punishment for not following a website's wishes~~. ~~It is additionally standard practice, but in no way enforced, that bots use a [[wikipedia~~:~~User~~-~~Agent header~~|~~User~~-~~Agent header]] to uniquely identify itself~~. ~~This allows a website operator to observe a bot's traffic patterns, potentially blocking the bot outright if its scraping is not desirable~~. ~~The header also typically contains a URL or email address that can be used to contact the operator in case of anomalies observed in its traffic~~.

The current well-funded industry of artificial intelligence tools has led to the rampant and unethical use of content. Startups aiming to develop AI services have been rapidly scraping the internet for content to train future models, and members of the field are concerned that they are approaching the limit of publicly available content to train from.<ref>{{Cite web |last=Tremayne-Pengelly |first=Alexandra |date=16 Dec 2024 |title=Ilya Sutskever Warns A.I. Is Running Out of Data—Here’s What Will Happen Next |url=https://observer.com/2024/12/openai-cofounder-ilya-sutskever-ai-data-peak/ |website=Observer |url-status=live |archive-url=http://web.archive.org/web/20251126053705/https://observer.com/2024/12/openai-cofounder-ilya-sutskever-ai-data-peak/ |archive-date=26 Nov 2025}}</ref>

Unethical AI scraper bots do not follow robots.txt - in fact, they may not even request this file at all. They typically completely ignore it, instead opting to start from an entry point such as the root home page (<code>/</code>), working its way through an exponentially growing list of links as it finds them, with little to no delay between requests. The bots use false User-Agent header strings that would correspond to real web browsers on desktop or mobile operating systems - blocking them would also block legitimate users, or at least legitimate users on VPNs.

==Why is it a problem==

===Unethical training of data===

:Further reading: [[Artificial intelligence/training]]

~~Some AI services opt to use separate User-Agent strings, potentially also ignoring robots.txt, when a request~~ is ~~made through user command rather than as part of model~~ training~~. For example~~, ~~ChatGPT identifies itself~~ as ~~<code>ChatGPT-User</code> rather than its standard <code>OpenAI</code> when it uses the "search the web" command - even if searching the web~~ was ~~an automatic decision. In a less favorable example, Perplexity AI in this same situation falsely identifies as a standard Chrome web browser running on Windows. AI companies defend this under~~ the ~~belief that they are not a "spider", but rather a "user agent" (like a web browser), when called upon by a user~~'s ~~request~~.~~<ref name="perplexity-aws" />~~

Users' work is sometimes silently used in training without their explicit consent, as was the case for [[Adobe's AI policy]].

~~Less legitimate bots use a wide distribution~~ of ~~IP addresses, further reducing options for the website~~ to ~~protect itself~~. ~~This is in a clear attempt to bypass IP-based request throttling and rate limiting~~ the ~~website may implement~~. ~~They are also known~~ to ~~ignore HTTP response status codes that indicate a server error ([[wikipedia:HTTP status code#5xx server errors|5xx]])~~, ~~or warnings that~~ the ~~client needs to slow down (~~[[~~wikipedia~~:~~HTTP status code#429~~|~~429 Too Many Requests]~~]~~) or has been entirely blocked ([[wikipedia:HTTP status code#403|403 Forbidden~~]]).

===Privacy concerns of AI===

AI can be and has been used to generate deepfakes of people with and without their consent. Deepfakes are media generated with the likeness of an individual. Deepfake media can range from harmless to harmful. The latter includes child pornography, revenge porn, blackmail, etc. Since the rampant rise of consumer AI, deepfakes have become even more prevalent, with some websites explicitly specializing in them.[<nowiki/>[[Consumer Rights Wiki:Verifiability|citation needed]]]

===~~Effect on users~~===

===Privacy concerns of online AI models===

~~To protect against unethical crawlers~~, ~~due~~ to ~~concerns of both intellectual property~~ and ~~service disruption~~, ~~websites adopt practices that affect the experience of real users:~~

There are several concerns with using online AI models like [[ChatGPT]], not only because they are proprietary, but also because there is no guarantee of where your data will be stored or used. Recent developments in local AI models offer an alternative to online AI models, which can be downloaded from platforms like [https://huggingface.co/ HuggingFace] and used offline. Common models to run include Llama ([[Meta]]), DeepSeek ([[DeepSeek]]), Phi ([[Microsoft]]), Mistral ([[Mistral AI]]), Gemma ([[Google]]).

*'''Bot check walls''': The user may be ~~required to pass a security check "wall". While usually automatic~~ for ~~the user, this can affect legitimate bots~~. ~~When a website protection service such as~~ [[~~Cloudflare~~]] ~~is not confident as~~ to ~~whether~~ the ~~visitor is legitimate~~, ~~it may present~~ a ~~CAPTCHA to be manually filled out~~. ~~An example is~~ "~~Google Sorry~~", ~~a CAPTCHA wall frequently seen when using Google Search via a VPN~~.

In some cases, AI models can be hijacked for malicious purposes. Demonstrated with Comet ([[Perplexity]]), users can run arbitrary prompts to the browser's built-in AI assistant by hiding text in the HTML comments, non-visible webpage text, or simple comments on a webpage.<ref name=":0">{{Cite web |date=Aug 20, 2025 |title=Tweet from Brave |url=https://nitter.us.catsarch.com/brave/status/1958152314914508893 |url-status=live |archive-url=https://web.archive.org/web/20260321120531/https://nitter.us.catsarch.com/brave/status/1958152314914508893 |archive-date=21 Mar 2026 |access-date=Aug 24, 2025 |website=X (formerly [[Twitter]])}}</ref> These arbitrary prompts can then be exploited to obtain sensitive information or gain unauthorized access to high-value accounts, such as those for banking or gaming libraries.<ref>{{Cite web |date=Aug 23, 2025 |title=Tweet from zack (in SF) |url=https://nitter.us.catsarch.com/zack_overflow/status/1959308058200551721 |url-status=live |archive-url=https://web.archive.org/web/20260321120841/https://nitter.us.catsarch.com/zack_overflow/status/1959308058200551721 |archive-date=21 Mar 2026 |access-date=Aug 24, 2025 |website=X (formerly [[Twitter]])}}</ref> See [[wikipedia:Prompt_injection|Prompt injection]].

*'''Login walls''': ~~Should bots be found to pass CAPTCHA walls~~, ~~the~~ website ~~may advance to requiring logging in to view content. A major recent example of this is~~ [[~~YouTube~~]]~~'s "Sign in~~ to ~~confirm you're not a bot" messages.~~

*'''JavaScript requirement''': Most websites do not need JavaScript to ~~deliver their content~~. ~~However~~, ~~as many scrapers expect content to be found directly~~ in ~~the HTML, it is often an easy workaround to use JavaScript to "insert" the content after the page has loaded~~. ~~This may reduce the responsiveness of the website, increasing points of failure, and preventing security~~-~~conscious users who disable JavaScript from viewing the website~~.

*'''IP address blocking''': ~~Blocking IP addresses~~, ~~especially by blocking entire providers via their~~ [[wikipedia:~~Autonomous system (Internet)~~|~~autonomous system number~~]]~~, always comes with some risk of blocking legitimate users. Particularly, this may restrict access to users making use of a VPN.~~

*'''Heuristic blocking''': Patterns in request headers may give away that the request is being made by an unethical bot, despite attempts to act as a legitimate visitor. Heuristics are imperfect and may block legitimate users, especially those that may use less common browsers.

~~In rare situations, a website operator may redirect detected bot traffic, such as to download speed test files hosted by ISPs containing multiple gigabytes~~ of ~~random garbage~~ data~~. This may have the effect of disrupting the bot, but its effectiveness is unknown.~~

===Unethical maintenance of data centers===

~~The need~~ to ~~respond~~ to ~~unethical scraping also further consolidates~~ the ~~web into~~ the ~~control~~ of ~~a few large~~ [[~~wikipedia~~:~~Web application firewall~~|~~web application firewall~~]] (~~WAF~~) ~~services~~, ~~most notably~~ [[~~Cloudflare~~]]~~, as website owners find themselves otherwise unable to protect their service from being disrupted by such traffic.~~

Due to heavy investments into and increased use of generative AI and LLMs, many data centers have been constructed to host LLMs. These data centers consume large amounts of power and water, in order to power and cool the computer systems running the models. Residents that live in cities where AI data centers have been constructed have complained of an increase in their electricity bills despite no change in their personal usage.[<nowiki/>[[Consumer Rights Wiki:Verifiability|citation needed]]] According to a research video by Benn Jordan, these data centers (as well as fracking operations and natural occurrences) cause a high amount of sound pollution, which can cause various symptoms.<ref> {{Cite web |last=Jordan |first=Benn |date=2026-02-18 |title=Datacenters Behaving Like Acoustic Weapons |url=https://www.youtube.com/watch?v=_bP80DEAbuo |url-status=live |archive-url=https://preservetube.com/watch?v=_bP80DEAbuo |archive-date=2026-02-21 |website=[[YouTube]]}}</ref>

===~~Case studies===~~

===Lack of control===

~~====Diaspora=~~===

In many cases where AI is deployed as a robot/drone or given access to a system/device (as an "[[wikipedia:AI_agent|agent]]"), there's no [[Security#Audit|security auditing]] of those systems that ensures AI can't exploit vulnerabilities and potentially become [[wikipedia:AI_takeover|rogue]] (either intentionally or accidentally). The consequences of this could be as little as a robot failing to do its job, to harming and killing humans and animals, this includes damaging other robots or itself.

~~On 27 December 2024, the open-source social network project Diaspora noted that 70% of traffic across its infrastructure was in service of~~ AI ~~scrapers.<ref name=~~"~~geraspora~~"~~>https://pod.geraspora.de/posts/17342163</ref> Particularly~~, ~~the project noted~~ that ~~bots had followed links to crawl every individual edit in their~~ [[~~#MediaWiki~~|~~MediaWiki~~]] ~~instance~~, ~~causing an exponential increase in the number of unique requests being made~~.

~~====LVFS====~~

Some experts in the field have started a [[wikipedia:Superintelligence_ban|Super Intelligence Ban]] movement, which is based on the aforementioned concerns along with the concern of AI becoming an existential threat to the entirety of humanity and the planet.

~~The~~ [~~https~~:~~//fwupd.org/ Linux Vendor Firmware Service~~] ~~(LVFS) provides a free central store of firmware updates~~, ~~such as for UEFI motherboards and SSD controllers. This feature~~ is ~~integrated~~ with ~~many Linux distributions through~~ the ~~<code>fwupd</code> daemon. For situations where internet access is not permitted,~~ the ~~service allows users to make a local mirror~~ of the ~~entire 100+ GB store~~.

~~On 9 January 2025~~, the ~~project announced that~~ it ~~would introduce a login wall around its mirror feature~~, ~~citing unnecessary use~~ of ~~its bandwidth~~.<ref>https://~~lore~~.~~kernel.org~~/~~lvfs~~-~~announce~~/~~zDlhotSvKqnMDfkCKaE_u4~~-~~8uvWsgkuj18ifLBwrLN9vWWrIJjrYQ~~-~~QfhpY3xuwIXuZgzOVajW99ymoWmijTdngeFRVjM0BxhPZquUzbDfM~~=~~@hughsie~~.~~com~~/T/</~~ref> Up to 1,000 files may be downloaded per day without logging in. The author later mentioned on Mastodon that the problem appears to be caused by AI scraping.<ref>~~https://~~mastodon~~.~~social~~/~~@hughsie~~/~~113871373001227969~~</ref>

===Hidden directives===

Most AI apps include an initial "root"/"system" prompt given to the AI, which is hidden from the user. Some corporations go to great lengths to keep those prompts hidden, and to avoid leaking it to the user. Some projects attempt to bring back transparency to these tools, in spite of the restrictions.<ref>{{Cite web |title=elder-plinius/CL4R1T4S: LEAKED SYSTEM PROMPTS FOR CHATGPT, CLAUDE, GEMINI, GROK, PERPLEXITY, CURSOR, LOVABLE, REPLIT, AND MORE! - AI SYSTEMS TRANSPARENCY FOR ALL! 👐 |url=https://github.com/elder-plinius/CL4R1T4S |url-status=live |archive-url=https://web.archive.org/web/20260614093312/https://github.com/elder-plinius/CL4R1T4S |archive-date=2026-06-14 |website=[[GitHub]]}}</ref>

==~~==LWN.net==~~==

==Further reading==

~~On 21 January 2025, Jonathan Corbet, maintainer of the Linux news website~~ [[~~wikipedia:LWN.net|LWN.net~~]]~~, made the following [https://social.kernel.org/notice/AqJkUigsjad3gQc664 post] to social.kernel.org:~~

*[[Automatic content recognition]]

~~<blockquote>~~

==External links==

Should you be wondering why @LWN #LWN is occasionally sluggish... since the new year, the DDOS onslaughts from AI-scraper bots has picked up considerably. Only a small fraction of our traffic is serving actual human readers at this point. At times, some bot decides to hit us from hundreds of IP addresses at once, clogging the works. They don't identify themselves as bots, and robots.txt is the only thing they *don't* read off the site.

This is beyond unsustainable. We are going to have to put time into deploying some sort of active defenses just to keep the site online. I think I'd even rather be writing about accounting systems than dealing with this cr*~~p. And it's not just us, of course; this behavior is going to wreck the net even more than it's already wrecked.~~

*[https://aisafety.dance/ Nicky Case, ''“AI Safety for Fleshy Humans”'', Hack Club (2024)]

~~</blockquote>~~

~~He later commented:<ref>https://www.heise.de/en/news/AI-bots-paralyze-Linux-news-site-and-others-10252162.html</ref>~~

~~<blockquote>~~

We do indeed see a kind of pattern. Every IP stays below the threshold for our fuses, but the overload is overwhelming. Any form of active defense will probably have to figure out to block entire subnets instead of individual addresses, and even that might not be enough.

~~</blockquote>~~

~~====MediaWiki====~~

[[wikipedia:MediaWiki|MediaWiki]] is of particular interest to LLM training due to the vast amount of factual, plain-text content wikis tend to hold. While [[wikipedia:Wikipedia|Wikipedia]] and the [[wikipedia:Wikimedia Foundation|Wikimedia Foundation]] host the most well-known wikis, numerous smaller wikis exist thanks to the work of many independent editors. The strength of wiki architecture is its ability for every edit to be audited by anyone, at any time - you can still view [https://~~en.wikipedia.org/w/index.php?oldid=1 the first edit to Wikipedia] from 2002~~. ~~This makes wikis a hybrid of a static website and a dynamic web app, which becomes problematic when poorly-designed bots attempt to scrape them.<ref name="geraspora"~~ />

~~<!-- COI alert: I~~, ~~[[User:kirb]], am an admin for The Apple Wiki. Hopefully this is neutral enough?~~

~~-->The Apple Wiki, which documents internal details of Apple~~'s hardware and software, holds more than 50,000 articles. On 2 August 2024, with a repeat occurrence on 5 January 2025, the service was disrupted by scraping efforts.<ref>https://theapplewiki.com/wiki/The_Apple_Wiki:Community_portal#Bot_traffic_abuse</ref> The wiki contains a considerable amount of information that is scraped by legitimate security research tools, making it difficult for the website to block non-legitimate requests. Efforts to block unethical scraping and protect the wiki have disrupted these legitimate tools. The large article count, combined with more than 280,000 total edits over the wiki's lifetime, create an untenable situation where it is simply not possible to scrape the website without causing significant service disruption.

~~====Perplexity AI and news outlets====~~

~~[[Perplexity AI]], founded in August 2022, is a large language model that aims to be viewed as a general search engine. It encourages users to consume news through its summaries of stories.~~

On 15 June 2024, an investigation by Apple blog MacStories found that Perplexity does not follow its own documented policies when accessing content the user requests from the web. In their testing, the scraper pretended to be Chrome 111 running on Windows 10, connecting from an IP address not found in Perplexity's publicly-listed IP address ranges.<ref>https://rknight.me/blog/perplexity-ai-is-lying-about-its-user-agent/</ref> MacStories' findings were confirmed by a WIRED investigation.<ref>https://www.wired.com/story/perplexity-is-a-bullshit-machine/</ref> Perplexity responded by removing its list of IP addresses.

On 27 June 2024, [[Amazon]] announced an investigation into Perplexity AI, suggesting the behavior may be considered abusive under Amazon Web Services terms of service:<ref name="perplexity-aws">https://www.wired.com/story/aws-perplexity-bot-scraping-investigation/</ref>

~~<blockquote>~~

~~"AWS~~'s terms of service prohibit abusive and illegal activities and our customers are responsible for complying with those terms," [AWS spokesperson Patrick] Neighorn said in a statement. "We routinely receive reports of alleged abuse from a variety of sources and engage our customers to understand those reports."

~~</blockquote>~~

~~==Privacy concerns of online AI models==~~

~~There are several concerns with using online AI models like [[ChatGPT]] ([[OpenAI]])~~, not only because they are proprietary, but also because there is no guarantee to where your data ends up being stored or used for. Recent developments in local AI models are an alternative to these online AI models, as they work offline once they are downloaded from platforms like HuggingFace.<ref>https://huggingface.co/</ref> Common models to run are like Llama (~~[[Meta]]), DeepSeek ([[DeepSeek]]), Phi ([[Microsoft]]), Mistral ([[Mistral AI]]~~)~~, Gemma ([[Google]~~]).

==References==

~~<references />~~

[[Category:Artificial intelligence| ]]

[[Category:Artificial intelligence]]

@@ Line 1: / Line 1: @@
-'''Artificial intelligence''' (AI) is a field of computer science producing software that aims to ultimately replace all manual labor. AI is not a new concept - it has been of interest as early as the 1950s. Since the November 2022 launch of [[ChatGPT]], [[wikipedia:Large language model|large language model]] (LLM) chatbots have been a main focus of the industry, with billions of dollars in funding allocated to producing more "intelligent" LLMs. Also a significant focus are [[wikipedia:Text-to-image model|text-to-image models]], which "draw" an image using written instructions, and [[wikipedia:Text-to-video model|text-to-video models]], which extend the text-to-image concept across several smooth video frames.
+{{Irrelevant}}{{ToneWarning}}
-[[wikipedia:Generative artificial intelligence|Generative artificial intelligence]] models are trained through vast amounts of existing human-generated content. Using the example of an LLM, by learning about common trends in sentence structure, the model is able to form complete sentences and show artificial "knowledge" of a topic. The artificial nature may cause [[wikipedia:Hallucination (artificial intelligence)|hallucination]] through confidently-written, but mostly or entirely incorrect, output.
+'''Artificial intelligence''' ('''AI''') is a field of computer science that produces systems designed to solve problems that humans typically solve using intelligence. In the consumer and industry space, it is commonly referred to as chatbots or [[wikipedia:Large language model|large language models]] (LLMs), which have been a main focus of industry since the November 2022 launch of [[OpenAI]]'s [[ChatGPT]], with tens of billions of dollars in funding allocated to producing more popular LLMs. This is also a significant focus on [[wikipedia:Text-to-image model|text-to-image models]], which "draw" an image using a written prompt, and less commonly, [[wikipedia:Text-to-video model|text-to-video models]], which extend the text-to-image concept across several smooth video frames.
-The current well-funded, lucrative industry of artificial intelligence tools has resulted in rampant unethical use of content. Startups intending to produce AI services have been scraping the internet for content to train future models at a concerning pace, with no regard for copyright law, as members of the field are concerned that they are approaching the limit of publicly-available content to train from.<ref>https://observer.com/2024/12/openai-cofounder-ilya-sutskever-ai-data-peak/</ref>
+AI is not a new concept; it has been of interest since the 1950s. AI is a catch-all term, encompassing many areas and techniques.
-==Unethical website scraping==
+[[wikipedia:Generative artificial intelligence|Generative artificial intelligence]] models are trained through vast amounts of existing human-generated content. LLMs gather statistics on word patterns, which allows the model to generate sequences of words that seem similar to what a person might have written. However, an LLM does not understand anything; they cannot reason.  They generate randomly modulated pattern of tokens. In this way, they function similarly to autocomplete.
-While "mainstream" companies such as [[OpenAI]], [[Anthropic]], and [[Meta]] appear to correctly follow industry-standard practice for web crawlers, others ignore them, causing [[wikipedia:Denial-of-service attack|distributed denial of service attacks]] which damage access to freely-accessible websites. This is particularly an issue for websites that are large or contain many dynamic links.
-Ethical website scrapers, known as "spiders" that crawl the web, follow a certain set of minimum guidelines. Specifically, they follow [[wikipedia:robots.txt|robots.txt]], a text file found at the root of a domain that indicates:
+People reading sequences of tokens sometimes perceive things they think are true.  Sequences that do not make sense to the reader, or that are false, are called [[wikipedia:Hallucination (artificial intelligence)|hallucinations]].  LLMs are typically trained to produce output that is pleasing to people, exhibiting [[Dark pattern|dark patterns]]. For example, they produce output which seems confidently written, use patterns which praise the user (sycophancy), and employ emotionally manipulative language.
-*Paths bots are allowed to index
+People are accustomed to interacting with others, and many overestimate the abilities of things that exhibit complex, person-like patterns. Promoters of “AI” systems take advantage of this tendency, using suggestive names (like “reasoning” and “learning”) and grand claims (“PhD level”), which make it harder for people to understand these systems.
-*Paths bots should not index
-*How long the bot should wait in between requests to the server, to reduce load
-*The [[wikipedia:Sitemaps|sitemap]] of the website's content
-These rules are typically configured for all bots, with minor adjustments made to individual bots as needed. Additionally, specific web pages may use the [[wikipedia:noindex|robots meta tag]] to control use of their output.
+From November 2022 to 2025, venture capitalists and companies invested hundreds of billions of dollars into AI but received minimal returns. When companies seek returns, consumers can expect that products may be orphaned, services may be reduced, customer data may be sold or repurposed, costs may rise, and companies may reduce staff or fail. Historically, AI has had brief periods of intense hype, followed by disillusionment, and “AI winters.”<sup>[<nowiki/>[[Consumer Rights Wiki:Verifiability|citation needed]]]</sup>
-While it is good practice for a bot to respect robots.txt, there is no requirement for it, and there is no punishment for not following a website's wishes. It is additionally standard practice, but in no way enforced, that bots use a [[wikipedia:User-Agent header|User-Agent header]] to uniquely identify itself. This allows a website operator to observe a bot's traffic patterns, potentially blocking the bot outright if its scraping is not desirable. The header also typically contains a URL or email address that can be used to contact the operator in case of anomalies observed in its traffic.
+The current well-funded industry of artificial intelligence tools has led to the rampant and unethical use of content. Startups aiming to develop AI services have been rapidly scraping the internet for content to train future models, and members of the field are concerned that they are approaching the limit of publicly available content to train from.<ref>{{Cite web |last=Tremayne-Pengelly |first=Alexandra |date=16 Dec 2024 |title=Ilya Sutskever Warns A.I. Is Running Out of Data—Here’s What Will Happen Next |url=https://observer.com/2024/12/openai-cofounder-ilya-sutskever-ai-data-peak/ |website=Observer |url-status=live |archive-url=http://web.archive.org/web/20251126053705/https://observer.com/2024/12/openai-cofounder-ilya-sutskever-ai-data-peak/ |archive-date=26 Nov 2025}}</ref>
-Unethical AI scraper bots do not follow robots.txt - in fact, they may not even request this file at all. They typically completely ignore it, instead opting to start from an entry point such as the root home page (<code>/</code>), working its way through an exponentially growing list of links as it finds them, with little to no delay between requests. The bots use false User-Agent header strings that would correspond to real web browsers on desktop or mobile operating systems - blocking them would also block legitimate users, or at least legitimate users on VPNs.
+==Why is it a problem==
+===Unethical training of data===
+:Further reading: [[Artificial intelligence/training]]
-Some AI services opt to use separate User-Agent strings, potentially also ignoring robots.txt, when a request is made through user command rather than as part of model training. For example, ChatGPT identifies itself as <code>ChatGPT-User</code> rather than its standard <code>OpenAI</code> when it uses the "search the web" command - even if searching the web was an automatic decision. In a less favorable example, Perplexity AI in this same situation falsely identifies as a standard Chrome web browser running on Windows. AI companies defend this under the belief that they are not a "spider", but rather a "user agent" (like a web browser), when called upon by a user's request.<ref name="perplexity-aws" />
+Users' work is sometimes silently used in training without their explicit consent, as was the case for [[Adobe's AI policy]].
-Less legitimate bots use a wide distribution of IP addresses, further reducing options for the website to protect itself. This is in a clear attempt to bypass IP-based request throttling and rate limiting the website may implement. They are also known to ignore HTTP response status codes that indicate a server error ([[wikipedia:HTTP status code#5xx server errors|5xx]]), or warnings that the client needs to slow down ([[wikipedia:HTTP status code#429|429 Too Many Requests]]) or has been entirely blocked ([[wikipedia:HTTP status code#403|403 Forbidden]]).
+===Privacy concerns of AI===
+AI can be and has been used to generate deepfakes of people with and without their consent. Deepfakes are media generated with the likeness of an individual. Deepfake media can range from harmless to harmful. The latter includes child pornography, revenge porn, blackmail, etc. Since the rampant rise of consumer AI, deepfakes have become even more prevalent, with some websites explicitly specializing in them.<sup>[<nowiki/>[[Consumer Rights Wiki:Verifiability|citation needed]]]</sup><!-- In this case, I would recommend including a reliable news source on the topic, not citing the harmful website themselves.  -->
-===Effect on users===
+===Privacy concerns of online AI models===
-To protect against unethical crawlers, due to concerns of both intellectual property and service disruption, websites adopt practices that affect the experience of real users:
+There are several concerns with using online AI models like [[ChatGPT]], not only because they are proprietary, but also because there is no guarantee of where your data will be stored or used. Recent developments in local AI models offer an alternative to online AI models, which can be downloaded from platforms like [https://huggingface.co/ HuggingFace] and used offline. Common models to run include Llama ([[Meta]]), DeepSeek ([[DeepSeek]]), Phi ([[Microsoft]]), Mistral ([[Mistral AI]]), Gemma ([[Google]]).
-*'''Bot check walls''': The user may be required to pass a security check "wall". While usually automatic for the user, this can affect legitimate bots. When a website protection service such as [[Cloudflare]] is not confident as to whether the visitor is legitimate, it may present a CAPTCHA to be manually filled out. An example is "Google Sorry", a CAPTCHA wall frequently seen when using Google Search via a VPN.
+In some cases, AI models can be hijacked for malicious purposes. Demonstrated with Comet ([[Perplexity]]), users can run arbitrary prompts to the browser's built-in AI assistant by hiding text in the HTML comments, non-visible webpage text, or simple comments on a webpage.<ref name=":0">{{Cite web |date=Aug 20, 2025 |title=Tweet from Brave |url=https://nitter.us.catsarch.com/brave/status/1958152314914508893 |url-status=live |archive-url=https://web.archive.org/web/20260321120531/https://nitter.us.catsarch.com/brave/status/1958152314914508893 |archive-date=21 Mar 2026 |access-date=Aug 24, 2025 |website=X (formerly [[Twitter]])}}</ref> These arbitrary prompts can then be exploited to obtain sensitive information or gain unauthorized access to high-value accounts, such as those for banking or gaming libraries.<ref>{{Cite web |date=Aug 23, 2025 |title=Tweet from zack (in SF) |url=https://nitter.us.catsarch.com/zack_overflow/status/1959308058200551721 |url-status=live |archive-url=https://web.archive.org/web/20260321120841/https://nitter.us.catsarch.com/zack_overflow/status/1959308058200551721 |archive-date=21 Mar 2026 |access-date=Aug 24, 2025 |website=X (formerly [[Twitter]])}}</ref> See [[wikipedia:Prompt_injection|Prompt injection]].
-*'''Login walls''': Should bots be found to pass CAPTCHA walls, the website may advance to requiring logging in to view content. A major recent example of this is [[YouTube]]'s "Sign in to confirm you're not a bot" messages.
-*'''JavaScript requirement''': Most websites do not need JavaScript to deliver their content. However, as many scrapers expect content to be found directly in the HTML, it is often an easy workaround to use JavaScript to "insert" the content after the page has loaded. This may reduce the responsiveness of the website, increasing points of failure, and preventing security-conscious users who disable JavaScript from viewing the website.
-*'''IP address blocking''': Blocking IP addresses, especially by blocking entire providers via their [[wikipedia:Autonomous system (Internet)|autonomous system number]], always comes with some risk of blocking legitimate users. Particularly, this may restrict access to users making use of a VPN.
-*'''Heuristic blocking''': Patterns in request headers may give away that the request is being made by an unethical bot, despite attempts to act as a legitimate visitor. Heuristics are imperfect and may block legitimate users, especially those that may use less common browsers.
-In rare situations, a website operator may redirect detected bot traffic, such as to download speed test files hosted by ISPs containing multiple gigabytes of random garbage data. This may have the effect of disrupting the bot, but its effectiveness is unknown.
+===Unethical maintenance of data centers===
-The need to respond to unethical scraping also further consolidates the web into the control of a few large [[wikipedia:Web application firewall|web application firewall]] (WAF) services, most notably [[Cloudflare]], as website owners find themselves otherwise unable to protect their service from being disrupted by such traffic.
+Due to heavy investments into and increased use of generative AI and LLMs, many data centers have been constructed to host LLMs. These data centers consume large amounts of power and water, in order to power and cool the computer systems running the models. Residents that live in cities where AI data centers have been constructed have complained of an increase in their electricity bills despite no change in their personal usage.<sup>[<nowiki/>[[Consumer Rights Wiki:Verifiability|citation needed]]]</sup> According to a research video by Benn Jordan, these data centers (as well as fracking operations and natural occurrences) cause a high amount of sound pollution, which can cause various symptoms.<ref> {{Cite web |last=Jordan |first=Benn |date=2026-02-18 |title=Datacenters Behaving Like Acoustic Weapons |url=https://www.youtube.com/watch?v=_bP80DEAbuo |url-status=live |archive-url=https://preservetube.com/watch?v=_bP80DEAbuo |archive-date=2026-02-21 |website=[[YouTube]]}}</ref>
-===Case studies===
+===Lack of control===
-====Diaspora====
+In many cases where AI is deployed as a robot/drone or given access to a system/device (as an "[[wikipedia:AI_agent|agent]]"), there's no [[Security#Audit|security auditing]] of those systems that ensures AI can't exploit vulnerabilities and potentially become [[wikipedia:AI_takeover|rogue]] (either intentionally or accidentally). The consequences of this could be as little as a robot failing to do its job, to harming and killing humans and animals, this includes damaging other robots or itself.
-On 27 December 2024, the open-source social network project Diaspora noted that 70% of traffic across its infrastructure was in service of AI scrapers.<ref name="geraspora">https://pod.geraspora.de/posts/17342163</ref> Particularly, the project noted that bots had followed links to crawl every individual edit in their [[#MediaWiki|MediaWiki]] instance, causing an exponential increase in the number of unique requests being made.
-====LVFS====
+Some experts in the field have started a [[wikipedia:Superintelligence_ban|Super Intelligence Ban]] movement, which is based on the aforementioned concerns along with the concern of AI becoming an existential threat to the entirety of humanity and the planet.
-The [https://fwupd.org/ Linux Vendor Firmware Service] (LVFS) provides a free central store of firmware updates, such as for UEFI motherboards and SSD controllers. This feature is integrated with many Linux distributions through the <code>fwupd</code> daemon. For situations where internet access is not permitted, the service allows users to make a local mirror of the entire 100+ GB store.
-On 9 January 2025, the project announced that it would introduce a login wall around its mirror feature, citing unnecessary use of its bandwidth.<ref>https://lore.kernel.org/lvfs-announce/zDlhotSvKqnMDfkCKaE_u4-8uvWsgkuj18ifLBwrLN9vWWrIJjrYQ-QfhpY3xuwIXuZgzOVajW99ymoWmijTdngeFRVjM0BxhPZquUzbDfM=@hughsie.com/T/</ref> Up to 1,000 files may be downloaded per day without logging in. The author later mentioned on Mastodon that the problem appears to be caused by AI scraping.<ref>https://mastodon.social/@hughsie/113871373001227969</ref>
+===Hidden directives===
+Most AI apps include an initial "root"/"system" prompt given to the AI, which is hidden from the user. Some corporations go to great lengths to keep those prompts hidden, and to avoid leaking it to the user. Some projects attempt to bring back transparency to these tools, in spite of the restrictions.<ref>{{Cite web |title=elder-plinius/CL4R1T4S: LEAKED SYSTEM PROMPTS FOR CHATGPT, CLAUDE, GEMINI, GROK, PERPLEXITY, CURSOR, LOVABLE, REPLIT, AND MORE! - AI SYSTEMS TRANSPARENCY FOR ALL! 👐 |url=https://github.com/elder-plinius/CL4R1T4S |url-status=live |archive-url=https://web.archive.org/web/20260614093312/https://github.com/elder-plinius/CL4R1T4S |archive-date=2026-06-14 |website=[[GitHub]]}}</ref>
-====LWN.net====
+==Further reading==
-On 21 January 2025, Jonathan Corbet, maintainer of the Linux news website [[wikipedia:LWN.net|LWN.net]], made the following [https://social.kernel.org/notice/AqJkUigsjad3gQc664 post] to social.kernel.org:
+*[[Automatic content recognition]]
-<blockquote>
+==External links==
-Should you be wondering why @LWN #LWN is occasionally sluggish... since the new year, the DDOS onslaughts from AI-scraper bots has picked up considerably. Only a small fraction of our traffic is serving actual human readers at this point. At times, some bot decides to hit us from hundreds of IP addresses at once, clogging the works. They don't identify themselves as bots, and robots.txt is the only thing they *don't* read off the site.
-This is beyond unsustainable. We are going to have to put time into deploying some sort of active defenses just to keep the site online. I think I'd even rather be writing about accounting systems than dealing with this cr*p. And it's not just us, of course; this behavior is going to wreck the net even more than it's already wrecked.
+*[https://aisafety.dance/ Nicky Case, ''“AI Safety for Fleshy Humans”'', Hack Club (2024)]
-</blockquote>
-He later commented:<ref>https://www.heise.de/en/news/AI-bots-paralyze-Linux-news-site-and-others-10252162.html</ref>
-<blockquote>
-We do indeed see a kind of pattern. Every IP stays below the threshold for our fuses, but the overload is overwhelming. Any form of active defense will probably have to figure out to block entire subnets instead of individual addresses, and even that might not be enough.
-</blockquote>
-====MediaWiki====
-[[wikipedia:MediaWiki|MediaWiki]] is of particular interest to LLM training due to the vast amount of factual, plain-text content wikis tend to hold. While [[wikipedia:Wikipedia|Wikipedia]] and the [[wikipedia:Wikimedia Foundation|Wikimedia Foundation]] host the most well-known wikis, numerous smaller wikis exist thanks to the work of many independent editors. The strength of wiki architecture is its ability for every edit to be audited by anyone, at any time - you can still view [https://en.wikipedia.org/w/index.php?oldid=1 the first edit to Wikipedia] from 2002. This makes wikis a hybrid of a static website and a dynamic web app, which becomes problematic when poorly-designed bots attempt to scrape them.<ref name="geraspora" />
-<!-- COI alert: I, [[User:kirb]], am an admin for The Apple Wiki. Hopefully this is neutral enough?
--->The Apple Wiki, which documents internal details of Apple's hardware and software, holds more than 50,000 articles. On 2 August 2024, with a repeat occurrence on 5 January 2025, the service was disrupted by scraping efforts.<ref>https://theapplewiki.com/wiki/The_Apple_Wiki:Community_portal#Bot_traffic_abuse</ref> The wiki contains a considerable amount of information that is scraped by legitimate security research tools, making it difficult for the website to block non-legitimate requests. Efforts to block unethical scraping and protect the wiki have disrupted these legitimate tools. The large article count, combined with more than 280,000 total edits over the wiki's lifetime, create an untenable situation where it is simply not possible to scrape the website without causing significant service disruption.
-====Perplexity AI and news outlets====
-[[Perplexity AI]], founded in August 2022, is a large language model that aims to be viewed as a general search engine. It encourages users to consume news through its summaries of stories.
-On 15 June 2024, an investigation by Apple blog MacStories found that Perplexity does not follow its own documented policies when accessing content the user requests from the web. In their testing, the scraper pretended to be Chrome 111 running on Windows 10, connecting from an IP address not found in Perplexity's publicly-listed IP address ranges.<ref>https://rknight.me/blog/perplexity-ai-is-lying-about-its-user-agent/</ref> MacStories' findings were confirmed by a WIRED investigation.<ref>https://www.wired.com/story/perplexity-is-a-bullshit-machine/</ref> Perplexity responded by removing its list of IP addresses.
-On 27 June 2024, [[Amazon]] announced an investigation into Perplexity AI, suggesting the behavior may be considered abusive under Amazon Web Services terms of service:<ref name="perplexity-aws">https://www.wired.com/story/aws-perplexity-bot-scraping-investigation/</ref>
-<blockquote>
-"AWS's terms of service prohibit abusive and illegal activities and our customers are responsible for complying with those terms," [AWS spokesperson Patrick] Neighorn said in a statement. "We routinely receive reports of alleged abuse from a variety of sources and engage our customers to understand those reports."
-</blockquote>
-==Privacy concerns of online AI models==
-There are several concerns with using online AI models like [[ChatGPT]] ([[OpenAI]]), not only because they are proprietary, but also because there is no guarantee to where your data ends up being stored or used for. Recent developments in local AI models are an alternative to these online AI models, as they work offline once they are downloaded from platforms like HuggingFace.<ref>https://huggingface.co/</ref> Common models to run are like Llama ([[Meta]]), DeepSeek ([[DeepSeek]]), Phi ([[Microsoft]]), Mistral ([[Mistral AI]]), Gemma ([[Google]]).
 ==References==
-<references />
+{{Reflist}}
+[[Category:Artificial intelligence| ]]
-[[Category:Artificial intelligence]]