Artificial intelligence: Difference between revisions

(6 intermediate revisions by 3 users not shown)

Line 3:

[[wikipedia:Generative artificial intelligence|Generative artificial intelligence]] models are trained through vast amounts of existing human-generated content. Using the example of an LLM, by learning about common trends in sentence structure, the model is able to form complete sentences and show artificial "knowledge" of a topic. The artificial nature may cause [[wikipedia:Hallucination (artificial intelligence)|hallucination]] through confidently-written, but mostly or entirely incorrect, output.

The current well-funded, lucrative industry of artificial intelligence tools has resulted in rampant unethical use of content. Startups intending to produce AI services have been scraping the internet for content to train future models at a concerning pace, with no regard for copyright law, as members of the field are concerned that they are approaching the limit of publicly-available content to train from.<ref>https://observer.com/2024/12/openai-cofounder-ilya-sutskever-ai-data-peak/</ref>

The current well-funded, lucrative industry of artificial intelligence tools has resulted in rampant unethical use of content. Startups intending to produce AI services have been scraping the internet for content to train future models at a concerning pace, with no regard for copyright law, as members of the field are concerned that they are approaching the limit of publicly-available content to train from.<ref>{{Cite web |last=Tremayne-Pengelly |first=Alexandra |date=16 Dec 2024 |title=Ilya Sutskever Warns A.I. Is Running Out of Data—Here’s What Will Happen Next |url=https://observer.com/2024/12/openai-cofounder-ilya-sutskever-ai-data-peak/ |website=Observer}}</ref>

==Unethical website scraping==

Further Reading: [[Nonconsensual Scraping|Nonconsensual scraping]]

While "mainstream" companies such as [[OpenAI]], [[Anthropic]], and [[Meta]] appear to correctly follow industry-standard practice for web crawlers, others ignore them, causing [[wikipedia:Denial-of-service attack|distributed denial of service attacks]] which damage access to freely-accessible websites. This is particularly an issue for websites that are large or contain many dynamic links.

Line 40:

Line 41:

===Case studies===

====Diaspora====

On 27 December 2024, the open-source social network project Diaspora noted that 70% of traffic across its infrastructure was in service of AI scrapers.<ref name="geraspora">https://pod.geraspora.de/posts/17342163</ref> Particularly, the project noted that bots had followed links to crawl every individual edit in their [[#MediaWiki|MediaWiki]] instance, causing an exponential increase in the number of unique requests being made.

On 27 December 2024, the open-source social network project Diaspora noted that 70% of traffic across its infrastructure was in service of AI scrapers.<ref name="geraspora">{{Cite web |last=Schubert |first=Dennis |date=27 Dec 2024 |title=In the last 60 days, the diaspora* web assets received 11.3 million requests ... |url=https://pod.geraspora.de/posts/17342163 |via=Diaspora*}}</ref> Particularly, the project noted that bots had followed links to crawl every individual edit in their [[#MediaWiki|MediaWiki]] instance, causing an exponential increase in the number of unique requests being made.

====LVFS====

The [https://fwupd.org/ Linux Vendor Firmware Service] (LVFS) provides a free central store of firmware updates, such as for UEFI motherboards and SSD controllers. This feature is integrated with many Linux distributions through the <code>fwupd</code> daemon. For situations where internet access is not permitted, the service allows users to make a local mirror of the entire 100+ GB store.

On 9 January 2025, the project announced that it would introduce a login wall around its mirror feature, citing unnecessary use of its bandwidth.<ref>https://lore.kernel.org/lvfs-announce/zDlhotSvKqnMDfkCKaE_u4-8uvWsgkuj18ifLBwrLN9vWWrIJjrYQ-QfhpY3xuwIXuZgzOVajW99ymoWmijTdngeFRVjM0BxhPZquUzbDfM=@hughsie.com/T/</ref> Up to 1,000 files may be downloaded per day without logging in. The author later mentioned on Mastodon that the problem appears to be caused by AI scraping.<ref>https://mastodon.social/@hughsie/113871373001227969</ref>

On 9 January 2025, the project announced that it would introduce a login wall around its mirror feature, citing unnecessary use of its bandwidth.<ref>{{Cite web |last=Hughes |first=Richard |date=9 Jan 2025 |title=Authentication soon required to mirror the entire LVFS |url=https://lore.kernel.org/lvfs-announce/zDlhotSvKqnMDfkCKaE_u4-8uvWsgkuj18ifLBwrLN9vWWrIJjrYQ-QfhpY3xuwIXuZgzOVajW99ymoWmijTdngeFRVjM0BxhPZquUzbDfM=@hughsie.com/T/ |website=Linux Vendor Firmware Service (LVFS) mailing list}}</ref> Up to 1,000 files may be downloaded per day without logging in. The author later mentioned on Mastodon that the problem appears to be caused by AI scraping.<ref>{{Cite web |last=Hughes |first=Richard |date=22 Jan 2025 |title=Commentary citing 'Authentication soon required to mirror the entire LVFS' |url=https://mastodon.social/@hughsie/113871373001227969 |via=Mastodon}}</ref>

====LWN.net====

Line 56:

Line 57:

</blockquote>

He later commented:<ref>https://www.heise.de/en/news/AI-bots-paralyze-Linux-news-site-and-others-10252162.html</ref>

He later commented:<ref>{{Cite web |last=Knop |first=Dirk |date=22 Jan 2025 |title=AI bots paralyze Linux news site and others |url=https://www.heise.de/en/news/AI-bots-paralyze-Linux-news-site-and-others-10252162.html |website=Heise |language=en, de}}</ref>

Line 62:

Line 63:

</blockquote>

====MediaWiki====

====MediaWiki, Wikipedia, and the Wikimedia Foundation====

[[wikipedia:MediaWiki|MediaWiki]] is of particular interest to LLM training due to the vast amount of factual, plain-text content wikis tend to hold. While [[wikipedia:Wikipedia|Wikipedia]] and the [[wikipedia:Wikimedia Foundation|Wikimedia Foundation]] host the most well-known wikis, numerous smaller wikis exist thanks to the work of many independent editors. The strength of wiki architecture is its ability for every edit to be audited by anyone, at any time - you can still view [https://en.wikipedia.org/w/index.php?oldid=1 the first edit to Wikipedia] from 2002. This makes wikis a hybrid of a static website and a dynamic web app, which becomes problematic when poorly-designed bots attempt to scrape them.<ref name="geraspora" />

<!-- COI alert: I, [[User:kirb]], am an admin for The Apple Wiki. Hopefully this is neutral enough?

-->The Apple Wiki, which documents internal details of Apple's hardware and software, holds more than 50,000 articles. On 2 August 2024, with a repeat occurrence on 5 January 2025, the service was disrupted by scraping efforts.<ref>https://theapplewiki.com/wiki/The_Apple_Wiki:Community_portal#Bot_traffic_abuse</ref> The wiki contains a considerable amount of information that is scraped by legitimate security research tools, making it difficult for the website to block non-legitimate requests. Efforts to block unethical scraping and protect the wiki have disrupted these legitimate tools. The large article count, combined with more than 280,000 total edits over the wiki's lifetime, create an untenable situation where it is simply not possible to scrape the website without causing significant service disruption.

-->The Apple Wiki, which documents internal details of Apple's hardware and software, holds more than 50,000 articles. On 2 August 2024, with a repeat occurrence on 5 January 2025, the service was disrupted by scraping efforts.<ref>{{Cite web |title=Bot traffic abuse |url=https://theapplewiki.com/wiki/The_Apple_Wiki:Community_portal#Bot_traffic_abuse |website=The Apple Wiki}}</ref> The wiki contains a considerable amount of information that is scraped by legitimate security research tools, making it difficult for the website to block non-legitimate requests. Efforts to block unethical scraping and protect the wiki have disrupted these legitimate tools. The large article count, combined with more than 280,000 total edits over the wiki's lifetime, create an untenable situation where it is simply not possible to scrape the website without causing significant service disruption.

On 1 April 2025, the Wikimedia Foundation indicated that its infrastructure has been under increasing pressure from content scraping bots since January 2024, with the particularly critical metric that "65% of our most expensive traffic comes from bots", despite estimating 35% of all traffic as coming from bots. The bots create traffic patterns that are significantly unlike human traffic patterns, effectively bypassing Wikimedia's caching infrastructure and placing significant load on the core servers. A blog post provides an example where bot traffic caused the [[wikipedia:Wikimedia Commons|Wikimedia Commons]] service to become unstable during a human traffic spike. The Foundation is considering introduction of a Responsible Use of Infrastructure policy to ensure the continued stability of their services.<ref>{{Cite web |last=Mueller |first=Birgit |last2=Danis |first2=Chris |last3=Lavagetto |first3=Giuseppe |date=1 Apr 2025 |title=How crawlers impact the operations of the Wikimedia projects |url=https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-the-operations-of-the-wikimedia-projects/ |website=Wikimedia Foundation}}</ref>

====Perplexity AI and news outlets====

[[Perplexity AI]], founded in August 2022, is a large language model that aims to be viewed as a general search engine. It encourages users to consume news through its summaries of stories.

On 15 June 2024, Apple blog MacStories found that Perplexity does not follow its own documented policies when accessing content the user requests from the web. In their testing, the scraper pretended to be Chrome 111 running on Windows 10, connecting from an IP address not found in Perplexity's ~~posted~~ IP address ranges.<ref>https://rknight.me/blog/perplexity-ai-is-lying-about-its-user-agent/</ref> ~~Two days later, this was corroborated~~ by WIRED.<ref>https://www.wired.com/story/perplexity-is-a-bullshit-machine/</ref> Perplexity responded by removing its list of IP addresses.

On 15 June 2024, an investigation by Apple blog MacStories found that Perplexity does not follow its own documented policies when accessing content the user requests from the web. In their testing, the scraper pretended to be Chrome 111 running on Windows 10, connecting from an IP address not found in Perplexity's publicly-listed IP address ranges.<ref>{{Cite web |last=Knight |first=Robb |date=15 Jun 2024 |title=Perplexity AI Is Lying about Their User Agent |url=https://rknight.me/blog/perplexity-ai-is-lying-about-its-user-agent/}}</ref> MacStories' findings were confirmed by a WIRED investigation.<ref>{{Cite web |last=Mehrotra |first=Dhruv |last2=Marchman |first2=Tim |date=19 Jun 2024 |title=Perplexity Is a Bullshit Machine |url=https://www.wired.com/story/perplexity-is-a-bullshit-machine/ |website=WIRED}}</ref> Perplexity responded by removing its list of IP addresses.

On 27 June 2024, [[Amazon]] announced an investigation into Perplexity AI, ~~citing a~~ terms of service ~~clause requiring bots hosted on Amazon Web Services to honor robots.txt~~:<ref name="perplexity-aws">https://www.wired.com/story/aws-perplexity-bot-scraping-investigation/</ref>

On 27 June 2024, [[Amazon]] announced an investigation into Perplexity AI, suggesting the behavior may be considered abusive under Amazon Web Services terms of service:<ref name="perplexity-aws">{{Cite web |last=Mehrotra |first=Dhruv |date=27 Jun 2024 |title=Amazon Is Investigating Perplexity Over Claims of Scraping Abuse |url=https://www.wired.com/story/aws-perplexity-bot-scraping-investigation/ |website=WIRED}}</ref>

Line 79:

Line 82:

</blockquote>

== ~~Privacy concerns~~ of ~~online~~ AI ~~models~~ ==

====Read the Docs====

~~There are several concerns with using online AI models like~~ [[~~ChatGPT~~]], ~~not only because they are proprietary, but~~ also ~~because there is no guarantee~~ to ~~where your data ends up being stored or used for~~.

In an early example, on 25 July 2024, open source documentation website Read the Docs detailed cases of abusive bots downloading large amounts of content from the service. Particularly, the significant range of IP addresses used in an aggressive manner rendered existing rate limiting ineffective. Taking action to block traffic identified by Cloudflare as "AI crawlers" reduced bandwidth requirements by 75%, at a cost saving of $1,500 USD/month.<ref>{{Cite web |last=Holscher |first=Eric |date=25 Jul 2024 |title=AI crawlers need to be more respectful |url=https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse/ |website=Read the Docs}}</ref>

====SourceHut and Fedora Linux====

On 15 March 2025, an infrastructure manager for the [[wikipedia:Fedora Linux|Fedora Linux]] open source project discussed an assumed large language model crawling attack against the Prague.io Git source code hosting service. The project made the decision to block the entire country of Brazil for some time, while also blocking access to certain repositories whose traffic was creating significant CPU usage.<ref>{{Cite web |last=Fenzi |first=Kevin |date=15 Mar 2025 |title=Mid March infra bits 2025 |url=https://www.scrye.com/blogs/nirik/posts/2025/03/15/mid-march-infra-bits-2025/}}</ref><ref>{{Cite web |last=Fenzi |first=Kevin |date=29 Mar 2025 |title=Late March infra bits 2025 |url=https://www.scrye.com/blogs/nirik/posts/2025/03/29/late-march-infra-bits-2025/}}</ref>

~~Luckily there is an alternative which solves many of these concerns~~, ~~which is to run AI models locally~~. ~~There currently exist different models that are small enough~~ to ~~run on a personal computer. Those models are indicated with a smaller parameter size,~~ for ~~instance models with 1.5B or 7B parameters. If~~ the ~~computer has a relatively modern GPU~~, ~~it can also run one~~ of ~~the larger models for more accurate answers~~, ~~as these models have GPU-acceleration. The software that will be recommended below runs on all major computer platforms (Windows/macOs/Linux). Be cautious if you download other kinds~~ of ~~models besides~~ the ~~major models, as platforms like HuggingFace allow anyone to upload~~.<ref>https://~~huggingface~~.co/</ref>

On 17 March 2025, the Git source code host SourceHut announced that the service was being disrupted by large language model crawlers. Mitigations deployed to reduce disruption involved requiring login for some areas of the service, and blocking IP ranges of cloud providers, affecting legitimate use of the website by its users.<ref>{{Cite web |date=17 Mar 2025 |title=LLM crawlers continue to DDoS SourceHut |url=https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/ |website=sr.ht status}}</ref> In response to the event, SourceHut founder Drew DeVault wrote a blog post entitled "[https://drewdevault.com/2025/03/17/2025-03-17-Stop-externalizing-your-costs-on-me.html Please stop externalizing your costs directly into my face]", discussing his frustrations with having ongoing and ever-adapting attacks that must be addressed in a timely fashion to reduce disruption to legitimate SourceHut users. DeVault estimates that between "20-100%" of his time is now spent addressing such attacks.

~~=== LM Studio ===~~

One of the easiest software to start with to run these models is LM Studio.<ref>https://lmstudio.ai/</ref> It is user-friendly as it has a graphical user interface aimed at beginners, and allows you to get started with just a few clicks. It recommends appropriately sized models for your specific computer hardware, and manages the rest of the installation for you. In terms of storage, you will need a few gigabytes to store the models locally, which you only have to do once. With the models installed, no further internet connection is required.<ref>[https://www.youtube.com/@NetworkChuck NetworkChuck]: [https://www.youtube.com/watch?v=7TR-FLWNVHY The only way to run deepseek]</ref> The software allows opening chats with the large language model, which you can also organize into folders.

==~~= Ollama =~~==

==Privacy concerns of online AI models==

~~If you~~ are ~~fine~~ with ~~just~~ using ~~the terminal~~, ~~another option~~ is to ~~install software~~ like ~~Ollama.<ref>~~https://~~ollama~~.~~com~~/~~</ref> Once installed, you can simply invoke the run command with the model you want to use, and it will download that model if it has not done that already~~. ~~The website lists the most common~~ models to run, like Llama ([[Meta]]) DeepSeek ([[DeepSeek]]), Phi ([[Microsoft]]), Mistral ([[Mistral AI]]), Gemma ([[Google]]). If you are a more advanced user, you can also run Ollama inside Docker.<ref>https://ollama.com/blog/ollama-is-now-available-as-an-official-docker-image</ref> That allows isolating the model completely from your host system, which may be what you want to be extra secure.

There are several concerns with using online AI models like [[ChatGPT]] ([[OpenAI]]), not only because they are proprietary, but also because there is no guarantee to where your data ends up being stored or used for. Recent developments in local AI models are an alternative to these online AI models, as they work offline once they are downloaded from platforms like [https://huggingface.co/ HuggingFace]. Common models to run are like Llama ([[Meta]]), DeepSeek ([[DeepSeek]]), Phi ([[Microsoft]]), Mistral ([[Mistral AI]]), Gemma ([[Google]]).

==References==

@@ Line 3: / Line 3: @@
 [[wikipedia:Generative artificial intelligence|Generative artificial intelligence]] models are trained through vast amounts of existing human-generated content. Using the example of an LLM, by learning about common trends in sentence structure, the model is able to form complete sentences and show artificial "knowledge" of a topic. The artificial nature may cause [[wikipedia:Hallucination (artificial intelligence)|hallucination]] through confidently-written, but mostly or entirely incorrect, output.
-The current well-funded, lucrative industry of artificial intelligence tools has resulted in rampant unethical use of content. Startups intending to produce AI services have been scraping the internet for content to train future models at a concerning pace, with no regard for copyright law, as members of the field are concerned that they are approaching the limit of publicly-available content to train from.<ref>https://observer.com/2024/12/openai-cofounder-ilya-sutskever-ai-data-peak/</ref>
+The current well-funded, lucrative industry of artificial intelligence tools has resulted in rampant unethical use of content. Startups intending to produce AI services have been scraping the internet for content to train future models at a concerning pace, with no regard for copyright law, as members of the field are concerned that they are approaching the limit of publicly-available content to train from.<ref>{{Cite web |last=Tremayne-Pengelly |first=Alexandra |date=16 Dec 2024 |title=Ilya Sutskever Warns A.I. Is Running Out of Data—Here’s What Will Happen Next |url=https://observer.com/2024/12/openai-cofounder-ilya-sutskever-ai-data-peak/ |website=Observer}}</ref>
 ==Unethical website scraping==
+ Further Reading: [[Nonconsensual Scraping|Nonconsensual scraping]]
 While "mainstream" companies such as [[OpenAI]], [[Anthropic]], and [[Meta]] appear to correctly follow industry-standard practice for web crawlers, others ignore them, causing [[wikipedia:Denial-of-service attack|distributed denial of service attacks]] which damage access to freely-accessible websites. This is particularly an issue for websites that are large or contain many dynamic links.
@@ Line 40: / Line 41: @@
 ===Case studies===
 ====Diaspora====
-On 27 December 2024, the open-source social network project Diaspora noted that 70% of traffic across its infrastructure was in service of AI scrapers.<ref name="geraspora">https://pod.geraspora.de/posts/17342163</ref> Particularly, the project noted that bots had followed links to crawl every individual edit in their [[#MediaWiki|MediaWiki]] instance, causing an exponential increase in the number of unique requests being made.
+On 27 December 2024, the open-source social network project Diaspora noted that 70% of traffic across its infrastructure was in service of AI scrapers.<ref name="geraspora">{{Cite web |last=Schubert |first=Dennis |date=27 Dec 2024 |title=In the last 60 days, the diaspora* web assets received 11.3 million requests ... |url=https://pod.geraspora.de/posts/17342163 |via=Diaspora*}}</ref> Particularly, the project noted that bots had followed links to crawl every individual edit in their [[#MediaWiki|MediaWiki]] instance, causing an exponential increase in the number of unique requests being made.
 ====LVFS====
 The [https://fwupd.org/ Linux Vendor Firmware Service] (LVFS) provides a free central store of firmware updates, such as for UEFI motherboards and SSD controllers. This feature is integrated with many Linux distributions through the <code>fwupd</code> daemon. For situations where internet access is not permitted, the service allows users to make a local mirror of the entire 100+ GB store.
-On 9 January 2025, the project announced that it would introduce a login wall around its mirror feature, citing unnecessary use of its bandwidth.<ref>https://lore.kernel.org/lvfs-announce/zDlhotSvKqnMDfkCKaE_u4-8uvWsgkuj18ifLBwrLN9vWWrIJjrYQ-QfhpY3xuwIXuZgzOVajW99ymoWmijTdngeFRVjM0BxhPZquUzbDfM=@hughsie.com/T/</ref> Up to 1,000 files may be downloaded per day without logging in. The author later mentioned on Mastodon that the problem appears to be caused by AI scraping.<ref>https://mastodon.social/@hughsie/113871373001227969</ref>
+On 9 January 2025, the project announced that it would introduce a login wall around its mirror feature, citing unnecessary use of its bandwidth.<ref>{{Cite web |last=Hughes |first=Richard |date=9 Jan 2025 |title=Authentication soon required to mirror the entire LVFS |url=https://lore.kernel.org/lvfs-announce/zDlhotSvKqnMDfkCKaE_u4-8uvWsgkuj18ifLBwrLN9vWWrIJjrYQ-QfhpY3xuwIXuZgzOVajW99ymoWmijTdngeFRVjM0BxhPZquUzbDfM=@hughsie.com/T/ |website=Linux Vendor Firmware Service (LVFS) mailing list}}</ref> Up to 1,000 files may be downloaded per day without logging in. The author later mentioned on Mastodon that the problem appears to be caused by AI scraping.<ref>{{Cite web |last=Hughes |first=Richard |date=22 Jan 2025 |title=Commentary citing 'Authentication soon required to mirror the entire LVFS' |url=https://mastodon.social/@hughsie/113871373001227969 |via=Mastodon}}</ref>
 ====LWN.net====
@@ Line 56: / Line 57: @@
 </blockquote>
-He later commented:<ref>https://www.heise.de/en/news/AI-bots-paralyze-Linux-news-site-and-others-10252162.html</ref>
+He later commented:<ref>{{Cite web |last=Knop |first=Dirk |date=22 Jan 2025 |title=AI bots paralyze Linux news site and others |url=https://www.heise.de/en/news/AI-bots-paralyze-Linux-news-site-and-others-10252162.html |website=Heise |language=en, de}}</ref>
 <blockquote>
@@ Line 62: / Line 63: @@
 </blockquote>
-====MediaWiki====
+====MediaWiki, Wikipedia, and the Wikimedia Foundation====
 [[wikipedia:MediaWiki|MediaWiki]] is of particular interest to LLM training due to the vast amount of factual, plain-text content wikis tend to hold. While [[wikipedia:Wikipedia|Wikipedia]] and the [[wikipedia:Wikimedia Foundation|Wikimedia Foundation]] host the most well-known wikis, numerous smaller wikis exist thanks to the work of many independent editors. The strength of wiki architecture is its ability for every edit to be audited by anyone, at any time - you can still view [https://en.wikipedia.org/w/index.php?oldid=1 the first edit to Wikipedia] from 2002. This makes wikis a hybrid of a static website and a dynamic web app, which becomes problematic when poorly-designed bots attempt to scrape them.<ref name="geraspora" />
 <!-- COI alert: I, [[User:kirb]], am an admin for The Apple Wiki. Hopefully this is neutral enough?
--->The Apple Wiki, which documents internal details of Apple's hardware and software, holds more than 50,000 articles. On 2 August 2024, with a repeat occurrence on 5 January 2025, the service was disrupted by scraping efforts.<ref>https://theapplewiki.com/wiki/The_Apple_Wiki:Community_portal#Bot_traffic_abuse</ref> The wiki contains a considerable amount of information that is scraped by legitimate security research tools, making it difficult for the website to block non-legitimate requests. Efforts to block unethical scraping and protect the wiki have disrupted these legitimate tools. The large article count, combined with more than 280,000 total edits over the wiki's lifetime, create an untenable situation where it is simply not possible to scrape the website without causing significant service disruption.
+-->The Apple Wiki, which documents internal details of Apple's hardware and software, holds more than 50,000 articles. On 2 August 2024, with a repeat occurrence on 5 January 2025, the service was disrupted by scraping efforts.<ref>{{Cite web |title=Bot traffic abuse |url=https://theapplewiki.com/wiki/The_Apple_Wiki:Community_portal#Bot_traffic_abuse |website=The Apple Wiki}}</ref> The wiki contains a considerable amount of information that is scraped by legitimate security research tools, making it difficult for the website to block non-legitimate requests. Efforts to block unethical scraping and protect the wiki have disrupted these legitimate tools. The large article count, combined with more than 280,000 total edits over the wiki's lifetime, create an untenable situation where it is simply not possible to scrape the website without causing significant service disruption.
+On 1 April 2025, the Wikimedia Foundation indicated that its infrastructure has been under increasing pressure from content scraping bots since January 2024, with the particularly critical metric that "65% of our most expensive traffic comes from bots", despite estimating 35% of all traffic as coming from bots. The bots create traffic patterns that are significantly unlike human traffic patterns, effectively bypassing Wikimedia's caching infrastructure and placing significant load on the core servers. A blog post provides an example where bot traffic caused the [[wikipedia:Wikimedia Commons|Wikimedia Commons]] service to become unstable during a human traffic spike. The Foundation is considering introduction of a Responsible Use of Infrastructure policy to ensure the continued stability of their services.<ref>{{Cite web |last=Mueller |first=Birgit |last2=Danis |first2=Chris |last3=Lavagetto |first3=Giuseppe |date=1 Apr 2025 |title=How crawlers impact the operations of the Wikimedia projects |url=https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-the-operations-of-the-wikimedia-projects/ |website=Wikimedia Foundation}}</ref>
 ====Perplexity AI and news outlets====
 [[Perplexity AI]], founded in August 2022, is a large language model that aims to be viewed as a general search engine. It encourages users to consume news through its summaries of stories.
-On 15 June 2024, Apple blog MacStories found that Perplexity does not follow its own documented policies when accessing content the user requests from the web. In their testing, the scraper pretended to be Chrome 111 running on Windows 10, connecting from an IP address not found in Perplexity's posted IP address ranges.<ref>https://rknight.me/blog/perplexity-ai-is-lying-about-its-user-agent/</ref> Two days later, this was corroborated by WIRED.<ref>https://www.wired.com/story/perplexity-is-a-bullshit-machine/</ref> Perplexity responded by removing its list of IP addresses.
+On 15 June 2024, an investigation by Apple blog MacStories found that Perplexity does not follow its own documented policies when accessing content the user requests from the web. In their testing, the scraper pretended to be Chrome 111 running on Windows 10, connecting from an IP address not found in Perplexity's publicly-listed IP address ranges.<ref>{{Cite web |last=Knight |first=Robb |date=15 Jun 2024 |title=Perplexity AI Is Lying about Their User Agent |url=https://rknight.me/blog/perplexity-ai-is-lying-about-its-user-agent/}}</ref> MacStories' findings were confirmed by a WIRED investigation.<ref>{{Cite web |last=Mehrotra |first=Dhruv |last2=Marchman |first2=Tim |date=19 Jun 2024 |title=Perplexity Is a Bullshit Machine |url=https://www.wired.com/story/perplexity-is-a-bullshit-machine/ |website=WIRED}}</ref> Perplexity responded by removing its list of IP addresses.
-On 27 June 2024, [[Amazon]] announced an investigation into Perplexity AI, citing a terms of service clause requiring bots hosted on Amazon Web Services to honor robots.txt:<ref name="perplexity-aws">https://www.wired.com/story/aws-perplexity-bot-scraping-investigation/</ref>
+On 27 June 2024, [[Amazon]] announced an investigation into Perplexity AI, suggesting the behavior may be considered abusive under Amazon Web Services terms of service:<ref name="perplexity-aws">{{Cite web |last=Mehrotra |first=Dhruv |date=27 Jun 2024 |title=Amazon Is Investigating Perplexity Over Claims of Scraping Abuse |url=https://www.wired.com/story/aws-perplexity-bot-scraping-investigation/ |website=WIRED}}</ref>
 <blockquote>
@@ Line 79: / Line 82: @@
 </blockquote>
-== Privacy concerns of online AI models ==
+====Read the Docs====
-There are several concerns with using online AI models like [[ChatGPT]], not only because they are proprietary, but also because there is no guarantee to where your data ends up being stored or used for.
+In an early example, on 25 July 2024, open source documentation website Read the Docs detailed cases of abusive bots downloading large amounts of content from the service. Particularly, the significant range of IP addresses used in an aggressive manner rendered existing rate limiting ineffective. Taking action to block traffic identified by Cloudflare as "AI crawlers" reduced bandwidth requirements by 75%, at a cost saving of $1,500 USD/month.<ref>{{Cite web |last=Holscher |first=Eric |date=25 Jul 2024 |title=AI crawlers need to be more respectful |url=https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse/ |website=Read the Docs}}</ref>
+====SourceHut and Fedora Linux====
+On 15 March 2025, an infrastructure manager for the [[wikipedia:Fedora Linux|Fedora Linux]] open source project discussed an assumed large language model crawling attack against the Prague.io Git source code hosting service. The project made the decision to block the entire country of Brazil for some time, while also blocking access to certain repositories whose traffic was creating significant CPU usage.<ref>{{Cite web |last=Fenzi |first=Kevin |date=15 Mar 2025 |title=Mid March infra bits 2025 |url=https://www.scrye.com/blogs/nirik/posts/2025/03/15/mid-march-infra-bits-2025/}}</ref><ref>{{Cite web |last=Fenzi |first=Kevin |date=29 Mar 2025 |title=Late March infra bits 2025 |url=https://www.scrye.com/blogs/nirik/posts/2025/03/29/late-march-infra-bits-2025/}}</ref>
-Luckily there is an alternative which solves many of these concerns, which is to run AI models locally. There currently exist different models that are small enough to run on a personal computer. Those models are indicated with a smaller parameter size, for instance models with 1.5B or 7B parameters. If the computer has a relatively modern GPU, it can also run one of the larger models for more accurate answers, as these models have GPU-acceleration. The software that will be recommended below runs on all major computer platforms (Windows/macOs/Linux). Be cautious if you download other kinds of models besides the major models, as platforms like HuggingFace allow anyone to upload.<ref>https://huggingface.co/</ref>
+On 17 March 2025, the Git source code host SourceHut announced that the service was being disrupted by large language model crawlers. Mitigations deployed to reduce disruption involved requiring login for some areas of the service, and blocking IP ranges of cloud providers, affecting legitimate use of the website by its users.<ref>{{Cite web |date=17 Mar 2025 |title=LLM crawlers continue to DDoS SourceHut |url=https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/ |website=sr.ht status}}</ref> In response to the event, SourceHut founder Drew DeVault wrote a blog post entitled "[https://drewdevault.com/2025/03/17/2025-03-17-Stop-externalizing-your-costs-on-me.html Please stop externalizing your costs directly into my face]", discussing his frustrations with having ongoing and ever-adapting attacks that must be addressed in a timely fashion to reduce disruption to legitimate SourceHut users. DeVault estimates that between "20-100%" of his time is now spent addressing such attacks.
-=== LM Studio ===
-One of the easiest software to start with to run these models is LM Studio.<ref>https://lmstudio.ai/</ref> It is user-friendly as it has a graphical user interface aimed at beginners, and allows you to get started with just a few clicks. It recommends appropriately sized models for your specific computer hardware, and manages the rest of the installation for you. In terms of storage, you will need a few gigabytes to store the models locally, which you only have to do once. With the models installed, no further internet connection is required.<ref>[https://www.youtube.com/@NetworkChuck NetworkChuck]: [https://www.youtube.com/watch?v=7TR-FLWNVHY The only way to run deepseek]</ref> The software allows opening chats with the large language model, which you can also organize into folders.
-=== Ollama ===
+==Privacy concerns of online AI models==
-If you are fine with just using the terminal, another option is to install software like Ollama.<ref>https://ollama.com/</ref> Once installed, you can simply invoke the run command with the model you want to use, and it will download that model if it has not done that already. The website lists the most common models to run, like Llama ([[Meta]]) DeepSeek ([[DeepSeek]]), Phi ([[Microsoft]]), Mistral ([[Mistral AI]]), Gemma ([[Google]]). If you are a more advanced user, you can also run Ollama inside Docker.<ref>https://ollama.com/blog/ollama-is-now-available-as-an-official-docker-image</ref> That allows isolating the model completely from your host system, which may be what you want to be extra secure.
+There are several concerns with using online AI models like [[ChatGPT]] ([[OpenAI]]), not only because they are proprietary, but also because there is no guarantee to where your data ends up being stored or used for. Recent developments in local AI models are an alternative to these online AI models, as they work offline once they are downloaded from platforms like [https://huggingface.co/ HuggingFace]. Common models to run are like Llama ([[Meta]]), DeepSeek ([[DeepSeek]]), Phi ([[Microsoft]]), Mistral ([[Mistral AI]]), Gemma ([[Google]]).
 ==References==