Artificial intelligence: Difference between revisions

Kirb (talk | contribs)
Kirb (talk | contribs)
 
Line 68: Line 68:
-->The Apple Wiki, which documents internal details of Apple's hardware and software, holds more than 50,000 articles. On 2 August 2024, with a repeat occurrence on 5 January 2025, the service was disrupted by scraping efforts.<ref>https://theapplewiki.com/wiki/The_Apple_Wiki:Community_portal#Bot_traffic_abuse</ref> The wiki contains a considerable amount of information that is scraped by legitimate security research tools, making it difficult for the website to block non-legitimate requests. Efforts to block unethical scraping and protect the wiki have disrupted these legitimate tools. The large article count, combined with more than 280,000 total edits over the wiki's lifetime, create an untenable situation where it is simply not possible to scrape the website without causing significant service disruption.
-->The Apple Wiki, which documents internal details of Apple's hardware and software, holds more than 50,000 articles. On 2 August 2024, with a repeat occurrence on 5 January 2025, the service was disrupted by scraping efforts.<ref>https://theapplewiki.com/wiki/The_Apple_Wiki:Community_portal#Bot_traffic_abuse</ref> The wiki contains a considerable amount of information that is scraped by legitimate security research tools, making it difficult for the website to block non-legitimate requests. Efforts to block unethical scraping and protect the wiki have disrupted these legitimate tools. The large article count, combined with more than 280,000 total edits over the wiki's lifetime, create an untenable situation where it is simply not possible to scrape the website without causing significant service disruption.


On 1 April 2025, the Wikimedia Foundation indicated that its infrastructure has been under increasing pressure from content scraping bots since January 2024, with the particularly critical metric that "65% of our most expensive traffic comes from bots", despite estimating 35% of all traffic as coming from bots. A blog post provides an example where bot traffic caused the [[wikipedia:Wikimedia Commons|Wikimedia Commons]] service to become unstable during a human traffic spike. The Foundation is considering introduction of a Responsible Use of Infrastructure policy to ensure the continued stability of their services.<ref>https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-the-operations-of-the-wikimedia-projects/</ref>
On 1 April 2025, the Wikimedia Foundation indicated that its infrastructure has been under increasing pressure from content scraping bots since January 2024, with the particularly critical metric that "65% of our most expensive traffic comes from bots", despite estimating 35% of all traffic as coming from bots. The bots create traffic patterns that are significantly unlike human traffic patterns, effectively bypassing Wikimedia's caching infrastructure and placing significant load on the core servers. A blog post provides an example where bot traffic caused the [[wikipedia:Wikimedia Commons|Wikimedia Commons]] service to become unstable during a human traffic spike. The Foundation is considering introduction of a Responsible Use of Infrastructure policy to ensure the continued stability of their services.<ref>https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-the-operations-of-the-wikimedia-projects/</ref>


====Perplexity AI and news outlets====
====Perplexity AI and news outlets====
Line 81: Line 81:
</blockquote>
</blockquote>


====SourceHut====
====Read the Docs====
On 17 March 2025, the Git source code host SourceHut announced that the service was being disrupted by large language model crawlers. Mitigations deployed to reduce disruption involved requiring login for some areas of the service, and blocking IP ranges of cloud providers, affecting legitimate use of the website by its users.<ref>https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/</ref>
In an early example, on 25 July 2024, open source documentation website Read the Docs detailed cases of abusive bots downloading large amounts of content from the service. Particularly, the significant range of IP addresses used in an aggressive manner rendered existing rate limiting ineffective. Taking action to block traffic identified by Cloudflare as "AI crawlers" reduced bandwidth requirements by 75%, at a cost saving of $1,500 USD/month.<ref>https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse/</ref>
 
====SourceHut and Fedora Linux====
On 15 March 2025, an infrastructure manager for the [[wikipedia:Fedora Linux|Fedora Linux]] open source project discussed an assumed large language model crawling attack against the Prague.io Git source code hosting service. The project made the decision to block the entire country of Brazil for some time, while also blocking access to certain repositories whose traffic was creating significant CPU usage.<ref>https://www.scrye.com/blogs/nirik/posts/2025/03/15/mid-march-infra-bits-2025/</ref><ref>https://www.scrye.com/blogs/nirik/posts/2025/03/29/late-march-infra-bits-2025/</ref>
 
On 17 March 2025, the Git source code host SourceHut announced that the service was being disrupted by large language model crawlers. Mitigations deployed to reduce disruption involved requiring login for some areas of the service, and blocking IP ranges of cloud providers, affecting legitimate use of the website by its users.<ref>https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/</ref> In response to the event, SourceHut founder Drew DeVault wrote a blog post entitled "[https://drewdevault.com/2025/03/17/2025-03-17-Stop-externalizing-your-costs-on-me.html Please stop externalizing your costs directly into my face]", discussing his frustrations with having ongoing and ever-adapting attacks that must be addressed in a timely fashion to reduce disruption to legitimate SourceHut users. DeVault estimates that between "20-100%" of his time is now spent addressing such attacks.


==Privacy concerns of online AI models==
==Privacy concerns of online AI models==