Artificial intelligence: Difference between revisions

Kirb (talk | contribs)
+AI cat
Kirb (talk | contribs)
Case studies: Trying to make wiki section even more neutral
Line 40: Line 40:
=== Case studies ===
=== Case studies ===
==== Diaspora ====
==== Diaspora ====
On 27 December 2024, the open-source social network project Diaspora noted that 70% of traffic across its infrastructure was in service of AI scrapers.<ref>https://pod.geraspora.de/posts/17342163</ref> Particularly, the project noted that bots had followed links to crawl every individual edit in their [[wikipedia:MediaWiki|MediaWiki]] instance, causing an exponential increase in the number of unique requests being made.
On 27 December 2024, the open-source social network project Diaspora noted that 70% of traffic across its infrastructure was in service of AI scrapers.<ref name="geraspora">https://pod.geraspora.de/posts/17342163</ref> Particularly, the project noted that bots had followed links to crawl every individual edit in their [[#MediaWiki|MediaWiki]] instance, causing an exponential increase in the number of unique requests being made.


==== LVFS ====
==== LVFS ====
Line 61: Line 61:
We do indeed see a kind of pattern. Every IP stays below the threshold for our fuses, but the overload is overwhelming. Any form of active defense will probably have to figure out to block entire subnets instead of individual addresses, and even that might not be enough.
We do indeed see a kind of pattern. Every IP stays below the threshold for our fuses, but the overload is overwhelming. Any form of active defense will probably have to figure out to block entire subnets instead of individual addresses, and even that might not be enough.
</blockquote>
</blockquote>
==== MediaWiki ====
[[wikipedia:MediaWiki|MediaWiki]] is of particular interest to LLM training due to the vast amount of factual, plain-text content wikis tend to hold. While [[wikipedia:Wikipedia|Wikipedia]] and the [[wikipedia:Wikimedia Foundation|Wikimedia Foundation]] host the most well-known wikis, numerous smaller wikis exist thanks to the work of many independent editors. The strength of wiki architecture is its ability for every edit to be audited by anyone, at any time - you can still view [https://en.wikipedia.org/w/index.php?oldid=1 the first edit to Wikipedia] from 2002. This makes wikis a hybrid of a static website and a dynamic web app, which becomes problematic when poorly-designed bots attempt to scrape them.<ref name="geraspora" />
<!-- COI alert: I, [[User:kirb]], am an admin for The Apple Wiki. Hopefully this is neutral enough?
-->The Apple Wiki, which documents internal details of Apple's hardware and software, holds more than 50,000 articles. On 2 August 2024, with a repeat occurrence on 5 January 2025, the service was disrupted by scraping efforts.<ref>https://theapplewiki.com/wiki/The_Apple_Wiki:Community_portal#Bot_traffic_abuse</ref> The wiki contains a considerable amount of information that is scraped by legitimate security research tools, making it difficult for the website to block non-legitimate requests. Efforts to block unethical scraping and protect the wiki have disrupted these legitimate tools. The large article count, combined with more than 280,000 total edits over the wiki's lifetime, create an untenable situation where it is simply not possible to scrape the website without causing significant service disruption.


==== Perplexity AI and news outlets ====
==== Perplexity AI and news outlets ====
Line 72: Line 78:
"AWS's terms of service prohibit abusive and illegal activities and our customers are responsible for complying with those terms," [AWS spokesperson Patrick] Neighorn said in a statement. "We routinely receive reports of alleged abuse from a variety of sources and engage our customers to understand those reports."
"AWS's terms of service prohibit abusive and illegal activities and our customers are responsible for complying with those terms," [AWS spokesperson Patrick] Neighorn said in a statement. "We routinely receive reports of alleged abuse from a variety of sources and engage our customers to understand those reports."
</blockquote>
</blockquote>
==== The Apple Wiki ====
<!-- COI alert: I, [[User:kirb]], am an admin for The Apple Wiki. Hopefully this is neutral enough? -->
The Apple Wiki, a MediaWiki instance that documents internal details of Apple's hardware and software, holds more than 50,000 articles. On 2 August 2024, with a repeat occurrence on 5 January 2025, the service was disrupted by scraping efforts.<ref>https://theapplewiki.com/wiki/The_Apple_Wiki:Community_portal#Bot_traffic_abuse</ref> The wiki contains a considerable amount of information that is scraped by legitimate security research tools, making it difficult for the website to block abusive requests. Efforts to block unethical scraping and protect the wiki have disrupted these legitimate tools. The large article count, combined with the more than 280,000 total edits, create an untenable situation where it is simply not possible to scrape the website without causing significant service disruption.


== References ==
== References ==