Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Categories
Random page
Top Contributors
Recent changes
Contribute
Create a page
How to help
Wiki policy
Adapt videos to articles
Articles in need of work
Help
Frequently asked questions
Join the discord!
Help about MediaWiki
Consumer_Action_Taskforce
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Artificial intelligence
(section)
Page
Discussion
English
Read
Edit
Edit source
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
Edit source
View history
Purge cache
General
What links here
Related changes
Special pages
Page information
Cargo data
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
====MediaWiki, Wikipedia, and the Wikimedia Foundation==== [[wikipedia:MediaWiki|MediaWiki]] is of particular interest to LLM training due to the vast amount of factual, plain-text content wikis tend to hold. While [[wikipedia:Wikipedia|Wikipedia]] and the [[wikipedia:Wikimedia Foundation|Wikimedia Foundation]] host the most well-known wikis, numerous smaller wikis exist thanks to the work of many independent editors. The strength of wiki architecture is its ability for every edit to be audited by anyone, at any time - you can still view [https://en.wikipedia.org/w/index.php?oldid=1 the first edit to Wikipedia] from 2002. This makes wikis a hybrid of a static website and a dynamic web app, which becomes problematic when poorly-designed bots attempt to scrape them.<ref name="geraspora" /> <!-- COI alert: I, [[User:kirb]], am an admin for The Apple Wiki. Hopefully this is neutral enough? -->The Apple Wiki, which documents internal details of Apple's hardware and software, holds more than 50,000 articles. On 2 August 2024, with a repeat occurrence on 5 January 2025, the service was disrupted by scraping efforts.<ref>https://theapplewiki.com/wiki/The_Apple_Wiki:Community_portal#Bot_traffic_abuse</ref> The wiki contains a considerable amount of information that is scraped by legitimate security research tools, making it difficult for the website to block non-legitimate requests. Efforts to block unethical scraping and protect the wiki have disrupted these legitimate tools. The large article count, combined with more than 280,000 total edits over the wiki's lifetime, create an untenable situation where it is simply not possible to scrape the website without causing significant service disruption. On 1 April 2025, the Wikimedia Foundation indicated that its infrastructure has been under increasing pressure from content scraping bots since January 2024, with the particularly critical metric that "65% of our most expensive traffic comes from bots", despite estimating 35% of all traffic as coming from bots. The bots create traffic patterns that are significantly unlike human traffic patterns, effectively bypassing Wikimedia's caching infrastructure and placing significant load on the core servers. A blog post provides an example where bot traffic caused the [[wikipedia:Wikimedia Commons|Wikimedia Commons]] service to become unstable during a human traffic spike. The Foundation is considering introduction of a Responsible Use of Infrastructure policy to ensure the continued stability of their services.<ref>https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-the-operations-of-the-wikimedia-projects/</ref>
Summary:
Please note that all contributions to Consumer_Action_Taskforce are considered to be released under the Creative Commons Attribution-ShareAlike 4.0 International (see
Consumer Action Taskforce:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
To protect the wiki against automated edit spam, we kindly ask you to solve the following hCaptcha:
Cancel
Editing help
(opens in new window)