Artificial intelligence: Difference between revisions

Line 5:

The current well-funded, lucrative industry of artificial intelligence tools has resulted in rampant unethical use of content. Startups intending to produce AI services have been scraping the internet for content to train future models at a concerning pace, with no regard for copyright law, as members of the field are concerned that they are approaching the limit of publicly-available content to train from.<ref>https://observer.com/2024/12/openai-cofounder-ilya-sutskever-ai-data-peak/</ref>

== Unethical website scraping ==

==Unethical website scraping==

While "mainstream" companies such as [[OpenAI]], [[Anthropic]], and [[Meta]] appear to correctly follow industry-standard practice for web crawlers, others ignore them, causing [[wikipedia:Denial-of-service attack|distributed denial of service attacks]] which damage access to freely-accessible websites. This is particularly an issue for websites that are large or contain many dynamic links.

Ethical website scrapers, known as "spiders" that crawl the web, follow a certain set of minimum guidelines. Specifically, they follow [[wikipedia:robots.txt|robots.txt]], a text file found at the root of a domain that indicates:

* Paths bots are allowed to index

*Paths bots are allowed to index

* Paths bots should not index

*Paths bots should not index

* How long the bot should wait in between requests to the server, to reduce load

*How long the bot should wait in between requests to the server, to reduce load

* The [[wikipedia:Sitemaps|sitemap]] of the website's content

*The [[wikipedia:Sitemaps|sitemap]] of the website's content

These rules are typically configured for all bots, with minor adjustments made to individual bots as needed. Additionally, specific web pages may use the [[wikipedia:noindex|robots meta tag]] to control use of their output.

Line 25:

Less legitimate bots use a wide distribution of IP addresses, further reducing options for the website to protect itself. This is in a clear attempt to bypass IP-based request throttling and rate limiting the website may implement. They are also known to ignore HTTP response status codes that indicate a server error ([[wikipedia:HTTP status code#5xx server errors|5xx]]), or warnings that the client needs to slow down ([[wikipedia:HTTP status code#429|429 Too Many Requests]]) or has been entirely blocked ([[wikipedia:HTTP status code#403|403 Forbidden]]).

=== Effect on users ===

===Effect on users===

To protect against unethical crawlers, due to concerns of both intellectual property and service disruption, websites adopt practices that affect the experience of real users:

* '''Bot check walls''': The user may be required to pass a security check "wall". While usually automatic for the user, this can affect legitimate bots. When a website protection service such as [[Cloudflare]] is not confident as to whether the visitor is legitimate, it may present a CAPTCHA to be manually filled out. An example is "Google Sorry", a CAPTCHA wall frequently seen when using Google Search via a VPN.

*'''Bot check walls''': The user may be required to pass a security check "wall". While usually automatic for the user, this can affect legitimate bots. When a website protection service such as [[Cloudflare]] is not confident as to whether the visitor is legitimate, it may present a CAPTCHA to be manually filled out. An example is "Google Sorry", a CAPTCHA wall frequently seen when using Google Search via a VPN.

* '''Login walls''': Should bots be found to pass CAPTCHA walls, the website may advance to requiring logging in to view content. A major recent example of this is [[YouTube]]'s "Sign in to confirm you're not a bot" messages.

*'''Login walls''': Should bots be found to pass CAPTCHA walls, the website may advance to requiring logging in to view content. A major recent example of this is [[YouTube]]'s "Sign in to confirm you're not a bot" messages.

* '''JavaScript requirement''': Most websites do not need JavaScript to deliver their content. However, as many scrapers expect content to be found directly in the HTML, it is often an easy workaround to use JavaScript to "insert" the content after the page has loaded. This may reduce the responsiveness of the website, increasing points of failure, and preventing security-conscious users who disable JavaScript from viewing the website.

*'''JavaScript requirement''': Most websites do not need JavaScript to deliver their content. However, as many scrapers expect content to be found directly in the HTML, it is often an easy workaround to use JavaScript to "insert" the content after the page has loaded. This may reduce the responsiveness of the website, increasing points of failure, and preventing security-conscious users who disable JavaScript from viewing the website.

* '''IP address blocking''': Blocking IP addresses, especially by blocking entire providers via their [[wikipedia:Autonomous system (Internet)|autonomous system number]], always comes with some risk of blocking legitimate users. Particularly, this may restrict access to users making use of a VPN.

*'''IP address blocking''': Blocking IP addresses, especially by blocking entire providers via their [[wikipedia:Autonomous system (Internet)|autonomous system number]], always comes with some risk of blocking legitimate users. Particularly, this may restrict access to users making use of a VPN.

* '''Heuristic blocking''': Patterns in request headers may give away that the request is being made by an unethical bot, despite attempts to act as a legitimate visitor. Heuristics are imperfect and may block legitimate users, especially those that may use less common browsers.

*'''Heuristic blocking''': Patterns in request headers may give away that the request is being made by an unethical bot, despite attempts to act as a legitimate visitor. Heuristics are imperfect and may block legitimate users, especially those that may use less common browsers.

In rare situations, a website operator may redirect detected bot traffic, such as to download speed test files hosted by ISPs containing multiple gigabytes of random garbage data. This may have the effect of disrupting the bot, but its effectiveness is unknown.

Line 38:

The need to respond to unethical scraping also further consolidates the web into the control of a few large [[wikipedia:Web application firewall|web application firewall]] (WAF) services, most notably [[Cloudflare]], as website owners find themselves otherwise unable to protect their service from being disrupted by such traffic.

=== Case studies ===

===Case studies===

==== Diaspora ====

====Diaspora====

On 27 December 2024, the open-source social network project Diaspora noted that 70% of traffic across its infrastructure was in service of AI scrapers.<ref name="geraspora">https://pod.geraspora.de/posts/17342163</ref> Particularly, the project noted that bots had followed links to crawl every individual edit in their [[#MediaWiki|MediaWiki]] instance, causing an exponential increase in the number of unique requests being made.

==== LVFS ====

====LVFS====

The [https://fwupd.org/ Linux Vendor Firmware Service] (LVFS) provides a free central store of firmware updates, such as for UEFI motherboards and SSD controllers. This feature is integrated with many Linux distributions through the <code>fwupd</code> daemon. For situations where internet access is not permitted, the service allows users to make a local mirror of the entire 100+ GB store.

On 9 January 2025, the project announced that it would introduce a login wall around its mirror feature, citing unnecessary use of its bandwidth.<ref>https://lore.kernel.org/lvfs-announce/zDlhotSvKqnMDfkCKaE_u4-8uvWsgkuj18ifLBwrLN9vWWrIJjrYQ-QfhpY3xuwIXuZgzOVajW99ymoWmijTdngeFRVjM0BxhPZquUzbDfM=@hughsie.com/T/</ref> Up to 1,000 files may be downloaded per day without logging in. The author later mentioned on Mastodon that the problem appears to be caused by AI scraping.<ref>https://mastodon.social/@hughsie/113871373001227969</ref>

==== LWN.net ====

====LWN.net====

On 21 January 2025, Jonathan Corbet, maintainer of the Linux news website [[wikipedia:LWN.net|LWN.net]], made the following [https://social.kernel.org/notice/AqJkUigsjad3gQc664 post] to social.kernel.org:

Line 62:

</blockquote>

==== MediaWiki ====

====MediaWiki====

[[wikipedia:MediaWiki|MediaWiki]] is of particular interest to LLM training due to the vast amount of factual, plain-text content wikis tend to hold. While [[wikipedia:Wikipedia|Wikipedia]] and the [[wikipedia:Wikimedia Foundation|Wikimedia Foundation]] host the most well-known wikis, numerous smaller wikis exist thanks to the work of many independent editors. The strength of wiki architecture is its ability for every edit to be audited by anyone, at any time - you can still view [https://en.wikipedia.org/w/index.php?oldid=1 the first edit to Wikipedia] from 2002. This makes wikis a hybrid of a static website and a dynamic web app, which becomes problematic when poorly-designed bots attempt to scrape them.<ref name="geraspora" />

Line 68:

-->The Apple Wiki, which documents internal details of Apple's hardware and software, holds more than 50,000 articles. On 2 August 2024, with a repeat occurrence on 5 January 2025, the service was disrupted by scraping efforts.<ref>https://theapplewiki.com/wiki/The_Apple_Wiki:Community_portal#Bot_traffic_abuse</ref> The wiki contains a considerable amount of information that is scraped by legitimate security research tools, making it difficult for the website to block non-legitimate requests. Efforts to block unethical scraping and protect the wiki have disrupted these legitimate tools. The large article count, combined with more than 280,000 total edits over the wiki's lifetime, create an untenable situation where it is simply not possible to scrape the website without causing significant service disruption.

==== Perplexity AI and news outlets ====

====Perplexity AI and news outlets====

[[Perplexity AI]], founded in August 2022, is a large language model that aims to be viewed as a general search engine. It encourages users to consume news through its summaries of stories.

Line 79:

</blockquote>

== References ==

== Privacy concerns of online AI models ==

There are several concerns with using online AI models like [[ChatGPT]], not only because they are proprietary, but also because there is no guarantee to where your data ends up being stored or used for.

Luckily there is an alternative which solves many of these concerns, which is to run AI models locally. There currently exist different models that are small enough to run on a personal computer. Those models are indicated with a smaller parameter size, for instance models with 1.5B or 7B parameters. If the computer has a relatively modern GPU, it can also run one of the larger models for more accurate answers, as these models have GPU-acceleration. The software that will be recommended below runs on all major computer platforms (Windows/macOs/Linux). Be cautious if you download other kinds of models besides the major models, as platforms like HuggingFace allow anyone to upload.<ref>https://huggingface.co/</ref>

=== LM Studio ===

One of the easiest software to start with to run these models is LM Studio.<ref>https://lmstudio.ai/</ref> It is user-friendly as it has a graphical user interface aimed at beginners, and allows you to get started with just a few clicks. It recommends appropriately sized models for your specific computer hardware, and manages the rest of the installation for you. In terms of storage, you will need a few gigabytes to store the models locally, which you only have to do once. With the models installed, no further internet connection is required.<ref>[https://www.youtube.com/@NetworkChuck NetworkChuck]: [https://www.youtube.com/watch?v=7TR-FLWNVHY The only way to run deepseek]</ref> The software allows opening chats with the large language model, which you can also organize into folders.

=== Ollama ===

If you are fine with just using the terminal, another option is to install software like Ollama.<ref>https://ollama.com/</ref> Once installed, you can simply invoke the run command with the model you want to use, and it will download that model if it has not done that already. The website lists the most common models to run, like Llama ([[Meta]]) DeepSeek ([[DeepSeek]]), Phi ([[Microsoft]]), Mistral ([[Mistral AI]]), Gemma ([[Google]]). If you are a more advanced user, you can also run Ollama inside Docker.<ref>https://ollama.com/blog/ollama-is-now-available-as-an-official-docker-image</ref> That allows isolating the model completely from your host system, which may be what you want to be extra secure.

==References==

[[Category:Artificial intelligence]]

@@ Line 5: / Line 5: @@
 The current well-funded, lucrative industry of artificial intelligence tools has resulted in rampant unethical use of content. Startups intending to produce AI services have been scraping the internet for content to train future models at a concerning pace, with no regard for copyright law, as members of the field are concerned that they are approaching the limit of publicly-available content to train from.<ref>https://observer.com/2024/12/openai-cofounder-ilya-sutskever-ai-data-peak/</ref>
-== Unethical website scraping ==
+==Unethical website scraping==
 While "mainstream" companies such as [[OpenAI]], [[Anthropic]], and [[Meta]] appear to correctly follow industry-standard practice for web crawlers, others ignore them, causing [[wikipedia:Denial-of-service attack|distributed denial of service attacks]] which damage access to freely-accessible websites. This is particularly an issue for websites that are large or contain many dynamic links.
 Ethical website scrapers, known as "spiders" that crawl the web, follow a certain set of minimum guidelines. Specifically, they follow [[wikipedia:robots.txt|robots.txt]], a text file found at the root of a domain that indicates:
-* Paths bots are allowed to index
+*Paths bots are allowed to index
-* Paths bots should not index
+*Paths bots should not index
-* How long the bot should wait in between requests to the server, to reduce load
+*How long the bot should wait in between requests to the server, to reduce load
-* The [[wikipedia:Sitemaps|sitemap]] of the website's content
+*The [[wikipedia:Sitemaps|sitemap]] of the website's content
 These rules are typically configured for all bots, with minor adjustments made to individual bots as needed. Additionally, specific web pages may use the [[wikipedia:noindex|robots meta tag]] to control use of their output.
@@ Line 25: / Line 25: @@
 Less legitimate bots use a wide distribution of IP addresses, further reducing options for the website to protect itself. This is in a clear attempt to bypass IP-based request throttling and rate limiting the website may implement. They are also known to ignore HTTP response status codes that indicate a server error ([[wikipedia:HTTP status code#5xx server errors|5xx]]), or warnings that the client needs to slow down ([[wikipedia:HTTP status code#429|429 Too Many Requests]]) or has been entirely blocked ([[wikipedia:HTTP status code#403|403 Forbidden]]).
-=== Effect on users ===
+===Effect on users===
 To protect against unethical crawlers, due to concerns of both intellectual property and service disruption, websites adopt practices that affect the experience of real users:
-* '''Bot check walls''': The user may be required to pass a security check "wall". While usually automatic for the user, this can affect legitimate bots. When a website protection service such as [[Cloudflare]] is not confident as to whether the visitor is legitimate, it may present a CAPTCHA to be manually filled out. An example is "Google Sorry", a CAPTCHA wall frequently seen when using Google Search via a VPN.
+*'''Bot check walls''': The user may be required to pass a security check "wall". While usually automatic for the user, this can affect legitimate bots. When a website protection service such as [[Cloudflare]] is not confident as to whether the visitor is legitimate, it may present a CAPTCHA to be manually filled out. An example is "Google Sorry", a CAPTCHA wall frequently seen when using Google Search via a VPN.
-* '''Login walls''': Should bots be found to pass CAPTCHA walls, the website may advance to requiring logging in to view content. A major recent example of this is [[YouTube]]'s "Sign in to confirm you're not a bot" messages.
+*'''Login walls''': Should bots be found to pass CAPTCHA walls, the website may advance to requiring logging in to view content. A major recent example of this is [[YouTube]]'s "Sign in to confirm you're not a bot" messages.
-* '''JavaScript requirement''': Most websites do not need JavaScript to deliver their content. However, as many scrapers expect content to be found directly in the HTML, it is often an easy workaround to use JavaScript to "insert" the content after the page has loaded. This may reduce the responsiveness of the website, increasing points of failure, and preventing security-conscious users who disable JavaScript from viewing the website.
+*'''JavaScript requirement''': Most websites do not need JavaScript to deliver their content. However, as many scrapers expect content to be found directly in the HTML, it is often an easy workaround to use JavaScript to "insert" the content after the page has loaded. This may reduce the responsiveness of the website, increasing points of failure, and preventing security-conscious users who disable JavaScript from viewing the website.
-* '''IP address blocking''': Blocking IP addresses, especially by blocking entire providers via their [[wikipedia:Autonomous system (Internet)|autonomous system number]], always comes with some risk of blocking legitimate users. Particularly, this may restrict access to users making use of a VPN.
+*'''IP address blocking''': Blocking IP addresses, especially by blocking entire providers via their [[wikipedia:Autonomous system (Internet)|autonomous system number]], always comes with some risk of blocking legitimate users. Particularly, this may restrict access to users making use of a VPN.
-* '''Heuristic blocking''': Patterns in request headers may give away that the request is being made by an unethical bot, despite attempts to act as a legitimate visitor. Heuristics are imperfect and may block legitimate users, especially those that may use less common browsers.
+*'''Heuristic blocking''': Patterns in request headers may give away that the request is being made by an unethical bot, despite attempts to act as a legitimate visitor. Heuristics are imperfect and may block legitimate users, especially those that may use less common browsers.
 In rare situations, a website operator may redirect detected bot traffic, such as to download speed test files hosted by ISPs containing multiple gigabytes of random garbage data. This may have the effect of disrupting the bot, but its effectiveness is unknown.
@@ Line 38: / Line 38: @@
 The need to respond to unethical scraping also further consolidates the web into the control of a few large [[wikipedia:Web application firewall|web application firewall]] (WAF) services, most notably [[Cloudflare]], as website owners find themselves otherwise unable to protect their service from being disrupted by such traffic.
-=== Case studies ===
+===Case studies===
-==== Diaspora ====
+====Diaspora====
 On 27 December 2024, the open-source social network project Diaspora noted that 70% of traffic across its infrastructure was in service of AI scrapers.<ref name="geraspora">https://pod.geraspora.de/posts/17342163</ref> Particularly, the project noted that bots had followed links to crawl every individual edit in their [[#MediaWiki|MediaWiki]] instance, causing an exponential increase in the number of unique requests being made.
-==== LVFS ====
+====LVFS====
 The [https://fwupd.org/ Linux Vendor Firmware Service] (LVFS) provides a free central store of firmware updates, such as for UEFI motherboards and SSD controllers. This feature is integrated with many Linux distributions through the <code>fwupd</code> daemon. For situations where internet access is not permitted, the service allows users to make a local mirror of the entire 100+ GB store.
 On 9 January 2025, the project announced that it would introduce a login wall around its mirror feature, citing unnecessary use of its bandwidth.<ref>https://lore.kernel.org/lvfs-announce/zDlhotSvKqnMDfkCKaE_u4-8uvWsgkuj18ifLBwrLN9vWWrIJjrYQ-QfhpY3xuwIXuZgzOVajW99ymoWmijTdngeFRVjM0BxhPZquUzbDfM=@hughsie.com/T/</ref> Up to 1,000 files may be downloaded per day without logging in. The author later mentioned on Mastodon that the problem appears to be caused by AI scraping.<ref>https://mastodon.social/@hughsie/113871373001227969</ref>
-==== LWN.net ====
+====LWN.net====
 On 21 January 2025, Jonathan Corbet, maintainer of the Linux news website [[wikipedia:LWN.net|LWN.net]], made the following [https://social.kernel.org/notice/AqJkUigsjad3gQc664 post] to social.kernel.org:
@@ Line 62: / Line 62: @@
 </blockquote>
-==== MediaWiki ====
+====MediaWiki====
 [[wikipedia:MediaWiki|MediaWiki]] is of particular interest to LLM training due to the vast amount of factual, plain-text content wikis tend to hold. While [[wikipedia:Wikipedia|Wikipedia]] and the [[wikipedia:Wikimedia Foundation|Wikimedia Foundation]] host the most well-known wikis, numerous smaller wikis exist thanks to the work of many independent editors. The strength of wiki architecture is its ability for every edit to be audited by anyone, at any time - you can still view [https://en.wikipedia.org/w/index.php?oldid=1 the first edit to Wikipedia] from 2002. This makes wikis a hybrid of a static website and a dynamic web app, which becomes problematic when poorly-designed bots attempt to scrape them.<ref name="geraspora" />
@@ Line 68: / Line 68: @@
 -->The Apple Wiki, which documents internal details of Apple's hardware and software, holds more than 50,000 articles. On 2 August 2024, with a repeat occurrence on 5 January 2025, the service was disrupted by scraping efforts.<ref>https://theapplewiki.com/wiki/The_Apple_Wiki:Community_portal#Bot_traffic_abuse</ref> The wiki contains a considerable amount of information that is scraped by legitimate security research tools, making it difficult for the website to block non-legitimate requests. Efforts to block unethical scraping and protect the wiki have disrupted these legitimate tools. The large article count, combined with more than 280,000 total edits over the wiki's lifetime, create an untenable situation where it is simply not possible to scrape the website without causing significant service disruption.
-==== Perplexity AI and news outlets ====
+====Perplexity AI and news outlets====
 [[Perplexity AI]], founded in August 2022, is a large language model that aims to be viewed as a general search engine. It encourages users to consume news through its summaries of stories.
@@ Line 79: / Line 79: @@
 </blockquote>
-== References ==
+== Privacy concerns of online AI models ==
+There are several concerns with using online AI models like [[ChatGPT]], not only because they are proprietary, but also because there is no guarantee to where your data ends up being stored or used for.
+Luckily there is an alternative which solves many of these concerns, which is to run AI models locally. There currently exist different models that are small enough to run on a personal computer. Those models are indicated with a smaller parameter size, for instance models with 1.5B or 7B parameters. If the computer has a relatively modern GPU, it can also run one of the larger models for more accurate answers, as these models have GPU-acceleration. The software that will be recommended below runs on all major computer platforms (Windows/macOs/Linux). Be cautious if you download other kinds of models besides the major models, as platforms like HuggingFace allow anyone to upload.<ref>https://huggingface.co/</ref>
+=== LM Studio ===
+One of the easiest software to start with to run these models is LM Studio.<ref>https://lmstudio.ai/</ref> It is user-friendly as it has a graphical user interface aimed at beginners, and allows you to get started with just a few clicks. It recommends appropriately sized models for your specific computer hardware, and manages the rest of the installation for you. In terms of storage, you will need a few gigabytes to store the models locally, which you only have to do once. With the models installed, no further internet connection is required.<ref>[https://www.youtube.com/@NetworkChuck NetworkChuck]: [https://www.youtube.com/watch?v=7TR-FLWNVHY The only way to run deepseek]</ref> The software allows opening chats with the large language model, which you can also organize into folders.
+=== Ollama ===
+If you are fine with just using the terminal, another option is to install software like Ollama.<ref>https://ollama.com/</ref> Once installed, you can simply invoke the run command with the model you want to use, and it will download that model if it has not done that already. The website lists the most common models to run, like Llama ([[Meta]]) DeepSeek ([[DeepSeek]]), Phi ([[Microsoft]]), Mistral ([[Mistral AI]]), Gemma ([[Google]]). If you are a more advanced user, you can also run Ollama inside Docker.<ref>https://ollama.com/blog/ollama-is-now-available-as-an-official-docker-image</ref> That allows isolating the model completely from your host system, which may be what you want to be extra secure.
+==References==
 <references />
 [[Category:Artificial intelligence]]