Artificial intelligence/training: Difference between revisions
mention chip shortage |
|||
| Line 46: | Line 46: | ||
To protect against unethical crawlers, due to concerns of both intellectual property and service disruption, websites adopt practices that affect the experience of real users: | To protect against unethical crawlers, due to concerns of both intellectual property and service disruption, websites adopt practices that affect the experience of real users: | ||
*'''Bot check walls''': The user may be required to pass a security check "wall". While usually automatic for the user, this can affect legitimate bots. When a website protection service such as [[Cloudflare]] is not confident as to whether the visitor is legitimate, it may present a [[CAPTCHA]] to be manually filled out. An example is "Google Sorry", a CAPTCHA wall frequently seen when using Google Search via a VPN. An example that's popular in the FOSS community is | *'''Bot check walls''': The user may be required to pass a security check "wall". While usually automatic for the user, this can affect legitimate bots. When a website protection service such as [[Cloudflare]] is not confident as to whether the visitor is legitimate, it may present a [[CAPTCHA]] to be manually filled out. An example is "Google Sorry", a CAPTCHA wall frequently seen when using Google Search via a VPN. An example that's popular in the FOSS community is {{Wplink|Anubis (software)|Anubis}}. | ||
*'''Login walls''': Should bots be found to pass CAPTCHA walls, the website may advance to requiring logging in to view content. A major recent example of this is [[YouTube]]'s "Sign in to confirm you're not a bot" messages. | *'''Login walls''': Should bots be found to pass CAPTCHA walls, the website may advance to requiring logging in to view content. A major recent example of this is [[YouTube]]'s "Sign in to confirm you're not a bot" messages. | ||
*''' | *'''JavaScript requirement''': Most websites do not need [[JavaScript]] to deliver their content. However, as many scrapers expect content to be found directly in the HTML, it is often an easy workaround to use JavaScript to "insert" the content after the page has loaded. This may reduce the responsiveness of the website, increasing points of failure, and preventing security-conscious users who disable JavaScript from viewing the website. | ||
*'''IP address blocking''': Blocking IP addresses, especially by blocking entire providers via their | *'''IP address blocking''': Blocking IP addresses, especially by blocking entire providers via their {{Wplink|Autonomous system (Internet)|autonomous system number}}, always comes with some risk of blocking legitimate users. Particularly, this may restrict access to users making use of a VPN. | ||
*'''Heuristic blocking''': Patterns in request headers may give away that the request is being made by an unethical bot, despite attempts to act as a legitimate visitor. Heuristics are imperfect and may block legitimate users, especially those that may use less common browsers. | *'''Heuristic blocking''': Patterns in request headers may give away that the request is being made by an unethical bot, despite attempts to act as a legitimate visitor. Heuristics are imperfect and may block legitimate users, especially those that may use less common browsers. | ||
In rare situations, a website operator may redirect detected bot traffic, such as to download speed test files hosted by ISPs containing multiple gigabytes of random garbage data. This may have the effect of disrupting the bot, but its effectiveness is unknown. | In rare situations, a website operator may redirect detected bot traffic, such as to download speed test files hosted by ISPs containing multiple gigabytes of random garbage data. This may have the effect of disrupting the bot, but its effectiveness is unknown. | ||
The need to respond to unethical scraping also further consolidates the web into the control of a few large | The need to respond to unethical scraping also further consolidates the web into the control of a few large {{Wplink|Web application firewall|web application firewall}} (WAF) services, most notably [[Cloudflare]], as website owners find themselves otherwise unable to protect their service from being disrupted by such traffic. | ||
===Case studies=== | ===Case studies=== | ||