Artificial intelligence/training: Difference between revisions
mention Alibaba as an offender |
m link CAPTCHA |
||
| Line 31: | Line 31: | ||
To protect against unethical crawlers, due to concerns of both intellectual property and service disruption, websites adopt practices that affect the experience of real users: | To protect against unethical crawlers, due to concerns of both intellectual property and service disruption, websites adopt practices that affect the experience of real users: | ||
*'''Bot check walls''': The user may be required to pass a security check "wall". While usually automatic for the user, this can affect legitimate bots. When a website protection service such as [[Cloudflare]] is not confident as to whether the visitor is legitimate, it may present a CAPTCHA to be manually filled out. An example is "Google Sorry", a CAPTCHA wall frequently seen when using Google Search via a VPN. An example that's popular in the FOSS community is [https://github.com/TecharoHQ/anubis Anubis]. | *'''Bot check walls''': The user may be required to pass a security check "wall". While usually automatic for the user, this can affect legitimate bots. When a website protection service such as [[Cloudflare]] is not confident as to whether the visitor is legitimate, it may present a [[CAPTCHA]] to be manually filled out. An example is "Google Sorry", a CAPTCHA wall frequently seen when using Google Search via a VPN. An example that's popular in the FOSS community is [https://github.com/TecharoHQ/anubis Anubis]. | ||
*'''Login walls''': Should bots be found to pass CAPTCHA walls, the website may advance to requiring logging in to view content. A major recent example of this is [[YouTube]]'s "Sign in to confirm you're not a bot" messages. | *'''Login walls''': Should bots be found to pass CAPTCHA walls, the website may advance to requiring logging in to view content. A major recent example of this is [[YouTube]]'s "Sign in to confirm you're not a bot" messages. | ||
*'''JavaScript requirement''': Most websites do not need JavaScript to deliver their content. However, as many scrapers expect content to be found directly in the HTML, it is often an easy workaround to use JavaScript to "insert" the content after the page has loaded. This may reduce the responsiveness of the website, increasing points of failure, and preventing security-conscious users who disable JavaScript from viewing the website. | *'''JavaScript requirement''': Most websites do not need JavaScript to deliver their content. However, as many scrapers expect content to be found directly in the HTML, it is often an easy workaround to use JavaScript to "insert" the content after the page has loaded. This may reduce the responsiveness of the website, increasing points of failure, and preventing security-conscious users who disable JavaScript from viewing the website. | ||