Artificial intelligence/training: Difference between revisions
m linkify: Chrome, Windows |
m re-link Anubis: GH -> WP |
||
| Line 31: | Line 31: | ||
To protect against unethical crawlers, due to concerns of both intellectual property and service disruption, websites adopt practices that affect the experience of real users: | To protect against unethical crawlers, due to concerns of both intellectual property and service disruption, websites adopt practices that affect the experience of real users: | ||
*'''Bot check walls''': The user may be required to pass a security check "wall". While usually automatic for the user, this can affect legitimate bots. When a website protection service such as [[Cloudflare]] is not confident as to whether the visitor is legitimate, it may present a [[CAPTCHA]] to be manually filled out. An example is "Google Sorry", a CAPTCHA wall frequently seen when using Google Search via a VPN. An example that's popular in the FOSS community is [ | *'''Bot check walls''': The user may be required to pass a security check "wall". While usually automatic for the user, this can affect legitimate bots. When a website protection service such as [[Cloudflare]] is not confident as to whether the visitor is legitimate, it may present a [[CAPTCHA]] to be manually filled out. An example is "Google Sorry", a CAPTCHA wall frequently seen when using Google Search via a VPN. An example that's popular in the FOSS community is [[wikipedia:Anubis_(software)|Anubis]]. | ||
*'''Login walls''': Should bots be found to pass CAPTCHA walls, the website may advance to requiring logging in to view content. A major recent example of this is [[YouTube]]'s "Sign in to confirm you're not a bot" messages. | *'''Login walls''': Should bots be found to pass CAPTCHA walls, the website may advance to requiring logging in to view content. A major recent example of this is [[YouTube]]'s "Sign in to confirm you're not a bot" messages. | ||
*'''[[JavaScript]] requirement''': Most websites do not need JavaScript to deliver their content. However, as many scrapers expect content to be found directly in the HTML, it is often an easy workaround to use JavaScript to "insert" the content after the page has loaded. This may reduce the responsiveness of the website, increasing points of failure, and preventing security-conscious users who disable JavaScript from viewing the website. | *'''[[JavaScript]] requirement''': Most websites do not need JavaScript to deliver their content. However, as many scrapers expect content to be found directly in the HTML, it is often an easy workaround to use JavaScript to "insert" the content after the page has loaded. This may reduce the responsiveness of the website, increasing points of failure, and preventing security-conscious users who disable JavaScript from viewing the website. | ||