Cloudflare, a major content delivery network (CDN) provider, recently launched a new feature designed to curb web scraping by AI companies. This functionality is available on both free and paid tiers of their service.pen_spark
The core of this feature lies in AI-powered bot detection. According to Cloudflare, their system can identify automated content extraction attempts, even when bots employ techniques to evade traditional methods. “We’ve observed bot operators disguise themselves as real browsers using spoofed user agents,” acknowledged Cloudflare engineers in a blog post today. “Our global machine learning model has consistently detected this activity.”
One such example is a bot identified by Cloudflare that collects content for Perplexity AI Inc., a well-funded search engine startup. A Wired report last month highlighted how this bot’s scraping methods mimicked regular user traffic, making it difficult for website owners to block Perplexity AI’s data collection.
Cloudflare assigns a score of 1 to 99 to each website visit processed through their platform, with lower scores signifying a higher likelihood of bot activity. The bot collecting data for Perplexity AI consistently receives scores below 30, according to the company.
“Malicious actors scraping websites often rely on identifiable tools and frameworks,” explained Cloudflare’s engineers. “We leverage our vast network, processing over 57 million requests per second on average, to assess the reliability of each fingerprint we encounter.” The feature is designed to adapt over time, keeping pace with evolving technical signatures of AI scraping bots and the emergence of new crawlers. Additionally, Cloudflare is introducing a tool that allows website owners to report any new bots they encounter.
This development comes amidst ongoing debate regarding the ethics of web scraping for AI training purposes. While some argue that scraping publicly available data is fair game, others raise concerns about copyright infringement and the potential misuse of scraped content. Cloudflare’s new feature equips website owners with greater control over their content, potentially impacting how AI companies gather data for training their large language models.