News

AI Search Engine Perplexity Accused of Stealth Tactics to Evade Website Blocks

Cloudflare Accuses Perplexity of Violating Web Norms with Stealth Crawling Techniques

AI search engine Perplexity is under fire for allegedly employing “stealth tactics” to bypass website rules that prohibit its bots from accessing content, a move that, if true, could have serious implications for web ethics. Cloudflare, a network security and optimization company, shared these allegations in a recent blog post, claiming that Perplexity’s actions flout the standard Internet norms that have been in place for decades.

According to Cloudflare researchers, complaints from customers revealed that Perplexity’s scraping bots were continuing to access websites’ content despite being explicitly blocked by robots.txt files and Web application firewalls. These tools are designed to prevent unauthorized crawlers from scraping content by blocking requests from specific bots.

In an effort to investigate these claims, Cloudflare researchers ran their own tests and discovered that when known Perplexity crawlers were blocked by robots.txt files or firewalls, the company resorted to using stealth bots. These bots employed various techniques to conceal their true nature, including rotating through multiple IP addresses not officially linked to Perplexity’s IP range. The bots also used different Autonomous System Numbers (ASNs) to further evade detection. This tactic allowed Perplexity to continue crawling across tens of thousands of domains and making millions of requests each day.

“This undeclared crawler utilized multiple IPs not listed in Perplexity’s official IP range, and would rotate through these IPs in response to the restrictive robots.txt policy and block from Cloudflare,” the researchers explained. The stealth approach used by Perplexity not only circumvented robots.txt exclusions but also flouted longstanding web standards.

Internet Norms at Risk

The issue at hand raises questions about the integrity of web crawling practices and their adherence to the rules established more than three decades ago. In 1994, engineer Martijn Koster introduced the Robots Exclusion Protocol, which allowed websites to communicate to crawlers whether they were allowed to access specific content. The protocol, which is implemented via the robots.txt file, has become an essential tool in the digital world for managing how content is indexed and shared by automated systems. The standard was formally recognized by the Internet Engineering Task Force in 2022.

Cloudflare’s blog post highlights that if Perplexity’s behavior is confirmed, it would be in direct violation of these norms. The researchers pointed out that bots should be transparent, respect website preferences, and clearly serve a legitimate purpose. However, Perplexity’s alleged use of stealth tactics conflicts with these expectations, prompting Cloudflare to take action.

Prior Allegations Against Perplexity

Cloudflare’s accusations are not the first to be levied against Perplexity for improper crawling practices. Last year, Reddit CEO Steve Huffman criticized the AI search engine and two other AI companies—Microsoft and Anthropic—for disregarding website restrictions and acting as though all online content is fair game for scraping. Huffman described the situation as “a real pain in the ass” for Reddit, as the company struggled to block unauthorized AI crawlers.

Additionally, several publishers have accused Perplexity of content plagiarism. Notably, Forbes accused the AI search engine of “cynical theft” after a post that closely mirrored one of Forbes’ proprietary articles was published a day later. Wired, a sister publication of Ars Technica, also raised concerns about suspicious traffic patterns from IP addresses potentially linked to Perplexity. These patterns indicated that Perplexity was disregarding robots.txt exclusions and manipulating its bot’s identity string to bypass blocks.

Cloudflare’s Response

In response to the findings, Cloudflare has taken steps to block Perplexity’s stealth crawlers from accessing its content delivery network. The company has removed Perplexity from its list of verified bots and added additional heuristics to its rules to block the alleged stealth activity.

“We’ve observed that Perplexity’s behavior is incompatible with web crawling best practices,” Cloudflare’s researchers wrote. “Based on this, we have taken action to prevent this bot from accessing websites through our services.”

Perplexity has not responded to requests for comment on the allegations.

As the debate over the ethical implications of AI-driven web scraping continues to evolve, these latest allegations serve as a reminder of the delicate balance between content access, copyright concerns, and the need for transparency in automated systems. Whether Perplexity will adjust its practices remains to be seen, but its actions highlight the growing tension between AI companies and the broader web community.

Photo Credit: DepositPhotos.com

Leave a Reply

Your email address will not be published. Required fields are marked *