Content producers spend a lot of time worrying about Google's search algorithms. But maybe it's time to think less about how frequently Google crawls your site -- and more about the potential damage from evil Googlebot imposters, who assume Googlebot’s identity to gain privileged access to websites and online information.
According to new research released today by Incapsula, a web security firm, millions of these “evil twins” are used for distributed denial-of-service (DDoS) attacks, hacking, spam, content theft and other shady activities on a daily basis.
Marc Gaffan, Incapsula’s co-founder and Chief Business Officer, shared a disturbing statistic. "For every 25 Googlebots that visit your site, you will also be visited by a fake Googlebot," he said.
Why worry? Because more than 23 percent of these fake Googlebots are designed to wreak havoc on your website.
In case you've been worrying more about content than creepy, crawly things, here's a primer. A web crawler — or "spider" — is an Internet bot that systematically crawls the World Wide Web, typically for the purpose of Web indexing.
Among the innumerable creatures roaming the web, Incapsula contends, "few are as intriguing as Googlebot – a web crawler that facilitates knowledge exchange between billions of humans, influencing our perceptions, preferences and imaginations in more ways than we can even comprehend."
Googlebots crawl the web to discover new and updated pages to be added to that ever so important Google index.
Incapsula observed more than 400 million search engine visits to 10,000 sites, resulting in more than 2.19 billion page crawls over a 30-day period. It found:
- Googlebot’s average visit rate per website is 187 visits per day
- Google’s average crawl rate is four pages per visit
- Google doesn’t crawl popular websites any more than smaller websites
- On average, a website will be visited by 187 Googlebots a day
- Content-heavy and frequently updated websites, including big forums, news sites and high-scale e-shops with a wide array of frequently updated products, are more thoroughly crawled
- Googlebots crawl more pages than all other search engines combined
Incapsula researchers noted they were surprised the Majestic12 Bot appears fourth on the Most Active Search Engines list, significantly outranking Yandex, a very popular Russian search engine. "Conspiracy theory buffs will recognize Majestic 12’s name for its connection to the (alleged) Roswell UFO landings. While the MJ12 bot is clearly non-human, it has much more earthier origins and its own share of controversy."
Googlebot’s 'Unruly Alter Ego'
Incapsula refers to a fake Googlebot as "Mr. Hack … scanner, spammer and DDoS imposter." By definition, it is a bot that operates with Googlebot’s HTTP(S) user-agent but is not what it claims to be. "For those who are unfamiliar with the term, 'user-agent' is an online equivalent of an ID card, used to identify website visitors — browsers or bots," Incapsula stated.
Fake Googlebot visits originate from botnets -- clusters of compromised connected devices (e.g., Trojan infected personal computers) exploited for malicious purposes. Most originate from the US. However, significant numbers of Googlebots also come from the UK, France, Belgium, Denmark and China.
Why Use Spoofed Googlebots? Simple. "'Google ID' is as close as a bot can get to having a VIP backstage pass for every show in town," Gaffan said.
Incapsula researchers, who inspected more than 50 million Googlebot impostor visits and also considered findings from their "DDoS Threat Landscape" report, published earlier this year, noted:
Most website operators know that to block Googlebot is to disappear from Google. Consequently, to preserve their SEO rankings, these website owners will go out of their way to ensure unhindered Googlebot access to their site, at all times. In practical terms, this may translate into exceptions to security rules and lenient rate limiting practices."
These numbers "make all sorts of sense because DDoS is just the situation where Googlebot’s ID can come in handy, particularly in the case of security solutions that still rely on rate limiting instead of case-by-case traffic inspection," Incapsula reports.
Website operators are often challenged by harsh “all or nothing” dilemmas: they can block all Googlebot agents and risk loss of traffic, or allow all Googlebots in and risk fakes and downtime.
There are ways to separate the real from the fakes. But that takes a combination of security heuristics, including IP and autonomous system number (ASN) verification — and most sites lack the processing power and software capabilities these strategies require.