Which bots consume my server’s resources the most? What’s the situation with bots at the beginning of 2020?
To answer the questions I decided to analyze access log of one tool I webmaster. From 2020-03-14 00:00:00 to 2020-03-18 00:00:00 bots I was able to identify were:
- Ahrefs (1965 hits) – botsbreeder‘s backlink scraper based in Singapore. I like the tool itself, but on this level bot’s activity has no direct value for me.
Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)
- Aspiegel (1062 hits) – voracious search engine (SE) bot from Huawei, potential traffic source
Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; AspiegelBot)
- Semrush (907 hits) – backlink scrapper, no value
Mozilla/5.0 (compatible; SemrushBot/6~bl; +http://www.semrush.com/bot.html)
- Majestic (647 hits) – backlink scrapper, no value
Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)
- Moz (477 hits) – backlink scrapper, no value
Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, [email protected])
- Bingbot (150 hits) – SE from Mrkvosoft 🥕, potential traffic source
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
- Seznam (85 hits) – SE, potential traffic source
Mozilla/5.0 (compatible; SeznamBot/3.2; +http://napoveda.seznam.cz/en/seznambot-intro/)
- Google (only 67 hits) – Big Brother, potential traffic source
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.92 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
- Yandex (24 hits) – SE, potential traffic source
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
- SearchAtlas.com (4 hits) – no idea
SearchAtlas.com SEO Crawler
- Coccoc (2 hits) – some cute little Vietnamese SE, potential traffic source
Mozilla/5.0 (compatible; coccocbot-web/1.0; +http://help.coccoc.com/searchengine)
I’m sure you’ve noticed a surprisingly low number of Google visits. It isn’t a mistake. GoogleBot has really low frequency. I think it is for several reasons:
- Google’s bot is pretty adjust after all those years of experience.
- Tools process data from Google so Google don’t need to index what already has.
- The site is policed by Google Analytics.
Back to backlink monitors
During analyzed period, backlink monitors did 75% of all bot traffic. They consume bandwidth, server’s processing time and the most important of all – API limits. All without bringing any direct value to the tool itself. My conclusion? Block them!
Am I the only one with such an opinion? Definitely no. One article I randomly found while searching info about UA strings. “Eli the Bearded” comes to the same conclusion, block SEO bots:
I don’t know everything MJ12bot does, but I do know one thing it does is power paid access to “incoming” links reports via “Majestic Site Explorer”: “Access raw exports from £79.99 a month”. So let me get this, you crawl sites to sell people lists of who links to them? Why should I waste my bandwidth giving you pages?
But clearly it is Megaindex that is abusive. At the .com version of the site I read “MegaIndex is a powerful and versatile competitive intelligence suite for online marketing, from SEO and PPC to social media and advertising research.” Again, this is a [BS] use of my resources (bandwidth, web server CPU) for some commercial enterprise that cannot benefit me.
So: another new plugin is born, browser_block. Goodbye Megaindex. Goodbye Majestic.
I complete his list with: Goodbye Ahrefs. Goodbay SemRush. Goodbay Moz.
Backlink scrappers have no value for me. Moreover, my tool has no value for them (there are no “valuable” links). To mutually save resources, I decided to block their bots for certain sections of the tool (result pages). Bots are still allowed to crawl general sections like about, contact, FAQs, etc.
This is how my robots.txt looks like now:
# Relaxed rules for well trained bots User-agent: AhrefsBot User-agent: MJ12bot User-agent: dotbot Disallow: /url?par=* # Since SemrushBot is less developed than is normal for a particular # age (search what it mean in Oxford dictionary) there is a need to # be more strict. User-agent: SemrushBot Disallow: /
Now I’m curious which company has a well-behaved robot obeying robots.txt.
Update: I have added a fresh version of robots.txt. I think answer on previous question is obvious 🙂
Update 2: YouTube video where Google’s engineer explains how GoogleBot works with