Gray Matter


Blocking Web Robots on This Website

RT Cunningham | September 5, 2020 (UTC) | Web Development

web robotsIn the past, and on all of my other websites, I always blocked web robots that didn’t benefit me in any way. With this website, I decided to forget about trying to block them (except for obvious questionable bots). I’m not blocking countries either.

I used to maintain a list of web robots, a bot list. The last one I added to the list was in June 2020. I wasted a lot of time researching what they were and their possible values. Blocking bad bots was a lot like playing Whac-A-Mole.

Web Robots Everywhere

There all kinds of web robots out there. I don’t know where I read it, but web robots surpassed regular visitors years ago. There are search bots, advertising bots, social bots, data collection bots and bad bots.

I used to block Baidu (based in China) and Yandex (based in Russia), along with a host of other search bots for places other than English-speaking countries, with my Nginx firewall. There are a lot of bots that should still be blocked, but I’m not going to bother with other than a few.

The Web Robots I Still Block

I only have 14 rules for blocking user agents. Here they are:

~^\' 1;
~^\" 1;
~*^acebookexternalhit 1;
~*^mozilla$ 1;
~*^Mozilla/5\.0\ \(compatible\)$ 1;
~*^Mozilla/5\.0\ Firefox/26\.0$ 1;
~*curl 1;
~*msie\ 5 1;
~*msie\ 6 1;
~*User-Agent 1;
~userAgent 1;
~*wget 1;
~ia_archiver 1;
~Screaming\ Frog 1;

I’m sure I’ll add more, eventually. As you can see, those are pretty obvious. I also block a few referrers. Most are nonexistent websites and URLs with the scheme (http or https) missing. Finally, I block a lot fake search engines. I maintain a list of CIDRs like this: 1; # SoftLayer 1; # US Ettnet 1; # US Choopa 1; # US Quadranet 1; # NL 1; # US Krypt 1; # US Comcast 1; # VN 1; # US Eonix 1; # US DigitalOcean 1; # US Nexeon 1; # BY 1; # NL 1; # DE Internetbolaget 1; # FR Internetbolaget 1; # DE Internetbolaget 1; # ZA

These conditions to do the actual blocking:

if ($fake_ua) { set $bad_ua A; }
if ($http_user_agent ~* bingbot) { set $bad_ua "B"; }
if ($http_user_agent ~* duckduckbot) { set $bad_ua "B"; }
if ($http_user_agent ~* googlebot) { set $bad_ua "B"; }
if ($http_user_agent ~* msnbot) { set $bad_ua "B"; }
if ($http_user_agent ~* slurp) { set $bad_ua "B"; }
if ($bad_ua = "AB") { return 403; }

Asking for Trouble?

Maybe, but I don’t think so. Unlike my other websites, those that I’m trying to combine into this one, this website is completely static. It’s pure HTML, CSS and JavaScript, with no underlying programming language. There’s no cache because it’s static to begin with.

Image Attribution: DrSJS from Pixabay

Share: Facebook | Twitter

These Posts May Also Interest You: