.htaccess rules for blocking bots with an extra condition

Question

Try it like this:

RewriteCond %{HTTP_USER_AGENT} (ADmantX|Proximic|Barkrowler|X-Middleton) [NC]
RewriteRule ^ - [F]

This will block any request where the User-Agent string contains either ADmantX, Proximic, Barkrowler or X-Middleton. The NC flag makes this a case-insensitive match – whether this is strictly required or not in this example I don’t know, but generally this should be a case-sensitive match, since User-Agents (even from bad bots) is usually consistent with regards to case.

The regex prefix ^.* and suffix .*$ are superfluous.

The regex pattern (A|B|C|D) is called alternation. Essentially means A or B or C or D will match.

The RewriteRule pattern .* can be simplified to ^ – which is also marginally more efficient, since you don’t need to actually match anything here, just to be successful.

The L flag is not required when the F flag is used – it is implied.

UPDATE:

It appears that X-Middleton (or rather X-Middleton/1) is appended to all User-Agent strings that reach your site, as they pass through the Ezoic reverse proxy. So, simply blocking based on the presence of this string in the User-Agent header (as above) is not going to work, since it will block all requests!

If X-Middleton is simply appended to the UA string and no further processing occurs then you could theoretically block the request when X-Middleton appears twice (or more times) in the UA string in order to block any request where X-Middleton occurs in the original request.

To handle this situation you would create an additional rule. For example:

RewriteCond %{HTTP_USER_AGENT} (X-Middleton).*\1 [NC]
RewriteRule ^ - [F]

\1 is an internal backreference that matches the first matched subpattern, ie. “X-Middleton”. So, it is only successful when the string “X-Middleton” occurs at least twice, separated by any number of characters (or none).

The above will block blah X-Middleton blah X-Middleton/1, but not blah blah X-Middleton/1 (case-insensitive match).

However, I would want to see an example access log entry (User-Agent string) of such a request before going live with this. It shouldn’t block real user requests, but it might not block fake requests either. If you don’t have an actual fake request, then you can mock one up by customising the User-Agent string your browser sends (in Chrome’s “Inspector” > Customize Menu > More tools > Network conditions OR install a User-Agent switcher plugin) OR use CURL to make the request, eg. curl -A "<custom-user-agent>" <siteurl> – where you’d set <custom-user-agent> to blah X-Middleton blah or something.

I would also be interested to see the complete list of HTTP request headers that are reaching your application server, as there may be a better way to solve this. (I find it unusual that an intermediate proxy would modify the User-Agent, without also providing the original value. Although, maybe there are no additional options?)

Related Posts: