X

Create browser whitelist with htaccess

To properly set up a browser and bot whitelist with Apache’s .htaccess file, you need to analyze the latest access logs (at least a full month) and create a list of services/bots that need access to your site. This might not be a very simple task if you use analytics, CDN, security, social networking, and other services. Luckily, many such services use fairly standard user agents and they might not need a separate rule.

I suggest importing your latest monthly log(s) into Microsoft Excel, LibreOffice Calc, Google Sheets, or some other spreadsheet software, group the data by user agent string, and sort by the number of occurrences. This gives you a rough overview of which browsers and bots access your site the most. Common online services scan your site at least once a day, so they should be within the first half of the results.

To find out if a request originated from a legitimate IP-address or range, use http://bgp.he.net/ or a similar service. The most commonly faked user agent is, of course, Googlebot. Make sure that your .htaccess file has a rule for blocking fake Google bots – googling can help you with this.

Why whitelist and not blacklist? Simple: blacklists tend to grow huge and need way more daily work, but whitelists do not require changes for months or for years after they’ve been fine-tuned.

Yes, this tutorial might not be for Apache beginners then.

An extremely basic example of a standard user agent string

You also need some basic knowledge of standard user agent strings. For example, just “Firefox 47.0” is most probably a bot trying to crawl or steal your site contents; the correct string should be something like “Mozilla/5.0 (Windows NT 10.0; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0”.

Of course, the correct string looks a bit different on a Mac, Linux, iOS, or Android device, and on a different version of Windows.
But a standard one consists of “Mozilla/5.0 ([operating system and/or device details]; rv:47.0) Gecko/20100101 Firefox/47.0 [more details for smartphones]”. Note that the rv: number must match Firefox/ number, otherwise it is a fake user agent.

Here’s a simple list of other common web browser user agent strings on a 64-bit Windows 10 desktop. As usual, version numbers will change and other operating systems and device types have their own modifications.

  • Google Chrome – Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36
  • Internet Explorer 11 – Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko
  • Internet Explorer 11 in Compatibility Mode – Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; Tablet PC 2.0)
  • Microsoft Edge – Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2486.0 Safari/537.36 Edge/13.10586
  • Opera – Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36 OPR/38.0.2220.41

You can browse common user-agent strings at http://www.useragentstring.com/pages/useragentstring.php. It also lists many known good and bad online services – helpful in setting up your whitelist.

.htaccess browser and bot whitelist rules example

First, you might want to create an informative page (for example, strangebrowser.html) and redirect visitors with non-standard browsers there. After you’ve thoroughly tested your list, you can replace the redirection rule with a deny (error 403) rule. Please remember to add the <meta content=”noindex, nofollow, noarchive” name=”robots”> line to the head section of your information page to prevent Google, Bing, and other search robots from indexing it. You might also want to add analytics code to the page if you use such services.

So, let’s set up the basic list then. This example has a rule for each whitelisted item for better readability. After testing, you might want to merge this list into one or more longer rules and remove most comments.

The exclamation mark (!) means “is not”; the caret or circumflex symbol (^) means that the user agent string begins with the string (rules without the symbol look for a match anywhere in the string); [NC] means no case or not case-sensitive.
You must escape dots and spaces with a backslash (\) in your own rules, or the rules may not work properly or stop Apache from working (error 500).

Please note that some services should be allowed only while in use: LoadImpact can be used to take down your site; W3C validator is often abused for content scraping.

Create a backup of your .htaccess file before making any modifications!

## User-agent whitelist example by winhelp.info
## Version 1.268, last modified: 2020-10-03
# Add your informative page to prevent redirection loops
RewriteCond %{REQUEST_URI} !strangebrowser\.html
# Exclusions for known problems
# Covenant Eyes parental monitoring
RewriteCond %{REMOTE_ADDR} !^69\.41\.14\.
# The real Googlebot sometimes uses just "Google" for user agent string from this IP range
RewriteCond %{REMOTE_ADDR} !^66\.102\.6\.
# Qwant search engine uses a non-standard user agent string
RewriteCond %{REMOTE_ADDR} !^194\.187\.168\.
RewriteCond %{REMOTE_ADDR} !^194\.187\.170\.
RewriteCond %{REMOTE_ADDR} !^194\.187\.171\.
# Some files can be crawled by empty or blacklisted user agents, this is normal
# Cropped-logo is specific to a WordPress theme, comment it out or change its name if necessary
RewriteCond %{REQUEST_URI} !ads\.txt
RewriteCond %{REQUEST_URI} !apple-touch-icon
RewriteCond %{REQUEST_URI} !cropped-logo
RewriteCond %{REQUEST_URI} !favicon\.
RewriteCond %{REQUEST_URI} !robots\.txt
# Search engine bot whitelist, comment out or remove everything your site does not need. Add your own services.
RewriteCond %{HTTP_USER_AGENT} !Applebot/
RewriteCond %{HTTP_USER_AGENT} !^AppleNewsBot$
RewriteCond %{HTTP_USER_AGENT} !BaiduSpider [NC]
RewriteCond %{HTTP_USER_AGENT} !bingbot/
RewriteCond %{HTTP_USER_AGENT} !^cortex/
RewriteCond %{HTTP_USER_AGENT} !^DDG-Android-
RewriteCond %{HTTP_USER_AGENT} !DuckDuckGo
RewriteCond %{HTTP_USER_AGENT} !Exabot/
RewriteCond %{HTTP_USER_AGENT} !Google\ favicon
RewriteCond %{HTTP_USER_AGENT} !^Google-AMPHTML
RewriteCond %{HTTP_USER_AGENT} !\ Google-SearchByImage\)
RewriteCond %{HTTP_USER_AGENT} !^GoogleBot/2
RewriteCond %{HTTP_USER_AGENT} !^Googlebot-Image/
RewriteCond %{HTTP_USER_AGENT} !^Google-Cloud-ML-Vision$
RewriteCond %{HTTP_USER_AGENT} !istellabot/
RewriteCond %{HTTP_USER_AGENT} !Moatbot/
RewriteCond %{HTTP_USER_AGENT} !MS\ Search\ 6
RewriteCond %{HTTP_USER_AGENT} !opensiteexplorer\.org
RewriteCond %{HTTP_USER_AGENT} !parsijoo-update-crawler
RewriteCond %{HTTP_USER_AGENT} !^Qwantify/
RewriteCond %{HTTP_USER_AGENT} !search\.msn\.com
RewriteCond %{HTTP_USER_AGENT} !SeznamBot/
RewriteCond %{HTTP_USER_AGENT} !sogou\.com/docs/
RewriteCond %{HTTP_USER_AGENT} !www\.google\.com
RewriteCond %{HTTP_USER_AGENT} !yahoo-help\.jp
RewriteCond %{HTTP_USER_AGENT} !^YahooCache
RewriteCond %{HTTP_USER_AGENT} !yandex\.com
# Advertising-related bot whitelist, comment out if your site does not display ads. Add your own services.
RewriteCond %{HTTP_USER_AGENT} !^admantx
RewriteCond %{HTTP_USER_AGENT} !^adreview/
RewriteCond %{HTTP_USER_AGENT} !^Adsbot-Google
RewriteCond %{HTTP_USER_AGENT} !AdxPsfFetcher-Google
RewriteCond %{HTTP_USER_AGENT} !^bidswitchbot/
RewriteCond %{HTTP_USER_AGENT} !Cliqzbot/
RewriteCond %{HTTP_USER_AGENT} !comscore\.com
RewriteCond %{HTTP_USER_AGENT} !^ias-.*admantx
RewriteCond %{HTTP_USER_AGENT} !^media-bot$
RewriteCond %{HTTP_USER_AGENT} !Mediapartners-Google
RewriteCond %{HTTP_USER_AGENT} !Pulsepoint\ XT3\ web\ scraper
RewriteCond %{HTTP_USER_AGENT} !Taboolabot/
# Other known bots, comment out or remove everything your site does not need. Add your own services.
RewriteCond %{HTTP_USER_AGENT} !Alexa\ Verification\ Agent
RewriteCond %{HTTP_USER_AGENT} !bitlybot/
RewriteCond %{HTTP_USER_AGENT} !Blackboard\ Safeassign
RewriteCond %{HTTP_USER_AGENT} !^CCBot/
RewriteCond %{HTTP_USER_AGENT} !^Clickagy\ Intelligence\ Bot
RewriteCond %{HTTP_USER_AGENT} !CloudFlare
RewriteCond %{HTTP_USER_AGENT} !Disqus/
RewriteCond %{HTTP_USER_AGENT} !^dproxy/
RewriteCond %{HTTP_USER_AGENT} !^EasyBib\ AutoCite
RewriteCond %{HTTP_USER_AGENT} !facebookexternalhit/
RewriteCond %{HTTP_USER_AGENT} !^FeedBurner/
RewriteCond %{HTTP_USER_AGENT} !Feedly/
RewriteCond %{HTTP_USER_AGENT} !filterdb\.iss\.net
RewriteCond %{HTTP_USER_AGENT} !^Flipboard/
RewriteCond %{HTTP_USER_AGENT} !^GetIntent\ Crawler
RewriteCond %{HTTP_USER_AGENT} !^GG\ PeekBot
RewriteCond %{HTTP_USER_AGENT} !^Grammarly/
RewriteCond %{HTTP_USER_AGENT} !^HubPages
RewriteCond %{HTTP_USER_AGENT} !^IAS\ crawler
RewriteCond %{HTTP_USER_AGENT} !^Leikibot/
RewriteCond %{HTTP_USER_AGENT} !^LinkedInBot/
RewriteCond %{HTTP_USER_AGENT} !^LongURL\ API
RewriteCond %{HTTP_USER_AGENT} !^Pinterest/
RewriteCond %{HTTP_USER_AGENT} !^Quora\ Link\ Preview/
RewriteCond %{HTTP_USER_AGENT} !^Qwantify/
RewriteCond %{HTTP_USER_AGENT} !rogerbot
RewriteCond %{HTTP_USER_AGENT} !^RyteBot/
RewriteCond %{HTTP_USER_AGENT} !^SafeDNSBot
RewriteCond %{HTTP_USER_AGENT} !^SafeSearch\ microdata\ crawler
RewriteCond %{HTTP_USER_AGENT} !^SiteTruth\.com
RewriteCond %{HTTP_USER_AGENT} !^Slackbot\ [0-9]
RewriteCond %{HTTP_USER_AGENT} !^Slackbot-LinkExpanding
RewriteCond %{HTTP_USER_AGENT} !^Sprinklr
RewriteCond %{HTTP_USER_AGENT} !^TelegramBot\ \(
RewriteCond %{HTTP_USER_AGENT} !^TinEye-bot/
RewriteCond %{HTTP_USER_AGENT} !TweetmemeBot/
RewriteCond %{HTTP_USER_AGENT} !^TurnitinBot\ \(
RewriteCond %{HTTP_USER_AGENT} !^Twitterbot/
RewriteCond %{HTTP_USER_AGENT} !^WhatsApp/
RewriteCond %{HTTP_USER_AGENT} !Wikipedia\ Broken
RewriteCond %{HTTP_USER_AGENT} !visionutils [NC]
RewriteCond %{HTTP_USER_AGENT} !www\.deadlinkchecker\.com
# Services that should not be always allowed, remove the comment mark only when necessary
#RewriteCond %{HTTP_USER_AGENT} !^LoadImpact
#RewriteCond %{HTTP_USER_AGENT} !validator\.w3\.org
# Browser whitelist, includes most modern ones, excludes text-based Lynx and Links
RewriteCond %{HTTP_USER_AGENT} !Dalvik/
RewriteCond %{HTTP_USER_AGENT} !^Dillo/
RewriteCond %{HTTP_USER_AGENT} !Dolfin/
RewriteCond %{HTTP_USER_AGENT} !^Dorado\ WAP-Browser/
RewriteCond %{HTTP_USER_AGENT} !MIDP
RewriteCond %{HTTP_USER_AGENT} !Mobile
RewriteCond %{HTTP_USER_AGENT} !^Mozilla/\d\.0\ \(([a-z]|[A-Z])
RewriteCond %{HTTP_USER_AGENT} !MSIE\ .*Windows!.*Trident/
RewriteCond %{HTTP_USER_AGENT} !NetFront/
RewriteCond %{HTTP_USER_AGENT} !ObigoInternetBrowser/
RewriteCond %{HTTP_USER_AGENT} !^Opera/9\.
RewriteCond %{HTTP_USER_AGENT} !^Safari/.*\ CFNetwork/.*\ Darwin/
RewriteCond %{HTTP_USER_AGENT} !^UCWEB/
# Redirect during the testing period
RewriteRule .* /strangebrowser.html? [R=307,L]
# Comment out the line above and remove the comment mark below to block non-standard browsers after the testing period
# RewriteRule ^.*$ - [F,L]

Browser whitelist rules should be almost at the beginning of the <IfModule mod_rewrite.c> section. You might want to add it after the RewriteBase / rule and redirection rules for HTTP(S) protocol and the www prefix.

After your first attempt works, keep scanning access logs daily to find out stuff you might have missed.

Please also read how to block fake user agents that might bypass this whitelist: Stop fake user agents with htaccess.