Skip to main content

Web Scraping Guide: Headers & User-Agents Optimization Checklist

In our Web Scraping Without Getting Blocked guide, we saw that there are a number of ways for websites to determine you are scraper and block you.

A lot of developers focus most of their attention on using proxies to avoid getting blocked, however, one of the most overlooked and leading causes of getting blocked is giving away your identity as a scraper in your request headers.

Headers are sent along with every HTTP request, and provide important meta data about the request to the recieving website so it knows who you are and how to process the request.

When it comes to web scraping it is vital that you optimize your headers or else you run the risk that websites will detect that you are a scraper and block your requests.

In this guide, we will walk you through the Header & User-Agent Optimization Checklist:


Why You Need To Use Real Web Browser Headers

By default, most HTTP libraries (Python Requests, Scrapy, NodeJs Axios, etc.) either don't attach real browser headers to your requests or include headers that identify the library that is being used. Both of which immediately tell the website you are trying to scrape that you are scraper, not a real user.

For example, let's send a request to http://httpbin.org/headers with the Python Requests library using the default setting:


import requests

r = requests.get('http://httpbin.org/headers')
print(r.text)

You will get a response like this that shows what headers we sent to the website:


{
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.26.0",
}
}

Here we can see that our request using the Python Requests libary appends very few headers to the request, and even identifies itself as the python requests library in the User-Agent header.


"User-Agent": "python-requests/2.26.0",

If you tried to scrape a website like this it would be very obvious to the website that you are in fact a web scraper and then would quickly block your IP address from accessing the website.

That is why we need to optimise our headers when web scraping.


Mimicing Real Browser Headers

To avoid our scrapers requests being detected and blocked, we need to make them blend into the normal website traffic and seem like they are coming from a user using a real web browser.

To do so we need to mimic real browser headers when we send requests to a website.

Here are example headers when using a Chrome browser on a MacOS machine:


Host: 127.0.0.1:65432
Connection: keep-alive
Cache-Control: max-age=0
sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="99", "Google Chrome";v="99"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "macOS"
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.83 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: en-GB,en-US;q=0.9,en;q=0.8

As we can see, a real web browser sends a lot of headers to the website everytime it makes a request to a website.

When our scrapers requests don't have headers like these, it is really obvious to the website that you aren't a real user and oftentimes they will block your IP address.

Every browser (Chrome, Firefox, Edge, etc.) attaches slightly different headers in a different order, based on the operating system the browser is running on. So it is important to ensure the headers (and header order) we attach to our requests is correct.

As our goal is to blend into a websites normal web traffic, we should use the most common browser/operating system combinations when making requests.

  • Chrome on Windows
  • Safari on MacOS

Generating HTTP Headers

To use real browser headers in our scrapers we first need to gather them.

To do so we can simply open up Developer Tools in your browser by right clicking on the page and selecting Inspect, and visit a website. For example: google.com

From here open the Network tab, and select Fetch/XHR.

This will show all the network requests that we used to get the google.com page.

Click on the first, network request in the side bar and select the Headers tab. This will show all the request and response headers our browser sent and recieved.

We care about the Request Headers, as we can copy these into our scraper and clean out some of the redundant headers or headers that are too specific like cookie or x-client-data.

The Web Scraping Playbook - Headers Developer Tools

In general you only want to include the following headers with your requests, unless a website requires you to send others to access the data you need.


Host
Connection
Cache-Control
sec-ch-ua
sec-ch-ua-mobile
sec-ch-ua-platform
Upgrade-Insecure-Requests
User-Agent
Accept
Sec-Fetch-Site
Sec-Fetch-Mode
Sec-Fetch-User
Sec-Fetch-Dest
Accept-Encoding
Accept-Language

This can be a bit of a trial and error process, as you find the optimal header set for your target website.

The order in which you add your headers can lead to your requests being flagged as suspicious so it is vital that you ensure you are using the correct header order when making requests.


Ensuring Proper Header Order

A common issue developers overlook when configuring headers for their web scrapers is the order of those headers.

Each browser has a specific order in which they send their headers, so if your scraper sends a request with the headers out of order then it can be used to detect your scraper.

For example, see how the header order for Chrome on Windows is different to the header order when using Firefox on Windows:

Chrome on Windows

Host: 127.0.0.1:65432
Connection: keep-alive
Cache-Control: max-age=0
sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="99", "Google Chrome";v="99"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Windows"
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9

Firefox on Windows

Host: 127.0.0.1:65432
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1
Cache-Control: max-age=0

What makes this issue more complicated is the fact that many HTTP clients implement their own header orders and don't respect the header orders you define in your scraper.

Take the Python Requests library, which does not always respect the header order you define depending on how it is being used. (See Issue 5814 for more info and solutions).

To prevent this, you should make sure the HTTP client you use respects the header order you set in your scraper and doesn't override it with owns header order.


Rotating Headers & User-Agents

When scraping at scale, it isn't good enough just to use real browser headers you need to have hundreds or thousands of headers that you can rotate through as you are making requests.

This is especially important for the User-Agent header, as it is probably the most important header for web scraping as it is the one that says which browser you are using.

You should configure your scraper to rotate through a list of user-agents when scraping.

There are number of databases out there that give you access to the most common user agents, like the whatismybrowser.com database which you can use to generate your user-agent lists.

Important: The User-Agent should match the other standard headers you set in the headers for that particular browser. For example, you don't want to use a Chrome browser running on Windows user-agent, whilst the rest of the headers are for a Firefox browser running on Windows. This will lead to you getting blocked.


Keeping Headers Up To Date

As browsers are constantly being updated, the default headers used are regularly changing too. Especially the user-agents.

Real users typically upgrade their browser automatically when a new browser version comes out, so it is very common for a large percentage of web users to be using the latest version of a browser very quickly after a new stable version has been released.

Therefore, to avoid your scrapers sticking out after a browser has been updated, you should regularly double check and update the headers your scrapers are using to make sure they are using the most popular headers.

Otherwise, your scrapers might get blocked at a increasing rate.


Remove Bad HTTP Headers

Unknown to many developers, the proxy servers you are using may be adding extra headers to your requests that can make your scrapers easily detectable.

When your request is forwarded from the proxy server to the target website sometimes they can inadvertently add additional headers to the request without you knowing it.

Headers such as these are commonly added by the intermediary server and are a clear sign the request was made through a proxy.


'Forwarded',
'Proxy-Authorization',
'X-Forwarded-For',
'Proxy-Authenticate',
'X-Requested-With',
'From',
'X-Real-Ip',
'Via',
'True-Client-Ip',
'Proxy_Connection'

Before you purchase a proxy package with a proxy provider, you should double check that their proxy server isn't adding these headers to your requests.

You can do so by sending a couple requests through the proxy provider to a site like http://httpbin.org/headers and inspect the response you get back.


import requests

proxies = {
"http": "PROXY"
}

r = requests.get('http://httpbin.org/headers', proxies=proxies, verify=False)
print(r.text)

If you see that the proxy server is adding suspicious headers to the request then either use a different proxy provider, or contact their support team and have them update their servers to drop those headers before they are sent to the target website.

This shouldn't happen with most of the big proxy providers, but we've seen it happen to those who are using a smaller proxy provider or who have built their own proxy network.


Managed Headers

Managing headers can be a bit of a pain, as you need to be optimizing for every website you are scraping.

You can do it yourself, however, a growing number of "smart" proxy providers do this optimization for you.

When you send a request to a proxy provider like:

They take the URL you want to scrape, and find the optimal header combination for each request to maximize the success rate of each request.

Meaning you don't need to worry about anything we discussed in this guide, as they do it for you.

If you would like to find the best proxy provider for your use case then be sure to check out our free proxy comparison tool.

Or if you would like to let someone else find the best proxy provider for your use case then check out the ScrapeOps Proxy Aggregator which automatically finds the best proxy provider for your particular domain so you don't have to.


More Web Scraping Guides

In this guide, we went through why headers are important when web scraping and how you should manage them to ensure your scrapers don't get blocked.

If you would like to learn more about how else your web scrapers can get detected and blocked then check out our How to Scrape The Web Without Getting Blocked Guide.

Of if you would like to learn more about web scraping in general, then be sure to check out The Web Scraping Playbook, or check out one of our more in-depth guides: