Web Scraping Part 5 - Using Fake User-Agents and Browser Headers

Modern websites don’t just check what you request - they scrutinize who’s asking. After scaling your scraper with retries and concurrency in Part 4, the next roadblock you’ll hit is aggressive bot‑detection based on HTTP fingerprints. User‑agents and browser headers are the two biggest tells: if they look synthetic or repetitive, your requests go straight to the ban list.

In Part 5 you’ll learn how to cloak your scraper so it blends in with ordinary traffic. We’ll cover why sites flag default headers, how to rotate realistic user‑agents, and how to send complete browser‑header sets that pass heuristic checks. By the end, you’ll know exactly how to make every request look like it came from a real person - laying the groundwork for the proxy strategies we tackle in Part 6.

Python Requests + BeautifulSoup
Node.js Axios + Cheerio
Node.js Puppeteer
Node.js Playwright

Fake User-Agents and Browser Headers

Python Requests/BS4 Beginners Series Part 5: Using Fake User-Agents and Browser Headers

So far in this Python Requests/BeautifulSoup 6-Part Beginner Series, we have learned how to build a basic web scraper Part 1, scrape data from a website in Part 2, clean it up, save it to a file or database in Part 3, and make our scraper more robust and scalable by handling failed requests and using concurrency in Part 4.

In Part 5, we’ll explore how to use fake user-agents and browser headers to bypass restrictions on sites trying to prevent scraping.

Getting Blocked and Banned While Web Scraping
Using Fake User-Agents When Scraping
Using Fake Browser Headers When Scraping
Next Steps

If you prefer to follow along with a video then check out the video tutorial version here:

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

Python Requests/BeautifulSoup 6-Part Beginner Series

Part 1: Basic Python Requests/BeautifulSoup Scraper - We'll go over the basics of scraping with Python, and build our first Python scraper. (Part 1)
Part 2: Cleaning Dirty Data & Dealing With Edge Cases - Web data can be messy, unstructured, and have lots of edge cases. In this tutorial we'll make our scraper robust to these edge cases, using data classes and data cleaning pipelines. (Part 2)
Part 3: Storing Data in AWS S3, MySQL & Postgres DBs - There are many different ways we can store the data that we scrape from databases, CSV files to JSON format, and S3 buckets. We'll explore several different ways we can store the data and talk about their pros, and cons and in which situations you would use them. (Part 3)
Part 4: Managing Retries & Concurrency - Make our scraper more robust and scalable by handling failed requests and using concurrency. (Part 4)
Part 5: Faking User-Agents & Browser Headers - Make our scraper production ready by using fake user agents & browser headers to make our scrapers look more like real users. (Part 5)
Part 6: Using Proxies To Avoid Getting Blocked - Explore how to use proxies to bypass anti-bot systems by hiding your real IP address and location. (Part 6)

The code for this project is available on GitHub.

Getting Blocked and Banned While Web Scraping

When you start scraping large volumes of data, you'll find that building and running scrapers is easy. The true difficulty lies in reliably retrieving HTML responses from the pages you want. While scraping a couple hundred pages with your local machine is easy, websites will quickly block your requests when you need to scrape thousands or millions.

Large websites like Amazon monitor visitors by tracking IP addresses and user-agents, detecting unusual behavior with sophisticated anti-bot measures. If you identify as a scraper, your request will be blocked.

However, by properly managing user-agents and browser headers during scraping, you can counter these anti-bot techniques. While these advanced techniques would be optional for our beginner project on scraping chocolate.co.uk.

In this guide, we're still going to look at how to use fake user-agents and browser headers so that you can apply these techniques if you ever need to scrape a more difficult website like Amazon.

Using Fake User-Agents When Scraping

One of the most common reasons for getting blocked while web scraping is using bad User-Agent headers. When scraping data from websites, the site often doesn't want you to extract their information, so you need to appear like a legitimate user.

To do this, you must manage the User-Agent headers you send along with your HTTP requests.

What are User-Agents

User-Agent (UA) is a string sent by the user's web browser to a server. It's located in the HTTP header and helps websites identify the following information about the user sending a request:

Operating system: The user's operating system (e.g., Windows, macOS, Linux, Android, iOS)
Browser: The specific browser being used (e.g., Chrome, Firefox, Safari, Edge)
Browser version: The version of the browser

Here's an example of a user-agent string that might be sent when you visit a website using Chrome:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36

The user-agent string indicates that you are using Chrome version 109.0.0.0 on a 64-bit Windows 10 computer.

The browser is Chrome
The version of Chrome is 109.0.0.0
The operating system is Windows 10
The device is a 64-bit computer

Note that using an incorrectly formed user-agent can lead to your data extraction script being blocked. Most web browsers tend to follow the below format:

Mozilla/5.0 (<system-information>) <platform> (<platform-details>) <extensions>

Check our Python Requests: Setting Fake User-Agents to get more information about using Fake-User Agents in Python Requests.

Why Use Fake User-Agents in Web Scraping

To avoid detection during web scraping, you can use a fake user-agent to mimic a real user's browser. This replaces your default user-agent, making your scraper appear legitimate and reducing the risk of being blocked as a bot.

You must set a unique user-agent for each request. Websites can detect repeated requests from the same user-agent and identify them as potential bots.

With most Python HTTP clients like Requests, the default user-agent string reveals that your request is coming from Python. To bypass this, you can manually set a different user-agent before sending each request.

'User-Agent': 'python-requests/2.31.0'

This user-agent will identify your requests are being made by the Python Requests library, so the website can easily block you from scraping the site. Therefore, it's crucial to manage your user-agents when sending requests with Python Requests.

How to Set a Fake User-Agent in Python Requests

Using Python Requests, setting a fake user-agent is straightforward. Define your desired user-agent string in a dictionary, then pass this dictionary to the headers parameter of your request.

import requests

head = {
    "User-Agent": "Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148"}

r = requests.get('http://httpbin.org/headers', headers=head)
print(r.json())

Here’s the result:

Using Fake User-Agents When Scraping - How to set a fake user-agent in Python request - set_user_agent

You can see that the user-agent we provided is included in the response and its value is reflected as the answer to our request.

How to Rotate User-Agents

You can easily rotate user-agents by including a list of user-agents in your scraper and randomly selecting one for each request.

import requests
import random

user_agent_list = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36',
    'Mozilla/5.0 (iPhone; CPU iPhone OS 14_4_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Mobile/15E148 Safari/604.1',
    'Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/87.0.664.75',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18363',
]

head = {
    "User-Agent": user_agent_list[random.randint(0, len(user_agent_list)-1)]}

r = requests.get('http://httpbin.org/headers', headers=head)
print(r.json())

Each request has a unique user-agent, ensuring the random selection of user-agents. Here’s the result:

Using Fake User-Agents When Scraping - How To Rotate User-Agents - random_user_agent

How to Create a Custom Fake User-Agent Middleware

Let's see how to create fake user-agent middleware that can effectively manage thousands of fake user-agents. This middleware can then be seamlessly integrated into your final scraper code.

The optimal approach is to use a free user-agent API, such as the ScrapeOps Fake User-Agent API. This API enables you to use a current and comprehensive user-agent list when your scraper starts up and then pick a random user-agent for each request.

To use the ScrapeOps Fake User-Agents API, you simply need to send a request to the API endpoint to fetch a list of user-agents.

http://headers.scrapeops.io/v1/user-agents?api_key=YOUR_API_KEY

To use the ScrapeOps Fake User-Agent API, you first need an API key which you can get by signing up for a free account here.

Here’s a response from the API that shows an up-to-date list of user-agents that you can use for each request.

{
  "result": [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Safari/605.1.15",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/603.3.8 (KHTML, like Gecko) Version/10.1.2 Safari/603.3.8",
    "Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Safari/605.1.15",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Safari/605.1.15",
    "Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36"
  ]
}

To integrate the Fake User-Agent API, you should configure your scraper to retrieve a batch of the most up-to-date user-agents from the ScrapeOps Fake User-Agent API when the scraper starts. Then, configure your scraper to pick a random user-agent from this list for each request.

Now, what if the retrieved list of user-agents from the ScrapeOps Fake User-Agent API is empty? In such cases, you can use the fallback user-agent list, which we will define in a separate method.

Here’s the step-by-step explanation of building custom user-agent middleware:

Create a UserAgentMiddleware instance with a ScrapeOps API key and the number of user-agents to fetch. Then, create a dictionary with the User-Agent set to a random user-agent by calling the get_random_user_agent method. Finally, send an HTTP GET request to the URL using the headers containing the random user-agent.

# Get Fake User-Agents
user_agent_middleware = UserAgentMiddleware(
    scrapeops_api_key='YOUR_API_KEY', num_user_agents=20)

head = {'User-Agent': user_agent_middleware.get_random_user_agent()}

# Make Request
response = requests.get(
    'https://www.chocolate.co.uk/collections/all', headers=head)

Let's explore the UserAgentMiddleware class and its important methods. The class constructor (__init__) initializes instance variables (user_agent_list, scrapeops_api_key) and calls the get_user_agents method to fetch user-agent strings.

class UserAgentMiddleware:
    def __init__(self, scrapeops_api_key='', num_user_agents=10):
        self.user_agent_list = []
        self.scrapeops_api_key = scrapeops_api_key
        self.get_user_agents(num_user_agents)

Now, the get_user_agents() method takes only the number of user-agents to fetch. First, it requests the ScrapeOps API to get the specified number of user-agents. If the response status code is 200, it parses the JSON response and extracts the user-agents.

If the list is empty or the status code is not 200, it prints a warning and uses a fallback user-agent list by calling the use_fallback_user_agent_list() method. The fallback user-agent list is a predefined list of user-agents to be used in case of API failure.

# This method is a part of the UserAgentMiddleware class.
def get_user_agents(self, num_user_agents):
    response = requests.get('http://headers.scrapeops.io/v1/user-agents?api_key=' +
                            self.scrapeops_api_key + '&num_results=' + str(num_user_agents))
    if response.status_code == 200:
        json_response = response.json()
        self.user_agent_list = json_response.get('result', [])
        if len(self.user_agent_list) == 0:
            print('WARNING: ScrapeOps user-agent list is empty.')
            self.user_agent_list = self.use_fallback_user_agent_list()
    else:
        print(
            f'WARNING: ScrapeOps Status Code is {response.status_code}, error message is:', response.text)
        self.user_agent_list = self.use_fallback_user_agent_list()

The use_fallback_user_agent_list() method returns a list of user-agents in case of an API failure.

# This method is a part of the UserAgentMiddleware class.
def use_fallback_user_agent_list(self):
    print('WARNING: Using fallback user-agent list.')
    return [
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36',
        'Mozilla/5.0 (iPhone; CPU iPhone OS 14_4_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Mobile/15E148 Safari/604.1',
        'Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/87.0.664.75',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18363',
    ]

Finally, we have defined a method that returns a random user-agent from the finalized list of user-agents. This list is prepared either by fetching user-agents from the ScrapeOps API or by calling the use_fallback_user_agent_list() method.

# This method is a part of the UserAgentMiddleware class.
def get_random_user_agent(self):
    random_index = randint(0, len(self.user_agent_list) - 1)
    return self.user_agent_list[random_index]

Here's the complete code for the User-Agent Middleware.

import requests
from random import randint

class UserAgentMiddleware:

    def __init__(self, scrapeops_api_key='', num_user_agents=10):
        self.user_agent_list = []
        self.scrapeops_api_key = scrapeops_api_key
        self.get_user_agents(num_user_agents)

    def get_user_agents(self, num_user_agents):
        response = requests.get('http://headers.scrapeops.io/v1/user-agents?api_key=' +
                                self.scrapeops_api_key + '&num_results=' + str(num_user_agents))
        if response.status_code == 200:
            json_response = response.json()
            self.user_agent_list = json_response.get('result', [])
            if len(self.user_agent_list) == 0:
                print('WARNING: ScrapeOps user-agent list is empty.')
                self.user_agent_list = self.use_fallback_user_agent_list()
        else:
            print(
                f'WARNING: ScrapeOps Status Code is {response.status_code}, error message is:', response.text)
            self.user_agent_list = self.use_fallback_user_agent_list()

    def use_fallback_user_agent_list(self):
        print('WARNING: Using fallback user-agent list.')
        return [
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36',
            'Mozilla/5.0 (iPhone; CPU iPhone OS 14_4_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Mobile/15E148 Safari/604.1',
            'Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/87.0.664.75',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18363',
        ]

    def get_random_user_agent(self):
        random_index = randint(0, len(self.user_agent_list) - 1)
        return self.user_agent_list[random_index]

# Get Fake User-Agents
user_agent_middleware = UserAgentMiddleware(
    scrapeops_api_key='YOUR_API_KEY', num_user_agents=20)
head = {'User-Agent': user_agent_middleware.get_random_user_agent()}

# Make Request
response = requests.get(
    'https://www.chocolate.co.uk/collections/all', headers=head)

print(user_agent_middleware.get_random_user_agent())
print(user_agent_middleware.get_random_user_agent())
print(user_agent_middleware.get_random_user_agent())

Here’s the result:

Using Fake User-Agents When Scraping - How to create a custom fake user-agent middleware - fake_user_agent

When you call the get_random_user_agent method, it returns different user agents without any additional messages. This indicates that these user agents are successfully fetched directly from the ScrapeOps Fake User-Agent API.

Integrating User-Agent Middleware in a Scraper

Integration of user agent middleware into our scraper is very easy. You just need to make some minor changes while making requests to the URL in the retry logic.

Here's the code for the retry logic, which was already discussed in Part 4 of this series.

class RetryLogic:

    def __init__(self, retry_limit=5, anti_bot_check=False, use_fake_user_agents=False):
        self.retry_limit = retry_limit
        self.anti_bot_check = anti_bot_check
        self.use_fake_user_agents = use_fake_user_agents
    
    def make_request(self, url, method='GET', **kwargs):
        kwargs.setdefault('allow_redirects', True)
        
        ## Use Fake User-Agents
        head = kwargs.get('headers', {})
        if self.use_fake_user_agents:
            head['User-Agent'] = user_agent_middleware.get_random_user_agent()

        ## Retry Logic
        for _ in range(self.retry_limit):
            try:
                response = requests.request(method, url, headers=head, **kwargs)
                if response.status_code in [200, 404]:
                    if self.anti_bot_check and response.status_code == 200:
                        if self.passed_anti_bot_check == False:
                            return False, response
                    return True, response

            except Exception as e:
                print('Error', e)
        return False, None
    
    def passed_anti_bot_check(self, response):
        
        ## Example Anti-Bot Check
        if '<title>Robot or human?</title>' in response.text:
            return False
        
        ## Passed All Tests
        return True

Now, let's highlight the additions to this class:

use_fake_user_agents Parameter: We've added an extra parameter, use_fake_user_agents. When set to True, it modifies the request headers to include a randomly chosen user agent.
Modified make_request Method: In the make_request method, for every request, the new user agent is set by calling the get_random_user_agent() method.

Here's the complete code after integrating user-agent middleware into our scraper.

import os
import time
import csv
import requests
from bs4 import BeautifulSoup
from dataclasses import dataclass, field, fields, InitVar, asdict
import concurrent.futures
from random import randint

@dataclass
class Product:
    name: str = ''
    price_string: InitVar[str] = ''
    price_gb: float = field(init=False)
    price_usd: float = field(init=False)
    url: str = ''

    def __post_init__(self, price_string):
        self.name = self.clean_name()
        self.price_gb = self.clean_price(price_string)
        self.price_usd = self.convert_price_to_usd()
        self.url = self.create_absolute_url()

    def clean_name(self):
        if self.name == '':
            return 'missing'
        return self.name.strip()

    def clean_price(self, price_string):
        price_string = price_string.strip()
        price_string = price_string.replace('Sale price£', '')
        price_string = price_string.replace('Sale priceFrom £', '')
        if price_string == '':
            return 0.0
        return float(price_string)

    def convert_price_to_usd(self):
        return self.price_gb * 1.21

    def create_absolute_url(self):
        if self.url == '':
            return 'missing'
        return 'https://www.chocolate.co.uk' + self.url

class ProductDataPipeline:

    def __init__(self, csv_filename='', storage_queue_limit=5):
        self.names_seen = []
        self.storage_queue = []
        self.storage_queue_limit = storage_queue_limit
        self.csv_filename = csv_filename
        self.csv_file_open = False

    def save_to_csv(self):
        self.csv_file_open = True
        products_to_save = []
        products_to_save.extend(self.storage_queue)
        self.storage_queue.clear()
        if not products_to_save:
            return

        keys = [field.name for field in fields(products_to_save[0])]
        file_exists = os.path.isfile(
            self.csv_filename) and os.path.getsize(self.csv_filename) > 0
        with open(self.csv_filename, mode='a', newline='', encoding='utf-8') as output_file:
            writer = csv.DictWriter(output_file, fieldnames=keys)

            if not file_exists:
                writer.writeheader()

            for product in products_to_save:
                writer.writerow(asdict(product))

        self.csv_file_open = False

    def clean_raw_product(self, scraped_data):
        return Product(
            name=scraped_data.get('name', ''),
            price_string=scraped_data.get('price', ''),
            url=scraped_data.get('url', '')
        )

    def is_duplicate(self, product_data):
        if product_data.name in self.names_seen:
            print(f"Duplicate item found: {product_data.name}. Item dropped.")
            return True
        self.names_seen.append(product_data.name)
        return False

    def add_product(self, scraped_data):
        product = self.clean_raw_product(scraped_data)
        if self.is_duplicate(product) == False:
            self.storage_queue.append(product)
            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:
                self.save_to_csv()

    def close_pipeline(self):
        if self.csv_file_open:
            time.sleep(3)
        if len(self.storage_queue) > 0:
            self.save_to_csv()

class RetryLogic:

    def __init__(self, retry_limit=5, anti_bot_check=False, use_fake_user_agents=False):
        self.retry_limit = retry_limit
        self.anti_bot_check = anti_bot_check
        self.use_fake_user_agents = use_fake_user_agents

    def make_request(self, url, method='GET', **kwargs):
        kwargs.setdefault('allow_redirects', True)

        # Use Fake User-Agents
        head = kwargs.get('headers', {})
        if self.use_fake_user_agents:
            head['User-Agent'] = user_agent_middleware.get_random_user_agent()

        # Retry Logic
        for _ in range(self.retry_limit):
            try:
                response = requests.request(
                    method, url, headers=head, **kwargs)
                print(response)
                if response.status_code in [200, 404]:
                    if self.anti_bot_check and response.status_code == 200:
                        if self.passed_anti_bot_check == False:
                            return False, response
                    return True, response

            except Exception as e:
                print('Error', e)
        return False, None

    def passed_anti_bot_check(self, response):

        # Example Anti-Bot Check
        if '<title>Robot or human?</title>' in response.text:
            return False

        # Passed All Tests
        return True

class UserAgentMiddleware:

    def __init__(self, scrapeops_api_key='', num_user_agents=10):
        self.user_agent_list = []
        self.scrapeops_api_key = scrapeops_api_key
        self.get_user_agents(num_user_agents)

    def get_user_agents(self, num_user_agents):
        response = requests.get('http://headers.scrapeops.io/v1/user-agents?api_key=' +
                                self.scrapeops_api_key + '&num_results=' + str(num_user_agents))
        if response.status_code == 200:
            json_response = response.json()
            self.user_agent_list = json_response.get('result', [])
            if len(self.user_agent_list) == 0:
                print('WARNING: ScrapeOps user-agent list is empty.')
                self.user_agent_list = self.use_fallback_user_agent_list()
        else:
            print(
                f'WARNING: ScrapeOps Status Code is {response.status_code}, error message is:', response.text)
            self.user_agent_list = self.use_fallback_user_agent_list()

    def use_fallback_user_agent_list(self):
        print('WARNING: Using fallback user-agent list.')
        return [
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36',
            'Mozilla/5.0 (iPhone; CPU iPhone OS 14_4_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Mobile/15E148 Safari/604.1',
            'Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/87.0.664.75',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18363',
        ]

    def get_random_user_agent(self):
        random_index = randint(0, len(self.user_agent_list) - 1)
        return self.user_agent_list[random_index]

def scrape_page(url):
    list_of_urls.remove(url)
    valid, response = retry_request.make_request(url)
    if valid and response.status_code == 200:
        # Parse Data
        soup = BeautifulSoup(response.content, 'html.parser')
        products = soup.select('product-item')
        for product in products:
            name = product.select('a.product-item-meta__title')[0].get_text()
            price = product.select('span.price')[
                0].get_text().replace('\nSale price£', '')
            url = product.select('div.product-item-meta a')[0]['href']

            # Add To Data Pipeline
            data_pipeline.add_product({
                'name': name,
                'price': price,
                'url': url
            })

        # Next Page
        next_page = soup.select('a[rel="next"]')
        if len(next_page) > 0:
            list_of_urls.append(
                'https://www.chocolate.co.uk' + next_page[0]['href'])

# Scraping Function
def start_concurrent_scrape(num_threads=5):
    while len(list_of_urls) > 0:
        with concurrent.futures.ThreadPoolExecutor(max_workers=num_threads) as executor:
            executor.map(scrape_page, list_of_urls)

list_of_urls = [
    'https://www.chocolate.co.uk/collections/all',
]

if __name__ == "__main__":
    data_pipeline = ProductDataPipeline(csv_filename='product_data.csv')
    user_agent_middleware = UserAgentMiddleware(
        scrapeops_api_key='YOUR_API_KEY', num_user_agents=20)
    retry_request = RetryLogic(
        retry_limit=3, anti_bot_check=False, use_fake_user_agents=True)
    start_concurrent_scrape(num_threads=10)
    data_pipeline.close_pipeline()

Using Fake Browser Headers When Scraping

For simple websites, simply setting an up-to-date user-agent would allow you to scrape data reliably. However, many popular websites are increasingly using sophisticated anti-bot technologies to prevent data scraping. These solutions analyze not only your request's user-agent but also the other headers a real browser normally sends.

Why Choose Fake Browser Headers Instead of User-Agents

Using a full set of browser headers, not just a fake user-agent, makes your requests appear more like those of real users, making them harder to detect.

Here is an example header when using a Chrome browser on a MacOS machine:

sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="99", "Google Chrome";v="99"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "macOS"
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.83 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: en-GB,en-US;q=0.9,en;q=0.8

As we can see, real browsers send not only User-Agent strings but also several other headers to identify and customize their requests. So, to improve the reliability of our scrapers, we should also include these headers when scraping.

How to Set Fake Browser Headers in Python Requests

Before setting fake browser headers, let's see what headers are sent by my Brave browser running on Windows 11.

import requests
response = requests.get('https://httpbin.org/headers')
print(response.json()['headers'])

Here’s the result:

Using Fake Browser Headers When Scraping - How to set fake browser headers in Python request- pre_browser_header

My browser sends some other important information along with the user-agent. When scraping thousands of pages from a website with anti-bot features, this data can lead to detection and blocking. To avoid such obstacles, we'll explore the use of fake browser headers as a potential solution.

Setting fake browser headers is similar to setting user agents. Define your desired browser headers as key-value pairs in a dictionary, and then pass this dictionary to the headers parameter of your request.

import requests
import random

headers_list = [{
    'authority': 'httpbin.org',
    'cache-control': 'max-age=0',
    'sec-ch-ua': '"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"',
    'sec-ch-ua-mobile': '?0',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'sec-fetch-site': 'none',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-user': '?1',
    'sec-fetch-dest': 'document',
    'accept-language': 'en-US,en;q=0.9',
}
]
head = random.choice(headers_list)
response = requests.get('https://httpbin.org/headers', headers=head)
print(response.json()['headers'])

We’re sending a request to httpbin.org with fake browser headers and retrieving the response. This is what we expect to see...

Using Fake Browser Headers When Scraping - How to set fake browser headers in Python request - browser_header

How to Create a Custom Fake Browser Agent Middleware

Creating custom fake browser agent middleware is very similar to creating custom fake user-agent middleware. You have two options: either build a list of fake browser headers manually or use the ScrapeOps Fake Browser Headers API to fetch an up-to-date list each time your scraper starts.

The ScrapeOps Fake Browser Headers API is a free API that returns a list of optimized fake browser headers, helping you evade blocks/bans and enhance the reliability of your web scrapers.

API Endpoint:

http://headers.scrapeops.io/v1/browser-headers?api_key=YOUR_API_KEY

Response:

{
  "result": [
    {
        "upgrade-insecure-requests": "1",
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "sec-ch-ua": "\".Not/A)Brand\";v=\"99\", \"Google Chrome\";v=\"103\", \"Chromium\";v=\"103\"",
        "sec-ch-ua-mobile": "?0",
        "sec-ch-ua-platform": "\"Windows\"",
        "sec-fetch-site": "none",
        "sec-fetch-mod": "",
        "sec-fetch-user": "?1",
        "accept-encoding": "gzip, deflate, br",
        "accept-language": "bg-BG,bg;q=0.9,en-US;q=0.8,en;q=0.7"
    },
    {
        "upgrade-insecure-requests": "1",
        "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36",
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "sec-ch-ua": "\".Not/A)Brand\";v=\"99\", \"Google Chrome\";v=\"103\", \"Chromium\";v=\"103\"",
        "sec-ch-ua-mobile": "?0",
        "sec-ch-ua-platform": "\"Linux\"",
        "sec-fetch-site": "none",
        "sec-fetch-mod": "",
        "sec-fetch-user": "?1",
        "accept-encoding": "gzip, deflate, br",
        "accept-language": "fr-CH,fr;q=0.9,en-US;q=0.8,en;q=0.7"
    }
  ]
}

To use the ScrapeOps Fake Browser Headers API, you first need an API key which you can get by signing up for a free account here.

To integrate the Fake Browser Headers API, you should configure your scraper to retrieve a batch of the most up-to-date headers upon startup. Then, configure it to pick a random header from this list for each request. Now, what if the retrieved list of headers from the ScrapeOps Fake Browser Headers API is empty? In such cases, you can use the fallback headers list.

The code closely resembles the user-agent middleware, with a few minor changes. The get_browser_headers() method sends a request to the ScrapeOps Fake Browser Headers API to retrieve browser headers. If the request is successful, it extracts the headers and stores them. If the request fails or in case of any error, it displays a warning message and uses a fallback headers list.

import requests
from random import randint

class BrowserHeadersMiddleware:

    def __init__(self, scrapeops_api_key='', num_headers=10):
        self.browser_headers_list = []
        self.scrapeops_api_key = scrapeops_api_key
        self.get_browser_headers(num_headers)

    def get_browser_headers(self, num_headers):
        response = requests.get('http://headers.scrapeops.io/v1/browser-headers?api_key=' +
                                self.scrapeops_api_key + '&num_results=' + str(num_headers))
        if response.status_code == 200:
            json_response = response.json()
            self.browser_headers_list = json_response.get('result', [])
            if len(self.browser_headers_list) == 0:
                print('WARNING: ScrapeOps headers list is empty.')
                self.browser_headers_list = self.use_fallback_headers_list()
        else:
            print(
                f'WARNING: ScrapeOps Status Code is {response.status_code}, error message is:', response.text)
            self.browser_headers_list = self.use_fallback_headers_list()

    def use_fallback_headers_list(self):
        print('WARNING: Using fallback headers list.')
        return [
            {
                "upgrade-insecure-requests": "1",
                "user-agent": "Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
                "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
                "sec-ch-ua": "\".Not/A)Brand\";v=\"99\", \"Google Chrome\";v=\"103\", \"Chromium\";v=\"103\"",
                "sec-ch-ua-mobile": "?0",
                "sec-ch-ua-platform": "\"Windows\"",
                "sec-fetch-site": "none",
                "sec-fetch-mod": "",
                "sec-fetch-user": "?1",
                "accept-encoding": "gzip, deflate, br",
                "accept-language": "bg-BG,bg;q=0.9,en-US;q=0.8,en;q=0.7"
            },
            {
                "upgrade-insecure-requests": "1",
                "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36",
                "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
                "sec-ch-ua": "\".Not/A)Brand\";v=\"99\", \"Google Chrome\";v=\"103\", \"Chromium\";v=\"103\"",
                "sec-ch-ua-mobile": "?0",
                "sec-ch-ua-platform": "\"Linux\"",
                "sec-fetch-site": "none",
                "sec-fetch-mod": "",
                "sec-fetch-user": "?1",
                "accept-encoding": "gzip, deflate, br",
                "accept-language": "fr-CH,fr;q=0.9,en-US;q=0.8,en;q=0.7"
            }
        ]

    def get_random_browser_headers(self):
        random_index = randint(0, len(self.browser_headers_list) - 1)
        return self.browser_headers_list[random_index]

browser_headers_middleware = BrowserHeadersMiddleware(
    scrapeops_api_key='YOUR_API_KEY', num_headers=20)
headers = browser_headers_middleware.get_random_browser_headers()

# Make Request
response = requests.get(
    'https://www.chocolate.co.uk/collections/all', headers=headers)

print(browser_headers_middleware.get_random_browser_headers())
print(browser_headers_middleware.get_random_browser_headers())
print(browser_headers_middleware.get_random_browser_headers())

Here’s the result:

Using Fake Browser Headers When Scraping - How to create a custom fake browser agent middleware - custom_browser_header

When you call the get_browser_headers method, it returns different headers without any additional messages. This indicates that these headers are successfully fetched directly from the ScrapeOps Fake Browser Headers API.

Integrating Fake Browser Headers Middleware

Integration of fake browser middleware into our scraper is very easy. You just need to make some minor changes while making requests to the URL in the retry logic.

Here's the code for the retry logic, which was already discussed in Part 4 of this series

class RetryLogic:

    def __init__(self, retry_limit=5, anti_bot_check=False, use_fake_browser_headers=False):
        self.retry_limit = retry_limit
        self.anti_bot_check = anti_bot_check
        self.use_fake_browser_headers = use_fake_browser_headers

    def make_request(self, url, method='GET', **kwargs):
        kwargs.setdefault('allow_redirects', True)

        # Use Fake Browser Headers
        headers = kwargs.get('headers', {})
        if self.use_fake_browser_headers:
            fake_browser_headers = browser_headers_middleware.get_random_browser_headers()
            for key, value in fake_browser_headers.items():
                headers[key] = value

        # Retry Logic
        for _ in range(self.retry_limit):
            try:
                response = requests.request(
                    method, url, headers=headers, **kwargs)
                if response.status_code in [200, 404]:
                    if self.anti_bot_check and response.status_code == 200:
                        if self.passed_anti_bot_check == False:
                            return False, response
                    return True, response

            except Exception as e:
                print('Error', e)
        return False, None

    def passed_anti_bot_check(self, response):

        # Example Anti-Bot Check
        if '<title>Robot or human?</title>' in response.text:
            return False

        # Passed All Tests
        return True

Now, let's highlight the additions to this class:

use_fake_browser_headers Parameter: When set to True, automatically modifies randomly chosen browser headers into the request headers.
Modified make_request Method: For each request within this method, a random browser header is fetched using the get_random_browser_headers method. The generated fake browser headers typically contain multiple key-value pairs. So, iterate through each key-value pair within the browser headers and append the corresponding key-value pair to the headers dictionary.

Here's the complete code after integrating browser headers middleware into our scraper.

import os
import time
import csv
import requests
from bs4 import BeautifulSoup
from dataclasses import dataclass, field, fields, InitVar, asdict
import concurrent.futures
from random import randint

@dataclass
class Product:
    name: str = ''
    price_string: InitVar[str] = ''
    price_gb: float = field(init=False)
    price_usd: float = field(init=False)
    url: str = ''

    def __post_init__(self, price_string):
        self.name = self.clean_name()
        self.price_gb = self.clean_price(price_string)
        self.price_usd = self.convert_price_to_usd()
        self.url = self.create_absolute_url()

    def clean_name(self):
        if self.name == '':
            return 'missing'
        return self.name.strip()

    def clean_price(self, price_string):
        price_string = price_string.strip()
        price_string = price_string.replace('Sale price£', '')
        price_string = price_string.replace('Sale priceFrom £', '')
        if price_string == '':
            return 0.0
        return float(price_string)

    def convert_price_to_usd(self):
        return self.price_gb * 1.21

    def create_absolute_url(self):
        if self.url == '':
            return 'missing'
        return 'https://www.chocolate.co.uk' + self.url

class ProductDataPipeline:

    def __init__(self, csv_filename='', storage_queue_limit=5):
        self.names_seen = []
        self.storage_queue = []
        self.storage_queue_limit = storage_queue_limit
        self.csv_filename = csv_filename
        self.csv_file_open = False

    def save_to_csv(self):
        self.csv_file_open = True
        products_to_save = []
        products_to_save.extend(self.storage_queue)
        self.storage_queue.clear()
        if not products_to_save:
            return

        keys = [field.name for field in fields(products_to_save[0])]
        file_exists = os.path.isfile(
            self.csv_filename) and os.path.getsize(self.csv_filename) > 0
        with open(self.csv_filename, mode='a', newline='', encoding='utf-8') as output_file:
            writer = csv.DictWriter(output_file, fieldnames=keys)

            if not file_exists:
                writer.writeheader()

            for product in products_to_save:
                writer.writerow(asdict(product))

        self.csv_file_open = False

    def clean_raw_product(self, scraped_data):
        return Product(
            name=scraped_data.get('name', ''),
            price_string=scraped_data.get('price', ''),
            url=scraped_data.get('url', '')
        )

    def is_duplicate(self, product_data):
        if product_data.name in self.names_seen:
            print(f"Duplicate item found: {product_data.name}. Item dropped.")
            return True
        self.names_seen.append(product_data.name)
        return False

    def add_product(self, scraped_data):
        product = self.clean_raw_product(scraped_data)
        if self.is_duplicate(product) == False:
            self.storage_queue.append(product)
            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:
                self.save_to_csv()

    def close_pipeline(self):
        if self.csv_file_open:
            time.sleep(3)
        if len(self.storage_queue) > 0:
            self.save_to_csv()

class RetryLogic:

    def __init__(self, retry_limit=5, anti_bot_check=False, use_fake_browser_headers=False):
        self.retry_limit = retry_limit
        self.anti_bot_check = anti_bot_check
        self.use_fake_browser_headers = use_fake_browser_headers

    def make_request(self, url, method='GET', **kwargs):
        kwargs.setdefault('allow_redirects', True)

        # Use Fake Browser Headers
        headers = kwargs.get('headers', {})
        if self.use_fake_browser_headers:
            fake_browser_headers = browser_headers_middleware.get_random_browser_headers()
            for key, value in fake_browser_headers.items():
                headers[key] = value

        # Retry Logic
        for _ in range(self.retry_limit):
            try:
                response = requests.request(
                    method, url, headers=headers, **kwargs)
                if response.status_code in [200, 404]:
                    if self.anti_bot_check and response.status_code == 200:
                        if self.passed_anti_bot_check == False:
                            return False, response
                    return True, response

            except Exception as e:
                print('Error', e)
        return False, None

    def passed_anti_bot_check(self, response):

        # Example Anti-Bot Check
        if '<title>Robot or human?</title>' in response.text:
            return False

        # Passed All Tests
        return True

class BrowserHeadersMiddleware:

    def __init__(self, scrapeops_api_key='', num_headers=10):
        self.browser_headers_list = []
        self.scrapeops_api_key = scrapeops_api_key
        self.get_browser_headers(num_headers)

    def get_browser_headers(self, num_headers):
        response = requests.get('http://headers.scrapeops.io/v1/browser-headers?api_key=' +
                                self.scrapeops_api_key + '&num_results=' + str(num_headers))
        if response.status_code == 200:
            json_response = response.json()
            self.browser_headers_list = json_response.get('result', [])
            if len(self.browser_headers_list) == 0:
                print('WARNING: ScrapeOps headers list is empty.')
                self.browser_headers_list = self.use_fallback_headers_list()
        else:
            print(
                f'WARNING: ScrapeOps Status Code is {response.status_code}, error message is:', response.text)
            self.browser_headers_list = self.use_fallback_headers_list()

    def use_fallback_headers_list(self):
        print('WARNING: Using fallback headers list.')
        return [
            {
                "upgrade-insecure-requests": "1",
                "user-agent": "Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
                "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
                "sec-ch-ua": "\".Not/A)Brand\";v=\"99\", \"Google Chrome\";v=\"103\", \"Chromium\";v=\"103\"",
                "sec-ch-ua-mobile": "?0",
                "sec-ch-ua-platform": "\"Windows\"",
                "sec-fetch-site": "none",
                "sec-fetch-mod": "",
                "sec-fetch-user": "?1",
                "accept-encoding": "gzip, deflate, br",
                "accept-language": "bg-BG,bg;q=0.9,en-US;q=0.8,en;q=0.7"
            },
            {
                "upgrade-insecure-requests": "1",
                "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36",
                "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
                "sec-ch-ua": "\".Not/A)Brand\";v=\"99\", \"Google Chrome\";v=\"103\", \"Chromium\";v=\"103\"",
                "sec-ch-ua-mobile": "?0",
                "sec-ch-ua-platform": "\"Linux\"",
                "sec-fetch-site": "none",
                "sec-fetch-mod": "",
                "sec-fetch-user": "?1",
                "accept-encoding": "gzip, deflate, br",
                "accept-language": "fr-CH,fr;q=0.9,en-US;q=0.8,en;q=0.7"
            }
        ]

    def get_random_browser_headers(self):
        random_index = randint(0, len(self.browser_headers_list) - 1)
        return self.browser_headers_list[random_index]

def scrape_page(url):
    list_of_urls.remove(url)
    valid, response = retry_request.make_request(url)
    if valid and response.status_code == 200:
        # Parse Data
        soup = BeautifulSoup(response.content, 'html.parser')
        products = soup.select('product-item')
        for product in products:
            name = product.select('a.product-item-meta__title')[0].get_text()
            price = product.select('span.price')[
                0].get_text().replace('\nSale price£', '')
            url = product.select('div.product-item-meta a')[0]['href']

            # Add To Data Pipeline
            data_pipeline.add_product({
                'name': name,
                'price': price,
                'url': url
            })

        # Next Page
        next_page = soup.select('a[rel="next"]')
        if len(next_page) > 0:
            list_of_urls.append(
                'https://www.chocolate.co.uk' + next_page[0]['href'])

# Scraping Function
def start_concurrent_scrape(num_threads=5):
    while len(list_of_urls) > 0:
        with concurrent.futures.ThreadPoolExecutor(max_workers=num_threads) as executor:
            executor.map(scrape_page, list_of_urls)

list_of_urls = [
    'https://www.chocolate.co.uk/collections/all',
]

if __name__ == "__main__":
    data_pipeline = ProductDataPipeline(csv_filename='product_data.csv')
    browser_headers_middleware = BrowserHeadersMiddleware(
        scrapeops_api_key='YOUR_API_KEY', num_headers=20)
    retry_request = RetryLogic(
        retry_limit=3, anti_bot_check=False, use_fake_browser_headers=True)
    start_concurrent_scrape(num_threads=10)
    data_pipeline.close_pipeline()

Part 5: Using Fake User-Agents and Browser Headers

Node.js Axios/CheerioJS Beginners Series Part 5: Using Fake User-Agents and Browser Headers

So far in this Node.js Cheerio Beginners Series, we have learned how to build a basic web scraper in Part 1, scrape data from a website in Part 2, clean it up, save it to a file or database in Part 3, and make our scraper more robust and scalable by handling failed requests and using concurrency in Part 4.

In Part 5, we’ll explore how to use fake user-agents and browser headers to bypass restrictions on sites trying to prevent scraping.

Getting Blocked and Banned While Web Scraping
Using Fake User-Agents When Scraping
Using Fake Browser Headers When Scraping
Next Steps

Node.js Axios/CheerioJS 6-Part Beginner Series

This 6-part Node.js Axios/CheerioJS Beginner Series will walk you through building a web scraping project from scratch, covering everything from creating the scraper to deployment and scheduling.

Part 1: Basic Node.js Cheerio Scraper - We'll learn the fundamentals of web scraping with Node.js and build your first scraper using Cheerio. (Part 1)
Part 2: Cleaning Unruly Data & Handling Edge Cases - Web data can be messy and unpredictable. In this part, we'll create a robust scraper using data structures and cleaning techniques to handle these challenges. (Part 2)
Part 3: Storing Scraped Data - Explore various options for storing your scraped data, including databases like MySQL or Postgres, cloud storage like AWS S3, and file formats like CSV and JSON. We'll discuss their pros, cons, and suitable use cases. (Part 3)
Part 4: Managing Retries & Concurrency - Enhance your scraper's reliability and scalability by handling failed requests and utilizing concurrency. (Part 4)
Part 5: Mimicking User Behavior - Learn how to create a production-ready scraper by simulating real users through user-agent and browser header manipulation. (This article)
Part 6: Avoiding Detection with Proxies - Discover how to use proxies to bypass anti-bot systems by disguising your real IP address and location. (Part 6)

GitHub Code

The code for this project is available on Github.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

Getting Blocked and Banned While Web Scraping

Using Fake User-Agents When Scraping

To do this, you must manage the User-Agent headers you send along with your HTTP requests.

What are User-Agents

User-Agent (UA) is a string sent by the user's web browser to a server. It's located in the HTTP header and helps websites identify the following information about the user sending a request:

Operating system: The user's operating system (e.g., Windows, macOS, Linux, Android, iOS)
Browser: The specific browser being used (e.g., Chrome, Firefox, Safari, Edge)
Browser version: The version of the browser

Here's an example of a user-agent string that might be sent when you visit a website using Chrome:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36

The user-agent string indicates that you are using Chrome version 109.0.0.0 on a 64-bit Windows 10 computer.

The browser is Chrome
The version of Chrome is 109.0.0.0
The operating system is Windows 10
The device is a 64-bit computer

Note that using an incorrectly formed user-agent can lead to your data extraction script being blocked. Most web browsers tend to follow the below format:

Mozilla/5.0 (<system-information>) <platform> (<platform-details>) <extensions>

Using Fake Browser Headers When Scraping

You must set a unique user-agent for each request. Websites can detect repeated requests from the same user-agent and identify them as potential bots.

With most NodeJS HTTP clients, like Axios, the default user-agent string reveals that your string is coming from NodeJS. To hide this, you can manually set a different user agent before sending the request.

'User-Agent': 'axios/1.6.7'

This user-agent will identify your requests are being made by the NodeJS Axios library, so the website can easily block you from scraping the site. Therefore, it's crucial to manage your user-agents when sending requests with NodeJS Axios.

How to Set a Fake User-Agent in Node with Axios

For Axios, setting a fake user-agent is pretty simple. You can define the User-Agent header before the request like this:

const axios = require("axios");
const headers = {
  "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
};
axios.get("http://httpbin.org/headers", { headers }).then((response) => {
  console.log(response.data);
});

Here's the result of that code: HttpBin Headers Response

You can see that the user-agent we provided is included in the response and its value is reflected as the answer to our request.

How to Rotate User-Agents

You can easily rotate user-agents by including a list of user-agents in your scraper and randomly selecting one for each request.

const axios = require("axios");

const userAgents = [
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36",
  "Mozilla/5.0 (iPhone; CPU iPhone OS 14_4_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Mobile/15E148 Safari/604.1",
  "Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)",
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/87.0.664.75",
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18363",
];

const headers = {
  "User-Agent": userAgents[Math.floor(Math.random() * userAgents.length)],
};
axios.get("http://httpbin.org/headers", { headers }).then((response) => {
  console.log(response.data);
});

Each request has a unique user-agent, ensuring the random selection of user-agents. Here’s the results of running it twice: HttpBin random user agent result 1

How to Create a Custom Fake User-Agent Middleware

Let's see how to create fake user-agent middleware that can effectively manage thousands of fake user-agents. This middleware can then be seamlessly integrated into your final scraper code.

To use the ScrapeOps Fake User-Agents API, you simply need to send a request to the API endpoint to fetch a list of user-agents.

http://headers.scrapeops.io/v1/user-agents?api_key=YOUR_API_KEY

To use the ScrapeOps Fake User-Agent API, you first need an API key which you can get by signing up for a free account here

Here’s a response from the API that shows an up-to-date list of user-agents that you can use for each request.

{
  "result": [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Safari/605.1.15",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/603.3.8 (KHTML, like Gecko) Version/10.1.2 Safari/603.3.8",
    "Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Safari/605.1.15",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Safari/605.1.15",
    "Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36"
  ]
}

Now, what if the retrieved list of user-agents from the ScrapeOps Fake User-Agent API is empty? In such cases, you can use the fallback user-agent list, which we will define in a separate method.

Here’s the step-by-step explanation of building custom user-agent middleware:

Create your getHeaders method that contains your Scrape Ops API Key and fallback headers. We will also take in an argument numHeaders to specify the number of results we want from the API. If the list is empty or the status code is not 200, it prints a warning and uses the fall back headers we defined. Otherwise, we parse and return the API's response.

async function getHeaders(numHeaders) {
  const fallbackHeaders = [
    "Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36",
  ];
  const scrapeOpsKey = "<YOUR_SCRAPE_OPS_KEY>";

  try {
    const response = await axios.get(
      `http://headers.scrapeops.io/v1/user-agents?api_key=${scrapeOpsKey}&num_results=${numHeaders}`
    );

    if (response.data.result.length > 0) {
      return response.data.result;
    } else {
      console.error("No headers from ScrapeOps, using fallback headers");
      return fallbackHeaders;
    }
  } catch (error) {
    console.error(
      "Failed to fetch headers from ScrapeOps, using fallback headers"
    );
    return fallbackHeaders;
  }
}

Now with that method added, we can call it during our scraper startup to fetch the user agents, then we can pass the user agents to subsequent calls to the scrape method where they can be used during the request. This is where we will add the randomization.

Using Math.random we can select a random index from the user-agent list and pass it to scrape to be used as header options:

if (isMainThread) {
  // ...
} else {
  // Perform work
  const { startUrl } = workerData;
  let headers = [];
  const handleWork = async (workUrl) => {
    if (headers.length == 0) {
      headers = await getHeaders(2);
    }
    const { nextUrl, products } = await scrape(workUrl, {
      "User-Agent": headers[Math.floor(Math.random() * headers.length)],
    });
    for (const product of products) {
      parentPort.postMessage(product);
    }

    if (nextUrl) {
      console.log("Worker working on", nextUrl);
      await handleWork(nextUrl);
    }
  };

  handleWork(startUrl).then(() => console.log("Worker finished"));
}

Integrating User-Agent Middleware in a Scraper

Integration of user agent middleware into our scraper is very easy. You just need to make some minor changes while making requests to the URL in the retry logic.

In Part 4 of the series, we made this makeRequest function that is responsible for the retry logic. We need to add an extra argument now to allow us to pass the random header to this function.

It should look something like this:

async function makeRequest(
  url,
  retries = 3,
  antiBotCheck = false,
  headers = {}
) {
  for (let i = 0; i < retries; i++) {
    try {
      const response = await axios.get(url, {
        headers: headers,
      });
      if ([200, 404].includes(response.status)) {
        if (antiBotCheck && response.status == 200) {
          if (response.data.includes("<title>Robot or human?</title>")) {
            return null;
          }
        }
        return response;
      }
    } catch (e) {
      console.log(`Failed to fetch ${url}, retrying...`);
    }
  }
  return null;
}

You can see our function doesn't change much. We add the headers argument (which defaults to an empty object) and then we pass that argument onto the axios.get(...) call as the headers option.

Using Fake Browser Headers When Scraping

These solutions analyze not only your request's user-agent but also the other headers a real browser normally sends.

Why Choose Fake Browser Headers Instead of User-Agents

Using a full set of browser headers, not just a fake user-agent, makes your requests appear more like those of real users, making them harder to detect.

Here is an example header when using a Chrome browser on a MacOS machine:

sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="99", "Google Chrome";v="99"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "macOS"
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.83 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: en-GB,en-US;q=0.9,en;q=0.8

How to Set Fake Browser Headers in Node.js Axios

Before setting any fake headers, let's take a look at what is sent by default with Axios

axios.get("http://httpbin.org/headers").then((response) => {
  console.log(response.data);
});

Returns: Default axios headers

As you can tell, we're missing a lot when comparing to the real browser output above. This lack of headers can lead to detection and request blocking when trying to scrape a large number of pages. To avoid that problem, we will show you how to use fake browser headers (which includes User Agents).

const axios = require("axios");

const headers = {
  authority: "httpbin.org",
  "cache-control": "max-age=0",
  "sec-ch-ua":
    '"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"',
  "sec-ch-ua-mobile": "?0",
  "upgrade-insecure-requests": "1",
  "user-agent":
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36",
  accept:
    "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
  "sec-fetch-site": "none",
  "sec-fetch-mode": "navigate",
  "sec-fetch-user": "?1",
  "sec-fetch-dest": "document",
  "accept-language": "en-US,en;q=0.9",
};

axios.get("http://httpbin.org/headers", { headers }).then((response) => {
  console.log(response.data);
});

This sends a request to an httpbin.org endpoint. We should expect to see all our fake headers: Fake browser headers shown in httpbin

How to Create a Custom Fake Browser Agent Middleware

The ScrapeOps Fake Browser Headers API is a free API that returns a list of optimized fake browser headers, helping you evade blocks/bans and enhance the reliability of your web scrapers.

API Endpoint:

http://headers.scrapeops.io/v1/browser-headers?api_key=YOUR_API_KEY

Response:

{
  "result": [
    {
        "upgrade-insecure-requests": "1",
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "sec-ch-ua": "\".Not/A)Brand\";v=\"99\", \"Google Chrome\";v=\"103\", \"Chromium\";v=\"103\"",
        "sec-ch-ua-mobile": "?0",
        "sec-ch-ua-platform": "\"Windows\"",
        "sec-fetch-site": "none",
        "sec-fetch-mod": "",
        "sec-fetch-user": "?1",
        "accept-encoding": "gzip, deflate, br",
        "accept-language": "bg-BG,bg;q=0.9,en-US;q=0.8,en;q=0.7"
    },
    {
        "upgrade-insecure-requests": "1",
        "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36",
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "sec-ch-ua": "\".Not/A)Brand\";v=\"99\", \"Google Chrome\";v=\"103\", \"Chromium\";v=\"103\"",
        "sec-ch-ua-mobile": "?0",
        "sec-ch-ua-platform": "\"Linux\"",
        "sec-fetch-site": "none",
        "sec-fetch-mod": "",
        "sec-fetch-user": "?1",
        "accept-encoding": "gzip, deflate, br",
        "accept-language": "fr-CH,fr;q=0.9,en-US;q=0.8,en;q=0.7"
    }
  ]
}

To use the ScrapeOps Fake Browser Headers API, you first need an API key which you can get by signing up for a free account here.

To integrate the Fake Browser Headers API, you should configure your scraper to retrieve a batch of the most up-to-date headers upon startup.
Then, configure it to pick a random header from this list for each request.

Now, what if the retrieved list of headers from the ScrapeOps Fake Browser Headers API is empty? In such cases, you can use the fallback headers list.

The code closely resembles the user-agent middleware, with a few minor changes. The getHeaders method instead sends a request to the ScrapeOps Fake Browser Headers API rather than User Agents API. If the request is successful, it extracts the headers and stores them.

If the request fails or in case of any error, it displays a warning message and uses a fallback headers list.

Finally, our scrape call can now use the headers directly because we are no longer setting just the User Agent header.

const { nextUrl, products } = await scrape(
  workUrl,
  headers[Math.floor(Math.random() * headers.length)]
);

Here's the complete code that utilizes the browser headers API:

const axios = require("axios");
const cheerio = require("cheerio");
const fs = require("fs");
const {
  Worker,
  isMainThread,
  parentPort,
  workerData,
} = require("worker_threads");

class Product {
  constructor(name, priceStr, url) {
    this.name = this.cleanName(name);
    this.priceGb = this.cleanPrice(priceStr);
    this.priceUsd = this.convertPriceToUsd(this.priceGb);
    this.url = this.createAbsoluteUrl(url);
  }

  cleanName(name) {
    if (name == " " || name == "" || name == null) {
      return "missing";
    }
    return name.trim();
  }

  cleanPrice(priceStr) {
    priceStr = priceStr.trim();
    priceStr = priceStr.replace("Sale price£", "");
    priceStr = priceStr.replace("Sale priceFrom £", "");
    if (priceStr == "") {
      return 0.0;
    }
    return parseFloat(priceStr);
  }

  convertPriceToUsd(priceGb) {
    return priceGb * 1.29;
  }

  createAbsoluteUrl(url) {
    if (url == "" || url == null) {
      return "missing";
    }
    return "https://www.chocolate.co.uk" + url;
  }
}

class ProductDataPipeline {
  constructor(csvFilename = "", storageQueueLimit = 5) {
    this.seenProducts = new Set();
    this.storageQueue = [];
    this.csvFilename = csvFilename;
    this.csvFileOpen = false;
    this.storageQueueLimit = storageQueueLimit;
  }

  saveToCsv() {
    this.csvFileOpen = true;
    const fileExists = fs.existsSync(this.csvFilename);
    const file = fs.createWriteStream(this.csvFilename, { flags: "a" });
    if (!fileExists) {
      file.write("name,priceGb,priceUsd,url\n");
    }
    for (const product of this.storageQueue) {
      file.write(
        `${product.name},${product.priceGb},${product.priceUsd},${product.url}\n`
      );
    }
    file.end();
    this.storageQueue = [];
    this.csvFileOpen = false;
  }

  cleanRawProduct(rawProduct) {
    return new Product(rawProduct.name, rawProduct.price, rawProduct.url);
  }

  isDuplicateProduct(product) {
    if (!this.seenProducts.has(product.url)) {
      this.seenProducts.add(product.url);
      return false;
    }
    return true;
  }

  addProduct(rawProduct) {
    const product = this.cleanRawProduct(rawProduct);
    if (!this.isDuplicateProduct(product)) {
      this.storageQueue.push(product);
      if (
        this.storageQueue.length >= this.storageQueueLimit &&
        !this.csvFileOpen
      ) {
        this.saveToCsv();
      }
    }
  }

  async close() {
    while (this.csvFileOpen) {
      // Wait for the file to be written
      await new Promise((resolve) => setTimeout(resolve, 100));
    }
    if (this.storageQueue.length > 0) {
      this.saveToCsv();
    }
  }
}

const listOfUrls = ["https://www.chocolate.co.uk/collections/all"];

async function getHeaders(numHeaders) {
  const fallbackHeaders = [
    {
      "upgrade-insecure-requests": "1",
      "user-agent":
        "Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
      accept:
        "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
      "sec-ch-ua":
        '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
      "sec-ch-ua-mobile": "?0",
      "sec-ch-ua-platform": '"Windows"',
      "sec-fetch-site": "none",
      "sec-fetch-mod": "",
      "sec-fetch-user": "?1",
      "accept-encoding": "gzip, deflate, br",
      "accept-language": "bg-BG,bg;q=0.9,en-US;q=0.8,en;q=0.7",
    },
    {
      "upgrade-insecure-requests": "1",
      "user-agent":
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36",
      accept:
        "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
      "sec-ch-ua":
        '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
      "sec-ch-ua-mobile": "?0",
      "sec-ch-ua-platform": '"Linux"',
      "sec-fetch-site": "none",
      "sec-fetch-mod": "",
      "sec-fetch-user": "?1",
      "accept-encoding": "gzip, deflate, br",
      "accept-language": "fr-CH,fr;q=0.9,en-US;q=0.8,en;q=0.7",
    },
  ];
  const scrapeOpsKey = "<YOUR_SCRAPE_OPS_KEY>";

  try {
    const response = await axios.get(
      `http://headers.scrapeops.io/v1/browser-headers?api_key=${scrapeOpsKey}&num_results=${numHeaders}`
    );

    if (response.data.result.length > 0) {
      return response.data.result;
    } else {
      console.error("No headers from ScrapeOps, using fallback headers");
      return fallbackHeaders;
    }
  } catch (error) {
    console.error(
      "Failed to fetch headers from ScrapeOps, using fallback headers"
    );
    return fallbackHeaders;
  }
}

async function makeRequest(
  url,
  retries = 3,
  antiBotCheck = false,
  headers = {}
) {
  for (let i = 0; i < retries; i++) {
    try {
      const response = await axios.get(url, {
        headers: headers,
      });
      if ([200, 404].includes(response.status)) {
        if (antiBotCheck && response.status == 200) {
          if (response.data.includes("<title>Robot or human?</title>")) {
            return null;
          }
        }
        return response;
      }
    } catch (e) {
      console.log(`Failed to fetch ${url}, retrying...`);
    }
  }
  return null;
}

async function scrape(url, headers) {
  const response = await makeRequest(url, 3, false, headers);
  if (!response) {
    throw new Error(`Failed to fetch ${url}`);
  }

  const html = response.data;
  const $ = cheerio.load(html);
  const productItems = $("product-item");

  const products = [];
  for (const productItem of productItems) {
    const title = $(productItem).find(".product-item-meta__title").text();
    const price = $(productItem).find(".price").first().text();
    const url = $(productItem).find(".product-item-meta__title").attr("href");
    products.push({ name: title, price: price, url: url });
  }

  const nextPage = $("a[rel='next']").attr("href");
  return {
    nextUrl: nextPage ? "https://www.chocolate.co.uk" + nextPage : null,
    products: products,
  };
}

if (isMainThread) {
  const pipeline = new ProductDataPipeline("chocolate.csv", 5);
  const workers = [];

  for (const url of listOfUrls) {
    workers.push(
      new Promise((resolve, reject) => {
        const worker = new Worker(__filename, {
          workerData: { startUrl: url },
        });
        console.log("Worker created", worker.threadId, url);

        worker.on("message", (product) => {
          pipeline.addProduct(product);
        });

        worker.on("error", reject);
        worker.on("exit", (code) => {
          if (code !== 0) {
            reject(new Error(`Worker stopped with exit code ${code}`));
          } else {
            console.log("Worker exited");
            resolve();
          }
        });
      })
    );
  }

  Promise.all(workers)
    .then(() => pipeline.close())
    .then(() => console.log("Pipeline closed"));
} else {
  // Perform work
  const { startUrl } = workerData;
  let headers = [];
  const handleWork = async (workUrl) => {
    if (headers.length == 0) {
      headers = await getHeaders(2);
    }
    const { nextUrl, products } = await scrape(
      workUrl,
      headers[Math.floor(Math.random() * headers.length)]
    );
    for (const product of products) {
      parentPort.postMessage(product);
    }

    if (nextUrl) {
      console.log("Worker working on", nextUrl);
      await handleWork(nextUrl);
    }
  };

  handleWork(startUrl).then(() => console.log("Worker finished"));
}

Part 5 - Faking User-Agents & Browser Headers

NodeJS Puppeteer Beginners Series Part 5 - Faking User-Agents & Browser Headers

So far in this NodeJS Puppeteer 6-Part Beginner Series, we have learned how to build a basic web scraper Part 1, scrape data from a website in Part 2, clean it up, save it to a file or database in Part 3, and make our scraper more robust and scalable by handling failed requests and using concurrency in Part 4.

In Part 5, we’ll explore how to use fake user-agents and browser headers to bypass restrictions on sites trying to prevent scraping.

Getting Blocked and Banned While Web Scraping
Using Fake User-Agents When Scraping
Using Fake Browser Headers When Scraping
Creating Custom Middleware for User-Agents and Headers
Integrating Middlewares in a Scraper
Next Steps

Node.js Puppeteer 6-Part Beginner Series

Part 1: Basic Node.js Puppeteer Scraper - We'll learn the fundamentals of web scraping with Node.js and build your first scraper using NpdeJS Puppeteer. (Part 1)
Part 2: Cleaning Unruly Data & Handling Edge Cases - Web data can be messy and unpredictable. In this part, we'll create a robust scraper using data structures and cleaning techniques to handle these challenges. (Part 2)
Part 3: Storing Scraped Data in AWS S3, MySQL & Postgres DBs - Explore various options for storing your scraped data, including databases like MySQL or Postgres, cloud storage like AWS S3, and file formats like CSV and JSON. We'll discuss their pros, cons, and suitable use cases. (Part 3)
Part 4: Managing Retries & Concurrency - Enhance your scraper's reliability and scalability by handling failed requests and utilizing concurrency. (Part 4)
Part 5: Faking User-Agents & Browser Headers - Learn how to create a production-ready scraper by simulating real users through user-agent and browser header manipulation. (This article)
Part 6: Using Proxies To Avoid Getting Blocked - Discover how to use proxies to bypass anti-bot systems by disguising your real IP address and location. (Part 6)

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

Getting Blocked and Banned While Web Scraping

Web scraping large volumes of data can be challenging, especially when dealing with sophisticated anti-bot mechanisms. While it is easy to build and run scrapers, ensuring reliable retrieval of HTML responses from target pages is often difficult.

Websites like Amazon employ advanced techniques to detect and block scraping activities. This guide will show you how to use Puppeteer to mimic real browser interactions, thereby avoiding detection and blocks.

In this guide, we're still going to look at how to use fake user-agents and browser headers so that you can apply these techniques if you ever need to scrape a more difficult website like Amazon.

Using Fake User-Agents When Scraping

User-agents are pieces of information that your browser sends to a website, telling it what type of device and browser you're using. Many websites use this data to detect bots and block scraping attempts. To avoid this, you can rotate or fake user-agents to make your scraping activities look like they’re coming from different browsers or devices.

By simulating various user-agents, you can reduce the chances of being flagged as a bot and increase your scraping success on websites with anti-bot mechanisms.

What are User-Agents?

A user-agent is a string of text sent by your browser to a web server when you visit a website.

It's located in the HTTP header and contains details about your browser, operating system, and device, allowing the website to customize the content based on this information.

Operating system: The user's operating system (e.g., Windows, macOS, Linux, Android, iOS)
Browser: The specific browser being used (e.g., Chrome, Firefox, Safari, Edge)
Browser version: The version of the browser

Here's an example of a user-agent string that might be sent when you visit a website using Chrome:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36

The user-agent string indicates that you are using Chrome version 109.0.0.0 on a 64-bit Windows 10 computer.

The browser is Chrome
The version of Chrome is 109.0.0.0
The operating system is Windows 10
The device is a 64-bit computer

Check our Puppeteer Guide: Using Fake User Agents to get more information about using Fake-User Agents in Python Requests.

Why Use Fake User-Agents in Web Scraping

Websites often use user-agents to identify the type of browser, device, or bot making a request. If a site detects that multiple requests are coming from a bot or the same user-agent, it may block or throttle the requests to prevent scraping.

Using fake or rotating user-agents helps overcome these blocks by mimicking different browsers and devices, making the scraper appear as legitimate traffic.

This technique improves the likelihood of bypassing anti-scraping measures, allowing scrapers to access and collect data from websites that might otherwise restrict or block their efforts.

How to Set a Fake User-Agent in NodeJS Puppeteer

Just like with other scraping tools, using appropriate user-agents is crucial. Puppeteer allows you to easily set and rotate user-agents to avoid detection.

To set a user-agent in Puppeteer, you can use the setUserAgent method:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36');
  await page.goto('https://www.chocolate.co.uk');
  const content = await page.content();
  console.log(content);
  await browser.close();
})();

How to Rotate User-Agents

You can rotate user-agents by creating a list of user-agents and selecting a random one for each request:

const puppeteer = require('puppeteer');

const userAgentList = [
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Safari/605.1.15',
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36',
];

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  const randomUserAgent = userAgentList[Math.floor(Math.random() * userAgentList.length)];
  await page.setUserAgent(randomUserAgent);
  await page.goto('https://www.chocolate.co.uk');
  const content = await page.content();
  console.log(content);
  await browser.close();
})();

Using Fake Browser Headers When Scraping

In addition to user-agents, setting other browser headers can make your scraping activities appear more legitimate.

Browser headers are key pieces of information that your browser sends to a website with each request. They include details like the user-agent, cookies, referrer, and accepted content types, helping the server understand the request and respond appropriately. Some websites use these headers to detect bots or scraping activity.

By faking or customizing browser headers, scrapers can disguise their requests to look like they’re coming from a regular user, not a bot. This helps avoid detection and allows scrapers to bypass security measures that block automated requests, improving the success of your web scraping.

Why Choose Fake Browser Headers Instead of User-Agents

While user-agents help disguise the type of browser and device making a request, websites often look at more than just the user-agent string to detect bots. Fake browser headers offer a broader and more convincing disguise by imitating a real browser's full request, including additional information like cookies, referrers, and accepted content types.

Here is an example header when using a Chrome browser on a MacOS machine:

sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="99", "Google Chrome";v="99"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "macOS"
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.83 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: en-GB,en-US;q=0.9,en;q=0.8

How to Set Fake Browser Headers in NodeJS Puppeteer

To set custom headers in Puppeteer, use the setExtraHTTPHeaders method:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.setExtraHTTPHeaders({
    'Accept-Language': 'en-US,en;q=0.9',
    'Upgrade-Insecure-Requests': '1',
  });
  await page.goto('https://www.chocolate.co.uk');
  const content = await page.content();
  console.log(content);
  await browser.close();
})();

Rotating Browser Headers

Similar to user-agents, you can rotate headers to avoid detection:

const puppeteer = require('puppeteer');

const headersList = [
  {
    'Accept-Language': 'en-US,en;q=0.9',
    'Upgrade-Insecure-Requests': '1',
  },
  {
    'Accept-Language': 'fr-FR,fr;q=0.9',
    'Upgrade-Insecure-Requests': '1',
  },
  {
    'Accept-Language': 'es-ES,es;q=0.9',
    'Upgrade-Insecure-Requests': '1',
  },
];

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  const randomHeaders = headersList[Math.floor(Math.random() * headersList.length)];
  await page.setExtraHTTPHeaders(randomHeaders);
  await page.goto('https://www.chocolate.co.uk');
  const content = await page.content();
  console.log(content);
  await browser.close();
})();

Creating Custom Middleware for User-Agents and Headers

To efficiently manage user-agents and headers, you can create a middleware that sets these values dynamically.

User-Agent Middleware

class UserAgentMiddleware {
  constructor(userAgents) {
    this.userAgents = userAgents;
  }

  getRandomUserAgent() {
    const randomIndex = Math.floor(Math.random() * this.userAgents.length);
    return this.userAgents[randomIndex];
  }
}

const userAgentMiddleware = new UserAgentMiddleware([
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Safari/605.1.15',
]);

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.setUserAgent(userAgentMiddleware.getRandomUserAgent());
  await page.goto('https://www.chocolate.co.uk');
  const content = await page.content();
  console.log(content);
  await browser.close();
})();

Browser Headers Middleware

class BrowserHeadersMiddleware {
  constructor(headersList) {
    this.headersList = headersList;
  }

  getRandomHeaders() {
    const randomIndex = Math.floor(Math.random() * this.headersList.length);
    return this.headersList[randomIndex];
  }
}

const headersMiddleware = new BrowserHeadersMiddleware([
  {
    'Accept-Language': 'en-US,en;q=0.9',
    'Upgrade-Insecure-Requests': '1',
  },
  {
    'Accept-Language': 'fr-FR,fr;q=0.9',
    'Upgrade-Insecure-Requests': '1',
  },
]);

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.setExtraHTTPHeaders(headersMiddleware.getRandomHeaders());
  await page.goto('https://www.chocolate.co.uk');
  const content = await page.content();
  console.log(content);
  await browser.close();
})();

Integrating Middlewares in a Scraper

Integrating the custom middlewares into a scraper is straightforward. Ensure the middleware is called before each page navigation.

const puppeteer = require('puppeteer');

class UserAgentMiddleware {
  constructor(userAgents) {
    this.userAgents = userAgents;
  }

  getRandomUserAgent() {
    const randomIndex = Math.floor(Math.random() * this.userAgents.length);
    return this.userAgents[randomIndex];
  }
}

class BrowserHeadersMiddleware {
  constructor(headersList) {
    this.headersList = headersList;
  }

  getRandomHeaders() {
    const randomIndex = Math.floor(Math.random() * this.headersList.length);
    return this.headersList[randomIndex];
  }
}

class Scraper {
  constructor(userAgentMiddleware, headersMiddleware) {
    this.userAgentMiddleware = userAgentMiddleware;
    this.headersMiddleware = headersMiddleware;
  }

  async scrape(url) {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.setUserAgent(this.userAgentMiddleware.getRandomUserAgent());
    await page.setExtraHTTPHeaders(this.headersMiddleware.getRandomHeaders());
    await page.goto(url);
    const content = await page.content();
    await browser.close();
    return content;
  }
}

const userAgentMiddleware = new UserAgentMiddleware([
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Safari/605.1.15',
]);

const headersMiddleware = new BrowserHeadersMiddleware([
  {
    'Accept-Language': 'en-US,en;q=0.9',
    'Upgrade-Insecure-Requests': '1',
  },
  {
    'Accept-Language': 'fr-FR,fr;q=0.9',
    'Upgrade-Insecure-Requests': '1',
  },
]);

const scraper = new Scraper(userAgentMiddleware, headersMiddleware);

(async () => {
  const content = await scraper.scrape('https://www.chocolate.co.uk');
  console.log(content);
})();

Part 5: Using Fake User-Agents and Browser Headers

Node.js Playwright Beginners Series Part 5: Using Fake User-Agents and Browser Headers

Welcome to Part 5 of our Node.js Playwright Beginner Series!

So far in this series, we learned how to build a basic web scraper in Part 1, get it to scrape some data from a website in Part 2, clean up the data as it was being scraped and then save the data to a file or database in Part 3, and make our scraper more robust and scalable by handling failed requests and using concurrency in Part 4.

In this guide, we’ll walk through how to customize User-Agent strings and Browser Headers to make your scraper behave like a real user, rather than a headless browser like Playwright.

Many websites use advanced bot detection techniques to block scrapers. By making your scraper appear more like a legitimate user, you can minimize the risk of detection and ensure smoother scraping operations.

Getting Blocked and Banned While Web Scraping
Using Fake User-Agents When Scraping
Using Fake Browser Headers When Scraping
Next Steps

Node.js Playwright 6-Part Beginner Series

Part 1: Basic Node.js Playwright Scraper - We'll learn the fundamentals of web scraping with Node.js and build your first scraper using Cheerio. (Part 1)
Part 2: Cleaning Unruly Data & Handling Edge Cases - Web data can be messy and unpredictable. In this part, we'll create a robust scraper using data structures and cleaning techniques to handle these challenges. (Part 2)
Part 3: Storing Scraped Data in AWS S3, MySQL & Postgres DBs - Explore various options for storing your scraped data, including databases like MySQL or Postgres, cloud storage like AWS S3, and file formats like CSV and JSON. We'll discuss their pros, cons, and suitable use cases. (Part 3)
Part 4: Managing Retries & Concurrency - Enhance your scraper's reliability and scalability by handling failed requests and utilizing concurrency. (Part 4)
Part 5: Faking User-Agents & Browser Headers - Learn how to create a production-ready scraper by simulating real users through user-agent and browser header manipulation. (This Article)
Part 6: Using Proxies To Avoid Getting Blocked - Discover how to use proxies to bypass anti-bot systems by disguising your real IP address and location. (Part 6)

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

Getting Blocked and Banned While Web Scraping

When scraping large volumes of data, you’ll quickly realize that building and running scrapers is the easy part-the real challenge is consistently retrieving HTML responses from the pages you want.

While scraping a few hundred pages on your local machine is manageable, websites will block your requests once you scale up to thousands or millions.

Major sites like Amazon monitor traffic using IP addresses and user-agents, employing advanced anti-bot systems to detect suspicious behavior. If your scraper is identified, your requests will be blocked.

Playwright scrapers are easily detected because their default settings signal bot-like behavior. Here’s why:

User-Agent Strings: When Playwright runs in headless mode, it includes Headless in the User-Agent string, which makes it an easy target for websites monitoring for bots.
Headers: Playwright's default headers differ from those sent by real browsers. Headers such as Accept-Language, Accept-Encoding, and User-Agent need to match typical browser requests. Any discrepancy can trigger bot detection mechanisms.

Using Fake User-Agents When Scraping

A common reason for getting blocked while web scraping is using bad User-Agent headers. Many websites are protective of their data and don’t want it scraped, so it's important to make your scraper appear as a legitimate user.

To achieve this, you need to carefully manage the User-Agent headers that are sent with your HTTP requests.

What are User-Agents?

A User-Agent is a string sent to the server via HTTP headers, allowing the server to identify the client making the request. This string typically contains information such as:

Browser: The name and version of the browser (e.g., Chrome, Firefox).
Operating System: The OS and its version (e.g., Windows 10, macOS).
Rendering Engine: The engine used to display the content (e.g., WebKit, Gecko).

In Playwright, the default User-Agent string can expose that a request is coming from a headless browser, which is often flagged by websites.

Let's check the default User-Agent in Playwright by sending a request to httpbin.io/user-agent endpoint:

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch({ headless: true });

  const context = await browser.newContext();
  const page = await context.newPage();
  await page.goto('https://httpbin.io/user-agent');

  const content = await page.content();
  console.log(content);

  await browser.close();
})();

This outputs the following User-Agent string:

<html><head><meta name="color-scheme" content="light dark"><meta charset="utf-8"></head><body><pre>{
  "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/128.0.6613.18 Safari/537.36"
}
</pre><div class="json-formatter-container"></div></body></html>

The user-agent string indicates that you are using Mozilla version 5.0 on a 64-bit Linux computer.

The browser is Mozilla
The version of Mozilla is 5.0
The operating system is Linux
The device is a 64-bit computer

You can see the User-Agent contains the string HeadlessChrome, indicating that this request comes from a headless browser—a key signal for websites to detect bots.

Check our Playwright: Using Fake User Agents to get more information about using Fake-User Agents in NodeJS Playwright.

Why Use Fake User-Agents in Web Scraping

Fake User-Agents are used in web scraping to make requests appear as though they are coming from a real browser and a legitimate user rather than a bot.

You must set a unique user-agent for each request. Websites can detect repeated requests from the same user-agent and identify them as potential bots.

In Node.js, using Playwright for web scraping, the default User-Agent string may reveal that your requests are being made by an automated tool, which websites can detect and block. To avoid this, you should manually set a custom User-Agent before making each request to mimic a real browser.

For example, Playwright's default user-agent might look like this:

'User-Agent': 'Playwright/1.18.0'

This makes it obvious that your requests are coming from Playwright, which could lead to blocking. Therefore, it's crucial to manage your user-agents when sending requests with Playwright.

How to Set a Fake User-Agent in Playwright

You can choose a genuine user agent from UserAgents.io for use in your code. For example, we've selected this user agent:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36

To use a specific User-Agent and override the default one in Playwright, you need to pass it to the newContext() method under the userAgent name. Check out the code below:

const { chromium } = require('playwright'); 

(async () => {
  const browser = await chromium.launch({ headless: true }); 

  const context = await browser.newContext({
    userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
  });

  const page = await context.newPage();
  await page.goto('https://httpbin.org/user-agent');
  
  const content = await page.content();
  console.log(content);
  
  await browser.close();
})();

// <html><head><meta name="color-scheme" content="light dark"><meta charset="utf-8"></head><body><pre>{
//   "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
// }
// </pre><div class="json-formatter-container"></div></body></html>

Our code has successfully overridden the default User-Agent, which previously contained a "Headless" string. It now resembles a genuine User-Agent without any such strings.

How to Rotate User-Agents

Using the same User-Agent for all requests isn't ideal. It can make your scraper appear suspicious since scrapers typically send a high volume of requests compared to regular users.

To mitigate this, you should rotate user agents and headers to simulate different profiles with each request.

Here's how you can do it:

const { chromium } = require('playwright'); 

const userAgents = [
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36",
  "Mozilla/5.0 (iPhone; CPU iPhone OS 14_4_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Mobile/15E148 Safari/604.1",
  "Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)",
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/87.0.664.75",
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18363",
];

(async () => {
  const browser = await chromium.launch({ headless: true }); 

  const context = await browser.newContext({
    userAgent: userAgents[Math.floor(Math.random() * userAgents.length)]
  });

  const page = await context.newPage();
  
  await page.goto('http://httpbin.org/user-agent');
  
  const content = await page.content();
  console.log(content);
  
  await browser.close();
})();

// <html><head><meta name="color-scheme" content="light dark"><meta charset="utf-8"></head><body><pre>{
//   "headers": {
    // "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7", 
    // "Accept-Encoding": "gzip, deflate", 
    // "Host": "httpbin.org", 

Start by compiling a list of user agents and storing them in an array called userAgents.
Each time you create a new context, randomly select a user agent from the array and pass it to the userAgent option in newContext().

Alternatively, you can use the npm package user-agents to get a larger dataset of user agents instead of manually listing them.

How to Create a Custom Fake User-Agent Middleware

Let's dive into creating a custom middleware that manages thousands of fake user agents efficiently. This middleware can be easily integrated into your scraper.

The best approach is to leverage a free user-agent API, like the ScrapeOps Fake User-Agent API. This API provides an up-to-date list of user agents, allowing your scraper to select a different one for each request.

To use the ScrapeOps Fake User-Agent API, you'll need to request a list of user agents from their endpoint:

http://headers.scrapeops.io/v1/user-agents?api_key=YOUR_API_KEY

To access this API, sign up for a free account and obtain an API key.

Here’s an example of the API response containing a list of user agents:

{
  "result": [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Safari/605.1.15",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/603.3.8 (KHTML, like Gecko) Version/10.1.2 Safari/603.3.8",
    "Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Safari/605.1.15",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Safari/605.1.15",
    "Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36"
  ]
}

To integrate the Fake User-Agent API into your scraper:

Configure it to fetch a list of user agents when the scraper starts.
Then, randomly select a user agent from this list for each request.

In case the list from the API is empty or unavailable, you can use a fallback list of user agents.

Here’s how to build the custom user-agent middleware:

Create a Method to Fetch User Agents

Define a method getHeaders() to retrieve user agents from the ScrapeOps API and use fallback headers if needed:

const axios = require('axios');

async function getHeaders(numHeaders) {
  const fallbackHeaders = [
    "Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36",
  ];
  const scrapeOpsKey = "<YOUR_SCRAPE_OPS_KEY>";

  try {
    const response = await axios.get(
      `http://headers.scrapeops.io/v1/user-agents?api_key=${scrapeOpsKey}&num_results=${numHeaders}`
    );

    if (response.data.result.length > 0) {
      return response.data.result;
    } else {
      console.error("No headers from ScrapeOps, using fallback headers");
      return fallbackHeaders;
    }
  } catch (error) {
    console.error(
      "Failed to fetch headers from ScrapeOps, using fallback headers"
    );
    return fallbackHeaders;
  }
}

Use Random User Agents:

Call getHeaders() during the startup of your scraper to fetch user agents. Use these agents for each request by selecting a random one:

if (isMainThread) {
  // ...
} else {
  const { startUrl } = workerData;
  const handleWork = async (workUrl) => {
    if (headers.length == 0) {
      headers = await getHeaders(2);
    }
    const { nextUrl, products } = await scrape(
        workUrl, 
        headers[Math.floor(Math.random() * headers.length)]
    );
    for (const product of products) {
      parentPort.postMessage(product);
    }

    if (nextUrl) {
      console.log("Worker working on", nextUrl);
      await handleWork(nextUrl);
    }
  };

  handleWork(startUrl).then(() => console.log("Worker finished"));
}

This setup ensures your scraper uses diverse user agents, making it less detectable and more effective.

Integrating User-Agent Middleware in a Scraper

Now that we’ve developed the getHeaders() middleware to fetch user agents from ScrapeOps using an Axios request, it's time to integrate it into our scraper.

To do this, we'll update our scrape() method from Part 4 to accept an additional parameter for headers. We'll then pass these headers as extraHTTPHeaders to the newPage() method.

Here's how you can implement it:

async function scrape(url, headers) {
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage({
    extraHTTPHeaders: headers
  });
  
  const response = await makeRequest(page, url);
  if (!response) {
    await browser.close();
    return { nextUrl: null, products: [] };
  }

  const productItems = await page.$$eval("product-item", items =>
    items.map(item => {
      const titleElement = item.querySelector(".product-item-meta__title");
      const priceElement = item.querySelector(".price");
      return {
        name: titleElement ? titleElement.textContent.trim() : null,
        price: priceElement ? priceElement.textContent.trim() : null,
        url: titleElement ? titleElement.getAttribute("href") : null
      };
    })
  );

  const nextUrl = await nextPage(page);
  await browser.close();

  return {
    nextUrl: nextUrl,
    products: productItems.filter(item => item.name && item.price && item.url)
  };
}

Using Fake Browser Headers When Scraping

For basic websites, setting an up-to-date User-Agent may be sufficient for reliable data scraping. However, many popular sites now employ advanced anti-bot technologies that look beyond the user-agent to detect scraping activities.

These technologies analyze additional headers that a real browser typically sends along with the user-agent.

Why Choose Fake Browser Headers Instead of User-Agents

Incorporating a full set of browser headers, rather than just a fake user-agent, makes your requests more similar to those of genuine users. This approach helps your requests blend in better and reduces the likelihood of detection.

Here’s an example of the headers a Chrome browser on macOS might use:

sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="99", "Google Chrome";v="99"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "macOS"
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.83 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: en-GB,en-US;q=0.9,en;q=0.8

As shown, real browsers send not only a User-Agent string but also several additional headers to identify and customize their requests.

To enhance the reliability of your scrapers, you should include these headers along with your user-agent.

How to Set Fake Browser Headers in Node.js Playwright

Before we set custom headers, let’s examine the default headers sent by Playwright:

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch({ headless: true });

  const context = await browser.newContext();
  const page = await context.newPage();

  await page.goto('https://httpbin.org/headers');

  const content = await page.content();
  console.log(content);

  await browser.close();
})();

The above code displays:

<html><head><meta name="color-scheme" content="light dark"><meta charset="utf-8"></head><body><pre>{
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7", 
    "Accept-Encoding": "gzip, deflate, br, zstd", 
    "Host": "httpbin.org", 
    "Priority": "u=0, i", 
    "Sec-Ch-Ua": "\"Chromium\";v=\"128\", \"Not;A=Brand\";v=\"24\", \"HeadlessChrome\";v=\"128\"", 
    "Sec-Ch-Ua-Mobile": "?0", 
    "Sec-Ch-Ua-Platform": "\"Linux\"", 
    "Sec-Fetch-Dest": "document", 
    "Sec-Fetch-Mode": "navigate", 
    "Sec-Fetch-Site": "none", 
    "Sec-Fetch-User": "?1", 
    "Upgrade-Insecure-Requests": "1", 
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/128.0.6613.18 Safari/537.36", 
    "X-Amzn-Trace-Id": "Root=1-66e343d9-5b9187420b8de5f2661f4a73"
  }
}
</pre><div class="json-formatter-container"></div></body></html>

As observed, the default headers are not as comprehensive as those sent by a real browser. This discrepancy can lead to detection and blocking when scraping numerous pages. To mitigate this, you should use a full set of fake browser headers.

To simulate a real browser, you'll need to set a full range of headers, not just the user-agent. Define these headers as key-value pairs and pass them to the extraHTTPHeaders parameter of the newPage() method. Here’s an example:

const { chromium } = require('playwright');

const headers = {
    authority: "httpbin.org",
    "cache-control": "max-age=0",
    "sec-ch-ua":
      '"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"',
    "sec-ch-ua-mobile": "?0",
    "upgrade-insecure-requests": "1",
    "user-agent":
      "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36",
    accept:
      "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "sec-fetch-site": "none",
    "sec-fetch-mode": "navigate",
    "sec-fetch-user": "?1",
    "sec-fetch-dest": "document",
    "accept-language": "en-US,en;q=0.9",
};

(async () => {
  const browser = await chromium.launch({ headless: true });

  const context = await browser.newContext();
  const page = await context.newPage({
    extraHTTPHeaders: headers
  });

  await page.goto('https://httpbin.org/headers');

  const content = await page.content();
  console.log(content);

  await browser.close();
})();

In this code, we send a request to the httpbin.org/headers endpoint. You should see all the custom headers included in the request, making your scraping activity less detectable.

Custom Headers

How to Create a Custom Fake Browser Headers Middleware

Creating custom fake browser agent middleware is quite similar to setting up custom fake user-agent middleware.

You can either manually build a list of fake browser headers or use the ScrapeOps Fake Browser Headers API to get an updated list each time your scraper runs.

The ScrapeOps Fake Browser Headers API is a free service that provides a set of optimized fake browser headers. This can help you avoid blocks and bans, enhancing the reliability of your web scrapers.

The API endpoint you’ll use is:

http://headers.scrapeops.io/v1/browser-headers?api_key=YOUR_API_KEY

Here is an example response:

{
  "result": [
    {
      "upgrade-insecure-requests": "1",
      "user-agent": "Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
      "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
      "sec-ch-ua": "\".Not/A)Brand\";v=\"99\", \"Google Chrome\";v=\"103\", \"Chromium\";v=\"103\"",
      "sec-ch-ua-mobile": "?0",
      "sec-ch-ua-platform": "\"Windows\"",
      "sec-fetch-site": "none",
      "sec-fetch-mode": "navigate",
      "sec-fetch-user": "?1",
      "accept-encoding": "gzip, deflate, br",
      "accept-language": "bg-BG,bg;q=0.9,en-US;q=0.8,en;q=0.7"
    },
    {
      "upgrade-insecure-requests": "1",
      "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36",
      "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
      "sec-ch-ua": "\".Not/A)Brand\";v=\"99\", \"Google Chrome\";v=\"103\", \"Chromium\";v=\"103\"",
      "sec-ch-ua-mobile": "?0",
      "sec-ch-ua-platform": "\"Linux\"",
      "sec-fetch-site": "none",
      "sec-fetch-mode": "navigate",
      "sec-fetch-user": "?1",
      "accept-encoding": "gzip, deflate, br",
      "accept-language": "fr-CH,fr;q=0.9,en-US;q=0.8,en;q=0.7"
    }
  ]
}

Steps to Integrate the API:

Obtain an API Key: Sign up for a free account at ScrapeOps to get your API key.
Configure Your Scraper: Set up your scraper to fetch a batch of updated headers from the API when it starts.
Randomize Headers: For each request, select a random header from the list retrieved.
Handle Empty or Failed Requests: If the header list is empty or the API request fails, use a predefined fallback header list.

Here is the complete code:

const { chromium } = require('playwright');
const fs = require('fs');
const { Worker, isMainThread, parentPort, workerData } = require('worker_threads');
const axios = require('axios');


class Product {
  constructor(name, priceStr, url, conversionRate = 1.32) {
    this.name = this.cleanName(name);
    this.priceGb = this.cleanPrice(priceStr);
    this.priceUsd = this.convertPriceToUsd(this.priceGb, conversionRate);
    this.url = this.createAbsoluteUrl(url);
  }

  cleanName(name) {
    return name?.trim() || "missing";
  }

  cleanPrice(priceStr) {
    if (!priceStr?.trim()) {
      return 0.0;
    }

    const cleanedPrice = priceStr
      .replace(/Sale priceFrom £|Sale price£/g, "")
      .trim();

    return cleanedPrice ? parseFloat(cleanedPrice) : 0.0;
  }

  convertPriceToUsd(priceGb, conversionRate) {
    return priceGb * conversionRate;
  }

  createAbsoluteUrl(url) {
    return (url?.trim()) ? `https://www.chocolate.co.uk${url.trim()}` : "missing";
  }
}

class ProductDataPipeline {
  constructor(csvFilename = "", storageQueueLimit = 5) {
    this.seenProducts = new Set();
    this.storageQueue = [];
    this.csvFilename = csvFilename;
    this.csvFileOpen = false;
    this.storageQueueLimit = storageQueueLimit;
  }

  saveToCsv() {
    this.csvFileOpen = true;
    const fileExists = fs.existsSync(this.csvFilename);
    const file = fs.createWriteStream(this.csvFilename, { flags: "a" });
    if (!fileExists) {
      file.write("name,priceGb,priceUsd,url\n");
    }
    for (const product of this.storageQueue) {
      file.write(
        `${product.name},${product.priceGb},${product.priceUsd},${product.url}\n`
      );
    }
    file.end();
    this.storageQueue = [];
    this.csvFileOpen = false;
  }

  cleanRawProduct(rawProduct) {
    return new Product(rawProduct.name, rawProduct.price, rawProduct.url);
  }

  isDuplicateProduct(product) {
    if (!this.seenProducts.has(product.url)) {
      this.seenProducts.add(product.url);
      return false;
    }
    return true;
  }

  addProduct(rawProduct) {
    const product = this.cleanRawProduct(rawProduct);
    if (!this.isDuplicateProduct(product)) {
      this.storageQueue.push(product);
      if (
        this.storageQueue.length >= this.storageQueueLimit &&
        !this.csvFileOpen
      ) {
        this.saveToCsv();
      }
    }
  }

  async close() {
    while (this.csvFileOpen) {
      // Wait for the file to be written
      await new Promise((resolve) => setTimeout(resolve, 1000));
    }
    if (this.storageQueue.length > 0) {
      this.saveToCsv();
    }
  }
}

const listOfUrls = ["https://www.chocolate.co.uk/collections/all"];
const scrapeOpsKey = "<YOUR_SCRAPE_OPS_KEY>";

async function makeRequest(page, url, retries = 3, antiBotCheck = false) {
  for (let i = 0; i < retries; i++) {
    try {
      const response = await page.goto(url);
      const status = response.status();
      if ([200, 404].includes(status)) {
        if (antiBotCheck && status == 200) {
          const content = await page.content();
          if (content.includes("<title>Robot or human?</title>")) {
            return null;
          }
        }
        return response;
      }
    } catch (e) {
      console.log(`Failed to fetch ${url}, retrying...`);
    }
  }
  return null;
}

async function getHeaders(numHeaders) {
  const fallbackHeaders = [
     {
      "upgrade-insecure-requests": "1",
      "user-agent":
        "Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
      accept:
        "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
      "sec-ch-ua":
        '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
      "sec-ch-ua-mobile": "?0",
      "sec-ch-ua-platform": '"Windows"',
      "sec-fetch-site": "none",
      "sec-fetch-mod": "",
      "sec-fetch-user": "?1",
      "accept-encoding": "gzip, deflate, br",
      "accept-language": "bg-BG,bg;q=0.9,en-US;q=0.8,en;q=0.7",
    },
    {
      "upgrade-insecure-requests": "1",
      "user-agent":
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36",
      accept:
        "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
      "sec-ch-ua":
        '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
      "sec-ch-ua-mobile": "?0",
      "sec-ch-ua-platform": '"Linux"',
      "sec-fetch-site": "none",
      "sec-fetch-mod": "",
      "sec-fetch-user": "?1",
      "accept-encoding": "gzip, deflate, br",
      "accept-language": "fr-CH,fr;q=0.9,en-US;q=0.8,en;q=0.7",
    },
  ];


  try {
    const response = await axios.get(
      `http://headers.scrapeops.io/v1/browser-headers?api_key=${scrapeOpsKey}&num_results=${numHeaders}`
    );

    if (response.data.result.length > 0) {
      return response.data.result;
    } else {
      console.error("No headers from ScrapeOps, using fallback headers");
      return fallbackHeaders;
    }
  } catch (error) {
    console.error(
      "Failed to fetch headers from ScrapeOps, using fallback headers"
    );
    return fallbackHeaders;
  }
}

async function scrape(url, headers) {
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage({
    extraHTTPHeaders: headers
  });
  
  const response = await makeRequest(page, url);
  if (!response) {
    await browser.close();
    return { nextUrl: null, products: [] };
  }

  const productItems = await page.$$eval("product-item", items =>
    items.map(item => {
      const titleElement = item.querySelector(".product-item-meta__title");
      const priceElement = item.querySelector(".price");
      return {
        name: titleElement ? titleElement.textContent.trim() : null,
        price: priceElement ? priceElement.textContent.trim() : null,
        url: titleElement ? titleElement.getAttribute("href") : null
      };
    })
  );

  const nextUrl = await nextPage(page);
  await browser.close();

  return {
    nextUrl: nextUrl,
    products: productItems.filter(item => item.name && item.price && item.url)
  };
}

async function nextPage(page) {
  let nextUrl = null;
  try {
    nextUrl = await page.$eval("a.pagination__nav-item:nth-child(4)", item => item.href);
  } catch (error) {
    console.log('Last Page Reached');
  }
  return nextUrl;
}

if (isMainThread) {
  const pipeline = new ProductDataPipeline("chocolate.csv", 5);
  const workers = [];

  for (const url of listOfUrls) {
    workers.push(
      new Promise((resolve, reject) => {
        const worker = new Worker(__filename, {
          workerData: { startUrl: url }
        });
        console.log("Worker created", worker.threadId, url);

        worker.on("message", (product) => {
          pipeline.addProduct(product);
        });

        worker.on("error", reject);
        worker.on("exit", (code) => {
          if (code !== 0) {
            reject(new Error(`Worker stopped with exit code ${code}`));
          } else {
            console.log("Worker exited");
            resolve();
          }
        });
      })
    );
  }

  Promise.all(workers)
    .then(() => pipeline.close())
    .then(() => console.log("Pipeline closed"));
} else {
  const { startUrl } = workerData;
  let headers = [];

  const handleWork = async (workUrl) => {
    if (headers.length == 0) {
      headers = await getHeaders(2);
    }
    const { nextUrl, products } = await scrape(
        workUrl, 
        headers[Math.floor(Math.random() * headers.length)]
    );
    for (const product of products) {
      parentPort.postMessage(product);
    }

    if (nextUrl) {
      console.log("Worker working on", nextUrl);
      await handleWork(nextUrl);
    }
  };

  handleWork(startUrl).then(() => console.log("Worker finished"));
}

// Worker created 1 https://www.chocolate.co.uk/collections/all
// Worker working on https://www.chocolate.co.uk/collections/all?page=2
// Worker working on https://www.chocolate.co.uk/collections/all?page=3
// Last Page Reached
// Worker finished
// Worker exited
// Pipeline closed

Integrating Fake Browser Headers Middleware

Integrating the headers middleware is straightforward. We'll just make a few minor changes to the getHeaders() method that we previously created for fetching user agents. In this updated method:

We'll send a request to "http://headers.scrapeops.io/v1/browser-headers" with the necessary query parameters and fetch the headers using Axios.
If the request fails, we’ll return the fallbackHeaders that we manually added to the script.

After adding the browser headers middleware, here’s what the entire code will look like:

const { chromium } = require('playwright');
const fs = require('fs');
const { Worker, isMainThread, parentPort, workerData } = require('worker_threads');
const axios = require('axios');


class Product {
  constructor(name, priceStr, url, conversionRate = 1.32) {
    this.name = this.cleanName(name);
    this.priceGb = this.cleanPrice(priceStr);
    this.priceUsd = this.convertPriceToUsd(this.priceGb, conversionRate);
    this.url = this.createAbsoluteUrl(url);
  }

  cleanName(name) {
    return name?.trim() || "missing";
  }

  cleanPrice(priceStr) {
    if (!priceStr?.trim()) {
      return 0.0;
    }

    const cleanedPrice = priceStr
      .replace(/Sale priceFrom £|Sale price£/g, "")
      .trim();

    return cleanedPrice ? parseFloat(cleanedPrice) : 0.0;
  }

  convertPriceToUsd(priceGb, conversionRate) {
    return priceGb * conversionRate;
  }

  createAbsoluteUrl(url) {
    return (url?.trim()) ? `https://www.chocolate.co.uk${url.trim()}` : "missing";
  }
}

class ProductDataPipeline {
  constructor(csvFilename = "", storageQueueLimit = 5) {
    this.seenProducts = new Set();
    this.storageQueue = [];
    this.csvFilename = csvFilename;
    this.csvFileOpen = false;
    this.storageQueueLimit = storageQueueLimit;
  }

  saveToCsv() {
    this.csvFileOpen = true;
    const fileExists = fs.existsSync(this.csvFilename);
    const file = fs.createWriteStream(this.csvFilename, { flags: "a" });
    if (!fileExists) {
      file.write("name,priceGb,priceUsd,url\n");
    }
    for (const product of this.storageQueue) {
      file.write(
        `${product.name},${product.priceGb},${product.priceUsd},${product.url}\n`
      );
    }
    file.end();
    this.storageQueue = [];
    this.csvFileOpen = false;
  }

  cleanRawProduct(rawProduct) {
    return new Product(rawProduct.name, rawProduct.price, rawProduct.url);
  }

  isDuplicateProduct(product) {
    if (!this.seenProducts.has(product.url)) {
      this.seenProducts.add(product.url);
      return false;
    }
    return true;
  }

  addProduct(rawProduct) {
    const product = this.cleanRawProduct(rawProduct);
    if (!this.isDuplicateProduct(product)) {
      this.storageQueue.push(product);
      if (
        this.storageQueue.length >= this.storageQueueLimit &&
        !this.csvFileOpen
      ) {
        this.saveToCsv();
      }
    }
  }

  async close() {
    while (this.csvFileOpen) {
      // Wait for the file to be written
      await new Promise((resolve) => setTimeout(resolve, 1000));
    }
    if (this.storageQueue.length > 0) {
      this.saveToCsv();
    }
  }
}

const listOfUrls = ["https://www.chocolate.co.uk/collections/all"];
const scrapeOpsKey = "<YOUR_SCRAPE_OPS_KEY>";

async function makeRequest(page, url, retries = 3, antiBotCheck = false) {
  for (let i = 0; i < retries; i++) {
    try {
      const response = await page.goto(url);
      const status = response.status();
      if ([200, 404].includes(status)) {
        if (antiBotCheck && status == 200) {
          const content = await page.content();
          if (content.includes("<title>Robot or human?</title>")) {
            return null;
          }
        }
        return response;
      }
    } catch (e) {
      console.log(`Failed to fetch ${url}, retrying...`);
    }
  }
  return null;
}

async function getHeaders(numHeaders) {
  const fallbackHeaders = [
     {
      "upgrade-insecure-requests": "1",
      "user-agent":
        "Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
      accept:
        "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
      "sec-ch-ua":
        '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
      "sec-ch-ua-mobile": "?0",
      "sec-ch-ua-platform": '"Windows"',
      "sec-fetch-site": "none",
      "sec-fetch-mod": "",
      "sec-fetch-user": "?1",
      "accept-encoding": "gzip, deflate, br",
      "accept-language": "bg-BG,bg;q=0.9,en-US;q=0.8,en;q=0.7",
    },
    {
      "upgrade-insecure-requests": "1",
      "user-agent":
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36",
      accept:
        "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
      "sec-ch-ua":
        '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
      "sec-ch-ua-mobile": "?0",
      "sec-ch-ua-platform": '"Linux"',
      "sec-fetch-site": "none",
      "sec-fetch-mod": "",
      "sec-fetch-user": "?1",
      "accept-encoding": "gzip, deflate, br",
      "accept-language": "fr-CH,fr;q=0.9,en-US;q=0.8,en;q=0.7",
    },
  ];


  try {
    const response = await axios.get(
      `http://headers.scrapeops.io/v1/browser-headers?api_key=${scrapeOpsKey}&num_results=${numHeaders}`
    );

    if (response.data.result.length > 0) {
      return response.data.result;
    } else {
      console.error("No headers from ScrapeOps, using fallback headers");
      return fallbackHeaders;
    }
  } catch (error) {
    console.error(
      "Failed to fetch headers from ScrapeOps, using fallback headers"
    );
    return fallbackHeaders;
  }
}

async function scrape(url, headers) {
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage({
    extraHTTPHeaders: headers
  });
  
  const response = await makeRequest(page, url);
  if (!response) {
    await browser.close();
    return { nextUrl: null, products: [] };
  }

  const productItems = await page.$$eval("product-item", items =>
    items.map(item => {
      const titleElement = item.querySelector(".product-item-meta__title");
      const priceElement = item.querySelector(".price");
      return {
        name: titleElement ? titleElement.textContent.trim() : null,
        price: priceElement ? priceElement.textContent.trim() : null,
        url: titleElement ? titleElement.getAttribute("href") : null
      };
    })
  );

  const nextUrl = await nextPage(page);
  await browser.close();

  return {
    nextUrl: nextUrl,
    products: productItems.filter(item => item.name && item.price && item.url)
  };
}

async function nextPage(page) {
  let nextUrl = null;
  try {
    nextUrl = await page.$eval("a.pagination__nav-item:nth-child(4)", item => item.href);
  } catch (error) {
    console.log('Last Page Reached');
  }
  return nextUrl;
}

if (isMainThread) {
  const pipeline = new ProductDataPipeline("chocolate.csv", 5);
  const workers = [];

  for (const url of listOfUrls) {
    workers.push(
      new Promise((resolve, reject) => {
        const worker = new Worker(__filename, {
          workerData: { startUrl: url }
        });
        console.log("Worker created", worker.threadId, url);

        worker.on("message", (product) => {
          pipeline.addProduct(product);
        });

        worker.on("error", reject);
        worker.on("exit", (code) => {
          if (code !== 0) {
            reject(new Error(`Worker stopped with exit code ${code}`));
          } else {
            console.log("Worker exited");
            resolve();
          }
        });
      })
    );
  }

  Promise.all(workers)
    .then(() => pipeline.close())
    .then(() => console.log("Pipeline closed"));
} else {
  const { startUrl } = workerData;
  let headers = [];

  const handleWork = async (workUrl) => {
    if (headers.length == 0) {
      headers = await getHeaders(2);
    }
    const { nextUrl, products } = await scrape(
        workUrl, 
        headers[Math.floor(Math.random() * headers.length)]
    );
    for (const product of products) {
      parentPort.postMessage(product);
    }

    if (nextUrl) {
      console.log("Worker working on", nextUrl);
      await handleWork(nextUrl);
    }
  };

  handleWork(startUrl).then(() => console.log("Worker finished"));
}

// Worker created 1 https://www.chocolate.co.uk/collections/all
// Worker working on https://www.chocolate.co.uk/collections/all?page=2
// Worker working on https://www.chocolate.co.uk/collections/all?page=3
// Last Page Reached
// Worker finished
// Worker exited
// Pipeline closed

Next Steps

Now that you understand how to use User Agents and Browser Headers to overcome blocks and restrictions, you're ready to tackle more advanced scraping techniques.

We hope you now have a good grasp of cloaking your requests so they look like genuine browser traffic, letting you scrape even the most protective sites without tripping anti‑bot alarms. If you run into any issues or have questions, drop them in the comments below and we'll jump in to help.

Want to dig into the code?

Python Requests/BS4 version: grab the repository on GitHub here.
Node.js implementations: see the full examples on GitHub here.

The next tutorial shows you how to layer in proxies - spreading requests across IP addresses to stay under the radar and unlock true production‑grade scale.

Web Scraping Part 5 - Using Fake User-Agents and Browser Headers

Python Requests/BS4 Beginners Series Part 5: Using Fake User-Agents and Browser Headers​

Need help scraping the web?

Python Requests/BeautifulSoup 6-Part Beginner Series​

Getting Blocked and Banned While Web Scraping​

Using Fake User-Agents When Scraping​

What are User-Agents​

Why Use Fake User-Agents in Web Scraping​

How to Set a Fake User-Agent in Python Requests​

How to Rotate User-Agents​

How to Create a Custom Fake User-Agent Middleware​

Integrating User-Agent Middleware in a Scraper​

Using Fake Browser Headers When Scraping​

Why Choose Fake Browser Headers Instead of User-Agents​

How to Set Fake Browser Headers in Python Requests​

How to Create a Custom Fake Browser Agent Middleware​

Integrating Fake Browser Headers Middleware​

Node.js Axios/CheerioJS Beginners Series Part 5: Using Fake User-Agents and Browser Headers​

Need help scraping the web?

Getting Blocked and Banned While Web Scraping​

Using Fake User-Agents When Scraping​

What are User-Agents​

Using Fake Browser Headers When Scraping​

How to Set a Fake User-Agent in Node with Axios​

How to Rotate User-Agents​

How to Create a Custom Fake User-Agent Middleware​

Integrating User-Agent Middleware in a Scraper​

Using Fake Browser Headers When Scraping​

Why Choose Fake Browser Headers Instead of User-Agents​

How to Set Fake Browser Headers in Node.js Axios​

How to Create a Custom Fake Browser Agent Middleware​

NodeJS Puppeteer Beginners Series Part 5 - Faking User-Agents & Browser Headers​

Need help scraping the web?

Getting Blocked and Banned While Web Scraping​

Using Fake User-Agents When Scraping​

What are User-Agents?​

Why Use Fake User-Agents in Web Scraping​

How to Set a Fake User-Agent in NodeJS Puppeteer​

How to Rotate User-Agents​

Using Fake Browser Headers When Scraping​

Why Choose Fake Browser Headers Instead of User-Agents​

How to Set Fake Browser Headers in NodeJS Puppeteer​

Rotating Browser Headers​

Creating Custom Middleware for User-Agents and Headers​

User-Agent Middleware​

Browser Headers Middleware​

Integrating Middlewares in a Scraper​

Node.js Playwright Beginners Series Part 5: Using Fake User-Agents and Browser Headers​

Need help scraping the web?

Getting Blocked and Banned While Web Scraping​

Using Fake User-Agents When Scraping​

What are User-Agents?​

Why Use Fake User-Agents in Web Scraping​

How to Set a Fake User-Agent in Playwright​

How to Rotate User-Agents​

How to Create a Custom Fake User-Agent Middleware​

Integrating User-Agent Middleware in a Scraper​

Using Fake Browser Headers When Scraping​

Why Choose Fake Browser Headers Instead of User-Agents​

How to Set Fake Browser Headers in Node.js Playwright​

How to Create a Custom Fake Browser Headers Middleware​

Integrating Fake Browser Headers Middleware​

Next Steps​

Want to dig into the code?​

Python Requests/BS4 Beginners Series Part 5: Using Fake User-Agents and Browser Headers

Python Requests/BeautifulSoup 6-Part Beginner Series

Getting Blocked and Banned While Web Scraping

Using Fake User-Agents When Scraping

What are User-Agents

Why Use Fake User-Agents in Web Scraping

How to Set a Fake User-Agent in Python Requests

How to Rotate User-Agents

How to Create a Custom Fake User-Agent Middleware

Integrating User-Agent Middleware in a Scraper

Using Fake Browser Headers When Scraping

Why Choose Fake Browser Headers Instead of User-Agents

How to Set Fake Browser Headers in Python Requests

How to Create a Custom Fake Browser Agent Middleware

Integrating Fake Browser Headers Middleware

Node.js Axios/CheerioJS Beginners Series Part 5: Using Fake User-Agents and Browser Headers

Getting Blocked and Banned While Web Scraping

Using Fake User-Agents When Scraping

What are User-Agents

Using Fake Browser Headers When Scraping

How to Set a Fake User-Agent in Node with Axios

How to Rotate User-Agents

How to Create a Custom Fake User-Agent Middleware

Integrating User-Agent Middleware in a Scraper

Using Fake Browser Headers When Scraping

Why Choose Fake Browser Headers Instead of User-Agents

How to Set Fake Browser Headers in Node.js Axios

How to Create a Custom Fake Browser Agent Middleware

NodeJS Puppeteer Beginners Series Part 5 - Faking User-Agents & Browser Headers

Getting Blocked and Banned While Web Scraping

Using Fake User-Agents When Scraping

What are User-Agents?

Why Use Fake User-Agents in Web Scraping

How to Set a Fake User-Agent in NodeJS Puppeteer

How to Rotate User-Agents

Using Fake Browser Headers When Scraping

Why Choose Fake Browser Headers Instead of User-Agents

How to Set Fake Browser Headers in NodeJS Puppeteer

Rotating Browser Headers

Creating Custom Middleware for User-Agents and Headers

User-Agent Middleware

Browser Headers Middleware

Integrating Middlewares in a Scraper

Node.js Playwright Beginners Series Part 5: Using Fake User-Agents and Browser Headers

Getting Blocked and Banned While Web Scraping

Using Fake User-Agents When Scraping

What are User-Agents?

Why Use Fake User-Agents in Web Scraping

How to Set a Fake User-Agent in Playwright

How to Rotate User-Agents

How to Create a Custom Fake User-Agent Middleware

Integrating User-Agent Middleware in a Scraper

Using Fake Browser Headers When Scraping

Why Choose Fake Browser Headers Instead of User-Agents

How to Set Fake Browser Headers in Node.js Playwright

How to Create a Custom Fake Browser Headers Middleware

Integrating Fake Browser Headers Middleware

Next Steps

Want to dig into the code?