Skip to main content

Part 5 - Faking User-Agents & Browser Headers

NodeJS Puppeteer Beginners Series Part 5 - Faking User-Agents & Browser Headers

So far in this NodeJS Puppeteer 6-Part Beginner Series, we have learned how to build a basic web scraper Part 1, scrape data from a website in Part 2, clean it up, save it to a file or database in Part 3, and make our scraper more robust and scalable by handling failed requests and using concurrency in Part 4.

In Part 5, we’ll explore how to use fake user-agents and browser headers to bypass restrictions on sites trying to prevent scraping.

Node.js Puppeteer 6-Part Beginner Series

  • Part 1: Basic Node.js Puppeteer Scraper - We'll learn the fundamentals of web scraping with Node.js and build your first scraper using NpdeJS Puppeteer. (Part 1)

  • Part 2: Cleaning Unruly Data & Handling Edge Cases - Web data can be messy and unpredictable. In this part, we'll create a robust scraper using data structures and cleaning techniques to handle these challenges. (Part 2)

  • Part 3: Storing Scraped Data in AWS S3, MySQL & Postgres DBs - Explore various options for storing your scraped data, including databases like MySQL or Postgres, cloud storage like AWS S3, and file formats like CSV and JSON. We'll discuss their pros, cons, and suitable use cases. (Part 3)

  • Part 4: Managing Retries & Concurrency - Enhance your scraper's reliability and scalability by handling failed requests and utilizing concurrency. (Part 4)

  • Part 5: Faking User-Agents & Browser Headers - Learn how to create a production-ready scraper by simulating real users through user-agent and browser header manipulation. (This article)

  • Part 6: Using Proxies To Avoid Getting Blocked - Discover how to use proxies to bypass anti-bot systems by disguising your real IP address and location. (Part 6)

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.


Getting Blocked and Banned While Web Scraping

Web scraping large volumes of data can be challenging, especially when dealing with sophisticated anti-bot mechanisms. While it is easy to build and run scrapers, ensuring reliable retrieval of HTML responses from target pages is often difficult.

Websites like Amazon employ advanced techniques to detect and block scraping activities. This guide will show you how to use Puppeteer to mimic real browser interactions, thereby avoiding detection and blocks.

However, by properly managing user-agents and browser headers during scraping, you can counter these anti-bot techniques. While these advanced techniques would be optional for our beginner project on scraping chocolate.co.uk.

In this guide, we're still going to look at how to use fake user-agents and browser headers so that you can apply these techniques if you ever need to scrape a more difficult website like Amazon.


Using Fake User-Agents When Scraping

User-agents are pieces of information that your browser sends to a website, telling it what type of device and browser you're using. Many websites use this data to detect bots and block scraping attempts. To avoid this, you can rotate or fake user-agents to make your scraping activities look like they’re coming from different browsers or devices.

By simulating various user-agents, you can reduce the chances of being flagged as a bot and increase your scraping success on websites with anti-bot mechanisms.

What are User-Agents?

A user-agent is a string of text sent by your browser to a web server when you visit a website.

It's located in the HTTP header and contains details about your browser, operating system, and device, allowing the website to customize the content based on this information.

  • Operating system: The user's operating system (e.g., Windows, macOS, Linux, Android, iOS)
  • Browser: The specific browser being used (e.g., Chrome, Firefox, Safari, Edge)
  • Browser version: The version of the browser

Here's an example of a user-agent string that might be sent when you visit a website using Chrome:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36

The user-agent string indicates that you are using Chrome version 109.0.0.0 on a 64-bit Windows 10 computer.

  • The browser is Chrome
  • The version of Chrome is 109.0.0.0
  • The operating system is Windows 10
  • The device is a 64-bit computer

Check our Puppeteer Guide: Using Fake User Agents to get more information about using Fake-User Agents in Python Requests.

Why Use Fake User-Agents in Web Scraping

Websites often use user-agents to identify the type of browser, device, or bot making a request. If a site detects that multiple requests are coming from a bot or the same user-agent, it may block or throttle the requests to prevent scraping.

Using fake or rotating user-agents helps overcome these blocks by mimicking different browsers and devices, making the scraper appear as legitimate traffic.

This technique improves the likelihood of bypassing anti-scraping measures, allowing scrapers to access and collect data from websites that might otherwise restrict or block their efforts.

How to Set a Fake User-Agent in NodeJS Puppeteer

Just like with other scraping tools, using appropriate user-agents is crucial. Puppeteer allows you to easily set and rotate user-agents to avoid detection.

To set a user-agent in Puppeteer, you can use the setUserAgent method:

const puppeteer = require('puppeteer');

(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36');
await page.goto('https://www.chocolate.co.uk');
const content = await page.content();
console.log(content);
await browser.close();
})();

How to Rotate User-Agents

You can rotate user-agents by creating a list of user-agents and selecting a random one for each request:

const puppeteer = require('puppeteer');

const userAgentList = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Safari/605.1.15',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36',
];

(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const randomUserAgent = userAgentList[Math.floor(Math.random() * userAgentList.length)];
await page.setUserAgent(randomUserAgent);
await page.goto('https://www.chocolate.co.uk');
const content = await page.content();
console.log(content);
await browser.close();
})();

Using Fake Browser Headers When Scraping

In addition to user-agents, setting other browser headers can make your scraping activities appear more legitimate.

Browser headers are key pieces of information that your browser sends to a website with each request. They include details like the user-agent, cookies, referrer, and accepted content types, helping the server understand the request and respond appropriately. Some websites use these headers to detect bots or scraping activity.

By faking or customizing browser headers, scrapers can disguise their requests to look like they’re coming from a regular user, not a bot. This helps avoid detection and allows scrapers to bypass security measures that block automated requests, improving the success of your web scraping.

Why Choose Fake Browser Headers Instead of User-Agents

While user-agents help disguise the type of browser and device making a request, websites often look at more than just the user-agent string to detect bots. Fake browser headers offer a broader and more convincing disguise by imitating a real browser's full request, including additional information like cookies, referrers, and accepted content types.

Here is an example header when using a Chrome browser on a MacOS machine:

sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="99", "Google Chrome";v="99"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "macOS"
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.83 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: en-GB,en-US;q=0.9,en;q=0.8

As we can see, real browsers send not only User-Agent strings but also several other headers to identify and customize their requests. So, to improve the reliability of our scrapers, we should also include these headers when scraping.

How to Set Fake Browser Headers in NodeJS Puppeteer

To set custom headers in Puppeteer, use the setExtraHTTPHeaders method:

const puppeteer = require('puppeteer');

(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.setExtraHTTPHeaders({
'Accept-Language': 'en-US,en;q=0.9',
'Upgrade-Insecure-Requests': '1',
});
await page.goto('https://www.chocolate.co.uk');
const content = await page.content();
console.log(content);
await browser.close();
})();

Rotating Browser Headers

Similar to user-agents, you can rotate headers to avoid detection:

const puppeteer = require('puppeteer');

const headersList = [
{
'Accept-Language': 'en-US,en;q=0.9',
'Upgrade-Insecure-Requests': '1',
},
{
'Accept-Language': 'fr-FR,fr;q=0.9',
'Upgrade-Insecure-Requests': '1',
},
{
'Accept-Language': 'es-ES,es;q=0.9',
'Upgrade-Insecure-Requests': '1',
},
];

(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const randomHeaders = headersList[Math.floor(Math.random() * headersList.length)];
await page.setExtraHTTPHeaders(randomHeaders);
await page.goto('https://www.chocolate.co.uk');
const content = await page.content();
console.log(content);
await browser.close();
})();

Creating Custom Middleware for User-Agents and Headers

To efficiently manage user-agents and headers, you can create a middleware that sets these values dynamically.

User-Agent Middleware

class UserAgentMiddleware {
constructor(userAgents) {
this.userAgents = userAgents;
}

getRandomUserAgent() {
const randomIndex = Math.floor(Math.random() * this.userAgents.length);
return this.userAgents[randomIndex];
}
}

const userAgentMiddleware = new UserAgentMiddleware([
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Safari/605.1.15',
]);

(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.setUserAgent(userAgentMiddleware.getRandomUserAgent());
await page.goto('https://www.chocolate.co.uk');
const content = await page.content();
console.log(content);
await browser.close();
})();

Browser Headers Middleware

class BrowserHeadersMiddleware {
constructor(headersList) {
this.headersList = headersList;
}

getRandomHeaders() {
const randomIndex = Math.floor(Math.random() * this.headersList.length);
return this.headersList[randomIndex];
}
}

const headersMiddleware = new BrowserHeadersMiddleware([
{
'Accept-Language': 'en-US,en;q=0.9',
'Upgrade-Insecure-Requests': '1',
},
{
'Accept-Language': 'fr-FR,fr;q=0.9',
'Upgrade-Insecure-Requests': '1',
},
]);

(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.setExtraHTTPHeaders(headersMiddleware.getRandomHeaders());
await page.goto('https://www.chocolate.co.uk');
const content = await page.content();
console.log(content);
await browser.close();
})();

Integrating Middlewares in a Scraper

Integrating the custom middlewares into a scraper is straightforward. Ensure the middleware is called before each page navigation.

const puppeteer = require('puppeteer');

class UserAgentMiddleware {
constructor(userAgents) {
this.userAgents = userAgents;
}

getRandomUserAgent() {
const randomIndex = Math.floor(Math.random() * this.userAgents.length);
return this.userAgents[randomIndex];
}
}

class BrowserHeadersMiddleware {
constructor(headersList) {
this.headersList = headersList;
}

getRandomHeaders() {
const randomIndex = Math.floor(Math.random() * this.headersList.length);
return this.headersList[randomIndex];
}
}

class Scraper {
constructor(userAgentMiddleware, headersMiddleware) {
this.userAgentMiddleware = userAgentMiddleware;
this.headersMiddleware = headersMiddleware;
}

async scrape(url) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.setUserAgent(this.userAgentMiddleware.getRandomUserAgent());
await page.setExtraHTTPHeaders(this.headersMiddleware.getRandomHeaders());
await page.goto(url);
const content = await page.content();
await browser.close();
return content;
}
}

const userAgentMiddleware = new UserAgentMiddleware([
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Safari/605.1.15',
]);

const headersMiddleware = new BrowserHeadersMiddleware([
{
'Accept-Language': 'en-US,en;q=0.9',
'Upgrade-Insecure-Requests': '1',
},
{
'Accept-Language': 'fr-FR,fr;q=0.9',
'Upgrade-Insecure-Requests': '1',
},
]);

const scraper = new Scraper(userAgentMiddleware, headersMiddleware);

(async () => {
const content = await scraper.scrape('https://www.chocolate.co.uk');
console.log(content);
})();

Next Steps

With Puppeteer, you can effectively bypass many anti-bot mechanisms by simulating real browser interactions. By setting and rotating user-agents and browser headers, you can make your scraping activities appear more legitimate.

The next step in enhancing your scraper's reliability is to use proxies to distribute your requests and further reduce the risk of detection and blocking.