Skip to main content

Part 5: Using Fake User-Agents and Browser Headers

Node.js Playwright Beginners Series Part 5: Using Fake User-Agents and Browser Headers

Welcome to Part 5 of our Node.js Playwright Beginner Series!

So far in this series, we learned how to build a basic web scraper in Part 1, get it to scrape some data from a website in Part 2, clean up the data as it was being scraped and then save the data to a file or database in Part 3, and make our scraper more robust and scalable by handling failed requests and using concurrency in Part 4.

In this guide, we’ll walk through how to customize User-Agent strings and Browser Headers to make your scraper behave like a real user, rather than a headless browser like Playwright.

Many websites use advanced bot detection techniques to block scrapers. By making your scraper appear more like a legitimate user, you can minimize the risk of detection and ensure smoother scraping operations.

Node.js Playwright 6-Part Beginner Series

  • Part 1: Basic Node.js Playwright Scraper - We'll learn the fundamentals of web scraping with Node.js and build your first scraper using Cheerio. (Part 1)

  • Part 2: Cleaning Unruly Data & Handling Edge Cases - Web data can be messy and unpredictable. In this part, we'll create a robust scraper using data structures and cleaning techniques to handle these challenges. (Part 2)

  • Part 3: Storing Scraped Data in AWS S3, MySQL & Postgres DBs - Explore various options for storing your scraped data, including databases like MySQL or Postgres, cloud storage like AWS S3, and file formats like CSV and JSON. We'll discuss their pros, cons, and suitable use cases. (Part 3)

  • Part 4: Managing Retries & Concurrency - Enhance your scraper's reliability and scalability by handling failed requests and utilizing concurrency. (Part 4)

  • Part 5: Faking User-Agents & Browser Headers - Learn how to create a production-ready scraper by simulating real users through user-agent and browser header manipulation. (This Article)

  • Part 6: Using Proxies To Avoid Getting Blocked - Discover how to use proxies to bypass anti-bot systems by disguising your real IP address and location. (Part 6)

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.


Getting Blocked and Banned While Web Scraping

When scraping large volumes of data, you’ll quickly realize that building and running scrapers is the easy part-the real challenge is consistently retrieving HTML responses from the pages you want.

While scraping a few hundred pages on your local machine is manageable, websites will block your requests once you scale up to thousands or millions.

Major sites like Amazon monitor traffic using IP addresses and user-agents, employing advanced anti-bot systems to detect suspicious behavior. If your scraper is identified, your requests will be blocked.

Playwright scrapers are easily detected because their default settings signal bot-like behavior. Here’s why:

  • User-Agent Strings: When Playwright runs in headless mode, it includes Headless in the User-Agent string, which makes it an easy target for websites monitoring for bots.

  • Headers: Playwright's default headers differ from those sent by real browsers. Headers such as Accept-Language, Accept-Encoding, and User-Agent need to match typical browser requests. Any discrepancy can trigger bot detection mechanisms.

In this guide, we're still going to look at how to use fake user-agents and browser headers so that you can apply these techniques if you ever need to scrape a more difficult website like Amazon.


Using Fake User-Agents When Scraping

A common reason for getting blocked while web scraping is using bad User-Agent headers. Many websites are protective of their data and don’t want it scraped, so it's important to make your scraper appear as a legitimate user.

To achieve this, you need to carefully manage the User-Agent headers that are sent with your HTTP requests.

What are User-Agents?

A User-Agent is a string sent to the server via HTTP headers, allowing the server to identify the client making the request. This string typically contains information such as:

  • Browser: The name and version of the browser (e.g., Chrome, Firefox).
  • Operating System: The OS and its version (e.g., Windows 10, macOS).
  • Rendering Engine: The engine used to display the content (e.g., WebKit, Gecko).

In Playwright, the default User-Agent string can expose that a request is coming from a headless browser, which is often flagged by websites.

Let's check the default User-Agent in Playwright by sending a request to httpbin.io/user-agent endpoint:

const { chromium } = require('playwright');

(async () => {
const browser = await chromium.launch({ headless: true });

const context = await browser.newContext();
const page = await context.newPage();
await page.goto('https://httpbin.io/user-agent');

const content = await page.content();
console.log(content);

await browser.close();
})();

This outputs the following User-Agent string:

<html><head><meta name="color-scheme" content="light dark"><meta charset="utf-8"></head><body><pre>{
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/128.0.6613.18 Safari/537.36"
}
</pre><div class="json-formatter-container"></div></body></html>

The user-agent string indicates that you are using Mozilla version 5.0 on a 64-bit Linux computer.

  • The browser is Mozilla
  • The version of Mozilla is 5.0
  • The operating system is Linux
  • The device is a 64-bit computer

You can see the User-Agent contains the string HeadlessChrome, indicating that this request comes from a headless browser—a key signal for websites to detect bots.

Check our Playwright: Using Fake User Agents to get more information about using Fake-User Agents in NodeJS Playwright.

Why Use Fake User-Agents in Web Scraping

Fake User-Agents are used in web scraping to make requests appear as though they are coming from a real browser and a legitimate user rather than a bot.

You must set a unique user-agent for each request. Websites can detect repeated requests from the same user-agent and identify them as potential bots.

In Node.js, using Playwright for web scraping, the default User-Agent string may reveal that your requests are being made by an automated tool, which websites can detect and block. To avoid this, you should manually set a custom User-Agent before making each request to mimic a real browser.

For example, Playwright's default user-agent might look like this:

'User-Agent': 'Playwright/1.18.0'

This makes it obvious that your requests are coming from Playwright, which could lead to blocking. Therefore, it's crucial to manage your user-agents when sending requests with Playwright.

How to Set a Fake User-Agent in Playwright

You can choose a genuine user agent from UserAgents.io for use in your code. For example, we've selected this user agent:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36

To use a specific User-Agent and override the default one in Playwright, you need to pass it to the newContext() method under the userAgent name. Check out the code below:

const { chromium } = require('playwright'); 

(async () => {
const browser = await chromium.launch({ headless: true });

const context = await browser.newContext({
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
});

const page = await context.newPage();
await page.goto('https://httpbin.org/user-agent');

const content = await page.content();
console.log(content);

await browser.close();
})();

// <html><head><meta name="color-scheme" content="light dark"><meta charset="utf-8"></head><body><pre>{
// "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
// }
// </pre><div class="json-formatter-container"></div></body></html>

Our code has successfully overridden the default User-Agent, which previously contained a "Headless" string. It now resembles a genuine User-Agent without any such strings.

How to Rotate User-Agents

Using the same User-Agent for all requests isn't ideal. It can make your scraper appear suspicious since scrapers typically send a high volume of requests compared to regular users.

To mitigate this, you should rotate user agents and headers to simulate different profiles with each request.

Here's how you can do it:

const { chromium } = require('playwright'); 

const userAgents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36",
"Mozilla/5.0 (iPhone; CPU iPhone OS 14_4_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Mobile/15E148 Safari/604.1",
"Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/87.0.664.75",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18363",
];

(async () => {
const browser = await chromium.launch({ headless: true });

const context = await browser.newContext({
userAgent: userAgents[Math.floor(Math.random() * userAgents.length)]
});

const page = await context.newPage();

await page.goto('http://httpbin.org/user-agent');

const content = await page.content();
console.log(content);

await browser.close();
})();

// <html><head><meta name="color-scheme" content="light dark"><meta charset="utf-8"></head><body><pre>{
// "headers": {
// "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
// "Accept-Encoding": "gzip, deflate",
// "Host": "httpbin.org",
  • Start by compiling a list of user agents and storing them in an array called userAgents.
  • Each time you create a new context, randomly select a user agent from the array and pass it to the userAgent option in newContext().

Alternatively, you can use the npm package user-agents to get a larger dataset of user agents instead of manually listing them.

How to Create a Custom Fake User-Agent Middleware

Let's dive into creating a custom middleware that manages thousands of fake user agents efficiently. This middleware can be easily integrated into your scraper.

The best approach is to leverage a free user-agent API, like the ScrapeOps Fake User-Agent API. This API provides an up-to-date list of user agents, allowing your scraper to select a different one for each request.

To use the ScrapeOps Fake User-Agent API, you'll need to request a list of user agents from their endpoint:

http://headers.scrapeops.io/v1/user-agents?api_key=YOUR_API_KEY

To access this API, sign up for a free account and obtain an API key.

Here’s an example of the API response containing a list of user agents:

{
"result": [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Safari/605.1.15",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/603.3.8 (KHTML, like Gecko) Version/10.1.2 Safari/603.3.8",
"Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Safari/605.1.15",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Safari/605.1.15",
"Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36"
]
}

To integrate the Fake User-Agent API into your scraper:

  • Configure it to fetch a list of user agents when the scraper starts.
  • Then, randomly select a user agent from this list for each request.

In case the list from the API is empty or unavailable, you can use a fallback list of user agents.

Here’s how to build the custom user-agent middleware:

  1. Create a Method to Fetch User Agents

Define a method getHeaders() to retrieve user agents from the ScrapeOps API and use fallback headers if needed:

const axios = require('axios');

async function getHeaders(numHeaders) {
const fallbackHeaders = [
"Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36",
];
const scrapeOpsKey = "<YOUR_SCRAPE_OPS_KEY>";

try {
const response = await axios.get(
`http://headers.scrapeops.io/v1/user-agents?api_key=${scrapeOpsKey}&num_results=${numHeaders}`
);

if (response.data.result.length > 0) {
return response.data.result;
} else {
console.error("No headers from ScrapeOps, using fallback headers");
return fallbackHeaders;
}
} catch (error) {
console.error(
"Failed to fetch headers from ScrapeOps, using fallback headers"
);
return fallbackHeaders;
}
}
  1. Use Random User Agents:

Call getHeaders() during the startup of your scraper to fetch user agents. Use these agents for each request by selecting a random one:

if (isMainThread) {
// ...
} else {
const { startUrl } = workerData;
const handleWork = async (workUrl) => {
if (headers.length == 0) {
headers = await getHeaders(2);
}
const { nextUrl, products } = await scrape(
workUrl,
headers[Math.floor(Math.random() * headers.length)]
);
for (const product of products) {
parentPort.postMessage(product);
}

if (nextUrl) {
console.log("Worker working on", nextUrl);
await handleWork(nextUrl);
}
};

handleWork(startUrl).then(() => console.log("Worker finished"));
}

This setup ensures your scraper uses diverse user agents, making it less detectable and more effective.

Integrating User-Agent Middleware in a Scraper

Now that we’ve developed the getHeaders() middleware to fetch user agents from ScrapeOps using an Axios request, it's time to integrate it into our scraper.

To do this, we'll update our scrape() method from Part 4 to accept an additional parameter for headers. We'll then pass these headers as extraHTTPHeaders to the newPage() method.

Here's how you can implement it:

async function scrape(url, headers) {
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage({
extraHTTPHeaders: headers
});

const response = await makeRequest(page, url);
if (!response) {
await browser.close();
return { nextUrl: null, products: [] };
}

const productItems = await page.$$eval("product-item", items =>
items.map(item => {
const titleElement = item.querySelector(".product-item-meta__title");
const priceElement = item.querySelector(".price");
return {
name: titleElement ? titleElement.textContent.trim() : null,
price: priceElement ? priceElement.textContent.trim() : null,
url: titleElement ? titleElement.getAttribute("href") : null
};
})
);

const nextUrl = await nextPage(page);
await browser.close();

return {
nextUrl: nextUrl,
products: productItems.filter(item => item.name && item.price && item.url)
};
}

Using Fake Browser Headers When Scraping

For basic websites, setting an up-to-date User-Agent may be sufficient for reliable data scraping. However, many popular sites now employ advanced anti-bot technologies that look beyond the user-agent to detect scraping activities.

These technologies analyze additional headers that a real browser typically sends along with the user-agent.

Why Choose Fake Browser Headers Instead of User-Agents

Incorporating a full set of browser headers, rather than just a fake user-agent, makes your requests more similar to those of genuine users. This approach helps your requests blend in better and reduces the likelihood of detection.

Here’s an example of the headers a Chrome browser on macOS might use:

sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="99", "Google Chrome";v="99"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "macOS"
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.83 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: en-GB,en-US;q=0.9,en;q=0.8

As shown, real browsers send not only a User-Agent string but also several additional headers to identify and customize their requests.

To enhance the reliability of your scrapers, you should include these headers along with your user-agent.

How to Set Fake Browser Headers in Node.js Playwright

Before we set custom headers, let’s examine the default headers sent by Playwright:

const { chromium } = require('playwright');

(async () => {
const browser = await chromium.launch({ headless: true });

const context = await browser.newContext();
const page = await context.newPage();

await page.goto('https://httpbin.org/headers');

const content = await page.content();
console.log(content);

await browser.close();
})();

The above code displays:

<html><head><meta name="color-scheme" content="light dark"><meta charset="utf-8"></head><body><pre>{
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Encoding": "gzip, deflate, br, zstd",
"Host": "httpbin.org",
"Priority": "u=0, i",
"Sec-Ch-Ua": "\"Chromium\";v=\"128\", \"Not;A=Brand\";v=\"24\", \"HeadlessChrome\";v=\"128\"",
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Ch-Ua-Platform": "\"Linux\"",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/128.0.6613.18 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-66e343d9-5b9187420b8de5f2661f4a73"
}
}
</pre><div class="json-formatter-container"></div></body></html>

As observed, the default headers are not as comprehensive as those sent by a real browser. This discrepancy can lead to detection and blocking when scraping numerous pages. To mitigate this, you should use a full set of fake browser headers.

To simulate a real browser, you'll need to set a full range of headers, not just the user-agent. Define these headers as key-value pairs and pass them to the extraHTTPHeaders parameter of the newPage() method. Here’s an example:

const { chromium } = require('playwright');

const headers = {
authority: "httpbin.org",
"cache-control": "max-age=0",
"sec-ch-ua":
'"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"',
"sec-ch-ua-mobile": "?0",
"upgrade-insecure-requests": "1",
"user-agent":
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36",
accept:
"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"sec-fetch-site": "none",
"sec-fetch-mode": "navigate",
"sec-fetch-user": "?1",
"sec-fetch-dest": "document",
"accept-language": "en-US,en;q=0.9",
};

(async () => {
const browser = await chromium.launch({ headless: true });

const context = await browser.newContext();
const page = await context.newPage({
extraHTTPHeaders: headers
});

await page.goto('https://httpbin.org/headers');

const content = await page.content();
console.log(content);

await browser.close();
})();

In this code, we send a request to the httpbin.org/headers endpoint. You should see all the custom headers included in the request, making your scraping activity less detectable.

Custom Headers

How to Create a Custom Fake Browser Headers Middleware

Creating custom fake browser agent middleware is quite similar to setting up custom fake user-agent middleware.

You can either manually build a list of fake browser headers or use the ScrapeOps Fake Browser Headers API to get an updated list each time your scraper runs.

The ScrapeOps Fake Browser Headers API is a free service that provides a set of optimized fake browser headers. This can help you avoid blocks and bans, enhancing the reliability of your web scrapers.

The API endpoint you’ll use is:

http://headers.scrapeops.io/v1/browser-headers?api_key=YOUR_API_KEY

Here is an example response:

{
"result": [
{
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"sec-ch-ua": "\".Not/A)Brand\";v=\"99\", \"Google Chrome\";v=\"103\", \"Chromium\";v=\"103\"",
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "\"Windows\"",
"sec-fetch-site": "none",
"sec-fetch-mode": "navigate",
"sec-fetch-user": "?1",
"accept-encoding": "gzip, deflate, br",
"accept-language": "bg-BG,bg;q=0.9,en-US;q=0.8,en;q=0.7"
},
{
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"sec-ch-ua": "\".Not/A)Brand\";v=\"99\", \"Google Chrome\";v=\"103\", \"Chromium\";v=\"103\"",
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "\"Linux\"",
"sec-fetch-site": "none",
"sec-fetch-mode": "navigate",
"sec-fetch-user": "?1",
"accept-encoding": "gzip, deflate, br",
"accept-language": "fr-CH,fr;q=0.9,en-US;q=0.8,en;q=0.7"
}
]
}

Steps to Integrate the API:

  • Obtain an API Key: Sign up for a free account at ScrapeOps to get your API key.
  • Configure Your Scraper: Set up your scraper to fetch a batch of updated headers from the API when it starts.
  • Randomize Headers: For each request, select a random header from the list retrieved.
  • Handle Empty or Failed Requests: If the header list is empty or the API request fails, use a predefined fallback header list.

Here is the complete code:

const { chromium } = require('playwright');
const fs = require('fs');
const { Worker, isMainThread, parentPort, workerData } = require('worker_threads');
const axios = require('axios');


class Product {
constructor(name, priceStr, url, conversionRate = 1.32) {
this.name = this.cleanName(name);
this.priceGb = this.cleanPrice(priceStr);
this.priceUsd = this.convertPriceToUsd(this.priceGb, conversionRate);
this.url = this.createAbsoluteUrl(url);
}

cleanName(name) {
return name?.trim() || "missing";
}

cleanPrice(priceStr) {
if (!priceStr?.trim()) {
return 0.0;
}

const cleanedPrice = priceStr
.replace(/Sale priceFrom £|Sale price£/g, "")
.trim();

return cleanedPrice ? parseFloat(cleanedPrice) : 0.0;
}

convertPriceToUsd(priceGb, conversionRate) {
return priceGb * conversionRate;
}

createAbsoluteUrl(url) {
return (url?.trim()) ? `https://www.chocolate.co.uk${url.trim()}` : "missing";
}
}

class ProductDataPipeline {
constructor(csvFilename = "", storageQueueLimit = 5) {
this.seenProducts = new Set();
this.storageQueue = [];
this.csvFilename = csvFilename;
this.csvFileOpen = false;
this.storageQueueLimit = storageQueueLimit;
}

saveToCsv() {
this.csvFileOpen = true;
const fileExists = fs.existsSync(this.csvFilename);
const file = fs.createWriteStream(this.csvFilename, { flags: "a" });
if (!fileExists) {
file.write("name,priceGb,priceUsd,url\n");
}
for (const product of this.storageQueue) {
file.write(
`${product.name},${product.priceGb},${product.priceUsd},${product.url}\n`
);
}
file.end();
this.storageQueue = [];
this.csvFileOpen = false;
}

cleanRawProduct(rawProduct) {
return new Product(rawProduct.name, rawProduct.price, rawProduct.url);
}

isDuplicateProduct(product) {
if (!this.seenProducts.has(product.url)) {
this.seenProducts.add(product.url);
return false;
}
return true;
}

addProduct(rawProduct) {
const product = this.cleanRawProduct(rawProduct);
if (!this.isDuplicateProduct(product)) {
this.storageQueue.push(product);
if (
this.storageQueue.length >= this.storageQueueLimit &&
!this.csvFileOpen
) {
this.saveToCsv();
}
}
}

async close() {
while (this.csvFileOpen) {
// Wait for the file to be written
await new Promise((resolve) => setTimeout(resolve, 1000));
}
if (this.storageQueue.length > 0) {
this.saveToCsv();
}
}
}

const listOfUrls = ["https://www.chocolate.co.uk/collections/all"];
const scrapeOpsKey = "<YOUR_SCRAPE_OPS_KEY>";

async function makeRequest(page, url, retries = 3, antiBotCheck = false) {
for (let i = 0; i < retries; i++) {
try {
const response = await page.goto(url);
const status = response.status();
if ([200, 404].includes(status)) {
if (antiBotCheck && status == 200) {
const content = await page.content();
if (content.includes("<title>Robot or human?</title>")) {
return null;
}
}
return response;
}
} catch (e) {
console.log(`Failed to fetch ${url}, retrying...`);
}
}
return null;
}

async function getHeaders(numHeaders) {
const fallbackHeaders = [
{
"upgrade-insecure-requests": "1",
"user-agent":
"Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
accept:
"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"sec-ch-ua":
'".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": '"Windows"',
"sec-fetch-site": "none",
"sec-fetch-mod": "",
"sec-fetch-user": "?1",
"accept-encoding": "gzip, deflate, br",
"accept-language": "bg-BG,bg;q=0.9,en-US;q=0.8,en;q=0.7",
},
{
"upgrade-insecure-requests": "1",
"user-agent":
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36",
accept:
"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"sec-ch-ua":
'".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": '"Linux"',
"sec-fetch-site": "none",
"sec-fetch-mod": "",
"sec-fetch-user": "?1",
"accept-encoding": "gzip, deflate, br",
"accept-language": "fr-CH,fr;q=0.9,en-US;q=0.8,en;q=0.7",
},
];


try {
const response = await axios.get(
`http://headers.scrapeops.io/v1/browser-headers?api_key=${scrapeOpsKey}&num_results=${numHeaders}`
);

if (response.data.result.length > 0) {
return response.data.result;
} else {
console.error("No headers from ScrapeOps, using fallback headers");
return fallbackHeaders;
}
} catch (error) {
console.error(
"Failed to fetch headers from ScrapeOps, using fallback headers"
);
return fallbackHeaders;
}
}

async function scrape(url, headers) {
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage({
extraHTTPHeaders: headers
});

const response = await makeRequest(page, url);
if (!response) {
await browser.close();
return { nextUrl: null, products: [] };
}

const productItems = await page.$$eval("product-item", items =>
items.map(item => {
const titleElement = item.querySelector(".product-item-meta__title");
const priceElement = item.querySelector(".price");
return {
name: titleElement ? titleElement.textContent.trim() : null,
price: priceElement ? priceElement.textContent.trim() : null,
url: titleElement ? titleElement.getAttribute("href") : null
};
})
);

const nextUrl = await nextPage(page);
await browser.close();

return {
nextUrl: nextUrl,
products: productItems.filter(item => item.name && item.price && item.url)
};
}

async function nextPage(page) {
let nextUrl = null;
try {
nextUrl = await page.$eval("a.pagination__nav-item:nth-child(4)", item => item.href);
} catch (error) {
console.log('Last Page Reached');
}
return nextUrl;
}

if (isMainThread) {
const pipeline = new ProductDataPipeline("chocolate.csv", 5);
const workers = [];

for (const url of listOfUrls) {
workers.push(
new Promise((resolve, reject) => {
const worker = new Worker(__filename, {
workerData: { startUrl: url }
});
console.log("Worker created", worker.threadId, url);

worker.on("message", (product) => {
pipeline.addProduct(product);
});

worker.on("error", reject);
worker.on("exit", (code) => {
if (code !== 0) {
reject(new Error(`Worker stopped with exit code ${code}`));
} else {
console.log("Worker exited");
resolve();
}
});
})
);
}

Promise.all(workers)
.then(() => pipeline.close())
.then(() => console.log("Pipeline closed"));
} else {
const { startUrl } = workerData;
let headers = [];

const handleWork = async (workUrl) => {
if (headers.length == 0) {
headers = await getHeaders(2);
}
const { nextUrl, products } = await scrape(
workUrl,
headers[Math.floor(Math.random() * headers.length)]
);
for (const product of products) {
parentPort.postMessage(product);
}

if (nextUrl) {
console.log("Worker working on", nextUrl);
await handleWork(nextUrl);
}
};

handleWork(startUrl).then(() => console.log("Worker finished"));
}

// Worker created 1 https://www.chocolate.co.uk/collections/all
// Worker working on https://www.chocolate.co.uk/collections/all?page=2
// Worker working on https://www.chocolate.co.uk/collections/all?page=3
// Last Page Reached
// Worker finished
// Worker exited
// Pipeline closed

Integrating Fake Browser Headers Middleware

Integrating the headers middleware is straightforward. We'll just make a few minor changes to the getHeaders() method that we previously created for fetching user agents. In this updated method:

After adding the browser headers middleware, here’s what the entire code will look like:

const { chromium } = require('playwright');
const fs = require('fs');
const { Worker, isMainThread, parentPort, workerData } = require('worker_threads');
const axios = require('axios');


class Product {
constructor(name, priceStr, url, conversionRate = 1.32) {
this.name = this.cleanName(name);
this.priceGb = this.cleanPrice(priceStr);
this.priceUsd = this.convertPriceToUsd(this.priceGb, conversionRate);
this.url = this.createAbsoluteUrl(url);
}

cleanName(name) {
return name?.trim() || "missing";
}

cleanPrice(priceStr) {
if (!priceStr?.trim()) {
return 0.0;
}

const cleanedPrice = priceStr
.replace(/Sale priceFrom £|Sale price£/g, "")
.trim();

return cleanedPrice ? parseFloat(cleanedPrice) : 0.0;
}

convertPriceToUsd(priceGb, conversionRate) {
return priceGb * conversionRate;
}

createAbsoluteUrl(url) {
return (url?.trim()) ? `https://www.chocolate.co.uk${url.trim()}` : "missing";
}
}

class ProductDataPipeline {
constructor(csvFilename = "", storageQueueLimit = 5) {
this.seenProducts = new Set();
this.storageQueue = [];
this.csvFilename = csvFilename;
this.csvFileOpen = false;
this.storageQueueLimit = storageQueueLimit;
}

saveToCsv() {
this.csvFileOpen = true;
const fileExists = fs.existsSync(this.csvFilename);
const file = fs.createWriteStream(this.csvFilename, { flags: "a" });
if (!fileExists) {
file.write("name,priceGb,priceUsd,url\n");
}
for (const product of this.storageQueue) {
file.write(
`${product.name},${product.priceGb},${product.priceUsd},${product.url}\n`
);
}
file.end();
this.storageQueue = [];
this.csvFileOpen = false;
}

cleanRawProduct(rawProduct) {
return new Product(rawProduct.name, rawProduct.price, rawProduct.url);
}

isDuplicateProduct(product) {
if (!this.seenProducts.has(product.url)) {
this.seenProducts.add(product.url);
return false;
}
return true;
}

addProduct(rawProduct) {
const product = this.cleanRawProduct(rawProduct);
if (!this.isDuplicateProduct(product)) {
this.storageQueue.push(product);
if (
this.storageQueue.length >= this.storageQueueLimit &&
!this.csvFileOpen
) {
this.saveToCsv();
}
}
}

async close() {
while (this.csvFileOpen) {
// Wait for the file to be written
await new Promise((resolve) => setTimeout(resolve, 1000));
}
if (this.storageQueue.length > 0) {
this.saveToCsv();
}
}
}

const listOfUrls = ["https://www.chocolate.co.uk/collections/all"];
const scrapeOpsKey = "<YOUR_SCRAPE_OPS_KEY>";

async function makeRequest(page, url, retries = 3, antiBotCheck = false) {
for (let i = 0; i < retries; i++) {
try {
const response = await page.goto(url);
const status = response.status();
if ([200, 404].includes(status)) {
if (antiBotCheck && status == 200) {
const content = await page.content();
if (content.includes("<title>Robot or human?</title>")) {
return null;
}
}
return response;
}
} catch (e) {
console.log(`Failed to fetch ${url}, retrying...`);
}
}
return null;
}

async function getHeaders(numHeaders) {
const fallbackHeaders = [
{
"upgrade-insecure-requests": "1",
"user-agent":
"Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
accept:
"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"sec-ch-ua":
'".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": '"Windows"',
"sec-fetch-site": "none",
"sec-fetch-mod": "",
"sec-fetch-user": "?1",
"accept-encoding": "gzip, deflate, br",
"accept-language": "bg-BG,bg;q=0.9,en-US;q=0.8,en;q=0.7",
},
{
"upgrade-insecure-requests": "1",
"user-agent":
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36",
accept:
"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"sec-ch-ua":
'".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": '"Linux"',
"sec-fetch-site": "none",
"sec-fetch-mod": "",
"sec-fetch-user": "?1",
"accept-encoding": "gzip, deflate, br",
"accept-language": "fr-CH,fr;q=0.9,en-US;q=0.8,en;q=0.7",
},
];


try {
const response = await axios.get(
`http://headers.scrapeops.io/v1/browser-headers?api_key=${scrapeOpsKey}&num_results=${numHeaders}`
);

if (response.data.result.length > 0) {
return response.data.result;
} else {
console.error("No headers from ScrapeOps, using fallback headers");
return fallbackHeaders;
}
} catch (error) {
console.error(
"Failed to fetch headers from ScrapeOps, using fallback headers"
);
return fallbackHeaders;
}
}

async function scrape(url, headers) {
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage({
extraHTTPHeaders: headers
});

const response = await makeRequest(page, url);
if (!response) {
await browser.close();
return { nextUrl: null, products: [] };
}

const productItems = await page.$$eval("product-item", items =>
items.map(item => {
const titleElement = item.querySelector(".product-item-meta__title");
const priceElement = item.querySelector(".price");
return {
name: titleElement ? titleElement.textContent.trim() : null,
price: priceElement ? priceElement.textContent.trim() : null,
url: titleElement ? titleElement.getAttribute("href") : null
};
})
);

const nextUrl = await nextPage(page);
await browser.close();

return {
nextUrl: nextUrl,
products: productItems.filter(item => item.name && item.price && item.url)
};
}

async function nextPage(page) {
let nextUrl = null;
try {
nextUrl = await page.$eval("a.pagination__nav-item:nth-child(4)", item => item.href);
} catch (error) {
console.log('Last Page Reached');
}
return nextUrl;
}

if (isMainThread) {
const pipeline = new ProductDataPipeline("chocolate.csv", 5);
const workers = [];

for (const url of listOfUrls) {
workers.push(
new Promise((resolve, reject) => {
const worker = new Worker(__filename, {
workerData: { startUrl: url }
});
console.log("Worker created", worker.threadId, url);

worker.on("message", (product) => {
pipeline.addProduct(product);
});

worker.on("error", reject);
worker.on("exit", (code) => {
if (code !== 0) {
reject(new Error(`Worker stopped with exit code ${code}`));
} else {
console.log("Worker exited");
resolve();
}
});
})
);
}

Promise.all(workers)
.then(() => pipeline.close())
.then(() => console.log("Pipeline closed"));
} else {
const { startUrl } = workerData;
let headers = [];

const handleWork = async (workUrl) => {
if (headers.length == 0) {
headers = await getHeaders(2);
}
const { nextUrl, products } = await scrape(
workUrl,
headers[Math.floor(Math.random() * headers.length)]
);
for (const product of products) {
parentPort.postMessage(product);
}

if (nextUrl) {
console.log("Worker working on", nextUrl);
await handleWork(nextUrl);
}
};

handleWork(startUrl).then(() => console.log("Worker finished"));
}

// Worker created 1 https://www.chocolate.co.uk/collections/all
// Worker working on https://www.chocolate.co.uk/collections/all?page=2
// Worker working on https://www.chocolate.co.uk/collections/all?page=3
// Last Page Reached
// Worker finished
// Worker exited
// Pipeline closed

Next Steps

Now that you understand how to use User Agents and Browser Headers to overcome blocks and restrictions, you're ready to tackle more advanced scraping techniques.

In the next tutorial, we’ll dive into using proxies to bypass anti-bot measures.