Skip to main content

Part 4: Retries and Concurrency

Node.js Playwright Beginners Series Part 4: Retries and Concurrency

So far in this Node.js Playwright Beginner Series, we learned how to build a basic web scraper in Part 1, get it to scrape some data from a website in Part 2, clean up the data as it was being scraped and then save the data to a file or database in Part 3.

In the Part 4 of our Node.js Playwright Beginner Series, we’ll focus on making our scraper more robust, scalable, and faster.

Let’s dive into it!

Node.js Playwright 6-Part Beginner Series

  • Part 1: Basic Node.js Playwright Scraper - We'll learn the fundamentals of web scraping with Node.js and build your first scraper using Cheerio. (Part 1)

  • Part 2: Cleaning Unruly Data & Handling Edge Cases - Web data can be messy and unpredictable. In this part, we'll create a robust scraper using data structures and cleaning techniques to handle these challenges. (Part 2)

  • Part 3: Storing Scraped Data in AWS S3, MySQL & Postgres DBs - Explore various options for storing your scraped data, including databases like MySQL or Postgres, cloud storage like AWS S3, and file formats like CSV and JSON. We'll discuss their pros, cons, and suitable use cases. (Part 3)

  • Part 4: Managing Retries & Concurrency - Enhance your scraper's reliability and scalability by handling failed requests and utilizing concurrency. (This Article)

  • Part 5: Faking User-Agents & Browser Headers - Learn how to create a production-ready scraper by simulating real users through user-agent and browser header manipulation. (Part 5)

  • Part 6: Using Proxies To Avoid Getting Blocked - Discover how to use proxies to bypass anti-bot systems by disguising your real IP address and location. (Part 6)

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.


Understanding Scraper Performance Bottlenecks

In web scraping projects, network delays are a primary bottleneck. While each request and response only takes a fraction of a second, these delays accumulate when scraping large numbers of pages (e.g., 5,000), leading to significant slowdowns.

Though humans may not notice delays on a few pages, scraping tools handling thousands of requests can experience delays stretching into hours.

In addition to network delays, other factors like parsing the data, identifying relevant information, and storing or processing it also impact performance. These CPU-intensive tasks can further slow down the overall scraping process.


Retry Requests and Concurrency Importance

The retry mechanism in web scraping helps tackle several common problems, ensuring the scraper doesn't break or miss out on data due to temporary issues.

Here are some key problems it can solve:

  1. Network Issues:
    • Timeouts (Error Code: ETIMEDOUT): The server takes too long to respond, which can happen due to network congestion or server overload. Retrying gives the server another chance to respond.
    • Connection Resets (Error Code: ECONNRESET): The connection drops unexpectedly. A retry can reestablish the connection and continue scraping without data loss.

  2. Server-Side Errors:
    • 500 Internal Server Error: Indicates that the server encountered a problem. Retrying after a short delay can help if the issue is temporary.
    • 502 Bad Gateway: This error suggests a problem with the server's upstream or proxy. Retrying might allow you to connect after the issue clears up.
    • 503 Service Unavailable: Often means the server is overloaded or undergoing maintenance. Retrying after a delay can give the server time to recover.
    • 504 Gateway Timeout: This happens when the server is taking too long to respond, but it might work on retry.

  3. Client-Side Errors:
    • 408 Request Timeout: The request took too long to complete. A retry gives you a chance to succeed without adjusting the timeout manually.
    • 429 Too Many Requests: Indicates that you've hit the rate limit. Retrying with some delay gives you a chance to stay within limits while still scraping data.

Websites often use rate limits to manage traffic, and introducing delays between requests can help you stay within those limits and avoid being blocked. When scraping, you may come across pages with dynamically loaded content. In such cases, multiple attempts with timed retries might be needed to capture all the elements correctly.

Without concurrency, a scraper sends one request at a time, waits for the response, and then sends the next one. This leads to unnecessary idle time when the scraper is just waiting for a server response.

Here’s how sequential execution looks:

Sequential

Source - RealPython

In the sequential execution model:

  • Active Time (Blue boxes): This is when your scraper is actively processing data or sending requests.
  • Waiting Time (Red boxes): This is when your scraper is paused, waiting for responses or I/O operations, such as downloading data or saving files.

With concurrency, your scraper sends multiple requests at the same time. Instead of waiting for each request to complete before starting the next, your program overlaps the waiting periods, significantly reducing overall runtime.

Concruent

Source - RealPython

In the concurrent execution model:

  • Active Time (Blue boxes): Multiple requests are processed simultaneously, maximizing the use of available resources.
  • Waiting Time (Red boxes): Overlapped waiting periods for different requests reduce overall downtime, making the process more efficient.

By handling multiple open requests simultaneously, concurrency makes your scraper more efficient, especially when dealing with time-consuming tasks like I/O operations (network requests, reading/writing files). This leads to faster scraping and better performance, even with high-volume data extraction.


Retry Logic Mechanism

Let's refine our scraper to include retry functionality. Previously, our scraper looped through URLs and made requests without checking the response status:

for (const url of listOfUrls) {
const response = await page.goto(url);
// Further processing...
}

We’ll enhance this by adding a retry mechanism. We’ll create a makeRequest() method to handle retries and confirm a valid response.

Here’s the updated approach:

const response = await makeRequest(url, 3, false);
if (!response) {
throw new Error(`Failed to fetch ${url}`);
}

The makeRequest() method takes three parameters:

  1. the URL,
  2. the number of retries, and
  3. an optional flag to check for bot detection.

It attempts to load the URL multiple times based on the retry count.

  • If the request is successful within the allowed attempts, it returns the response.
  • Otherwise, it retries until the limit is reached, and then returns null.

Here’s how it works:

async function makeRequest(page, url, retries = 3, antiBotCheck = false) {
for (let i = 0; i < retries; i++) {
try {
console.log({ url })
const response = await page.goto(url);
const status = response.status();
if ([200, 404].includes(status)) {
if (antiBotCheck && status == 200) {
const content = await page.content();
if (content.includes("<title>Robot or human?</title>")) {
return null;
}
}
return response;
}
} catch (e) {
console.log(`Failed to fetch ${url}, retrying...`);
}
}
return null;
}

Concurrency Management

Concurrency is the ability to execute multiple tasks simultaneously, making efficient use of system resources and often speeding up execution. In Node.js, concurrency differs from traditional multi-threaded or multi-process approaches due to its single-threaded, event-driven architecture.

Node.js runs on a single thread, leveraging non-blocking I/O calls. This design allows it to handle thousands of concurrent connections efficiently without the overhead of thread context switching.

For CPU-intensive tasks, Node.js provides the worker_threads module, which allows parallel execution of JavaScript using threads. To manage data safely between threads and avoid race conditions, Node.js uses MessageChannel for communication.

An alternative for concurrency in Node.js is the cluster module, which creates multiple child processes to utilize multi-core systems. For this guide, we'll focus on worker_threads as it's better suited for our needs.

Here’s how we adapt our entry point code to support worker_threads:

if (isMainThread) {
const pipeline = new ProductDataPipeline("chocolate.json", 5);
const workers = [];

for (const url of listOfUrls) {
workers.push(
new Promise((resolve, reject) => {
const worker = new Worker(__filename, {
workerData: { startUrl: url }
});
console.log("Worker created", worker.threadId, url);

worker.on("message", (product) => {
pipeline.addProduct(product);
});

worker.on("error", reject);
worker.on("exit", (code) => {
if (code !== 0) {
reject(new Error(`Worker stopped with exit code ${code}`));
} else {
console.log("Worker exited");
resolve();
}
});
})
);
}

Promise.all(workers)
.then(() => pipeline.close())
.then(() => console.log("Pipeline closed"));
} else {
const { startUrl } = workerData;

const handleWork = async (workUrl) => {
const { nextUrl, products } = await scrape(workUrl);
for (const product of products) {
parentPort.postMessage(product);
}

if (nextUrl) {
console.log("Worker working on", nextUrl);
await handleWork(nextUrl);
}
};

handleWork(startUrl).then(() => console.log("Worker finished"));
}
  • We used isMainThread to determine if the code is running in the main thread or a worker thread.
  • For each URL, a worker thread is created and associated with a Promise.
  • Workers send found products back to the pipeline. Errors are handled by rejecting the Promise, and successful exits resolve the Promise.
  • Then, we wait for all worker Promises to complete and then close the pipeline.
  • The handleWork function recursively scrapes URLs and sends products back to the main thread using parentPort.
  • If additional URLs are found, they are recursively processed.

This approach may seem complex compared to threading in languages like Python, but it's straightforward once you grasp the concepts.


Complete Code

With all these improvements, our scraper is now more scalable and robust. It’s better at handling errors and can efficiently manage concurrent requests. Here is the complete code:

const { chromium } = require('playwright');
const fs = require('fs');
const { Worker, isMainThread, parentPort, workerData } = require('worker_threads');

class Product {
constructor(name, priceStr, url, conversionRate = 1.32) {
this.name = this.cleanName(name);
this.priceGb = this.cleanPrice(priceStr);
this.priceUsd = this.convertPriceToUsd(this.priceGb, conversionRate);
this.url = this.createAbsoluteUrl(url);
}

cleanName(name) {
return name?.trim() || "missing";
}

cleanPrice(priceStr) {
if (!priceStr?.trim()) {
return 0.0;
}

const cleanedPrice = priceStr
.replace(/Sale priceFrom £|Sale price£/g, "")
.trim();

return cleanedPrice ? parseFloat(cleanedPrice) : 0.0;
}

convertPriceToUsd(priceGb, conversionRate) {
return priceGb * conversionRate;
}

createAbsoluteUrl(url) {
return (url?.trim()) ? `https://www.chocolate.co.uk${url.trim()}` : "missing";
}
}

class ProductDataPipeline {
constructor(csvFilename = "", storageQueueLimit = 5) {
this.seenProducts = new Set();
this.storageQueue = [];
this.csvFilename = csvFilename;
this.csvFileOpen = false;
this.storageQueueLimit = storageQueueLimit;
}

saveToCsv() {
this.csvFileOpen = true;
const fileExists = fs.existsSync(this.csvFilename);
const file = fs.createWriteStream(this.csvFilename, { flags: "a" });
if (!fileExists) {
file.write("name,priceGb,priceUsd,url\n");
}
for (const product of this.storageQueue) {
file.write(
`${product.name},${product.priceGb},${product.priceUsd},${product.url}\n`
);
}
file.end();
this.storageQueue = [];
this.csvFileOpen = false;
}

cleanRawProduct(rawProduct) {
return new Product(rawProduct.name, rawProduct.price, rawProduct.url);
}

isDuplicateProduct(product) {
if (!this.seenProducts.has(product.url)) {
this.seenProducts.add(product.url);
return false;
}
return true;
}

addProduct(rawProduct) {
const product = this.cleanRawProduct(rawProduct);
if (!this.isDuplicateProduct(product)) {
this.storageQueue.push(product);
if (
this.storageQueue.length >= this.storageQueueLimit &&
!this.csvFileOpen
) {
this.saveToCsv();
}
}
}

async close() {
while (this.csvFileOpen) {
// Wait for the file to be written
await new Promise((resolve) => setTimeout(resolve, 1000));
}
if (this.storageQueue.length > 0) {
this.saveToCsv();
}
}
}

const listOfUrls = ["https://www.chocolate.co.uk/collections/all"];

async function makeRequest(page, url, retries = 3, antiBotCheck = false) {
for (let i = 0; i < retries; i++) {
try {
const response = await page.goto(url);
const status = response.status();
if ([200, 404].includes(status)) {
if (antiBotCheck && status == 200) {
const content = await page.content();
if (content.includes("<title>Robot or human?</title>")) {
return null;
}
}
return response;
}
} catch (e) {
console.log(`Failed to fetch ${url}, retrying...`);
}
}
return null;
}

async function scrape(url) {
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();

const response = await makeRequest(page, url);
if (!response) {
await browser.close();
return { nextUrl: null, products: [] };
}

const productItems = await page.$$eval("product-item", items =>
items.map(item => {
const titleElement = item.querySelector(".product-item-meta__title");
const priceElement = item.querySelector(".price");
return {
name: titleElement ? titleElement.textContent.trim() : null,
price: priceElement ? priceElement.textContent.trim() : null,
url: titleElement ? titleElement.getAttribute("href") : null
};
})
);

const nextUrl = await nextPage(page);
await browser.close();

return {
nextUrl: nextUrl,
products: productItems.filter(item => item.name && item.price && item.url)
};
}

async function nextPage(page) {
let nextUrl = null;
try {
nextUrl = await page.$eval("a.pagination__nav-item:nth-child(4)", item => item.href);
} catch (error) {
console.log('Last Page Reached');
}
return nextUrl;
}

if (isMainThread) {
const pipeline = new ProductDataPipeline("chocolate.json", 5);
const workers = [];

for (const url of listOfUrls) {
workers.push(
new Promise((resolve, reject) => {
const worker = new Worker(__filename, {
workerData: { startUrl: url }
});
console.log("Worker created", worker.threadId, url);

worker.on("message", (product) => {
pipeline.addProduct(product);
});

worker.on("error", reject);
worker.on("exit", (code) => {
if (code !== 0) {
reject(new Error(`Worker stopped with exit code ${code}`));
} else {
console.log("Worker exited");
resolve();
}
});
})
);
}

Promise.all(workers)
.then(() => pipeline.close())
.then(() => console.log("Pipeline closed"));
} else {
const { startUrl } = workerData;
const handleWork = async (workUrl) => {
const { nextUrl, products } = await scrape(workUrl);
for (const product of products) {
parentPort.postMessage(product);
}

if (nextUrl) {
console.log("Worker working on", nextUrl);
await handleWork(nextUrl);
}
};

handleWork(startUrl).then(() => console.log("Worker finished"));
}

// Worker created 1 https://www.chocolate.co.uk/collections/all
// Worker working on https://www.chocolate.co.uk/collections/all?page=2
// Worker working on https://www.chocolate.co.uk/collections/all?page=3
// Last Page Reached
// Worker finished
// Worker exited
// Pipeline closed

Next Steps

In this guide, we aimed to give you a solid understanding of why retrying requests and using concurrency are crucial for effective web scraping. We've covered how retry logic operates, how to detect anti-bot measures, and how to manage concurrency.

In the next tutorial, we’ll focus on preparing our scraper for production by managing User Agents and IPs to prevent getting blocked. Stay tuned for Part 5!