Skip to main content

Part 4 - Storing Data in AWS S3, MySQL & Postgres DBs Using Puppeteer

NodeJS Puppeteer Beginners Series Part 4 - Managing Retries & Concurrency

So far in this Node.js Puppeteer 6-Part Beginner Series, we learned how to build a basic web scraper in Part 1, get it to scrape some data from a website in Part 2, clean up the data as it was being scraped and then save the data to a file or database in Part 3.

In this tutorial, we will dive into strategies for optimizing your web scraping process by implementing retry logic and concurrency. These techniques will make your scraper more robust and efficient, especially when dealing with large datasets.

Node.js Puppeteer 6-Part Beginner Series

  • Part 1: Basic Node.js Puppeteer Scraper - We'll learn the fundamentals of web scraping with Node.js and build your first scraper using NpdeJS Puppeteer. (Part 1)

  • Part 2: Cleaning Unruly Data & Handling Edge Cases - Web data can be messy and unpredictable. In this part, we'll create a robust scraper using data structures and cleaning techniques to handle these challenges. (Part 2)

  • Part 3: Storing Scraped Data in AWS S3, MySQL & Postgres DBs - Explore various options for storing your scraped data, including databases like MySQL or Postgres, cloud storage like AWS S3, and file formats like CSV and JSON. We'll discuss their pros, cons, and suitable use cases. (Part 3)

  • Part 4: Managing Retries & Concurrency - Enhance your scraper's reliability and scalability by handling failed requests and utilizing concurrency. (This article)

  • Part 5: Faking User-Agents & Browser Headers - Learn how to create a production-ready scraper by simulating real users through user-agent and browser header manipulation. (Part 5)

  • Part 6: Using Proxies To Avoid Getting Blocked - Discover how to use proxies to bypass anti-bot systems by disguising your real IP address and location. (Part 6)

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.


Understanding Scraper Performance Bottlenecks

In web scraping projects, network delays are often the main bottleneck. Scraping requires sending multiple requests to a website and handling their responses. While each request and response only takes a fraction of a second, these delays add up, significantly slowing down the process, especially when scraping a large number of pages (e.g., 5,000 pages).

While human users visiting a few pages wouldn't notice such minor delays, scraping tools sending hundreds or thousands of requests can face delays that stretch into hours. Additionally, network delay is just one factor affecting scraping speed.

The scraper must not only send and receive requests but also parse the extracted data, identify relevant information, and store or process it. These additional steps are CPU-intensive and can significantly slow down scraping.


Retry Requests and Concurrency Importance

When web scraping, retrying requests and using concurrency are crucial for several reasons. Retrying requests helps handle temporary network glitches, server errors, rate limits, or connection timeouts, increasing the chances of a successful response.

Common Status Codes Indicating Retry:

  • 429: Too many requests
  • 500: Internal server error
  • 502: Bad gateway
  • 503: Service unavailable
  • 504: Gateway timeout

Websites often implement rate limits to control traffic. Retrying with delays can help you stay within these limits and avoid getting blocked. Additionally, pages with dynamically loaded content may require multiple attempts and retries at intervals to retrieve all elements.

Now let’s talk about concurrency. When making sequential requests, you make one request at a time, wait for the response, and then make the next one. This approach is inefficient for scraping large datasets because it involves a lot of waiting.

In the diagram below, the blue boxes show the time when your program is actively working, while the red boxes show when it's paused waiting for an I/O operation, such as downloading data from the website, reading data from files, or writing data to files, to complete.

Pre-Concurrent

Source: Real Python

Concurrency allows your program to handle multiple open requests to websites simultaneously, significantly improving performance and efficiency, particularly for time-consuming tasks.

By concurrently sending these requests, your program overlaps the waiting times for responses, reducing the overall waiting time and getting the final results faster.

Threading

Source: Real Python


Retry Logic Mechanism

Let's examine how we'll implement retry logic within our scraper. This involves modifying the startScrape function to include retry logic for handling failed requests.

Implementing Retry Logic

We will first create a retryRequest function that attempts to make the HTTP request multiple times.

The RetryLogic Class

The RetryLogic class will handle the retry functionality. Here’s an overview of how it works:

  1. Initialization: The constructor initializes instance variables like retryLimit and antiBotCheck.
  2. Making Requests: The makeRequest method attempts to make the HTTP request up to retryLimit times.
  3. Anti-Bot Check: If enabled, checks for anti-bot mechanisms in the response.

Initialization

class RetryLogic {
constructor(retryLimit = 5, antiBotCheck = false) {
this.retryLimit = retryLimit;
this.antiBotCheck = antiBotCheck;
}
}

Making Requests

async makeRequest(page, url) {
for (let i = 0; i < this.retryLimit; i++) {
try {
const response = await page.goto(url, { waitUntil: 'networkidle2' });
const status = response.status();
if (status === 200 || status === 404) {
if (this.antiBotCheck && status === 200 && !this.passedAntiBotCheck(page)) {
return { success: false, response };
}
return { success: true, response };
}
} catch (error) {
console.error(`Attempt ${i + 1} failed: ${error}`);
}
}
return { success: false, response: null };
}

Anti-Bot Check

async passedAntiBotCheck(page) {
const content = await page.content();
return !content.includes('<title>Robot or human?</title>');
}

Complete Code for the Retry Logic

class RetryLogic {
constructor(retryLimit = 5, antiBotCheck = false) {
this.retryLimit = retryLimit;
this.antiBotCheck = antiBotCheck;
}

async makeRequest(page, url) {
for (let i = 0; i < this.retryLimit; i++) {
try {
const response = await page.goto(url, { waitUntil: 'networkidle2' });
const status = response.status();
if (status === 200 || status === 404) {
if (this.antiBotCheck && status === 200 && !await this.passedAntiBotCheck(page)) {
return { success: false, response };
}
return { success: true, response };
}
} catch (error) {
console.error(`Attempt ${i + 1} failed: ${error}`);
}
}
return { success: false, response: null };
}

async passedAntiBotCheck(page) {
const content = await page.content();
return !content.includes('<title>Robot or human?</title>');
}
}

Concurrency Management

Concurrency refers to the ability to execute multiple tasks or processes concurrently. Concurrency enables efficient utilization of system resources and can significantly speed up program execution. Node.js provides several methods and modules to achieve concurrency, such as multi-threading.

Multi-threading refers to the ability of a processor to execute multiple threads concurrently. Using the Promise.all method in Node.js, you can handle multiple asynchronous operations concurrently.

Here’s a simple code snippet to add concurrency to your scraper:

async function startConcurrentScrape(urls, numThreads = 5) {
while (urls.length > 0) {
const tasks = [];
for (let i = 0; i < numThreads && urls.length > 0; i++) {
const url = urls.shift();
tasks.push(scrapePage(url));
}
await Promise.all(tasks);
}
}
  • The startConcurrentScrape function processes the URLs concurrently.
  • It creates an array of promises (tasks) for scraping pages and uses Promise.all to wait for all of them to complete before moving to the next batch of URLs.

Complete Code

Here's the complete code for implementing retry logic and concurrency in your Puppeteer scraper:

const puppeteer = require('puppeteer');
const fs = require('fs');

class RetryLogic {
constructor(retryLimit = 5, antiBotCheck = false) {
this.retryLimit = retryLimit;
this.antiBotCheck = antiBotCheck;
}

async makeRequest(page, url) {
for (let i = 0; i < this.retryLimit; i++) {
try {
const response = await page.goto(url, { waitUntil: 'networkidle2' });
const status = response.status();
if (status === 200 || status === 404) {
if (this.antiBotCheck && status === 200 && !await this.passedAntiBotCheck(page)) {
return { success: false, response };
}
return { success: true, response };
}
} catch (error) {
console.error(`Attempt ${i + 1} failed: ${error}`);
}
}
return { success: false, response: null };
}

async passedAntiBotCheck(page) {
const content = await page.content();
return !content.includes('<title>Robot or human?</title>');
}
}

async function scrapePage(url) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const retryLogic = new RetryLogic(3, true);
const { success, response } = await retryLogic.makeRequest(page, url);

if (success) {
const data = await extractData(page);
await saveData(data);
}

await browser.close();
}

async function extractData(page) {
return await page.evaluate(() => {
const items = document.querySelectorAll('.product-item');
return Array.from(items).map(item => ({
name: item.querySelector('.product-item-meta__title').innerText.trim(),
price: item.querySelector('.price').innerText.trim().replace(/[^0-9\.]+/g, ''),
url: item.querySelector('.product-item-meta a').getAttribute('href')
}));
});
}

async function saveData(data) {
fs.appendFileSync('data.json', JSON.stringify(data, null, 2));
}

async function startConcurrentScrape(urls, numThreads = 5) {
while (urls.length > 0) {
const tasks = [];
for (let i = 0;

i < numThreads && urls.length > 0; i++) {
const url = urls.shift();
tasks.push(scrapePage(url));
}
await Promise.all(tasks);
}
}

const urls = [
'https://www.chocolate.co.uk/collections/all'
];

startConcurrentScrape(urls, 10);

Next Steps

We hope you now have a good understanding of why you need to retry requests and use concurrency when web scraping. This includes how the retry logic works, how to check for anti-bots, and how concurrency management works.

In the next tutorial, we'll cover how to make our scraper production-ready by managing our user agents and IPs to avoid getting blocked.