NodeJS Puppeteer Beginners Series Part 4 - Managing Retries & Concurrency
So far in this Node.js Puppeteer 6-Part Beginner Series, we learned how to build a basic web scraper in Part 1, get it to scrape some data from a website in Part 2, clean up the data as it was being scraped and then save the data to a file or database in Part 3.
In this tutorial, we will dive into strategies for optimizing your web scraping process by implementing retry logic and concurrency. These techniques will make your scraper more robust and efficient, especially when dealing with large datasets.
- Understanding Scraper Performance Bottlenecks
- Retry Requests and Concurrency Importance
- Retry Logic Mechanism
- Concurrency Management
Node.js Puppeteer 6-Part Beginner Series
-
Part 1: Basic Node.js Puppeteer Scraper - We'll learn the fundamentals of web scraping with Node.js and build your first scraper using NpdeJS Puppeteer. (Part 1)
-
Part 2: Cleaning Unruly Data & Handling Edge Cases - Web data can be messy and unpredictable. In this part, we'll create a robust scraper using data structures and cleaning techniques to handle these challenges. (Part 2)
-
Part 3: Storing Scraped Data in AWS S3, MySQL & Postgres DBs - Explore various options for storing your scraped data, including databases like MySQL or Postgres, cloud storage like AWS S3, and file formats like CSV and JSON. We'll discuss their pros, cons, and suitable use cases. (Part 3)
-
Part 4: Managing Retries & Concurrency - Enhance your scraper's reliability and scalability by handling failed requests and utilizing concurrency. (This article)
-
Part 5: Faking User-Agents & Browser Headers - Learn how to create a production-ready scraper by simulating real users through user-agent and browser header manipulation. (Part 5)
-
Part 6: Using Proxies To Avoid Getting Blocked - Discover how to use proxies to bypass anti-bot systems by disguising your real IP address and location. (Part 6)
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
Understanding Scraper Performance Bottlenecks
In web scraping projects, network delays are often the main bottleneck. Scraping requires sending multiple requests to a website and handling their responses. While each request and response only takes a fraction of a second, these delays add up, significantly slowing down the process, especially when scraping a large number of pages (e.g., 5,000 pages).
While human users visiting a few pages wouldn't notice such minor delays, scraping tools sending hundreds or thousands of requests can face delays that stretch into hours. Additionally, network delay is just one factor affecting scraping speed.
The scraper must not only send and receive requests but also parse the extracted data, identify relevant information, and store or process it. These additional steps are CPU-intensive and can significantly slow down scraping.
Retry Requests and Concurrency Importance
When web scraping, retrying requests and using concurrency are crucial for several reasons. Retrying requests helps handle temporary network glitches, server errors, rate limits, or connection timeouts, increasing the chances of a successful response.
Common Status Codes Indicating Retry:
- 429: Too many requests
- 500: Internal server error
- 502: Bad gateway
- 503: Service unavailable
- 504: Gateway timeout
Websites often implement rate limits to control traffic. Retrying with delays can help you stay within these limits and avoid getting blocked. Additionally, pages with dynamically loaded content may require multiple attempts and retries at intervals to retrieve all elements.
Now let’s talk about concurrency. When making sequential requests, you make one request at a time, wait for the response, and then make the next one. This approach is inefficient for scraping large datasets because it involves a lot of waiting.
In the diagram below, the blue boxes show the time when your program is actively working, while the red boxes show when it's paused waiting for an I/O operation, such as downloading data from the website, reading data from files, or writing data to files, to complete.
Source: Real Python
Concurrency allows your program to handle multiple open requests to websites simultaneously, significantly improving performance and efficiency, particularly for time-consuming tasks.
By concurrently sending these requests, your program overlaps the waiting times for responses, reducing the overall waiting time and getting the final results faster.
Source: Real Python
Retry Logic Mechanism
Let's examine how we'll implement retry logic within our scraper. This involves modifying the startScrape
function to include retry logic for handling failed requests.
Implementing Retry Logic
We will first create a retryRequest
function that attempts to make the HTTP request multiple times.
The RetryLogic
Class
The RetryLogic
class will handle the retry functionality. Here’s an overview of how it works:
- Initialization: The constructor initializes instance variables like
retryLimit
andantiBotCheck
. - Making Requests: The
makeRequest
method attempts to make the HTTP request up toretryLimit
times. - Anti-Bot Check: If enabled, checks for anti-bot mechanisms in the response.
Initialization
class RetryLogic {
constructor(retryLimit = 5, antiBotCheck = false) {
this.retryLimit = retryLimit;
this.antiBotCheck = antiBotCheck;
}
}
Making Requests
async makeRequest(page, url) {
for (let i = 0; i < this.retryLimit; i++) {
try {
const response = await page.goto(url, { waitUntil: 'networkidle2' });
const status = response.status();
if (status === 200 || status === 404) {
if (this.antiBotCheck && status === 200 && !this.passedAntiBotCheck(page)) {
return { success: false, response };
}
return { success: true, response };
}
} catch (error) {
console.error(`Attempt ${i + 1} failed: ${error}`);
}
}
return { success: false, response: null };
}
Anti-Bot Check
async passedAntiBotCheck(page) {
const content = await page.content();
return !content.includes('<title>Robot or human?</title>');
}
Complete Code for the Retry Logic
class RetryLogic {
constructor(retryLimit = 5, antiBotCheck = false) {
this.retryLimit = retryLimit;
this.antiBotCheck = antiBotCheck;
}
async makeRequest(page, url) {
for (let i = 0; i < this.retryLimit; i++) {
try {
const response = await page.goto(url, { waitUntil: 'networkidle2' });
const status = response.status();
if (status === 200 || status === 404) {
if (this.antiBotCheck && status === 200 && !await this.passedAntiBotCheck(page)) {
return { success: false, response };
}
return { success: true, response };
}
} catch (error) {
console.error(`Attempt ${i + 1} failed: ${error}`);
}
}
return { success: false, response: null };
}
async passedAntiBotCheck(page) {
const content = await page.content();
return !content.includes('<title>Robot or human?</title>');
}
}
Concurrency Management
Concurrency refers to the ability to execute multiple tasks or processes concurrently. Concurrency enables efficient utilization of system resources and can significantly speed up program execution. Node.js provides several methods and modules to achieve concurrency, such as multi-threading.
Multi-threading refers to the ability of a processor to execute multiple threads concurrently. Using the Promise.all
method in Node.js, you can handle multiple asynchronous operations concurrently.
Here’s a simple code snippet to add concurrency to your scraper:
async function startConcurrentScrape(urls, numThreads = 5) {
while (urls.length > 0) {
const tasks = [];
for (let i = 0; i < numThreads && urls.length > 0; i++) {
const url = urls.shift();
tasks.push(scrapePage(url));
}
await Promise.all(tasks);
}
}
- The
startConcurrentScrape
function processes the URLs concurrently. - It creates an array of promises (
tasks
) for scraping pages and usesPromise.all
to wait for all of them to complete before moving to the next batch of URLs.
Complete Code
Here's the complete code for implementing retry logic and concurrency in your Puppeteer scraper:
const puppeteer = require('puppeteer');
const fs = require('fs');
class RetryLogic {
constructor(retryLimit = 5, antiBotCheck = false) {
this.retryLimit = retryLimit;
this.antiBotCheck = antiBotCheck;
}
async makeRequest(page, url) {
for (let i = 0; i < this.retryLimit; i++) {
try {
const response = await page.goto(url, { waitUntil: 'networkidle2' });
const status = response.status();
if (status === 200 || status === 404) {
if (this.antiBotCheck && status === 200 && !await this.passedAntiBotCheck(page)) {
return { success: false, response };
}
return { success: true, response };
}
} catch (error) {
console.error(`Attempt ${i + 1} failed: ${error}`);
}
}
return { success: false, response: null };
}
async passedAntiBotCheck(page) {
const content = await page.content();
return !content.includes('<title>Robot or human?</title>');
}
}
async function scrapePage(url) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const retryLogic = new RetryLogic(3, true);
const { success, response } = await retryLogic.makeRequest(page, url);
if (success) {
const data = await extractData(page);
await saveData(data);
}
await browser.close();
}
async function extractData(page) {
return await page.evaluate(() => {
const items = document.querySelectorAll('.product-item');
return Array.from(items).map(item => ({
name: item.querySelector('.product-item-meta__title').innerText.trim(),
price: item.querySelector('.price').innerText.trim().replace(/[^0-9\.]+/g, ''),
url: item.querySelector('.product-item-meta a').getAttribute('href')
}));
});
}
async function saveData(data) {
fs.appendFileSync('data.json', JSON.stringify(data, null, 2));
}
async function startConcurrentScrape(urls, numThreads = 5) {
while (urls.length > 0) {
const tasks = [];
for (let i = 0;
i < numThreads && urls.length > 0; i++) {
const url = urls.shift();
tasks.push(scrapePage(url));
}
await Promise.all(tasks);
}
}
const urls = [
'https://www.chocolate.co.uk/collections/all'
];
startConcurrentScrape(urls, 10);
Next Steps
We hope you now have a good understanding of why you need to retry requests and use concurrency when web scraping. This includes how the retry logic works, how to check for anti-bots, and how concurrency management works.
In the next tutorial, we'll cover how to make our scraper production-ready by managing our user agents and IPs to avoid getting blocked.