Skip to main content

Part 1 - Building Your First Scraper

NodeJS Puppeteer Beginners Series Part 1: How To Build Your First Puppeteer Scraper

This guide is your comprehensive, step-by-step journey to building a production-ready web scraper with Node.js and Puppeteer.

While many tutorials cover only the basics, this six-part series goes further, leading you through the creation of a well-structured scraper using object-oriented programming (OOP) principles.

This 6-part Node.js Puppeteer Beginner Series will walk you through building a web scraping project from scratch, covering everything from creating the scraper to deployment and scheduling.

You'll learn not just how to scrape data but also how to store and clean it, handle errors and retries, and optimize performance with Node.js concurrency modules. By the end of this guide, you'll be equipped to create a robust, efficient, and scalable web scraper.

Node.js Puppeteer 6-Part Beginner Series

  • Part 1: Basic Node.js Puppeteer Scraper - We'll learn the fundamentals of web scraping with Node.js and build your first scraper using Puppeteer. (This article)

  • Part 2: Cleaning Unruly Data & Handling Edge Cases - Web data can be messy and unpredictable. In this part, we'll create a robust scraper using data structures and cleaning techniques to handle these challenges. (Part 2)

  • Part 3: Storing Scraped Data in AWS S3, MySQL & Postgres DBs - Explore various options for storing your scraped data, including databases like MySQL or Postgres, cloud storage like AWS S3, and file formats like CSV and JSON. We'll discuss their pros, cons, and suitable use cases. (Part 3)

  • Part 4: Managing Retries & Concurrency - Enhance your scraper's reliability and scalability by handling failed requests and utilizing concurrency. (Part 4)

  • Part 5: Faking User-Agents & Browser Headers - Learn how to create a production-ready scraper by simulating real users through user-agent and browser header manipulation. (Part 5)

  • Part 6: Using Proxies To Avoid Getting Blocked - Discover how to use proxies to bypass anti-bot systems by disguising your real IP address and location. (Part 6)

For this beginner series, we'll focus on a simple scraping structure. We'll build a single scraper that takes a starting URL, fetches the website, parses and cleans data from the HTML response, and stores the extracted information - all within the same process.

This approach is ideal for personal projects and small-scale scraping tasks. However, larger-scale scraping, especially for business-critical data, may require more complex architectures.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.


Part 1: Basic Node.js Scraper

In Part 1, we'll start by building a basic web scraper that extracts data from webpages using CSS selectors and saves it in CSV format.

In the following sections, we'll expand on this foundation, adding more features and functionality.

For this series, we will be scraping products from an e-commerce website,chocolate.co.uk, using Puppeteer for its ability to handle JavaScript-heavy pages. Let's get started!

chocoloate.co.uk Products Page


Our Puppeteer Web Scraping Stack

When it comes to web scraping stacks, two key components are necessary:

  1. HTTP Client: Sends a request to the website to retrieve the HTML/JSON response.
  2. Browser Automation Tool: Used to navigate and interact with web pages.

For our purposes, we will use Puppeteer, a Node.js library that provides a high-level API to control headless Chrome or Chromium. Puppeteer is particularly useful for scraping dynamic content that requires JavaScript to render.

Using Puppeteer, you can simulate a real user navigating through a website. This includes clicking on buttons, filling out forms, and waiting for dynamic content to load.

This makes it a powerful tool for web scraping, especially for modern websites that rely heavily on JavaScript.


How to Set Up Our Node.js Environment

Let's start by setting up our Node.js environment.

Step 1 - Setup Your Node.js Environment

Ensure you have Node.js installed on your machine. You can download it from nodejs.org.

Once installed, set up a new project and initialize a package.json file:

$ mkdir puppeteer_scraper
$ cd puppeteer_scraper
$ npm init -y

This creates a new directory for our project and initializes it with a default package.json file.

Step 2 - Install Puppeteer

Install Puppeteer using npm:

$ npm install puppeteer

Puppeteer will download a recent version of Chromium by default, which ensures that your scraper works out of the box with a known good version of the browser.


Creating Our Scraper Project

Now that we have our environment set up, we can start building our Puppeteer scraper. First, create a new file called chocolate_scraper.js in our project folder:

puppeteer_scraper
└── chocolate_scraper.js

This chocolate_scraper.js file will contain all the code we use to scrape the e-commerce website.


Laying Out Our Puppeteer Scraper

First, let's lay out the basic structure of our scraper.

const puppeteer = require('puppeteer');

const urls = [
'https://www.chocolate.co.uk/collections/all'
];

let scrapedData = [];

const startScrape = async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();

for (const url of urls) {
await page.goto(url);

// Parse Data

// Add to Data Output
}

await browser.close();
console.log(scrapedData);
};

startScrape();
  • We imported Puppeteer which provides an API to control a browser programmatically.
  • Then, defined a list of URLs (urls) to scrape.
  • Next, an empty array (scrapedData) is initialized to store the data that will be scraped from the website.
  • Finally, set up a basic function to start our scraping process.

If we run this script now using startScrape() then we should get a empty list as an output.


Retrieving The HTML From Website

The first step every web scraper must do is retrieve the HTML/JSON response from the target website so that it can extract the data from the response.

Let's update our scraper to navigate to the target URLs and retrieve the HTML content:

const puppeteer = require('puppeteer');

const urls = [
'https://www.chocolate.co.uk/collections/all'
];

let scrapedData = [];

const startScrape = async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();

for (const url of urls) {
await page.goto(url, { waitUntil: 'networkidle2' });

const html = await page.content();
}

await browser.close();
};

startScrape();
  • Here, we navigate to each URL in our list using page.goto(url, { waitUntil: 'networkidle2' }).
    • The networkidle2 option ensures that Puppeteer waits until there are no more than two network connections for at least 500 ms.
    • This is particularly useful for pages that load additional content dynamically.
  • We then retrieve the HTML content of the page with page.content() and print it out for debugging purposes.

Extracting Data From HTML

Now that our scraper can retrieve HTML content, we need to extract the data we want.

This will be done using Puppeteer's page.evaluate() function, which allows us to execute JavaScript in the context of the page.

Find Product CSS Selectors

To identify the correct CSS selectors for parsing product details, start by opening the website in your browser.

Then, right-click anywhere on the page and select "Inspect" to open the developer tools console.

Product CSS Selectors

Using the inspect element, hover over the item and look at the id's and classes on the individual products.

In this case we can see that each box of chocolates has its own special component which is called product-item.

We can just use this to reference our products (see above image).

const puppeteer = require('puppeteer');

const urls = [
'https://www.chocolate.co.uk/collections/all'
];

let scrapedData = [];

const startScrape = async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();

for (const url of urls) {
await page.goto(url, { waitUntil: 'networkidle2' });

const products = await page.evaluate(() => {
const items = document.querySelectorAll('.product-item');
return Array.from(items).map(item => ({
name: item.querySelector('.product-item-meta__title').innerText,
price: item.querySelector('.price').innerText.replace('\nSale price', '').trim(),
url: item.querySelector('.product-item-meta a').href
}));
});

scrapedData.push(...products);
}

await browser.close();
console.log(scrapedData);
};

startScrape();
  • In this code, we use page.evaluate() to execute a function in the context of the page.
  • This function selects all elements with the class product-item and maps them to an array of objects, each containing the product's name, price, and URL.
  • We then append this array to our scrapedData array.

Saving Data to CSV

In Part 4 of this beginner series, we go through in much more detail how to save data to various file formats and databases.

However, as a simple example for part 1 of this series we're going to save the data we've scraped and stored in scraped_data into a CSV file once the scrape has been completed.

To do this, we need to install the csv-writer package:

$ npm install csv-writer

Now, update our scraper to include the CSV writing functionality:

const puppeteer = require('puppeteer');
const createCsvWriter = require('csv-writer').createObjectCsvWriter;

const urls = [
'https://www.chocolate.co.uk/collections/all'
];

let scrapedData = [];

const startScrape = async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();

for (const url of urls) {
await page.goto(url, { waitUntil: 'networkidle2' });

const products = await page.evaluate(() => {
const items = document.querySelectorAll('.product-item');
return Array.from(items).map(item => ({
name: item.querySelector('.product-item-meta__title').innerText,
price: item.querySelector('.price').innerText.replace('\nSale price', '').trim(),
url: item.querySelector('.product-item-meta a').href
}));
});

scrapedData.push(...products);
}

await browser.close();
saveToCSV(scrapedData);
};

const saveToCSV = (data) => {
const csvWriter = createCsvWriter({
path: 'scraped_data.csv',
header: [
{ id: 'name', title: 'Name' },
{ id: 'price', title: 'Price' },
{ id: 'url', title: 'URL' }
]
});

csvWriter.writeRecords(data).then(() => {
console.log('CSV file was written successfully');
});
};

startScrape();
  • In this code, we define a saveToCSV function that takes an array of data and writes it to a CSV file using csv-writer.
  • The csvWriter object is configured with the path to the output file and the headers for the CSV columns.
  • After scraping the data, we call saveToCSV(scrapedData) to save the data to a file.

So far the code is working great but we're only getting the products from the first page of the site, the URL which we have defined in the urls list.

So the next logical step is to go to the next page if there is one and scrape the item data from that too! So here's how we do that.

To handle pagination, we need to find the CSS selector for the "next page" button and scrape each page iteratively until there are no more pages.

document.querySelector('a[rel="next"]')

Now, we just need to update our scraper to extract this next page url and add it to our urls to scrape.

const puppeteer = require('puppeteer');
const createCsvWriter = require('csv-writer').createObjectCsvWriter;

const baseURL = 'https://www.chocolate.co.uk/collections/all';
let scrapedData = [];

const startScrape = async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();

let nextPageExists = true;
let currentPage = baseURL;

while (nextPageExists) {
await page.goto(currentPage, { waitUntil: 'networkidle2' });

const products = await page.evaluate(() => {
const items = document.querySelectorAll('.product-item');
return Array.from(items).map(item => ({
name: item.querySelector('.product-item-meta__title').innerText,
price: item.querySelector('.price').innerText.replace('\nSale price', '').trim(),
url: item.querySelector('.product-item-meta a').href
}));
});

scrapedData.push(...products);

nextPageExists = await page.evaluate(() => {
const nextPage = document.querySelector('a[rel="next"]');
return nextPage ? nextPage.href : null;
});

if (nextPageExists) {
currentPage = nextPageExists;
}


}

await browser.close();
saveToCSV(scrapedData);
};

const saveToCSV = (data) => {
const csvWriter = createCsvWriter({
path: 'scraped_data.csv',
header: [
{ id: 'name', title: 'Name' },
{ id: 'price', title: 'Price' },
{ id: 'url', title: 'URL' }
]
});

csvWriter.writeRecords(data).then(() => {
console.log('CSV file was written successfully');
});
};

startScrape();
  • In this updated code, we use a while loop to keep scraping until there are no more pages.
  • We check for the presence of a "next page" link using page.evaluate(), and if it exists, we set currentPage to the URL of the next page.
  • This process repeats until nextPageExists is null, indicating there are no more pages to scrape.

Next Steps

With this guide, you should be able to set up a basic web scraper using Puppeteer to handle dynamic content.

In Part 2 of the series, we will explore handling data cleaning and dealing with edge cases to make our scraper more robust. Stay tuned!