Skip to main content

Part 1 - Building Your First Scraper

NodeJS Playwright Beginner Series Part 1: How To Build Your First Playwright Scraper

This guide is your comprehensive, step-by-step journey to building a production-ready web scraper with Node.js and Playwright.

While many tutorials cover only the basics, this six-part series goes further, leading you through the creation of a well-structured scraper using object-oriented programming (OOP) principles.

This 6-part Node.js Playwright Beginner Series will walk you through building a web scraping project from scratch, covering everything from creating the scraper to deployment and scheduling.

You'll learn not just how to scrape data but also how to store and clean it, handle errors and retries, and optimize performance with Node.js concurrency modules. By the end of this guide, you'll be equipped to create a robust, efficient, and scalable web scraper.

Node.js Playwright 6-Part Beginner Series

  • Part 1: Basic Node.js Playwright Scraper - We'll learn the fundamentals of web scraping with Node.js and build your first scraper using Cheerio. (This article)

  • Part 2: Cleaning Unruly Data & Handling Edge Cases - Web data can be messy and unpredictable. In this part, we'll create a robust scraper using data structures and cleaning techniques to handle these challenges. (Part 2)

  • Part 3: Storing Scraped Data in AWS S3, MySQL & Postgres DBs - Explore various options for storing your scraped data, including databases like MySQL or Postgres, cloud storage like AWS S3, and file formats like CSV and JSON. We'll discuss their pros, cons, and suitable use cases. (Part 3)

  • Part 4: Managing Retries & Concurrency - Enhance your scraper's reliability and scalability by handling failed requests and utilizing concurrency. (Part 4)

  • Part 5: Faking User-Agents & Browser Headers - Learn how to create a production-ready scraper by simulating real users through user-agent and browser header manipulation. (Part 5)

  • Part 6: Using Proxies To Avoid Getting Blocked - Discover how to use proxies to bypass anti-bot systems by disguising your real IP address and location. (Part 6)

For this beginner series, we'll focus on a simple scraping structure. We'll build a single scraper that takes a starting URL, fetches the website, parses and cleans data from the HTML response, and stores the extracted information - all within the same process.

This approach is ideal for personal projects and small-scale scraping tasks. However, larger-scale scraping, especially for business-critical data, may require more complex architectures.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.


Part 1: Basic Node.js Scraper

In Part 1, we'll start by building a basic web scraper that extracts data from webpages using CSS selectors and saves it in CSV format. In the following sections, we'll expand on this foundation, adding more features and functionality.

Most commercial web scrapers are typically used to collect data from e-commerce sites for competitive analysis.

To demonstrate this, we've chosen chocolate.co.uk—a simple, multi-page website that perfectly suits our needs.

chocolate.co.uk HomePage

Let's dive in!


How to Setup Our Environment

Before we start writing our code, we need to set up our environment. The requirements are:

  • The latest version of Node.js
  • Playwright
  • A Chromium browser compatible with Playwright

To install Node.js, visit the official website and download the appropriate version for your operating system.

Node.js comes bundled with npm, a package manager that we will use to install Playwright and other dependencies. After installing Node.js, verify the installation with the following commands:

node -v && npm -v

These commands will print the version numbers of Node.js and npm, confirming that they were successfully installed.

For example, it might print:

v22.6.0
10.8.2

Your versions might be different, but that’s okay. Next, let's create a working directory named chocolateScraper and open it in your terminal.

Run the following command to create a package.json file, which will act as a configuration file and keep track of our project dependencies and other information:

npm init -y

After that, install Playwright using this command:

npm i playwright

This will successfully install Playwright.

It's important to note that Playwright doesn’t come with any browser pre-bundled, unlike Puppeteer. You’ll need to install Chromium, Firefox, or WebKit separately.

For our scraper, we’ll use Chromium, so let's install it with this command:

npx playwright install chromium

Creating Your Playwright Scraper Project

Now that we’ve finished setting up our environment, let's create an entry point to start writing our code.

Create a file named chocolateScraper.js, where our code will go.

After this, your directory structure will look like this:

chocolateScraper/
├── node_modules/ # Installed dependencies
├── chocolateScraper.js # Main entry point
├── package.json # Project metadata and dependencies
├── package-lock.json # Lockfile for dependencies (auto-generated)

Laying Out our Playwright Scraper

In this section, we'll outline how to structure our scraper. Let’s take a look at the following code:

const { chromium } = require("playwright");

const listOfUrls = ["https://www.chocolate.co.uk/collections/all"];
const scrapedData = [];

async function scrape() {
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext();
const page = await context.newPage();

for (let url of listOfUrls) {
console.log(`Scraping: ${url}`);
}

await browser.close();
}

(async () => {
await scrape();
})();

// Scraping: https://www.chocolate.co.uk/collections/all

In the code above, we:

  • Imported the Chromium browser from the Playwright library.
  • Created two data structures: listOfUrls to store the URLs that need to be scraped, and scrapedData to store the important data we extract.
  • Defined a function scrape() to hold our core scraping logic.
  • Inside the function, we looped through each URL and simply printed it out.
  • Then, we closed the browser using browser.close().
  • Finally, we called the scrape() function using an Immediately Invoked Function Expression (IIFE), ensuring it runs as soon as the script is executed.

Retrieving HTML From The Website

One of the most essential methods in Playwright is the page.goto() method, which is used to navigate to a given URL.

Let's extract the HTML content from the URLs in our listOfUrls array using the page.content() method:

const { chromium } = require("playwright");

const listOfUrls = ["https://www.chocolate.co.uk/collections/all"];
const scrapedData = [];

async function scrape() {
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext();
const page = await context.newPage();

for (let url of listOfUrls) {
await page.goto(url);

const html = await page.content();
console.log(html);
}

await browser.close();
}

(async () => {
await scrape();
})();

The code above successfully fetches the HTML from the web page and displays it in the console. The output will look something like this

<!DOCTYPE html>
<html class="js" lang="en" dir="ltr"
style="--announcement-bar-height: 30px; --header-height: 218.453125px; --header-height-without-bottom-nav: 152px; --window-height: 720px;">

<head>
<meta charset="utf-8">
<meta name="viewport"
content="width=device-width, initial-scale=1.0, height=device-height, minimum-scale=1.0, maximum-scale=1.0">
<meta name="theme-color" content="#682464">

<title>Products</title>
... More

Extracting Data from HTML

Unlike Cheerio, Playwright doesn’t provide methods for querying plain HTML directly. Instead, you have to use methods that operate within the browser context, executed as client-side JavaScript.

Here are some important methods that you should become familiar with:

  • $eval(selector, callback): Selects a single element that matches a CSS selector and runs a callback function on it. The function’s result is returned.

  • $$eval(selector, callback): Similar to $eval, but it selects all elements that match the CSS selector, applies a callback function to each, and returns an array of results.

  • querySelector(selector): A standard DOM method that selects the first element matching a given CSS selector.

  • textContent: Retrieves the text content of an element, including all its descendants. It returns the text inside the element as a string.

  • getAttribute: Retrieves the value of a specified attribute on an element. If the attribute doesn’t exist, it returns null.

For example:

const title = await page.$eval('.product-title', element => element.textContent.trim());
console.log(title);

We will use these methods in the upcoming sections. There, you will see how to apply them in practice.


Find Product CSS Selectors

To identify the CSS selectors for target elements within the DOM, you can utilize the Inspect tool in Google Chrome's Developer Tools (or any browser of your choice).

Begin by opening the desired URL, then right-click on the page and select "Inspect". This will open the Inspect tab, which displays the HTML structure of the webpage.

Inspect Tab

Within this tab, you can hover over or click on DOM elements to reveal their associated IDs, classes, and other attributes.

For example, if you're interested in identifying product items, you might find that their class is ".product-item".

Now, let's put this information to use by counting how many product elements, or product cards, exist on the webpage. The following code demonstrates how to do this using Playwright:

const { chromium } = require("playwright");

const listOfUrls = ["https://www.chocolate.co.uk/collections/all"];
const scrapedData = [];

async function scrape() {
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext();
const page = await context.newPage();

for (let url of listOfUrls) {
console.log(`Scraping: ${url}`);
await page.goto(url);

const productItems = await page.$$eval('.product-item', items => items.length);
console.log(productItems);
}

await browser.close();
}

(async () => {
await scrape();
})();

// Scraping: https://www.chocolate.co.uk/collections/all
// 24

In this code, we use the $$eval() function to select all elements matching the ".product-item" CSS selector and then run a callback function on the returned array to print its length, effectively counting the number of product cards on the page.


Extract Product Details

This is where the actual data extraction begins. Up to this point, we’ve just set up the structure of our code and explored some basic Playwright concepts. Now, we’ll focus on extracting the key details we need about each product.

Since we’ve already selected all the product elements, we can now use query selectors to extract specific information like the product title, price, and the URL.

But first, we need use inspector again to find the selectors for these values, which we identified as follows:

  • title: ".product-item-meta__title"
  • price: ".price"

Let’s take a look at the code:

const { chromium } = require("playwright");

const listOfUrls = ["https://www.chocolate.co.uk/collections/all"];
const scrapedData = [];

async function scrape() {
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext();
const page = await context.newPage();

for (let url of listOfUrls) {
console.log(`Scraping: ${url}`);
await page.goto(url);

const productItems = await page.$$eval("product-item", items =>
items.map(item => {
const titleElement = item.querySelector(".product-item-meta__title");
const priceElement = item.querySelector(".price");
return {
title: titleElement ? titleElement.textContent : null,
price: priceElement ? priceElement.textContent : null,
url: titleElement ? titleElement.getAttribute("href") : null
};
})
);

scrapedData.push(...productItems);
}

await browser.close();
console.log(scrapedData);
}

(async () => {
await scrape();
})();

[
{
title: '100% Dark Hot Chocolate Flakes',
price: '\n Sale price£9.95',
url: '/products/100-dark-hot-chocolate-flakes'
},
{
title: '2.5kg Bulk 41% Milk Hot Chocolate Drops',
price: '\n Sale price£45.00',
url: '/products/2-5kg-bulk-of-our-41-milk-hot-chocolate-drops'
},
... More
]
  • In this code, we use querySelector() to locate the title and price elements within each product element using their CSS selectors.
  • We then extract their text content, along with the URL from the title element, which is a clickable link (using getAttribute('href')).
  • If any of this information is missing, we return null instead.
  • Finally, we push all the extracted data into our scrapedData array and print it out.

One thing you might have noticed in the output of the above code is that the price value isn't very clean.

It contains a newline character at the beginning and includes the text Sale price£, which we don't need as it could interfere with numeric calculations.

//price: '\n              Sale price£45.00',

We'll cover how to clean this data thoroughly by implementing a proper Product class in Part 2.

But for now, let's apply a quick fix using the trim() method to remove any empty spaces and the replace() method to eliminate the Sale price£ text.

return {
title: titleElement ? titleElement.textContent.trim() : null,
price: priceElement ? priceElement.textContent.replace("Sale price£", "").trim() : null,
url: titleElement ? titleElement.getAttribute("href") : null
};

Saving Data to CSV

Extracted data is most valuable when stored properly on your local disk. While there are several formats you can use—such as CSV, JSON, NoSQL databases like MongoDB, or SQL databases like PostgreSQL—we'll focus on saving data in CSV format for now. We'll explore other formats in Part 3 of this guide.

CSV (Comma-Separated Values) files organize data in a simple text format where each column is separated by a comma (,), and each row is separated by a newline (\n).

Here's an example:

title,price,url
Almost Perfect,3,/products/almost-perfect

To save data as a CSV file in Node.js, we can use the fs module. Let's walk through an example:

const { chromium } = require("playwright");
const fs = require("fs");

const listOfUrls = ["https://www.chocolate.co.uk/collections/all"];
const scrapedData = [];

async function scrape() {
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext();
const page = await context.newPage();

for (let url of listOfUrls) {
console.log(`Scraping: ${url}`);
await page.goto(url);

const productItems = await page.$$eval(".product-item", items =>
items.map(item => {
const titleElement = item.querySelector(".product-item-meta__title");
const priceElement = item.querySelector(".price");
return {
title: titleElement ? titleElement.textContent.trim() : null,
price: priceElement ? priceElement.textContent.replace("Sale price£", "").trim() : null,
url: titleElement ? titleElement.getAttribute("href") : null
};
})
);

scrapedData.push(...productItems);
}

await browser.close();
saveAsCSV(scrapedData, 'scraped_data.csv');
}

function saveAsCSV(data, filename) {
if (data.length === 0) {
console.log("No data to save.");
return;
}

const header = Object.keys(data[0]).join(",");
const csv = [header, ...data.map((obj) => Object.values(obj).join(","))].join("\n");
fs.writeFileSync(filename, csv);
console.log(`Data saved to ${filename}`);
}

(async () => {
await scrape();
})();

// Scraping: https://www.chocolate.co.uk/collections/all
// Data saved to scraped_data.csv

In this code, we've created a new function saveAsCSV() that takes the scraped data and a filename as input.

  • The data is formatted according to the CSV structure (with commas separating columns and newline characters separating rows) and then written to a file using the writeFileSync() method.
  • The writeFileSync() method is synchronous, meaning it will wait until all the data has been written to the file before proceeding to the next command.

Here’s how the CSV file would look after running this code:

 title,price,url
100% Dark Hot Chocolate Flakes,9.95,/products/100-dark-hot-chocolate-flakes
2.5kg Bulk 41% Milk Hot Chocolate Drops,45.00,/products/2-5kg-bulk-of-our-41-milk-hot-chocolate-drops
2.5kg Bulk 61% Dark Hot Chocolate Drops,45.00,/products/2-5kg-of-our-best-selling-61-dark-hot-chocolate-drops
... More
Data Quality

As you might have noticed in the above CSV file, we seem to have a data quality issue with the price for the "Almost Perfect" perfect product. We will deal with this in the Part 2: Data Cleaning & Edge Cases


In this section, we’ll enhance our scraper to handle pagination and scrape all the pages of a website. The "Next Page (→)" button, in our case is indicated by an arrow symbol, and is a link element with an href attribute containing the URL for the next page.

We can extract this URL, add it to our listOfUrls array, and let our loop process it in subsequent iterations. When the "Next Page" button is no longer present, it indicates that we've reached the last page.

The CSS selector for the "Next Page (→)" button is "a.pagination__nav-item:nth-child(4)".

Below is the updated code, which includes a new asynchronous function nextPage() to handle pagination. The function is asynchronous because locating the button may take some time.

const { chromium } = require("playwright");
const fs = require("fs");

const listOfUrls = ["https://www.chocolate.co.uk/collections/all"];
const scrapedData = [];

async function scrape() {
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext();
const page = await context.newPage();

for (let url of listOfUrls) {
console.log(`Scraping: ${url}`);
await page.goto(url);

const productItems = await page.$$eval("product-item", items =>
items.map(item => ({
title: item.querySelector(".product-item-meta__title")?.textContent.trim() || null,
price: item.querySelector(".price")?.textContent.replace("Sale price£", "").trim() || null,
url: item.querySelector(".product-item-meta__title")?.getAttribute("href") || null
}))
);

scrapedData.push(...productItems);
await nextPage(page);
}

await browser.close();
saveAsCSV(scrapedData, 'scraped_data.csv');
}

function saveAsCSV(data, filename) {
if (data.length === 0) {
console.log("No data to save.");
return;
}

const header = Object.keys(data[0]).join(",");
const csv = [header, ...data.map((obj) => Object.values(obj).join(","))].join("\n");
fs.writeFileSync(filename, csv);
console.log(`Data saved to ${filename}`);
}

async function nextPage(page) {
let nextUrl;
try {
nextUrl = await page.$eval("a.pagination__nav-item:nth-child(4)", item => item.href);
} catch (error) {
console.log('Last Page Reached');
return;
}
listOfUrls.push(nextUrl);
}

(async () => {
await scrape();
})();


 Scraping: https://www.chocolate.co.uk/collections/all
Scraping: https://www.chocolate.co.uk/collections/all?page=2
Scraping: https://www.chocolate.co.uk/collections/all?page=3
Last Page Reached
Data saved to scraped_data.csv

Here's a summary of the changes:

  • The nextPage() function locates the "Next Page (→)" button using page.$eval() (not $$eval(), since it’s a single element). It retrieves the href attribute and adds the URL to the listOfUrls array.
  • The function uses a try-catch block to handle the absence of the "Next Page" button. If the button isn’t found, an error is thrown, indicating that we've reached the last page, and the function exits successfuly.
  • As new URLs are added to listOfUrls, the loop in a scrape() continues to navigate through all available pages, scraping data from each one.

This complete code sets up the scraper to automatically handle pagination, ensuring that all relevant data is collected across multiple pages.


Next Steps

With Part 1 laying the groundwork for our scraper—covering data extraction, CSV storage, and pagination—we’re ready to take the next step.

In Part 2, we’ll enhance our scraper by refining the data processing. We’ll clean up any inconsistencies in the extracted data, such as messy price values and extraneous text, to ensure accuracy.