Skip to main content

Part 2 - Cleaning Dirty Data & Dealing With Edge Cases

NodeJS Playwright Beginner Series Part 2: Cleaning Dirty Data & Dealing With Edge Cases

In Part 1 of this Node.js Playwright Beginners Series, we learned the basics of scraping with Node.js and built our first Node.js scraper.

Data on the web is often messy or incomplete, which means we need to clean it up and handle missing information to keep our scraper running smoothly.

In Part-2 of our Node.js Playwright Beginner Series, we’ll explore how to structure data using a dedicated Product class and enhance our scraper's flexibility with a ProductDataPipeline for managing tasks like scheduling and data storage.

Node.js Playwright 6-Part Beginner Series

  • Part 1: Basic Node.js Playwright Scraper - We'll learn the fundamentals of web scraping with Node.js and build your first scraper using Cheerio. (Part 1)

  • Part 2: Cleaning Unruly Data & Handling Edge Cases - Web data can be messy and unpredictable. In this part, we'll create a robust scraper using data structures and cleaning techniques to handle these challenges. (This article)

  • Part 3: Storing Scraped Data in AWS S3, MySQL & Postgres DBs - Explore various options for storing your scraped data, including databases like MySQL or Postgres, cloud storage like AWS S3, and file formats like CSV and JSON. We'll discuss their pros, cons, and suitable use cases. (Part 3)

  • Part 4: Managing Retries & Concurrency - Enhance your scraper's reliability and scalability by handling failed requests and utilizing concurrency. (Part 4)

  • Part 5: Faking User-Agents & Browser Headers - Learn how to create a production-ready scraper by simulating real users through user-agent and browser header manipulation. (Part 5)

  • Part 6: Using Proxies To Avoid Getting Blocked - Discover how to use proxies to bypass anti-bot systems by disguising your real IP address and location. (Part 6)

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.


Strategies to Deal With Edge Cases

In Part 1 of this series, we used basic trim() and replace() methods to clean data on the fly and returned null when the title or price was missing.

While this worked in the short term, it lacked a solid structure and missed several important factors.

In the case of the chocolate.co.uk website that we’re scraping for this series, if we inspect the data we can see a couple of issues. For example:

  • Unclean Price Data: Prices may include extra prefixes like "Sale price" or "Sale priceFrom" that need to be removed.
  • Currency Conversion: Prices are provided in British pounds (GBP), but we need them in US dollars (USD).
  • Relative URLs: Scraped URLs are relative, so we need to convert them into absolute URLs for direct use.
  • Missing Data: The name, price, or URL might be missing, and we need to handle these cases.

Here’s a look at some problematic entries from the CSV file generated in Part-1:

Messy Data

Here are several strategies to handle situations like this:

OptionDescription
Try/ExceptWrap parts of your parsers in Try/Except blocks. If an error occurs when scraping a specific field, the scraper will switch to an alternative parser.
Conditional ParsingSet up your scraper to check the HTML response for certain DOM elements, and apply different parsers based on the situation.
Data ClassesUse data classes to create structured containers, making your code clearer, reducing repetitive boilerplate, and simplifying data manipulation.
Data PipelinesImplement data pipelines to design a series of post-processing steps that clean, manipulate, and validate your data before storing it.
Clean During Data AnalysisParse all relevant fields first, then clean and process the data during the analysis phase.

Each method has its own advantages and drawbacks, so it’s important to be familiar with all of them. This allows you to choose the most suitable approach for your specific scenario.

For this project, we’ll focus on Data Classes and Data Pipelines as they offer the most structured and efficient way to process data using Playwright.

Here’s a system diagram that maps out our code structure, including the Product and ProductPipeline classes:

System Desing Diagram


Structure Your Scraped Data with Data Classes

In Part 1, we scraped data (name, price, and URL) and stored it directly in a dictionary without any formal structure.

In this section, however, we'll implement data classes to create a structured Product class. The Product class will help turn raw, unstructured data from the website into a clean and structured object. Instances of this class will contain sanitized data that can be easily converted into formats like CSV, JSON, or others for local storage.

Data classes in Playwright provide an efficient method for structuring and managing data in your web scraping tasks. They help streamline the process by organizing scraped elements into clean, reusable data structures.

This approach eliminates repetitive code, enhances readability, and simplifies the handling of common tasks such as parsing and validation of scraped data.

Here's how a new instance will be created by passing unclean raw data to the Product class:

new Product(rawProduct.name, rawProduct.price, rawProduct.url);

While we're passing three parameters, the resulting instance will have four key properties:

  • name: The product name, cleaned of any unwanted characters
  • priceGb: The price in British pounds (GBP)
  • priceUsd: The price converted to US dollars (USD)
  • url: The absolute URL that you can navigate to directly

Here's a look at the Product class structure:

class Product {
constructor(name, priceStr, url, conversionRate = 1.32) {
this.name = this.cleanName(name);
this.priceGb = this.cleanPrice(priceStr);
this.priceUsd = this.convertPriceToUsd(this.priceGb, conversionRate);
this.url = this.createAbsoluteUrl(url);
}

cleanName(name) {}

cleanPrice(priceStr) {}

convertPriceToUsd(priceGb, conversionRate) {}

createAbsoluteUrl(url) {}
}

We’re introducing a fourth parameter to the Product class: conversionRate, which defaults to 1.32, the current exchange rate. You can update this value as needed or use an API like ExchangeRate-API for dynamic rate updates.

Since it’s a default parameter, you don’t need to specify it when creating an instance of the Product class unless you want to override the default rate.

You'll notice the use of several methods that we'll define in the upcoming sections. Each method is responsible for handling specific tasks, leading to a cleaner, more modular codebase.

Here's a quick overview of what each method does:

  • cleanName(): Cleans up the product name.
  • cleanPrice(): Strips unwanted characters from the price string.
  • convertPriceToUsd(): Converts the GBP price to USD.
  • createAbsoluteUrl(): Converts relative URLs to absolute ones.

Clean the Price

The cleanPrice() method performs several checks to ensure the price data is valid and clean:

  • If the price data is missing or contains only empty spaces, it returns 0.0.
  • If the price exists, it removes unnecessary prefixes and trims any extra spaces. (Eg. "Sale price£" and "Sale priceFrom £")
  • Finally, it attempts to convert the cleaned price string to a floating-point number. If the conversion fails, it returns 0.0.

Here’s the method:

cleanPrice(priceStr) {
if (!priceStr?.trim()) {
return 0.0;
}

const cleanedPrice = priceStr
.replace(/Sale priceFrom £|Sale price£/g, "")
.trim();

return cleanedPrice ? parseFloat(cleanedPrice) : 0.0;
}

In the script above:

  • We used optional chaining (?.) in !priceStr?.trim(), which ensures that trim() is only called if priceStr exists. This feature is available in Node.js to prevent errors when accessing properties of null or undefined.

  • The replace(/Sale priceFrom £|Sale price£/g, "") uses regular expressions to detect and remove the unwanted prefixes ("Sale priceFrom £" and "Sale price£") from the price string.

  • The parseFloat() method is used because the price value extracted from the web is a string, so it needs to be converted into a floating-point number for numeric calculations.

  • The conditional return cleanedPrice ? parseFloat(cleanedPrice) : 0.0 ensures that if the cleaned price string is empty or non-numeric, the method returns 0.0 instead of attempting an invalid conversion.

The optional chaining (?.) operator accesses an object's property or calls a function. If the object accessed or function called using this operator is undefined or null, the expression short circuits and evaluates to undefined instead of throwing an error - (Source: MDN)


Convert the Price

The convertPriceToUsd() method takes the price in GBP and converts it to USD using the current exchange rate (1.32 in our case).

Here's how:

convertPriceToUsd(priceGb, conversionRate) {
return priceGb * conversionRate;
}

Clean the Name

The cleanName() method performs the following checks:

  • If the name is missing or contains only spaces, it returns "missing".
  • Otherwise, it returns the trimmed and cleaned name.
cleanName(name) {
return name?.trim() || "missing";
}

Convert Relative to Absolute URL

The createAbsoluteUrl() method performs the following checks:

  • If the URL is missing or consists only of empty spaces, it returns "missing".
  • Otherwise, it returns the trimmed URL prefixed with https://www.chocolate.co.uk
createAbsoluteUrl(url) {
return (url?.trim()) ? `https://www.chocolate.co.uk${url.trim()}` : "missing";
}

This code will convert "/products/almost-perfect" to "https://www.chocolate.co.uk/products/almost-perfect," providing a navigable link.

Here’s the snapshot of the data that will be returned from the product data class. It consists of name, price_gb, price_usd, and url.

Structured Data


Complete Code for the Data Class

Now that we've defined all our methods, let's take a look at the complete code for Product class.

class Product {
constructor(name, priceStr, url, conversionRate = 1.32) {
this.name = this.cleanName(name);
this.priceGb = this.cleanPrice(priceStr);
this.priceUsd = this.convertPriceToUsd(this.priceGb, conversionRate);
this.url = this.createAbsoluteUrl(url);
}

cleanName(name) {
return name?.trim() || "missing";
}

cleanPrice(priceStr) {
if (!priceStr?.trim()) {
return 0.0;
}

const cleanedPrice = priceStr
.replace(/Sale priceFrom £|Sale price£/g, "")
.trim();

return cleanedPrice ? parseFloat(cleanedPrice) : 0.0;
}

convertPriceToUsd(priceGb, conversionRate) {
return priceGb * conversionRate;
}

createAbsoluteUrl(url) {
return (url?.trim()) ? `https://www.chocolate.co.uk${url.trim()}` : "missing";
}
}

Let's test if our Product class works as expected by creating a new instance with some messy data and checking if it cleans it up:

const p = new Product(
"Almost Perfect",
"Sale priceFrom £3.00",
"/products/almost-perfect");

console.log(p);;

// Product {
// name: 'Almost Perfect',
// priceGb: 3,
// priceUsd: 3.96,
// url: 'https://www.chocolate.co.uk/products/almost-perfect'
// }

This output is exactly what we anticipated. Next, we'll dive into the ProductPipeline class, where we'll implement the core logic.


Process and Store Scraped Data with Data Pipeline

A Pipeline refers to a sequence of steps where data moves through various stages, getting transformed and processed at each step. It’s a common pattern in programming for organizing tasks efficiently.

Here’s how our ProductDataPipeline will operate:

  1. Take raw product data
  2. Clean and structure the data
  3. Filter out duplicates
  4. Queue the product for storage
  5. Save data to CSV
  6. Perform final cleanup

Let's take a look at the overall structure of ProductDataPipeline:

class ProductDataPipeline {
constructor(csvFilename = "", storageQueueLimit = 5) {
this.seenProducts = new Set();
this.storageQueue = [];
this.storageQueueLimit = storageQueueLimit;
this.csvFilename = csvFilename;
this.csvFileOpen = false;
}

saveToCsv() {}

cleanRawProduct(rawProduct) {}

isDuplicateProduct(product) {}

addProduct(rawProduct) {}

async close() {}
}

The class above requires only two parameters, but it defines five properties, each serving a distinct purpose that will become clearer as we proceed. Here’s an overview of these properties:

  • seenProducts: A Set that checks for duplicates, as a set automatically rejects any repeated values.
  • storageQueue: A Queue that temporarily holds products until the storageQueueLimit is reached.
  • storageQueueLimit: An integer representing the maximum number of products allowed in the storageQueue. This value is passed as an argument when creating an instance of the class.
  • csvFilename: The name of the CSV file where the product data will be stored. This value is also passed as an argument when creating an instance of the class.
  • csvFileOpen: A boolean flag to track whether the CSV file is currently open or closed, which will be useful in the addProduct() and saveToCsv() methods you'll see in later sections.

Similarly, there are five key methods that process and store our data as it moves through the pipeline. Here’s a brief overview of each:

  • saveToCsv(): Periodically writes the products stored in the storageQueue to a CSV file once the storageQueueLimit is reached.
  • cleanRawProduct(): Cleans the raw data extracted from the web and converts it into a Product instance to structure and sanitize it.
  • isDuplicateProduct(): Checks if the product already exists in the seenProducts set to avoid duplicate entries.
  • addProduct(): Cleans, checks for duplicates and adds the product to the pipeline. If the queue limit is reached, it saves the data to CSV.
  • close(): Async method that ensures any remaining queued data is saved to the file before closing the pipeline.

Clean the Product Data

We’ve already covered how to clean data using the Product class. Here, we simply apply that by taking the raw data and creating an instance of the Product class:

cleanRawProduct(rawProduct) {
return new Product(rawProduct.name, rawProduct.price, rawProduct.url);
}

Add the Product

The addProduct() method processes each product in a structured way:

  • First, it cleans the raw product data by converting it into a Product instance using the cleanRawProduct() method.
  • Then, it checks if the product is a duplicate using the isDuplicateProduct() method, and if it isn't, the product is added to the storageQueue.
  • If the storageQueue reaches its defined limit and the CSV file isn't already open, the saveToCsv() method is triggered to save the queued data.

Here is the code:

addProduct(rawProduct) {
const product = this.cleanRawProduct(rawProduct);
if (!this.isDuplicateProduct(product)) {
this.storageQueue.push(product);
if (
this.storageQueue.length >= this.storageQueueLimit &&
!this.csvFileOpen
) {
this.saveToCsv();
}
}
}

Check for Duplicate Product

To ensure we don't add duplicate products to the storageQueue, we need a way to uniquely identify each product.

We'll use the URL of the products for this purpose, as it is unique to each product—even if two products have the same price.

Here’s how it works:

  • When adding a product, its URL is added to the seenProducts set.
  • The isDuplicateProduct() method checks if the product's URL is already in the seenProducts set.
  • If the URL is not found, it indicates that the product is new, and we add the URL to the set and return false.
  • If the URL is found, it means the product is a duplicate, so we return true.
isDuplicateProduct(product) {
if (!this.seenProducts.has(product.url)) {
this.seenProducts.add(product.url);
return false;
}
return true;
}

Periodically Save Data to CSV

Saving all the data to a CSV file at once could result in data loss if an error or interruption occurs during processing.

To mitigate this risk, we use a periodic approach where data is saved to the CSV file as soon as the storageQueue reaches its default limit of 5 items.

This way, if something goes wrong, only the latest batch of data is at risk, not the entire dataset. This method improves efficiency and data integrity.

In the saveToCsv() method:

  • We determine if the CSV file already exists. If it does, the headers are assumed to be present.
  • If the file does not exist, we write the headers ("name,priceGb,priceUsd,url\n") since headers should only be written once at the top of the file.
  • Then we add the product data from the storageQueue to the CSV file using file.write() method.
  • After writing all data, we close the file with file.end() method and set csvFileOpen to false to indicate that the CSV operations are complete

Here’s the code for saveToCsv():

saveToCsv() {
this.csvFileOpen = true;
const fileExists = fs.existsSync(this.csvFilename);
const file = fs.createWriteStream(this.csvFilename, { flags: "a" });
if (!fileExists) {
file.write("name,priceGb,priceUsd,url\n");
}
for (const product of this.storageQueue) {
file.write(
`${product.name},${product.priceGb},${product.priceUsd},${product.url}\n`
);
}
file.end();
this.storageQueue = [];
this.csvFileOpen = false;
}

In the code above, we utilized four methods from Node.js' fs module:

  • existsSync(filename): This method checks if a file exists synchronously, returning true if the file is found, and false otherwise.
  • createWriteStream(filename, flags: "a" ): Opens a writable stream with the option to append data ({ flags: "a" }), ensuring new content is added without overwriting existing data.
  • write(data): Writes data to the stream, allowing content to be appended line by line when working with file streams.
  • end(): Closes the writable stream, ensuring that all buffered data is flushed to the file and the file is properly closed. This should be called when no more data will be written.

Closing the Pipeline

When the close() method is called, it ensures that the pipeline completes all of its tasks. However, there might still be some products left in the storageQueue, which haven’t been saved to the CSV file yet.

We handle this by writing any remaining data to the CSV before closing.

async close() {
while (this.csvFileOpen) {
// Wait for the file to be written
await new Promise((resolve) => setTimeout(resolve, 1000));
}
if (this.storageQueue.length > 0) {
this.saveToCsv();
}
}

Full Data Pipeline Code

Here, we’ve combined all the methods we defined in the previous sections. This is how our complete ProductDataPipeline class looks:

class ProductDataPipeline {
constructor(csvFilename = "", storageQueueLimit = 5) {
this.seenProducts = new Set();
this.storageQueue = [];
this.csvFilename = csvFilename;
this.csvFileOpen = false;
this.storageQueueLimit = storageQueueLimit;
}

saveToCsv() {
this.csvFileOpen = true;
const fileExists = fs.existsSync(this.csvFilename);
const file = fs.createWriteStream(this.csvFilename, { flags: "a" });
if (!fileExists) {
file.write("name,priceGb,priceUsd,url\n");
}
for (const product of this.storageQueue) {
file.write(
`${product.name},${product.priceGb},${product.priceUsd},${product.url}\n`
);
}
file.end();
this.storageQueue = [];
this.csvFileOpen = false;
}

cleanRawProduct(rawProduct) {
return new Product(rawProduct.name, rawProduct.price, rawProduct.url);
}

isDuplicateProduct(product) {
if (!this.seenProducts.has(product.url)) {
this.seenProducts.add(product.url);
return false;
}
return true;
}

addProduct(rawProduct) {
const product = this.cleanRawProduct(rawProduct);
if (!this.isDuplicateProduct(product)) {
this.storageQueue.push(product);
if (
this.storageQueue.length >= this.storageQueueLimit &&
!this.csvFileOpen
) {
this.saveToCsv();
}
}
}

async close() {
while (this.csvFileOpen) {
// Wait for the file to be written
await new Promise((resolve) => setTimeout(resolve, 1000));
}
if (this.storageQueue.length > 0) {
this.saveToCsv();
}
}
}

Now, let's test our pipeline to see if it works as expected.

We'll manually add the data extracted in Part 1 of this series, and after passing it through our pipeline, we'll save it to a file named "chocolate.csv":

const fs = require("fs");

class Product {
// Code for Product class
}

class ProductDataPipeline {
// Code for ProductDataPipeline
}

const pipeline = new ProductDataPipeline("chocolate.csv", 5);
// Add to data pipeline
pipeline.addProduct({
name: "Lovely Chocolate",
price: "Sale priceFrom £1.50",
url: "/products/100-dark-hot-chocolate-flakes",
});

// Add to data pipeline
pipeline.addProduct({
name: "My Nice Chocolate",
price: "Sale priceFrom £4",
url: "/products/nice-chocolate-flakes",
});

// Add to duplicate data pipeline
pipeline.addProduct({
name: "Lovely Chocolate",
price: "Sale priceFrom £1.50",
url: "/products/100-dark-hot-chocolate-flakes",
});

// Close pipeline when finished - saves data to CSV
pipeline.close();

CSV file output:

name,price_gb,price_usd,url
Lovely Chocolate,1.5,1.98,https://www.chocolate.co.uk/products/100-dark-hot-chocolate-flakes
My Nice Chocolate,4,5.28,https://www.chocolate.co.uk/products/nice-chocolate-flakes

In the above example, we:

  • Imported the fs module.
  • Defined the Product and ProductDataPipeline classes.
  • Created a new pipeline instance.
  • Added three unclean products, two of which are duplicates, to test the pipeline's handling of duplicates.
  • Closed the pipeline to finish processing.

The output shows that the pipeline successfully cleaned the data, ignored duplicates, and saved the cleaned data to a file named "chocolate.csv" in our current directory.


Testing Our Data Processing

Now, let’s bring everything together by testing the complete code from Part 1 and Part 2 to ensure it scrapes, cleans, and stores all the data from chocolate.co.uk without any errors.

Below is the full code, including the scrape() and nextPage() methods from Part 1.

The scrape() method has been slightly modified to reflect the Product and ProductPipeline classes, but the changes are self-explanatory, so we won’t dive into the details here:

const { chromium } = require('playwright');
const fs = require('fs');

class Product {
constructor(name, priceStr, url, conversionRate = 1.32) {
this.name = this.cleanName(name);
this.priceGb = this.cleanPrice(priceStr);
this.priceUsd = this.convertPriceToUsd(this.priceGb, conversionRate);
this.url = this.createAbsoluteUrl(url);
}

cleanName(name) {
return name?.trim() || "missing";
}

cleanPrice(priceStr) {
if (!priceStr?.trim()) {
return 0.0;
}

const cleanedPrice = priceStr
.replace(/Sale priceFrom £|Sale price£/g, "")
.trim();

return cleanedPrice ? parseFloat(cleanedPrice) : 0.0;
}

convertPriceToUsd(priceGb, conversionRate) {
return priceGb * conversionRate;
}

createAbsoluteUrl(url) {
return (url?.trim()) ? `https://www.chocolate.co.uk${url.trim()}` : "missing";
}
}

class ProductDataPipeline {
constructor(csvFilename = "", storageQueueLimit = 5) {
this.seenProducts = new Set();
this.storageQueue = [];
this.csvFilename = csvFilename;
this.csvFileOpen = false;
this.storageQueueLimit = storageQueueLimit;
}

saveToCsv() {
this.csvFileOpen = true;
const fileExists = fs.existsSync(this.csvFilename);
const file = fs.createWriteStream(this.csvFilename, { flags: "a" });
if (!fileExists) {
file.write("name,priceGb,priceUsd,url\n");
}
for (const product of this.storageQueue) {
file.write(
`${product.name},${product.priceGb},${product.priceUsd},${product.url}\n`
);
}
file.end();
this.storageQueue = [];
this.csvFileOpen = false;
}

cleanRawProduct(rawProduct) {
return new Product(rawProduct.name, rawProduct.price, rawProduct.url);
}

isDuplicateProduct(product) {
if (!this.seenProducts.has(product.url)) {
this.seenProducts.add(product.url);
return false;
}
return true;
}

addProduct(rawProduct) {
const product = this.cleanRawProduct(rawProduct);
if (!this.isDuplicateProduct(product)) {
this.storageQueue.push(product);
if (
this.storageQueue.length >= this.storageQueueLimit &&
!this.csvFileOpen
) {
this.saveToCsv();
}
}
}

async close() {
while (this.csvFileOpen) {
// Wait for the file to be written
await new Promise((resolve) => setTimeout(resolve, 1000));
}
if (this.storageQueue.length > 0) {
this.saveToCsv();
}
}
}

const listOfUrls = ["https://www.chocolate.co.uk/collections/all"];

async function scrape() {
const pipeline = new ProductDataPipeline("chocolate.csv", 5);
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext();
const page = await context.newPage();

for (let url of listOfUrls) {
console.log(`Scraping: ${url}`);
await page.goto(url);

const productItems = await page.$$eval("product-item", items =>
items.map(item => {
const titleElement = item.querySelector(".product-item-meta__title");
const priceElement = item.querySelector(".price");
return {
title: titleElement ? titleElement.textContent.trim() : null,
price: priceElement ? priceElement.textContent.trim() : null,
url: titleElement ? titleElement.getAttribute("href") : null
};
})
);

for (const rawProduct of productItems) {
if (rawProduct.title && rawProduct.price && rawProduct.url) {
pipeline.addProduct({
name: rawProduct.title,
price: rawProduct.price,
url: rawProduct.url
});
}
}

await nextPage(page);
}

await pipeline.close();
await browser.close();
}

async function nextPage(page) {
let nextUrl;
try {
nextUrl = await page.$eval("a.pagination__nav-item:nth-child(4)", item => item.href);
} catch (error) {
console.log('Last Page Reached');
return;
}
listOfUrls.push(nextUrl);
}

(async () => {
await scrape();
})();

// Scraping: https://www.chocolate.co.uk/collections/all
// Scraping: https://www.chocolate.co.uk/collections/all?page=2
// Scraping: https://www.chocolate.co.uk/collections/all?page=3
// Last Page Reached

After running the code, we should see all the pages from chocolate.co.uk being scraped, with prices displayed in both GBP and USD. The relative URLs are converted to absolute URLs after passing through our Product class, and the data pipeline has successfully removed any duplicates and saved the clean data into the CSV file.

Here’s a screenshot of the fully cleaned and structured data:

CSV Data


Next Steps

We hope you've gained a solid understanding of the basics of data classes, data pipelines, and periodic data storage in CSV files. If you have any questions, feel free to leave them in the comments, and we’ll be happy to help!

In Part 3 of this series, we’ll explore how to store our data in different formats, such as JSON, and how to use databases like PostgreSQL, MySQL, and Amazon S3 buckets for storage.

We’ll also dive into the pros and cons of each storage method. Stay tuned!