Skip to main content

Part 2 - Cleaning Dirty Data & Dealing With Edge Cases

NodeJS Puppeteer Beginners Series Part 2: Cleaning Dirty Data & Dealing With Edge Cases

In Part 1 of this Node.js Puppeteer Beginners Series, we learned the basics of scraping with Node.js and built our first Node.js scraper.

In Part-2 of the series, we’ll explore how to structure data using a dedicated Product class and enhance our scraper's flexibility with a ProductDataPipeline for managing tasks like scheduling and data storage.

Node.js Puppeteer 6-Part Beginner Series

  • Part 1: Basic Node.js Puppeteer Scraper - We'll learn the fundamentals of web scraping with Node.js and build your first scraper using NpdeJS Puppeteer. (Part 1)

  • Part 2: Cleaning Unruly Data & Handling Edge Cases - Web data can be messy and unpredictable. In this part, we'll create a robust scraper using data structures and cleaning techniques to handle these challenges. (This article)

  • Part 3: Storing Scraped Data in AWS S3, MySQL & Postgres DBs - Explore various options for storing your scraped data, including databases like MySQL or Postgres, cloud storage like AWS S3, and file formats like CSV and JSON. We'll discuss their pros, cons, and suitable use cases. (Part 3)

  • Part 4: Managing Retries & Concurrency - Enhance your scraper's reliability and scalability by handling failed requests and utilizing concurrency. (Part 4)

  • Part 5: Faking User-Agents & Browser Headers - Learn how to create a production-ready scraper by simulating real users through user-agent and browser header manipulation. (Part 5)

  • Part 6: Using Proxies To Avoid Getting Blocked - Discover how to use proxies to bypass anti-bot systems by disguising your real IP address and location. (Part 6)

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.


Strategies to Deal With Edge Cases

Web data is often messy and incomplete, which makes web scraping a bit more complicated for us. For example, when scraping e-commerce sites, most products follow a specific data structure. However, sometimes, things are displayed differently:

  • Some items have both a regular price and a sale price.
  • Prices might include sales taxes or VAT in some cases but not others.
  • If a product is sold out, its price might be missing.
  • Product descriptions can vary, with some in paragraphs and others in bullet points.

Dealing with these edge cases is part of the web scraping process, so we need to come up with a way to handle them.

In the case of the e-commerce website we're scraping, if we inspect the data, we can see a couple of issues. Here are some examples:

  • Some prices are missing, either because the item is out of stock or the price wasn't listed.
  • The prices are currently shown in British Pounds (GBP), but we need them in US Dollars (USD).
  • Product URLs are relative and would be preferable as absolute URLs for easier tracking and accessibility.
  • Some products are listed multiple times.

Messy Data

There are several options to deal with situations like this:

OptionsDescription
Try/CatchWrap parts of your parsers in try/catch blocks so if there's an error scraping a particular field, it can handle it gracefully.
Conditional ParsingHave your scraper check the HTML response for particular DOM elements and use specific parsers depending on the situation.
JavaScript ClassesUse classes to define structured data containers, leading to clearer code and easier manipulation.
Data PipelinesDesign a series of post-processing steps to clean, manipulate, and validate your data before storing it.
Clean During AnalysisParse data for every relevant field, and then later in your data analysis pipeline, clean the data.

Each strategy comes with its own advantages and disadvantages, so it's important to understand all the available methods. This way, you can easily choose the best option for your specific situation when you need it.

In this project, we're going to focus on using JavaScript Classes and Data Pipelines as they are the most powerful options available to structure and process data.


Structure Your Scraped Data with JavaScript Classes

In Part 1, we scraped data (name, price, and URL) and stored it directly in an array without proper structuring.

In this part, we'll use JavaScript classes to define a structured class called Product and directly pass the scraped data into its instances.

JavaScript classes offer a convenient way of structuring and managing data effectively. They can handle methods for cleaning and processing data, making your scraping code more modular and maintainable.

Defining the Product Class

The following code snippet directly passes scraped data to the Product class to ensure proper structuring and management. This class accepts three parameters:

  • name: the product's name.
  • priceString: a string representing the product's price in GBP (e.g., "£10.99").
  • url: a relative URL for the product.

Using Data Classes we’re going to do the following:

  • cleanName(name): Cleans up product names by stripping leading and trailing whitespaces. If a name is empty, it's set to "missing".
  • cleanPrice(priceString): Cleans up price strings by removing anything that's not a numeric character, then converting the cleaned string to a float. If a price string is empty, the price is set to 0.0.
  • convertPriceToUSD(): Converts the price from British Pounds to US Dollars using a fixed exchange rate (1.21 in our case).
  • createAbsoluteURL(relativeURL): Creates absolute URLs for products by appending their relative URLs to the base URL.

Clean the Name

  • This method removes any extra spaces from the name and returns it.
  • If the name is empty or just spaces, it defaults to "missing".
class Product {
constructor(name, priceString, url) {
this.name = this.cleanName(name);
this.priceGBP = this.cleanPrice(priceString);
this.priceUSD = this.convertPriceToUSD();
this.url = this.createAbsoluteURL(url);
}

cleanName(name) {
return name.trim() || "missing";
}

}

Clean the Price

  • This method removes any non-numeric characters (except for periods) from the price string, leaving only the numeric part.
  • It then converts this cleaned string into a floating-point number using parseFloat().
  • If the price string is empty or invalid, it defaults to 0.0.
class Product {
constructor(name, priceString, url) {
this.name = this.cleanName(name);
this.priceGBP = this.cleanPrice(priceString);
this.priceUSD = this.convertPriceToUSD();
this.url = this.createAbsoluteURL(url);
}

cleanName(name) {
return name.trim() || "missing";
}

cleanPrice(priceString) {
if (!priceString) return 0.0;
priceString = priceString.replace(/[^0-9\.]+/g, '');
return parseFloat(priceString) || 0.0;
}

}

Convert the Price

  • This method converts the price in GBP to USD using a fixed exchange rate of 1.21.
  • It multiplies this.priceGBP by 1.21 and returns the price in USD.
class Product {
constructor(name, priceString, url) {
this.name = this.cleanName(name);
this.priceGBP = this.cleanPrice(priceString);
this.priceUSD = this.convertPriceToUSD();
this.url = this.createAbsoluteURL(url);
}

cleanName(name) {
return name.trim() || "missing";
}


cleanPrice(priceString) {
if (!priceString) return 0.0;
priceString = priceString.replace(/[^0-9\.]+/g, '');
return parseFloat(priceString) || 0.0;
}

convertPriceToUSD() {
const exchangeRate = 1.21;
return this.priceGBP * exchangeRate;
}

}

Convert Relative to Absolute URL

  • This method creates an absolute URL by appending the relative URL to the base URL https://www.chocolate.co.uk
  • If no relative URL is provided, it defaults to "missing".
class Product {
constructor(name, priceString, url) {
this.name = this.cleanName(name);
this.priceGBP = this.cleanPrice(priceString);
this.priceUSD = this.convertPriceToUSD();
this.url = this.createAbsoluteURL(url);
}

cleanName(name) {
return name.trim() || "missing";
}

cleanPrice(priceString) {
if (!priceString) return 0.0;
priceString = priceString.replace(/[^0-9\.]+/g, '');
return parseFloat(priceString) || 0.0;
}

convertPriceToUSD() {
const exchangeRate = 1.21;
return this.priceGBP * exchangeRate;
}

createAbsoluteURL(relativeURL) {
const baseURL = "https://www.chocolate.co.uk";
return relativeURL ? `${baseURL}${relativeURL}` : "missing";
}
}

Data classes are helping us effectively structure and manage the messy data we've scraped. They handle edge cases, removing irrelevant text and cleaning up the information. The cleaned data is then sent back into the data pipeline for further processing.

Here’s a snapshot of the data returned by the product data class, which includes the name, price_gb, price_usd, and url.

Structured Data

Here's the complete code for the product data class.

class Product {
constructor(name, priceString, url) {
this.name = this.cleanName(name);
this.priceGBP = this.cleanPrice(priceString);
this.priceUSD = this.convertPriceToUSD();
this.url = this.createAbsoluteURL(url);
}

cleanName(name) {
return name.trim() || "missing";
}

cleanPrice(priceString) {
if (!priceString) return 0.0;
priceString = priceString.replace(/[^0-9\.]+/g, '');
return parseFloat(priceString) || 0.0;
}

convertPriceToUSD() {
const exchangeRate = 1.21;
return this.priceGBP * exchangeRate;
}

createAbsoluteURL(relativeURL) {
const baseURL = "https://www.chocolate.co.uk";
return relativeURL ? `${baseURL}${relativeURL}` : "missing";
}
}

Process and Store Scraped Data with Data Pipeline

Now that we have our clean data, we'll use a data pipeline to process this data before saving it. The pipeline will guide the data through several steps, ultimately storing it in a CSV file.

Using data pipelines, we're going to do the following:

  • Identify and remove any duplicate items.
  • Add the processed data to the storage queue.
  • Periodically save the processed data to the CSV file.

Let's first examine the ProductDataPipeline class and its constructor. Here we define six methods in this ProductDataPipeline class:

  • saveToCSV: Periodically saves the products stored in the pipeline to a CSV file.
  • cleanRawProduct: Cleans scraped data and returns a Product object.
  • isDuplicate: Checks if a product is a duplicate based on its name.
  • addProduct: Adds a product to the pipeline after cleaning and checks for duplicates before storing, and triggers saving to CSV if necessary.

Within the constructor, five variables are defined, each serving a distinct purpose:

  • namesSeen: This list is used for checking duplicates.
  • storageQueue: This queue holds products temporarily until a specified storage limit is reached.
  • storageQueueLimit: This variable defines the maximum number of products that can reside in the storageQueue.
  • csvFilename: This variable stores the name of the CSV file used for product data storage.
  • csvFileOpen: This boolean variable tracks whether the CSV file is currently open or closed.

Full Data Pipeline Code

Here's the complete code for the ProductDataPipeline class.

const fs = require('fs');

class ProductDataPipeline {
constructor(csvFilename = '', storageQueueLimit = 5) {
this.namesSeen = [];
this.storageQueue = [];
this.storageQueueLimit = storageQueueLimit;
this.csvFilename = csvFilename;
this.csvFileOpen = false;
}

saveToCSV() {
this.csvFileOpen = true;
const productsToSave = [...this.storageQueue];
this.storageQueue = [];

if (productsToSave.length === 0) return;

const headers = Object.keys(productsToSave[0]);
const fileExists = fs.existsSync(this.csvFilename);

const csvWriter = fs.createWriteStream(this.csvFilename, { flags: 'a' });
if (!fileExists) {
csvWriter.write(headers.join(',') + '\n');
}

productsToSave.forEach(product => {
const row = headers.map(header => product[header]).join(',');
csvWriter.write(row + '\n');
});

csvWriter.end();
this.csvFileOpen = false;
}

cleanRawProduct(scrapedData) {
return new Product(
scrapedData.name || '',
scrapedData.price || '',
scrapedData.url || ''
);
}

isDuplicate(product) {
if (this.namesSeen.includes(product.name)) {
console.log(`Duplicate item found: ${product.name}. Item dropped.`);
return true;
}
this.namesSeen.push(product.name);
return false;
}

addProduct(scrapedData) {
const product = this.cleanRawProduct(scrapedData);
if (!this.isDuplicate(product)) {
this.storageQueue.push(product);
if (this.storageQueue.length >= this.storageQueueLimit && !this.csvFileOpen) {
this.saveToCSV();
}
}
}

closePipeline() {
if (this.csvFileOpen) {
setTimeout(() => this.saveToCSV(), 3000);
} else if (this.storageQueue.length > 0) {
this.saveToCSV();
}
}
}

Let's test our ProductDataPipeline class:

const dataPipeline = new ProductDataPipeline('product_data.csv');

// Add products to the data pipeline
dataPipeline.addProduct({
name: 'Lovely Chocolate',
price: 'Sale priceFrom £1.50',
url: '/products/100-dark-hot-chocolate-flakes'
});

dataPipeline.addProduct({
name: 'My Nice Chocolate',
price: 'Sale priceFrom £4',
url: '/products/nice-chocolate-flakes'
});

dataPipeline.addProduct({
name: 'Lovely Chocolate',
price: 'Sale priceFrom £1.50',
url: '/products/100-dark-hot-chocolate-flakes'
});

// Close the pipeline when finished - saves data to CSV
dataPipeline.closePipeline();

Here we:

  1. Initialize The Data Pipeline: Creates an instance of ProductDataPipeline with a specified CSV filename.
  2. Add To Data Pipeline: Adds three products to the data pipeline, each with a name, price, and URL. Two products are unique and one is a duplicate.
  3. Close Pipeline When Finished: Closes the pipeline, ensuring all pending data is saved to the CSV file.

The output CSV file will look like this:

name,priceGBP,priceUSD,url
Lovely Chocolate,1.5,1.815,https://www.example.com/products/100-dark-hot-chocolate-flakes
My Nice Chocolate,4.0,4.84,https://www.example.com/products/nice-chocolate-flakes

Testing Our Data Processing

When we run our code, we should see all the chocolates being crawled, with the price now displaying in both GBP and USD. The relative URL is converted to an absolute URL after our Product class has cleaned the data. The data pipeline has dropped any duplicates and saved the data to the CSV file.

Here’s the snapshot of the completely cleaned and structured data:

CSV Data

Here is the full code with the Product class and the ProductDataPipeline integrated:

const puppeteer = require('puppeteer');
const fs = require('fs');

class Product {
constructor(name, priceString, url) {


this.name = this.cleanName(name);
this.priceGBP = this.cleanPrice(priceString);
this.priceUSD = this.convertPriceToUSD();
this.url = this.createAbsoluteURL(url);
}

cleanName(name) {
return name.trim() || "missing";
}

cleanPrice(priceString) {
if (!priceString) return 0.0;
priceString = priceString.replace(/[^0-9\.]+/g, '');
return parseFloat(priceString) || 0.0;
}

convertPriceToUSD() {
const exchangeRate = 1.21;
return this.priceGBP * exchangeRate;
}

createAbsoluteURL(relativeURL) {
const baseURL = "https://www.chocolate.co.uk";
return relativeURL ? `${baseURL}${relativeURL}` : "missing";
}
}

class ProductDataPipeline {
constructor(csvFilename = '', storageQueueLimit = 5) {
this.namesSeen = [];
this.storageQueue = [];
this.storageQueueLimit = storageQueueLimit;
this.csvFilename = csvFilename;
this.csvFileOpen = false;
}

saveToCSV() {
this.csvFileOpen = true;
const productsToSave = [...this.storageQueue];
this.storageQueue = [];

if (productsToSave.length === 0) return;

const headers = Object.keys(productsToSave[0]);
const fileExists = fs.existsSync(this.csvFilename);

const csvWriter = fs.createWriteStream(this.csvFilename, { flags: 'a' });
if (!fileExists) {
csvWriter.write(headers.join(',') + '\n');
}

productsToSave.forEach(product => {
const row = headers.map(header => product[header]).join(',');
csvWriter.write(row + '\n');
});

csvWriter.end();
this.csvFileOpen = false;
}

cleanRawProduct(scrapedData) {
return new Product(
scrapedData.name || '',
scrapedData.price || '',
scrapedData.url || ''
);
}

isDuplicate(product) {
if (this.namesSeen.includes(product.name)) {
console.log(`Duplicate item found: ${product.name}. Item dropped.`);
return true;
}
this.namesSeen.push(product.name);
return false;
}

addProduct(scrapedData) {
const product = this.cleanRawProduct(scrapedData);
if (!this.isDuplicate(product)) {
this.storageQueue.push(product);
if (this.storageQueue.length >= this.storageQueueLimit && !this.csvFileOpen) {
this.saveToCSV();
}
}
}

closePipeline() {
if (this.csvFileOpen) {
setTimeout(() => this.saveToCSV(), 3000);
} else if (this.storageQueue.length > 0) {
this.saveToCSV();
}
}
}

const startScrape = async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const baseURL = 'https://www.chocolate.co.uk/collections/all';
const dataPipeline = new ProductDataPipeline('product_data.csv');
let nextPageExists = true;
let currentPage = baseURL;

while (nextPageExists) {
await page.goto(currentPage, { waitUntil: 'networkidle2' });

const products = await page.evaluate(() => {
const items = document.querySelectorAll('.product-item');
return Array.from(items).map(item => ({
name: item.querySelector('.product-item-meta__title').innerText,
price: item.querySelector('.price').innerText,
url: item.querySelector('.product-item-meta a').getAttribute('href')
}));
});

products.forEach(product => dataPipeline.addProduct(product));

nextPageExists = await page.evaluate(() => {
const nextPage = document.querySelector('a[rel="next"]');
return nextPage ? nextPage.href : null;
});

if (nextPageExists) {
currentPage = nextPageExists;
}
}

await browser.close();
dataPipeline.closePipeline();
};

startScrape();

Next Steps

We hope you've gained a solid understanding of the basics of data classes, data pipelines, and periodic data storage in CSV files. If you have any questions, please leave them in the comments below, and we'll do our best to assist you!

In Part 3 of the series, we'll work on storing our data. There are many different ways we can store the data that we scrape, from databases and CSV files to JSON format and S3 buckets.

We'll explore several different ways to store the data and discuss their pros and cons, and in which situations you would use them.