NodeJS Playwright Beginner Series Part 2: Cleaning Dirty Data & Dealing With Edge Cases
In Part 1 of this Node.js Playwright Beginners Series, we learned the basics of scraping with Node.js and built our first Node.js scraper.
Data on the web is often messy or incomplete, which means we need to clean it up and handle missing information to keep our scraper running smoothly.
In Part-2 of our Node.js Playwright Beginner Series, we’ll explore how to structure data using a dedicated Product
class and enhance our scraper's flexibility with a ProductDataPipeline
for managing tasks like scheduling and data storage.
- Strategies to Deal With Edge Cases:
- Structure your Scraped Data with Data Classes
- Process and Store Scraped Data with Data Pipeline
- Testing Our Data Processing
- Next Steps
Node.js Playwright 6-Part Beginner Series
-
Part 1: Basic Node.js Playwright Scraper - We'll learn the fundamentals of web scraping with Node.js and build your first scraper using Cheerio. (Part 1)
-
Part 2: Cleaning Unruly Data & Handling Edge Cases - Web data can be messy and unpredictable. In this part, we'll create a robust scraper using data structures and cleaning techniques to handle these challenges. (This article)
-
Part 3: Storing Scraped Data in AWS S3, MySQL & Postgres DBs - Explore various options for storing your scraped data, including databases like MySQL or Postgres, cloud storage like AWS S3, and file formats like CSV and JSON. We'll discuss their pros, cons, and suitable use cases. (Part 3)
-
Part 4: Managing Retries & Concurrency - Enhance your scraper's reliability and scalability by handling failed requests and utilizing concurrency. (Part 4)
-
Part 5: Faking User-Agents & Browser Headers - Learn how to create a production-ready scraper by simulating real users through user-agent and browser header manipulation. (Part 5)
-
Part 6: Using Proxies To Avoid Getting Blocked - Discover how to use proxies to bypass anti-bot systems by disguising your real IP address and location. (Part 6)
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
Strategies to Deal With Edge Cases
In Part 1 of this series, we used basic trim()
and replace()
methods to clean data on the fly and returned null when the title or price was missing.
While this worked in the short term, it lacked a solid structure and missed several important factors.
In the case of the chocolate.co.uk website that we’re scraping for this series, if we inspect the data we can see a couple of issues. For example:
- Unclean Price Data: Prices may include extra prefixes like "Sale price" or "Sale priceFrom" that need to be removed.
- Currency Conversion: Prices are provided in British pounds (GBP), but we need them in US dollars (USD).
- Relative URLs: Scraped URLs are relative, so we need to convert them into absolute URLs for direct use.
- Missing Data: The name, price, or URL might be missing, and we need to handle these cases.
Here’s a look at some problematic entries from the CSV file generated in Part-1:
Here are several strategies to handle situations like this:
Option | Description |
---|---|
Try/Except | Wrap parts of your parsers in Try/Except blocks. If an error occurs when scraping a specific field, the scraper will switch to an alternative parser. |
Conditional Parsing | Set up your scraper to check the HTML response for certain DOM elements, and apply different parsers based on the situation. |
Data Classes | Use data classes to create structured containers, making your code clearer, reducing repetitive boilerplate, and simplifying data manipulation. |
Data Pipelines | Implement data pipelines to design a series of post-processing steps that clean, manipulate, and validate your data before storing it. |
Clean During Data Analysis | Parse all relevant fields first, then clean and process the data during the analysis phase. |
Each method has its own advantages and drawbacks, so it’s important to be familiar with all of them. This allows you to choose the most suitable approach for your specific scenario.
For this project, we’ll focus on Data Classes and Data Pipelines as they offer the most structured and efficient way to process data using Playwright.
Here’s a system diagram that maps out our code structure, including the Product
and ProductPipeline
classes:
Structure Your Scraped Data with Data Classes
In Part 1, we scraped data (name, price, and URL) and stored it directly in a dictionary without any formal structure.
In this section, however, we'll implement data classes to create a structured Product
class. The Product
class will help turn raw, unstructured data from the website into a clean and structured object. Instances of this class will contain sanitized data that can be easily converted into formats like CSV, JSON, or others for local storage.
Data classes in Playwright provide an efficient method for structuring and managing data in your web scraping tasks. They help streamline the process by organizing scraped elements into clean, reusable data structures.
This approach eliminates repetitive code, enhances readability, and simplifies the handling of common tasks such as parsing and validation of scraped data.
Here's how a new instance will be created by passing unclean raw data to the Product
class:
new Product(rawProduct.name, rawProduct.price, rawProduct.url);
While we're passing three parameters, the resulting instance will have four key properties:
- name: The product name, cleaned of any unwanted characters
- priceGb: The price in British pounds (GBP)
- priceUsd: The price converted to US dollars (USD)
- url: The absolute URL that you can navigate to directly
Here's a look at the Product
class structure:
class Product {
constructor(name, priceStr, url, conversionRate = 1.32) {
this.name = this.cleanName(name);
this.priceGb = this.cleanPrice(priceStr);
this.priceUsd = this.convertPriceToUsd(this.priceGb, conversionRate);
this.url = this.createAbsoluteUrl(url);
}
cleanName(name) {}
cleanPrice(priceStr) {}
convertPriceToUsd(priceGb, conversionRate) {}
createAbsoluteUrl(url) {}
}
We’re introducing a fourth parameter to the Product class: conversionRate
, which defaults to 1.32, the current exchange rate. You can update this value as needed or use an API like ExchangeRate-API for dynamic rate updates.
Since it’s a default parameter, you don’t need to specify it when creating an instance of the Product
class unless you want to override the default rate.
You'll notice the use of several methods that we'll define in the upcoming sections. Each method is responsible for handling specific tasks, leading to a cleaner, more modular codebase.
Here's a quick overview of what each method does:
- cleanName(): Cleans up the product name.
- cleanPrice(): Strips unwanted characters from the price string.
- convertPriceToUsd(): Converts the GBP price to USD.
- createAbsoluteUrl(): Converts relative URLs to absolute ones.
Clean the Price
The cleanPrice()
method performs several checks to ensure the price data is valid and clean:
- If the price data is missing or contains only empty spaces, it returns 0.0.
- If the price exists, it removes unnecessary prefixes and trims any extra spaces. (Eg. "Sale price£" and "Sale priceFrom £")
- Finally, it attempts to convert the cleaned price string to a floating-point number. If the conversion fails, it returns 0.0.
Here’s the method:
cleanPrice(priceStr) {
if (!priceStr?.trim()) {
return 0.0;
}
const cleanedPrice = priceStr
.replace(/Sale priceFrom £|Sale price£/g, "")
.trim();
return cleanedPrice ? parseFloat(cleanedPrice) : 0.0;
}
In the script above:
-
We used optional chaining (?.) in
!priceStr?.trim()
, which ensures thattrim()
is only called ifpriceStr
exists. This feature is available in Node.js to prevent errors when accessing properties ofnull
orundefined
. -
The
replace(/Sale priceFrom £|Sale price£/g, "")
uses regular expressions to detect and remove the unwanted prefixes ("Sale priceFrom £" and "Sale price£") from the price string. -
The parseFloat() method is used because the price value extracted from the web is a string, so it needs to be converted into a floating-point number for numeric calculations.
-
The conditional return
cleanedPrice ? parseFloat(cleanedPrice) : 0.0
ensures that if the cleaned price string is empty or non-numeric, the method returns 0.0 instead of attempting an invalid conversion.
The optional chaining (?.) operator accesses an object's property or calls a function. If the object accessed or function called using this operator is undefined or null, the expression short circuits and evaluates to undefined instead of throwing an error - (Source: MDN)
Convert the Price
The convertPriceToUsd()
method takes the price in GBP and converts it to USD using the current exchange rate (1.32 in our case).
Here's how:
convertPriceToUsd(priceGb, conversionRate) {
return priceGb * conversionRate;
}
Clean the Name
The cleanName()
method performs the following checks:
- If the name is missing or contains only spaces, it returns "missing".
- Otherwise, it returns the trimmed and cleaned name.
cleanName(name) {
return name?.trim() || "missing";
}
Convert Relative to Absolute URL
The createAbsoluteUrl()
method performs the following checks:
- If the URL is missing or consists only of empty spaces, it returns "missing".
- Otherwise, it returns the trimmed URL prefixed with https://www.chocolate.co.uk
createAbsoluteUrl(url) {
return (url?.trim()) ? `https://www.chocolate.co.uk${url.trim()}` : "missing";
}
This code will convert "/products/almost-perfect" to "https://www.chocolate.co.uk/products/almost-perfect," providing a navigable link.
Here’s the snapshot of the data that will be returned from the product data class. It consists of name, price_gb, price_usd, and url.
Complete Code for the Data Class
Now that we've defined all our methods, let's take a look at the complete code for Product
class.
class Product {
constructor(name, priceStr, url, conversionRate = 1.32) {
this.name = this.cleanName(name);
this.priceGb = this.cleanPrice(priceStr);
this.priceUsd = this.convertPriceToUsd(this.priceGb, conversionRate);
this.url = this.createAbsoluteUrl(url);
}
cleanName(name) {
return name?.trim() || "missing";
}
cleanPrice(priceStr) {
if (!priceStr?.trim()) {
return 0.0;
}
const cleanedPrice = priceStr
.replace(/Sale priceFrom £|Sale price£/g, "")
.trim();
return cleanedPrice ? parseFloat(cleanedPrice) : 0.0;
}
convertPriceToUsd(priceGb, conversionRate) {
return priceGb * conversionRate;
}
createAbsoluteUrl(url) {
return (url?.trim()) ? `https://www.chocolate.co.uk${url.trim()}` : "missing";
}
}
Let's test if our Product
class works as expected by creating a new instance with some messy data and checking if it cleans it up:
const p = new Product(
"Almost Perfect",
"Sale priceFrom £3.00",
"/products/almost-perfect");
console.log(p);;
// Product {
// name: 'Almost Perfect',
// priceGb: 3,
// priceUsd: 3.96,
// url: 'https://www.chocolate.co.uk/products/almost-perfect'
// }
This output is exactly what we anticipated. Next, we'll dive into the ProductPipeline
class, where we'll implement the core logic.
Process and Store Scraped Data with Data Pipeline
A Pipeline refers to a sequence of steps where data moves through various stages, getting transformed and processed at each step. It’s a common pattern in programming for organizing tasks efficiently.
Here’s how our ProductDataPipeline
will operate:
- Take raw product data
- Clean and structure the data
- Filter out duplicates
- Queue the product for storage
- Save data to CSV
- Perform final cleanup
Let's take a look at the overall structure of ProductDataPipeline
:
class ProductDataPipeline {
constructor(csvFilename = "", storageQueueLimit = 5) {
this.seenProducts = new Set();
this.storageQueue = [];
this.storageQueueLimit = storageQueueLimit;
this.csvFilename = csvFilename;
this.csvFileOpen = false;
}
saveToCsv() {}
cleanRawProduct(rawProduct) {}
isDuplicateProduct(product) {}
addProduct(rawProduct) {}
async close() {}
}
The class above requires only two parameters, but it defines five properties, each serving a distinct purpose that will become clearer as we proceed. Here’s an overview of these properties:
- seenProducts: A Set that checks for duplicates, as a set automatically rejects any repeated values.
- storageQueue: A Queue that temporarily holds products until the
storageQueueLimit
is reached. - storageQueueLimit: An integer representing the maximum number of products allowed in the
storageQueue
. This value is passed as an argument when creating an instance of the class. - csvFilename: The name of the CSV file where the product data will be stored. This value is also passed as an argument when creating an instance of the class.
- csvFileOpen: A boolean flag to track whether the CSV file is currently open or closed, which will be useful in the
addProduct()
andsaveToCsv()
methods you'll see in later sections.
Similarly, there are five key methods that process and store our data as it moves through the pipeline. Here’s a brief overview of each:
- saveToCsv(): Periodically writes the products stored in the
storageQueue
to a CSV file once thestorageQueueLimit
is reached. - cleanRawProduct(): Cleans the raw data extracted from the web and converts it into a
Product
instance to structure and sanitize it. - isDuplicateProduct(): Checks if the product already exists in the
seenProducts
set to avoid duplicate entries. - addProduct(): Cleans, checks for duplicates and adds the product to the pipeline. If the queue limit is reached, it saves the data to CSV.
- close(): Async method that ensures any remaining queued data is saved to the file before closing the pipeline.
Clean the Product Data
We’ve already covered how to clean data using the Product
class. Here, we simply apply that by taking the raw data and creating an instance of the Product
class:
cleanRawProduct(rawProduct) {
return new Product(rawProduct.name, rawProduct.price, rawProduct.url);
}
Add the Product
The addProduct()
method processes each product in a structured way:
- First, it cleans the raw product data by converting it into a
Product
instance using thecleanRawProduct()
method. - Then, it checks if the product is a duplicate using the
isDuplicateProduct()
method, and if it isn't, the product is added to thestorageQueue
. - If the
storageQueue
reaches its defined limit and the CSV file isn't already open, thesaveToCsv()
method is triggered to save the queued data.
Here is the code:
addProduct(rawProduct) {
const product = this.cleanRawProduct(rawProduct);
if (!this.isDuplicateProduct(product)) {
this.storageQueue.push(product);
if (
this.storageQueue.length >= this.storageQueueLimit &&
!this.csvFileOpen
) {
this.saveToCsv();
}
}
}
Check for Duplicate Product
To ensure we don't add duplicate products to the storageQueue
, we need a way to uniquely identify each product.
We'll use the URL of the products for this purpose, as it is unique to each product—even if two products have the same price.
Here’s how it works:
- When adding a product, its URL is added to the seenProducts set.
- The
isDuplicateProduct()
method checks if the product's URL is already in theseenProducts
set. - If the URL is not found, it indicates that the product is new, and we add the URL to the set and return
false
. - If the URL is found, it means the product is a duplicate, so we return
true
.
isDuplicateProduct(product) {
if (!this.seenProducts.has(product.url)) {
this.seenProducts.add(product.url);
return false;
}
return true;
}
Periodically Save Data to CSV
Saving all the data to a CSV file at once could result in data loss if an error or interruption occurs during processing.
To mitigate this risk, we use a periodic approach where data is saved to the CSV file as soon as the storageQueue
reaches its default limit of 5 items.
This way, if something goes wrong, only the latest batch of data is at risk, not the entire dataset. This method improves efficiency and data integrity.
In the saveToCsv()
method:
- We determine if the CSV file already exists. If it does, the headers are assumed to be present.
- If the file does not exist, we write the headers ("name,priceGb,priceUsd,url\n") since headers should only be written once at the top of the file.
- Then we add the product data from the
storageQueue
to the CSV file using file.write() method. - After writing all data, we close the file with file.end() method and set
csvFileOpen
to false to indicate that the CSV operations are complete
Here’s the code for saveToCsv()
:
saveToCsv() {
this.csvFileOpen = true;
const fileExists = fs.existsSync(this.csvFilename);
const file = fs.createWriteStream(this.csvFilename, { flags: "a" });
if (!fileExists) {
file.write("name,priceGb,priceUsd,url\n");
}
for (const product of this.storageQueue) {
file.write(
`${product.name},${product.priceGb},${product.priceUsd},${product.url}\n`
);
}
file.end();
this.storageQueue = [];
this.csvFileOpen = false;
}
In the code above, we utilized four methods from Node.js' fs
module:
- existsSync(filename): This method checks if a file exists synchronously, returning true if the file is found, and false otherwise.
- createWriteStream(filename, flags: "a" ): Opens a writable stream with the option to append data (
{ flags: "a" }
), ensuring new content is added without overwriting existing data. - write(data): Writes data to the stream, allowing content to be appended line by line when working with file streams.
- end(): Closes the writable stream, ensuring that all buffered data is flushed to the file and the file is properly closed. This should be called when no more data will be written.
Closing the Pipeline
When the close()
method is called, it ensures that the pipeline completes all of its tasks. However, there might still be some products left in the storageQueue
, which haven’t been saved to the CSV file yet.
We handle this by writing any remaining data to the CSV before closing.
async close() {
while (this.csvFileOpen) {
// Wait for the file to be written
await new Promise((resolve) => setTimeout(resolve, 1000));
}
if (this.storageQueue.length > 0) {
this.saveToCsv();
}
}
Full Data Pipeline Code
Here, we’ve combined all the methods we defined in the previous sections. This is how our complete ProductDataPipeline
class looks:
class ProductDataPipeline {
constructor(csvFilename = "", storageQueueLimit = 5) {
this.seenProducts = new Set();
this.storageQueue = [];
this.csvFilename = csvFilename;
this.csvFileOpen = false;
this.storageQueueLimit = storageQueueLimit;
}
saveToCsv() {
this.csvFileOpen = true;
const fileExists = fs.existsSync(this.csvFilename);
const file = fs.createWriteStream(this.csvFilename, { flags: "a" });
if (!fileExists) {
file.write("name,priceGb,priceUsd,url\n");
}
for (const product of this.storageQueue) {
file.write(
`${product.name},${product.priceGb},${product.priceUsd},${product.url}\n`
);
}
file.end();
this.storageQueue = [];
this.csvFileOpen = false;
}
cleanRawProduct(rawProduct) {
return new Product(rawProduct.name, rawProduct.price, rawProduct.url);
}
isDuplicateProduct(product) {
if (!this.seenProducts.has(product.url)) {
this.seenProducts.add(product.url);
return false;
}
return true;
}
addProduct(rawProduct) {
const product = this.cleanRawProduct(rawProduct);
if (!this.isDuplicateProduct(product)) {
this.storageQueue.push(product);
if (
this.storageQueue.length >= this.storageQueueLimit &&
!this.csvFileOpen
) {
this.saveToCsv();
}
}
}
async close() {
while (this.csvFileOpen) {
// Wait for the file to be written
await new Promise((resolve) => setTimeout(resolve, 1000));
}
if (this.storageQueue.length > 0) {
this.saveToCsv();
}
}
}
Now, let's test our pipeline to see if it works as expected.
We'll manually add the data extracted in Part 1 of this series, and after passing it through our pipeline, we'll save it to a file named "chocolate.csv":
const fs = require("fs");
class Product {
// Code for Product class
}
class ProductDataPipeline {
// Code for ProductDataPipeline
}
const pipeline = new ProductDataPipeline("chocolate.csv", 5);
// Add to data pipeline
pipeline.addProduct({
name: "Lovely Chocolate",
price: "Sale priceFrom £1.50",
url: "/products/100-dark-hot-chocolate-flakes",
});
// Add to data pipeline
pipeline.addProduct({
name: "My Nice Chocolate",
price: "Sale priceFrom £4",
url: "/products/nice-chocolate-flakes",
});
// Add to duplicate data pipeline
pipeline.addProduct({
name: "Lovely Chocolate",
price: "Sale priceFrom £1.50",
url: "/products/100-dark-hot-chocolate-flakes",
});
// Close pipeline when finished - saves data to CSV
pipeline.close();
CSV file output:
name,price_gb,price_usd,url
Lovely Chocolate,1.5,1.98,https://www.chocolate.co.uk/products/100-dark-hot-chocolate-flakes
My Nice Chocolate,4,5.28,https://www.chocolate.co.uk/products/nice-chocolate-flakes
In the above example, we:
- Imported the
fs
module. - Defined the
Product
andProductDataPipeline
classes. - Created a new pipeline instance.
- Added three unclean products, two of which are duplicates, to test the pipeline's handling of duplicates.
- Closed the pipeline to finish processing.
The output shows that the pipeline successfully cleaned the data, ignored duplicates, and saved the cleaned data to a file named "chocolate.csv" in our current directory.
Testing Our Data Processing
Now, let’s bring everything together by testing the complete code from Part 1 and Part 2 to ensure it scrapes, cleans, and stores all the data from chocolate.co.uk without any errors.
Below is the full code, including the scrape()
and nextPage()
methods from Part 1.
The scrape()
method has been slightly modified to reflect the Product
and ProductPipeline
classes, but the changes are self-explanatory, so we won’t dive into the details here:
const { chromium } = require('playwright');
const fs = require('fs');
class Product {
constructor(name, priceStr, url, conversionRate = 1.32) {
this.name = this.cleanName(name);
this.priceGb = this.cleanPrice(priceStr);
this.priceUsd = this.convertPriceToUsd(this.priceGb, conversionRate);
this.url = this.createAbsoluteUrl(url);
}
cleanName(name) {
return name?.trim() || "missing";
}
cleanPrice(priceStr) {
if (!priceStr?.trim()) {
return 0.0;
}
const cleanedPrice = priceStr
.replace(/Sale priceFrom £|Sale price£/g, "")
.trim();
return cleanedPrice ? parseFloat(cleanedPrice) : 0.0;
}
convertPriceToUsd(priceGb, conversionRate) {
return priceGb * conversionRate;
}
createAbsoluteUrl(url) {
return (url?.trim()) ? `https://www.chocolate.co.uk${url.trim()}` : "missing";
}
}
class ProductDataPipeline {
constructor(csvFilename = "", storageQueueLimit = 5) {
this.seenProducts = new Set();
this.storageQueue = [];
this.csvFilename = csvFilename;
this.csvFileOpen = false;
this.storageQueueLimit = storageQueueLimit;
}
saveToCsv() {
this.csvFileOpen = true;
const fileExists = fs.existsSync(this.csvFilename);
const file = fs.createWriteStream(this.csvFilename, { flags: "a" });
if (!fileExists) {
file.write("name,priceGb,priceUsd,url\n");
}
for (const product of this.storageQueue) {
file.write(
`${product.name},${product.priceGb},${product.priceUsd},${product.url}\n`
);
}
file.end();
this.storageQueue = [];
this.csvFileOpen = false;
}
cleanRawProduct(rawProduct) {
return new Product(rawProduct.name, rawProduct.price, rawProduct.url);
}
isDuplicateProduct(product) {
if (!this.seenProducts.has(product.url)) {
this.seenProducts.add(product.url);
return false;
}
return true;
}
addProduct(rawProduct) {
const product = this.cleanRawProduct(rawProduct);
if (!this.isDuplicateProduct(product)) {
this.storageQueue.push(product);
if (
this.storageQueue.length >= this.storageQueueLimit &&
!this.csvFileOpen
) {
this.saveToCsv();
}
}
}
async close() {
while (this.csvFileOpen) {
// Wait for the file to be written
await new Promise((resolve) => setTimeout(resolve, 1000));
}
if (this.storageQueue.length > 0) {
this.saveToCsv();
}
}
}
const listOfUrls = ["https://www.chocolate.co.uk/collections/all"];
async function scrape() {
const pipeline = new ProductDataPipeline("chocolate.csv", 5);
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext();
const page = await context.newPage();
for (let url of listOfUrls) {
console.log(`Scraping: ${url}`);
await page.goto(url);
const productItems = await page.$$eval("product-item", items =>
items.map(item => {
const titleElement = item.querySelector(".product-item-meta__title");
const priceElement = item.querySelector(".price");
return {
title: titleElement ? titleElement.textContent.trim() : null,
price: priceElement ? priceElement.textContent.trim() : null,
url: titleElement ? titleElement.getAttribute("href") : null
};
})
);
for (const rawProduct of productItems) {
if (rawProduct.title && rawProduct.price && rawProduct.url) {
pipeline.addProduct({
name: rawProduct.title,
price: rawProduct.price,
url: rawProduct.url
});
}
}
await nextPage(page);
}
await pipeline.close();
await browser.close();
}
async function nextPage(page) {
let nextUrl;
try {
nextUrl = await page.$eval("a.pagination__nav-item:nth-child(4)", item => item.href);
} catch (error) {
console.log('Last Page Reached');
return;
}
listOfUrls.push(nextUrl);
}
(async () => {
await scrape();
})();
// Scraping: https://www.chocolate.co.uk/collections/all
// Scraping: https://www.chocolate.co.uk/collections/all?page=2
// Scraping: https://www.chocolate.co.uk/collections/all?page=3
// Last Page Reached
After running the code, we should see all the pages from chocolate.co.uk being scraped, with prices displayed in both GBP and USD. The relative URLs are converted to absolute URLs after passing through our Product class, and the data pipeline has successfully removed any duplicates and saved the clean data into the CSV file.
Here’s a screenshot of the fully cleaned and structured data:
Next Steps
We hope you've gained a solid understanding of the basics of data classes, data pipelines, and periodic data storage in CSV files. If you have any questions, feel free to leave them in the comments, and we’ll be happy to help!
In Part 3 of this series, we’ll explore how to store our data in different formats, such as JSON, and how to use databases like PostgreSQL, MySQL, and Amazon S3 buckets for storage.
We’ll also dive into the pros and cons of each storage method. Stay tuned!