Node.js Playwright Beginners Series Part 3: Storing Data
In Part 1 and Part 2 of our Node.js Playwright Beginner Series, we built a scraper to extract data and learned how to clean, structure, and store it in CSV format using data classes and pipelines.
In Part 3, we’ll expand on that by showing how to store data in other formats like JSON, and databases such as PostgreSQL, MySQL, and cloud storage solutions like AWS S3 buckets.
- Saving Data to a JSON File
- Saving Data to Amazon S3 Storage
- Saving Data to MySQL Database
- Saving Data to Postgres Database
- Complete Code
- Next Steps
Node.js Playwright 6-Part Beginner Series
-
Part 1: Basic Node.js Playwright Scraper - We'll learn the fundamentals of web scraping with Node.js and build your first scraper using Cheerio. (Part 1)
-
Part 2: Cleaning Unruly Data & Handling Edge Cases - Web data can be messy and unpredictable. In this part, we'll create a robust scraper using data structures and cleaning techniques to handle these challenges. (Part 2)
-
Part 3: Storing Scraped Data in AWS S3, MySQL & Postgres DBs - Explore various options for storing your scraped data, including databases like MySQL or Postgres, cloud storage like AWS S3, and file formats like CSV and JSON. We'll discuss their pros, cons, and suitable use cases. (This Article)
-
Part 4: Managing Retries & Concurrency - Enhance your scraper's reliability and scalability by handling failed requests and utilizing concurrency. (Part 4)
-
Part 5: Faking User-Agents & Browser Headers - Learn how to create a production-ready scraper by simulating real users through user-agent and browser header manipulation. (Part 5)
-
Part 6: Using Proxies To Avoid Getting Blocked - Discover how to use proxies to bypass anti-bot systems by disguising your real IP address and location. (Part 6)
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
Saving Data to a JSON File
💡Note: All methods for storing data in JSON files, databases, or S3 buckets will be implemented within the ProductDataPipeline
class, which was created in Part 2 of this series.
To save our data in JSON format, we’ll create a saveToJson()
method similar to the saveToCsv()
method in Part 2.
Here’s how it works:
- First, we check if the
storageQueue
is empty and return early if there’s no data to add. - Just like in
saveToCsv()
, we check if the file already exists because we don’t want to overwrite existing data. - If the file exists, we read it using readFileSync() from the fs module, parse its content, and store it in a variable called
existingData
. - Then, we merge the existing data with the new data in
storageQueue
using the spread operator (...), which combines both arrays into a new array,mergedData
. - Finally, we write the combined data to the JSON file using writeFileSync().
Here’s the complete code:
saveToJson() {
if (this.storageQueue.length <= 0) {
return;
}
const fileExists = fs.existsSync(this.jsonFileName);
let existingData = [];
if (fileExists) {
const fileContent = fs.readFileSync(this.jsonFileName, "utf8");
existingData = JSON.parse(fileContent);
}
const mergedData = [...existingData, ...this.storageQueue];
fs.writeFileSync(this.jsonFileName, JSON.stringify(mergedData, null, 2));
}
Here’s what our JSON file will look like after incorporating this method into our ProductDataPipeline
class and running the code:
Saving Data to Amazon S3 Storage
Amazon S3 (Simple Storage Service) allows you to store data as objects in buckets. Each S3 object contains the data file along with metadata, making it a flexible storage solution.
In this section, we’ll demonstrate how to upload the CSV file we created in Part 2, "chocolate.csv", to an Amazon S3 bucket.
Before we begin, you'll need to set up an S3 bucket. If you're unfamiliar with how to do this, AWS provides comprehensive instructions on creating an S3 bucket.
Once your bucket is ready, you’ll also receive access keys that will allow your scraper to authenticate and interact with AWS.
Next, you need to install the AWS SDK to interact with your S3 bucket. Run the following command in your Node.js project:
npm i @aws-sdk/client-s3
This will install the necessary modules for connecting to S3 and uploading files.
We’ll create a new method, saveToS3Bucket()
, to handle uploading our CSV data to S3. Here’s how the method works:
-
We’ll use the
S3Client
andPutObjectCommand
from the AWS SDK to manage our connection and upload process. -
The client is initialized with the region and credentials (such as
accessKeyId
andsecretAccessKey
), which you can find in your AWS S3 dashboard. -
We’ll check if the "chocolate.csv" file exists locally before proceeding. If it doesn't, we wait for a second to ensure all data has been written to the file.
-
Once the file is ready, we upload it using the
send()
method of theS3Client
. The data object includes:- Bucket: The name of your S3 bucket.
- Key: The name of the file we are uploading ("chocolate.csv").
- Body: The file content, which is read using the fs.createReadStream() method.
-
A retry mechanism is included to ensure the file is available before the upload process begins.
Here’s the complete code:
async saveToS3Bucket() {
const client = new S3Client({
region: "us-east-1",
credentials: {
accessKeyId: "YOUR_ACCESS_KEY_ID",
secretAccessKey: "YOUR_SECRET_ACCESS_KEY",
},
});
let retry = 3;
while (retry > 0) {
if (!fs.existsSync(this.csvFilename)) {
await new Promise((resolve) => setTimeout(resolve, 1000));
}
retry -= 1;
}
const params = {
Bucket: this.s3Bucket,
Key: this.csvFilename,
Body: fs.createReadStream(this.csvFilename),
};
await client.send(new PutObjectCommand(params));
}
Once the data has been uploaded, you can confirm it in the AWS S3 dashboard. The file will be listed with its name, size, and any associated metadata.
Saving Data to MySQL Database
MySQL is a relational database developed by Oracle, and it uses SQL (Structured Query Language) to interact with and manage data. SQL is likely familiar to you if you've worked with databases before.
Downloading MySQL
To get started, you need to install MySQL depending on your operating system. Use the following links for the installation guide:
- Windows: MySQL Installation
- Mac: MySQL Installation
- Ubuntu: MySQL Installation
Once MySQL is installed, you can access it from your terminal by running:
mysql -u username -p
Make sure to replace username with your MySQL username (e.g., root).
Creating MySQL Table
After entering your password, create a new database for storing our product data:
CREATE DATABASE chocolate;
To switch to this new database, use:
USE choclate;
Now, let’s create a table to store the product data. Execute the following SQL query in the terminal:
CREATE TABLE IF NOT EXISTS chocolate_products (
id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(255),
price_gb FLOAT,
price_usd FLOAT,
url TEXT
);
This query will create a table named chocolate_products with the following columns:
- id: An auto-incremented integer, serving as the unique identifier for each product.
- name: A VARCHAR field to store the product name (up to 255 characters).
- price_gb: A FLOAT field to store the product's price in GBP.
- price_usd: A FLOAT field for the price in USD.
- url: A TEXT field for the product’s URL.
Connecting Scraper To MySQL Database
To interact with MySQL from Node.js, you need to install the MySQL client library (mysql). Run the following command to install it:
npm i mysql
This package provides us with essential methods for database interactions, such as:
- createConnection(): Establishes a connection to the MySQL database.
- connect(): Opens the connection to MySQL.
- query(): Executes SQL queries.
- end(): Closes the connection to the database.
Now we’ll create a new method, saveToMySQL()
, in our ProductDataPipeline
class. This method will insert product data from the storageQueue
into the MySQL database we set up earlier. Here’s how it works:
- If
storageQueue
is empty, there are no data to save, so we simply return. - We use
createConnection()
and connect() to connect to the MySQL database, passing the host, username, password, and database name. - We build an
SQL INSERT
query that adds the product data (name, price in GBP, price in USD, and URL) to thechocolate_products
table. This query is executed withconnection.query()
. - Since the MySQL library doesn’t support async/await natively, we wrap the method in a Promise to make it compatible with the rest of our async code.
- After the data is inserted, the connection to the database is closed.
Here’s the full implementation:
saveToMySQL() {
if (this.storageQueue.length <= 0) {
return;
}
return new Promise((resolve, reject) => {
const connection = mysql.createConnection(this.mysqlConfig);
connection.connect((err) => {
if (err) {
console.error("Error connecting to database: ", err);
reject(err);
return;
}
const query = "INSERT INTO chocolate_products (name, price_gb, price_usd, url) VALUES ?";
const values = this.storageQueue.map((product) => [
product.name,
product.priceGb,
product.priceUsd,
product.url,
]);
connection.query(query, [values], (err, results) => {
if (err) {
console.error("Error inserting data into database: ", err);
connection.end();
reject(err);
} else {
connection.end();
resolve(results);
}
});
});
});
}
After running the above method, you’ll see that your product data has been successfully inserted into the chocolate_products
table.
Saving Data to Postgres Database
PostgreSQL is an open-source relational database that supports advanced data types and performance optimization features.
Downloading Postgres
If you don’t already have PostgreSQL set up, you’ll need to install it and create a table in your database to store product data.
- Windows: PostgreSQL Installation
- Mac: PostgreSQL Installation
- Ubuntu: PostgreSQL Installation
Creating Postgres Table
First, log in to your PostgreSQL database from the terminal by running:
psql -U username -d databasename
Replace username with your PostgreSQL username (e.g., postgres) and databasename with your database name. If the chocolate_products
table doesn’t exist, create it by executing the following SQL query:
CREATE TABLE IF NOT EXISTS chocolate_products (
id SERIAL PRIMARY KEY,
name VARCHAR(255),
price_gb FLOAT,
price_usd FLOAT,
url TEXT
)
This will create the table with the following structure:
- id: A unique identifier for each product, automatically incremented.
- name: A string column (max 255 characters) for the product's name.
- price_gb: A FLOAT column to store the price in GBP.
- price_usd: A FLOAT column for the price in USD.
- url: A TEXT column for the product’s URL.
Connecting Scraper To Postgres Database
To interact with PostgreSQL in Node.js, install the pg package by running:
npm install pg
This package allows us to connect to PostgreSQL, run queries, and manage data.
The saveToPostgres()
method will handle saving product data from the storageQueue
to the PostgreSQL database. Here’s how it works:
- Similar to MySQL, we first check if there’s any data in the
storageQueue
. If not, the method simply returns. - We use the
Client
class from thepg
package to create a connection to the PostgreSQL database. The connection details (username, password, host, etc.) are passed in as part of the configuration. - Since PostgreSQL doesn’t support bulk inserts natively with the
pg
package, we iterate through thestorageQueue
and insert each product one by one using parameterized queries ($1, $2, etc.) to avoid SQL injection. - We wrap the logic in a try/catch block to handle any potential errors during database interactions.
- Once all the data is inserted, we close the database connection.
Here’s the complete implementation:
async saveToPostgres() {
if (this.storageQueue.length <= 0) {
return;
}
const client = new Client({
user: "postgres",
host: "localhost",
database: this.pgDbName,
password: "mysecretpassword",
port: 5432,
});
try {
await client.connect();
const query =
"INSERT INTO chocolate_products (name, price_gb, price_usd, url) VALUES ($1, $2, $3, $4)";
for (const product of this.storageQueue) {
await client.query(query, [
product.name,
product.priceGb,
product.priceUsd,
product.url,
]);
}
} catch (error) {
} finally {
await client.end();
}
}
After running the above code, the data from the storageQueue
will be inserted into the chocolate_products
table in your PostgreSQL database. You’ll see each product neatly stored with its name
, price_gb
, price_usd
, and url
.
Complete Code
Now, let’s combine the storage methods we’ve created in this guide with the code from Part 2 to see how the entire solution comes together:
const { chromium } = require('playwright');
const fs = require('fs');
const { S3Client, PutObjectCommand } = require("@aws-sdk/client-s3");
const mysql = require("mysql");
const { Client } = require("pg");
class Product {
constructor(name, priceStr, url, conversionRate = 1.32) {
this.name = this.cleanName(name);
this.priceGb = this.cleanPrice(priceStr);
this.priceUsd = this.convertPriceToUsd(this.priceGb, conversionRate);
this.url = this.createAbsoluteUrl(url);
}
cleanName(name) {
return name?.trim() || "missing";
}
cleanPrice(priceStr) {
if (!priceStr?.trim()) {
return 0.0;
}
const cleanedPrice = priceStr
.replace(/Sale priceFrom £|Sale price£/g, "")
.trim();
return cleanedPrice ? parseFloat(cleanedPrice) : 0.0;
}
convertPriceToUsd(priceGb, conversionRate) {
return priceGb * conversionRate;
}
createAbsoluteUrl(url) {
return (url?.trim()) ? `https://www.chocolate.co.uk${url.trim()}` : "missing";
}
}
class ProductDataPipeline {
constructor(
csvFilename = "",
jsonFileName = "",
s3Bucket = "",
mysqlDbName = "",
pgDbName = "",
storageQueueLimit = 5
) {
this.seenProducts = new Set();
this.storageQueue = [];
this.csvFilename = csvFilename;
this.csvFileOpen = false;
this.jsonFileName = jsonFileName;
this.s3Bucket = s3Bucket;
this.mysqlDbName = mysqlDbName;
this.pgDbName = pgDbName;
this.storageQueueLimit = storageQueueLimit;
}
saveToCsv() {
this.csvFileOpen = true;
const fileExists = fs.existsSync(this.csvFilename);
const file = fs.createWriteStream(this.csvFilename, { flags: "a" });
if (!fileExists) {
file.write("name,priceGb,priceUsd,url\n");
}
for (const product of this.storageQueue) {
file.write(
`${product.name},${product.priceGb},${product.priceUsd},${product.url}\n`
);
}
file.end();
this.csvFileOpen = false;
}
saveToJson() {
if (this.storageQueue.length <= 0) {
return;
}
const fileExists = fs.existsSync(this.jsonFileName);
let existingData = [];
if (fileExists) {
const fileContent = fs.readFileSync(this.jsonFileName, "utf8");
existingData = JSON.parse(fileContent);
}
const mergedData = [...existingData, ...this.storageQueue];
fs.writeFileSync(this.jsonFileName, JSON.stringify(mergedData, null, 2));
}
async saveToS3Bucket() {
const client = new S3Client({
region: "us-east-1",
credentials: {
accessKeyId: "YOUR_ACCESS_KEY_ID",
secretAccessKey: "YOUR_SECRET_ACCESS_KEY",
},
});
let retry = 3;
while (retry > 0) {
if (!fs.existsSync(this.csvFilename)) {
await new Promise((resolve) => setTimeout(resolve, 1000));
}
retry -= 1;
}
const params = {
Bucket: this.s3Bucket,
Key: this.csvFilename,
Body: fs.createReadStream(this.csvFilename),
};
await client.send(new PutObjectCommand(params));
}
saveToMysql() {
if (this.storageQueue.length <= 0) {
return;
}
return new Promise((resolve, reject) => {
const connection = mysql.createConnection({
host: "localhost",
user: "root",
password: "password",
database: this.mysqlDbName,
});
connection.connect((err) => {
if (err) {
console.error("Error connecting to database: ", err);
reject(err);
return;
}
const query =
"INSERT INTO chocolate_products (name, price_gb, price_usd, url) VALUES ?";
const values = this.storageQueue.map((product) => [
product.name,
product.priceGb,
product.priceUsd,
product.url,
]);
connection.query(query, [values], (err, results) => {
if (err) {
console.error("Error inserting data into database: ", err);
connection.end();
reject(err);
} else {
connection.end();
resolve(results);
}
});
});
});
}
async saveToPostgres() {
if (this.storageQueue.length <= 0) {
return;
}
const client = new Client({
user: "postgres",
host: "localhost",
database: this.pgDbName,
password: "mysecretpassword",
port: 5432,
});
try {
await client.connect();
const query =
"INSERT INTO chocolate_products (name, price_gb, price_usd, url) VALUES ($1, $2, $3, $4)";
for (const product of this.storageQueue) {
await client.query(query, [
product.name,
product.priceGb,
product.priceUsd,
product.url,
]);
}
} catch (error) {
} finally {
await client.end();
}
}
cleanRawProduct(rawProduct) {
return new Product(rawProduct.name, rawProduct.price, rawProduct.url);
}
isDuplicateProduct(product) {
if (!this.seenProducts.has(product.url)) {
this.seenProducts.add(product.url);
return false;
}
return true;
}
async addProduct(rawProduct) {
const product = this.cleanRawProduct(rawProduct);
if (!this.isDuplicateProduct(product)) {
this.storageQueue.push(product);
if (
this.storageQueue.length >= this.storageQueueLimit &&
!this.csvFileOpen
) {
this.saveToCsv();
this.saveToJson();
this.saveToMysql();
await this.saveToPostgres();
this.storageQueue = [];
}
}
}
async close() {
while (this.csvFileOpen) {
// Wait for the file to be written
await new Promise((resolve) => setTimeout(resolve, 1000));
}
if (this.storageQueue.length > 0) {
this.saveToCsv();
}
}
}
const listOfUrls = ["https://www.chocolate.co.uk/collections/all"];
async function scrape() {
const pipeline = new ProductDataPipeline(
"chocolate.csv",
"chocolate.json",
"chocolate-bucket",
"chocolate_db",
"chocolate_db",
5
);
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext();
const page = await context.newPage();
for (let url of listOfUrls) {
console.log(`Scraping: ${url}`);
await page.goto(url);
const productItems = await page.$$eval("product-item", items =>
items.map(item => {
const titleElement = item.querySelector(".product-item-meta__title");
const priceElement = item.querySelector(".price");
return {
title: titleElement ? titleElement.textContent.trim() : null,
price: priceElement ? priceElement.textContent.trim() : null,
url: titleElement ? titleElement.getAttribute("href") : null
};
})
);
for (const rawProduct of productItems) {
if (rawProduct.title && rawProduct.price && rawProduct.url) {
await pipeline.addProduct({
name: rawProduct.title,
price: rawProduct.price,
url: rawProduct.url
});
}
}
await nextPage(page);
}
await pipeline.close();
await pipeline.saveToS3Bucket();
await browser.close();
}
async function nextPage(page) {
let nextUrl;
try {
nextUrl = await page.$eval("a.pagination__nav-item:nth-child(4)", item => item.href);
} catch (error) {
console.log('Last Page Reached');
return;
}
listOfUrls.push(nextUrl);
}
(async () => {
await scrape();
})();
When you execute the above code, your product data will first be stored in both CSV and JSON formats.
After that, the data will be stored in AWS S3, MySQL, and PostgreSQL, ensuring multiple backups and storage options for your scraped data.
Next Steps
So far, in Part 1, Part 2, and Part 3, we've built an awesome scraper. It extracts data, cleans it, and stores it across multiple storage options.
In Part 4 of this series, we’ll enhance our scraper to be more robust and scalable. We’ll implement a retry mechanism to handle failed requests and use Node.js's concurrency features to deploy multiple worker threads, accelerating both the extraction and storage processes.