NodeJS Puppeteer Beginners Series Part 3: Storing Data
In Part 1 and Part 2 of this Node.js Puppeteer 6-Part Beginner Series, we learned how to build a basic web scraper and extract data from websites, as well as how to clean the scraped data.
In Part 3, we will cover various methods for saving the data scraped using Puppeteer. We'll implement these methods within the ProductDataPipeline
class, which we created in Part 2 of this series.
Specifically, we'll explore saving data to a JSON file, Amazon S3 storage, a MySQL database, and a PostgreSQL database.
- Saving Data to a JSON File
- Saving Data to Amazon S3 Storage
- Saving Data to MySQL Database
- Saving Data to PostgreSQL Database
Node.js Puppeteer 6-Part Beginner Series
-
Part 1: Basic Node.js Puppeteer Scraper - We'll learn the fundamentals of web scraping with Node.js and build your first scraper using NpdeJS Puppeteer. (Part 1)
-
Part 2: Cleaning Unruly Data & Handling Edge Cases - Web data can be messy and unpredictable. In this part, we'll create a robust scraper using data structures and cleaning techniques to handle these challenges. (Part 2)
-
Part 3: Storing Scraped Data in AWS S3, MySQL & Postgres DBs - Explore various options for storing your scraped data, including databases like MySQL or Postgres, cloud storage like AWS S3, and file formats like CSV and JSON. We'll discuss their pros, cons, and suitable use cases. (This article)
-
Part 4: Managing Retries & Concurrency - Enhance your scraper's reliability and scalability by handling failed requests and utilizing concurrency. (Part 4)
-
Part 5: Faking User-Agents & Browser Headers - Learn how to create a production-ready scraper by simulating real users through user-agent and browser header manipulation. (Part 5)
-
Part 6: Using Proxies To Avoid Getting Blocked - Discover how to use proxies to bypass anti-bot systems by disguising your real IP address and location. (Part 6)
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
Saving Data to a JSON File
💡Note: All methods for storing data in JSON files, databases, or S3 buckets will be implemented within the ProductDataPipeline
class, which was created in Part 2 of this series.
Saving data to a JSON file is a straightforward and widely used method for storing structured data.
JSON, or JavaScript Object Notation, is a lightweight data interchange format that's easy for humans to read and write and easy for machines to parse and generate. This format is particularly useful for web applications and APIs, making it a good choice for our web scraping project.
Storing data in a JSON file has several benefits:
- Human-Readable: JSON files are easy to read and understand, which makes them ideal for debugging and manual inspection.
- Lightweight: JSON is a lightweight data format, making it efficient for storage and transfer.
- Interoperable: JSON is widely supported across different programming languages and platforms, ensuring interoperability.
- Flexible: JSON can easily handle complex nested data structures, making it versatile for various data representation needs.
To save data to a JSON file, we will create a method called save_to_json()
within the ProductDataPipeline
class. This method will perform the following actions:
-
Iterate Over Each Product: The method will iterate over each product in the
products_to_save
list. -
Read Existing Data: The method will open the JSON file specified by
self.json_filename
in read mode. If the file exists, it will usejson.load()
to load the existing data into theexisting_data
list. Otherwise, it will initializeexisting_data
as an empty list. -
Combine Data: The
json_data
list, containing the newly converted dictionaries, will be appended to the end of theexisting_data
list, combining both datasets. -
Write to JSON File: Finally, the method will open the JSON file in write mode and use
json.dump()
to write the combined data (now stored inexisting_data
) back to the file.
Here's how you can implement the save_to_json()
method:
const fs = require('fs');
const path = require('path');
class ProductDataPipeline {
constructor(csvFilename, jsonFilename, storageQueueLimit = 5) {
this.namesSeen = [];
this.storageQueue = [];
this.storageQueueLimit = storageQueueLimit;
this.csvFilename = csvFilename;
this.jsonFilename = jsonFilename;
this.csvFileOpen = false;
}
saveToJson() {
const productsToSave = [...this.storageQueue];
this.storageQueue = [];
if (!productsToSave.length) return;
let existingData = [];
try {
existingData = JSON.parse(fs.readFileSync(this.jsonFilename, 'utf8'));
} catch (error) {
if (error.code !== 'ENOENT') throw error;
}
existingData.push(...productsToSave);
fs.writeFileSync(this.jsonFilename, JSON.stringify(existingData, null, 2), 'utf8');
}
}
After running this code, the JSON file will look something like this:
[
{
"name": "Lovely Chocolate",
"price_gb": 1.5,
"price_usd": 1.82,
"url": "https://www.chocolate.co.uk/products/100-dark-hot-chocolate-flakes"
},
{
"name": "My Nice Chocolate",
"price_gb": 4.0,
"price_usd": 4.84,
"url": "https://www.chocolate.co.uk/products/nice-chocolate-flakes"
}
]
Saving Data to Amazon S3 Storage
Amazon S3 (Simple Storage Service) is a scalable, high-speed, web-based cloud storage service designed for online backup and archiving of data and applications on Amazon Web Services (AWS).
Saving your data to Amazon S3 can be particularly useful for making it accessible across multiple devices and for ensuring redundancy and durability of your data.
Storing data in Amazon S3 has several benefits:
- Scalability: S3 automatically scales storage capacity to accommodate your growing data needs.
- Durability: S3 is designed to provide 99.999999999% durability by redundantly storing objects across multiple devices in multiple facilities.
- Accessibility: Data stored in S3 can be accessed from anywhere with an internet connection, making it ideal for distributed applications.
- Security: S3 offers robust security features, including access control, encryption, and auditing capabilities.
We've already covered how to export the data to a CSV file in Part 2 of this series. Nonetheless, we've also included the code for that, which you can find in the last section.
Now, let's save the created CSV file directly to an Amazon S3 bucket (you'll need to have one set up already).
If you don't have one, you can follow the instructions on the AWS website.
To save the created CSV file directly to an Amazon S3 bucket, we will use the Boto3
library, which is the Amazon Web Services (AWS) SDK for Python. This library allows Python developers to write software that makes use of Amazon services like S3 and others.
Once you’ve a bucket, we'll start by installing Boto3
library using npm:
$ npm install aws-sdk
We will create a method called save_to_s3_bucket()
within the ProductDataPipeline
class. This method will perform the following actions:
- Set Up S3 Client: Create an S3 client using the aws-sdk library. This client will be used to interact with the S3 service.
- Upload File: Use the
upload
method of the S3 client to upload the fileproduct_data.csv
to the specified S3 bucket (self.aws_s3_bucket
) with the keychocolate_data.csv
.
Here is the implementation:
const AWS = require('aws-sdk');
class ProductDataPipeline {
constructor(csvFilename, jsonFilename, awsS3Bucket, storageQueueLimit = 5) {
this.namesSeen = [];
this.storageQueue = [];
this.storageQueueLimit = storageQueueLimit;
this.csvFilename = csvFilename;
this.jsonFilename = jsonFilename;
this.awsS3Bucket = awsS3Bucket;
}
saveToS3Bucket() {
const s3 = new AWS.S3({
accessKeyId: process.env.AWS_ACCESS_KEY_ID,
secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY
});
const params = {
Bucket: this.awsS3Bucket,
Key: 'chocolate_data.csv',
Body: fs.createReadStream(this.csvFilename)
};
s3.upload(params, function(err, data) {
if (err) {
console.log("Error", err);
}
if (data) {
console.log("Upload Success", data.Location);
}
});
}
}
Obviously, you’ll need to replace the AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
with your own Amazon Key and Secret. Also, pass your bucket name and correct file path.
Here’s the snapshot showing that the file has been uploaded successfully:
Note: When saving data to AWS S3, employ delayed file uploads. This means that the file is first temporarily saved locally on the machine running the scraper, and then uploaded to AWS S3 once the scraper's job is complete.
Saving Data to MySQL Database
MySQL is a popular open-source relational database management system that uses structured query language (SQL).
Saving data to a MySQL database is beneficial for applications that require complex querying and data manipulation capabilities.
Storing data in a MySQL database has several benefits:
- Structured Data: MySQL uses a structured format, which is ideal for applications that require complex queries and data manipulation.
- Performance: MySQL is optimized for read-heavy operations, making it suitable for applications with high query loads.
- Flexibility: MySQL supports a wide range of data types and indexing options, providing flexibility in how data is stored and retrieved.
- Community Support: MySQL has a large and active community, offering extensive documentation, tutorials, and support resources.
Downloading MySQL Database
Ensure you have MySQL installed on your machine and have created a database. If you don't have MySQL installed, you can follow the instructions below:
Creating the MySQL Table
Ensure you have a table called chocolate_products
in your database. If not, you can log in to your database and run the following command to create the table:
CREATE TABLE IF NOT EXISTS chocolate_products (
id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(255),
price_gb DECIMAL(10, 2),
price_usd DECIMAL(10, 2),
url TEXT
);
Connecting Scraper To MySQL Database
To interact with your MySQL database from Node.js, you'll need the appropriate MySQL connector. You can easily install it using npm:
$ npm install mysql2
We will create a method called save_to_mysql()
within the ProductDataPipeline
class. This method will perform the following actions:
- Establish Connection: Establish a connection to your MySQL database.
- Insert Data: Insert the product data into the
chocolate_products
table.
Here is the implementation:
const mysql = require('mysql2/promise');
class ProductDataPipeline {
constructor(csvFilename, jsonFilename, mysqlDbName, storageQueueLimit = 5) {
this.namesSeen = [];
this.storageQueue = [];
this.storageQueueLimit = storageQueueLimit;
this.csvFilename = csvFilename;
this.jsonFilename = jsonFilename;
this.mysqlDbName = mysqlDbName;
}
async saveToMySQL() {
const productsToSave = [...
this.storageQueue];
this.storageQueue = [];
if (!productsToSave.length) return;
const connection = await mysql.createConnection({
host: 'localhost',
user: 'root',
password: 'password',
database: this.mysqlDbName
});
const query = `INSERT INTO chocolate_products (name, price_gb, price_usd, url) VALUES (?, ?, ?, ?)`;
for (const product of productsToSave) {
await connection.execute(query, [product.name, product.priceGBP, product.priceUSD, product.url]);
}
await connection.end();
}
}
Here’s the snapshot of the data stored in the MySQL database:
Saving Data to PostgreSQL Database
PostgreSQL is a powerful, open-source relational database management system known for its robustness, extensibility, and compliance with SQL standards.
Saving data to a PostgreSQL database is beneficial for applications that require complex querying, data integrity, and support for advanced data types.
Storing data in a PostgreSQL database has several benefits:
- Advanced Features: PostgreSQL supports advanced features such as full-text search, custom data types, and complex queries.
- Data Integrity: PostgreSQL offers robust data integrity features, including foreign keys, constraints, and transactional integrity.
- Extensibility: PostgreSQL is highly extensible, allowing developers to add custom functions, operators, and data types.
- Compliance: PostgreSQL adheres closely to SQL standards, ensuring compatibility with a wide range of SQL-based applications.
Downloading PostgreSQL
Ensure you have PostgreSQL installed on your machine and have created a database. For setting up a Postgres database, check out the following resources:
Creating the PostgreSQL Table
Ensure you have a table called chocolate_products
in your database. If not, you can log in to your database and run the following command to create the table:
CREATE TABLE IF NOT EXISTS chocolate_products (
id SERIAL PRIMARY KEY,
name VARCHAR(255),
price_gb DECIMAL(10, 2),
price_usd DECIMAL(10, 2),
url TEXT
);
Installing PostgreSQL Connector
To interact with your PostgreSQL database from Node.js, you'll need the appropriate PostgreSQL connector. You can easily install it using npm:
$ npm install pg
We will create a method called save_to_postgresql()
within the ProductDataPipeline
class. This method will perform the following actions:
- Establish Connection: Establish a connection to your PostgreSQL database.
- Insert Data: Insert the product data into the
chocolate_products
table.
Here is the implementation:
const { Client } = require('pg');
class ProductDataPipeline {
constructor(csvFilename, jsonFilename, postgresDbName, storageQueueLimit = 5) {
this.namesSeen = [];
this.storageQueue = [];
this.storageQueueLimit = storageQueueLimit;
this.csvFilename = csvFilename;
this.jsonFilename = jsonFilename;
this.postgresDbName = postgresDbName;
}
async saveToPostgreSQL() {
const productsToSave = [...this.storageQueue];
this.storageQueue = [];
if (!productsToSave.length) return;
const client = new Client({
host: 'localhost',
user: 'postgres',
password: 'password',
database: this.postgresDbName
});
await client.connect();
const query = `INSERT INTO chocolate_products (name, price_gb, price_usd, url) VALUES ($1, $2, $3, $4)`;
for (const product of productsToSave) {
await client.query(query, [product.name, product.priceGBP, product.priceUSD, product.url]);
}
await client.end();
}
}
Here’s the snapshot of the data stored in the PostgreSQL database:
Complete Code
When you execute the code, it will create two files, product_data.csv
and product_data.json
. It will then store the data in MySQL, PostgreSQL, and the Amazon S3 bucket.
Execute the following command to see the data in your MySQL and PostgreSQL databases:
select * from chocolate_products;
Here’s the complete code integrating all the methods for saving data to different storage options:
const fs = require('fs');
const AWS = require('aws-sdk');
const mysql = require('mysql2/promise');
const { Client } = require('pg');
const puppeteer = require('puppeteer');
class Product {
constructor(name, priceString, url) {
this.name = this.cleanName(name);
this.priceGBP = this.cleanPrice(priceString);
this.priceUSD = this.convertPriceToUSD();
this.url = this.createAbsoluteURL(url);
}
cleanName(name) {
return name.trim() || "missing";
}
cleanPrice(priceString) {
if (!priceString) return 0.0;
priceString = priceString.replace(/[^0-9\.]+/g, '');
return parseFloat(priceString) || 0.0;
}
convertPriceToUSD() {
const exchangeRate = 1.21;
return this.priceGBP * exchangeRate;
}
createAbsoluteURL(relativeURL) {
const baseURL = "https://www.chocolate.co.uk";
return relativeURL ? `${baseURL}${relativeURL}` : "missing";
}
}
class ProductDataPipeline {
constructor(csvFilename, jsonFilename, mysqlDbName, postgresDbName, awsS3Bucket, storageQueueLimit = 5) {
this.namesSeen = [];
this.storageQueue = [];
this.storageQueueLimit = storageQueueLimit;
this.csvFilename = csvFilename;
this.jsonFilename = jsonFilename;
this.mysqlDbName = mysqlDbName;
this.postgresDbName = postgresDbName;
this.awsS3Bucket = awsS3Bucket;
}
saveToCsv() {
const productsToSave = [...this.storageQueue];
this.storageQueue = [];
if (!productsToSave.length) return;
const headers = Object.keys(productsToSave[0]);
const fileExists = fs.existsSync(this.csvFilename);
const csvWriter = fs.createWriteStream(this.csvFilename, { flags: 'a' });
if (!fileExists) {
csvWriter.write(headers.join(',') + '\n');
}
productsToSave.forEach(product => {
const row = headers.map(header => product[header]).join(',');
csvWriter.write(row + '\n');
});
csvWriter.end();
}
saveToJson() {
const productsToSave = [...this.storageQueue];
this.storageQueue = [];
if (!productsToSave.length) return;
let existingData = [];
try {
existingData = JSON.parse(fs.readFileSync(this.jsonFilename, 'utf8'));
} catch (error) {
if (error.code !== 'ENOENT') throw error;
}
existingData.push(...productsToSave);
fs.writeFileSync(this.jsonFilename, JSON.stringify(existingData, null, 2), 'utf8');
}
async saveToMySQL() {
const productsToSave = [...this.storageQueue];
this.storageQueue = [];
if (!productsToSave.length) return;
const connection = await mysql.createConnection({
host: 'localhost',
user: 'root',
password: 'password',
database: this.mysqlDbName
});
const query = `INSERT INTO chocolate_products (name, price_gb, price_usd, url) VALUES (?, ?, ?, ?)`;
for (const product of productsToSave) {
await connection.execute(query, [product.name, product.priceGBP, product.priceUSD, product.url]);
}
await connection.end();
}
async saveToPostgreSQL() {
const productsToSave = [...this.storageQueue];
this.storageQueue = [];
if (!productsToSave.length) return;
const client = new Client({
host: 'localhost',
user: 'postgres',
password: 'password',
database: this.postgresDbName
});
await client.connect();
const query = `INSERT INTO chocolate_products (name, price_gb, price_usd, url) VALUES ($1, $2, $3, $4)`;
for (const product of productsToSave) {
await client.query(query, [product.name, product.priceGBP, product.priceUSD, product.url]);
}
await client.end();
}
saveToS3Bucket() {
const s3 = new AWS.S3({
accessKeyId: process.env.AWS_ACCESS_KEY_ID,
secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY
});
const params = {
Bucket: this.awsS3Bucket,
Key: 'chocolate_data.csv',
Body: fs.createReadStream(this.csvFilename)
};
s3.upload(params,
function(err, data) {
if (err) {
console.log("Error", err);
}
if (data) {
console.log("Upload Success", data.Location);
}
});
}
cleanRawProduct(scrapedData) {
return new Product(
scrapedData.name || '',
scrapedData.price || '',
scrapedData.url || ''
);
}
isDuplicate(product) {
if (this.namesSeen.includes(product.name)) {
console.log(`Duplicate item found: ${product.name}. Item dropped.`);
return true;
}
this.namesSeen.push(product.name);
return false;
}
addProduct(scrapedData) {
const product = this.cleanRawProduct(scrapedData);
if (!this.isDuplicate(product)) {
this.storageQueue.push(product);
if (this.storageQueue.length >= this.storageQueueLimit) {
this.saveToCsv();
this.saveToJson();
this.saveToMySQL();
this.saveToPostgreSQL();
}
}
}
closePipeline() {
if (this.storageQueue.length > 0) {
this.saveToCsv();
this.saveToJson();
this.saveToMySQL();
this.saveToPostgreSQL();
}
}
}
const startScrape = async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const baseURL = 'https://www.chocolate.co.uk/collections/all';
const dataPipeline = new ProductDataPipeline('product_data.csv', 'product_data.json', 'chocolatedb', 'chocolatedb', 'chocolate-products');
let nextPageExists = true;
let currentPage = baseURL;
while (nextPageExists) {
await page.goto(currentPage, { waitUntil: 'networkidle2' });
const products = await page.evaluate(() => {
const items = document.querySelectorAll('.product-item');
return Array.from(items).map(item => ({
name: item.querySelector('.product-item-meta__title').innerText,
price: item.querySelector('.price').innerText,
url: item.querySelector('.product-item-meta a').getAttribute('href')
}));
});
products.forEach(product => dataPipeline.addProduct(product));
nextPageExists = await page.evaluate(() => {
const nextPage = document.querySelector('a[rel="next"]');
return nextPage ? nextPage.href : null;
});
if (nextPageExists) {
currentPage = nextPageExists;
}
}
await browser.close();
dataPipeline.closePipeline();
console.log("Congratulations! Data saved successfully on MySQL, PostgreSQL, JSON, and CSV.");
dataPipeline.saveToS3Bucket();
};
startScrape();
Next Steps
We hope you now have a good understanding of how to save the data you've scraped into the file or database you need! If you have any questions, leave them in the comments below and we'll do our best to help out!
In the next tutorial, we'll cover how to make our scraper more robust and scalable by handling failed requests and using concurrency. Stay tuned!