Skip to main content

Part 3 - Storing Data in AWS S3, MySQL & Postgres DBs Using Puppeteer

NodeJS Puppeteer Beginners Series Part 3: Storing Data

In Part 1 and Part 2 of this Node.js Puppeteer 6-Part Beginner Series, we learned how to build a basic web scraper and extract data from websites, as well as how to clean the scraped data.

In Part 3, we will cover various methods for saving the data scraped using Puppeteer. We'll implement these methods within the ProductDataPipeline class, which we created in Part 2 of this series.

Specifically, we'll explore saving data to a JSON file, Amazon S3 storage, a MySQL database, and a PostgreSQL database.

Node.js Puppeteer 6-Part Beginner Series

  • Part 1: Basic Node.js Puppeteer Scraper - We'll learn the fundamentals of web scraping with Node.js and build your first scraper using NpdeJS Puppeteer. (Part 1)

  • Part 2: Cleaning Unruly Data & Handling Edge Cases - Web data can be messy and unpredictable. In this part, we'll create a robust scraper using data structures and cleaning techniques to handle these challenges. (Part 2)

  • Part 3: Storing Scraped Data in AWS S3, MySQL & Postgres DBs - Explore various options for storing your scraped data, including databases like MySQL or Postgres, cloud storage like AWS S3, and file formats like CSV and JSON. We'll discuss their pros, cons, and suitable use cases. (This article)

  • Part 4: Managing Retries & Concurrency - Enhance your scraper's reliability and scalability by handling failed requests and utilizing concurrency. (Part 4)

  • Part 5: Faking User-Agents & Browser Headers - Learn how to create a production-ready scraper by simulating real users through user-agent and browser header manipulation. (Part 5)

  • Part 6: Using Proxies To Avoid Getting Blocked - Discover how to use proxies to bypass anti-bot systems by disguising your real IP address and location. (Part 6)

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.


Saving Data to a JSON File

💡Note: All methods for storing data in JSON files, databases, or S3 buckets will be implemented within the ProductDataPipeline class, which was created in Part 2 of this series.

Saving data to a JSON file is a straightforward and widely used method for storing structured data.

JSON, or JavaScript Object Notation, is a lightweight data interchange format that's easy for humans to read and write and easy for machines to parse and generate. This format is particularly useful for web applications and APIs, making it a good choice for our web scraping project.

Storing data in a JSON file has several benefits:

  • Human-Readable: JSON files are easy to read and understand, which makes them ideal for debugging and manual inspection.
  • Lightweight: JSON is a lightweight data format, making it efficient for storage and transfer.
  • Interoperable: JSON is widely supported across different programming languages and platforms, ensuring interoperability.
  • Flexible: JSON can easily handle complex nested data structures, making it versatile for various data representation needs.

To save data to a JSON file, we will create a method called save_to_json() within the ProductDataPipeline class. This method will perform the following actions:

  1. Iterate Over Each Product: The method will iterate over each product in the products_to_save list.

  2. Read Existing Data: The method will open the JSON file specified by self.json_filename in read mode. If the file exists, it will use json.load() to load the existing data into the existing_data list. Otherwise, it will initialize existing_data as an empty list.

  3. Combine Data: The json_data list, containing the newly converted dictionaries, will be appended to the end of the existing_data list, combining both datasets.

  4. Write to JSON File: Finally, the method will open the JSON file in write mode and use json.dump() to write the combined data (now stored in existing_data) back to the file.

Here's how you can implement the save_to_json() method:

const fs = require('fs');
const path = require('path');

class ProductDataPipeline {
constructor(csvFilename, jsonFilename, storageQueueLimit = 5) {
this.namesSeen = [];
this.storageQueue = [];
this.storageQueueLimit = storageQueueLimit;
this.csvFilename = csvFilename;
this.jsonFilename = jsonFilename;
this.csvFileOpen = false;
}

saveToJson() {
const productsToSave = [...this.storageQueue];
this.storageQueue = [];

if (!productsToSave.length) return;

let existingData = [];

try {
existingData = JSON.parse(fs.readFileSync(this.jsonFilename, 'utf8'));
} catch (error) {
if (error.code !== 'ENOENT') throw error;
}

existingData.push(...productsToSave);

fs.writeFileSync(this.jsonFilename, JSON.stringify(existingData, null, 2), 'utf8');
}
}

After running this code, the JSON file will look something like this:

[
{
"name": "Lovely Chocolate",
"price_gb": 1.5,
"price_usd": 1.82,
"url": "https://www.chocolate.co.uk/products/100-dark-hot-chocolate-flakes"
},
{
"name": "My Nice Chocolate",
"price_gb": 4.0,
"price_usd": 4.84,
"url": "https://www.chocolate.co.uk/products/nice-chocolate-flakes"
}
]

Saving Data to Amazon S3 Storage

Amazon S3 (Simple Storage Service) is a scalable, high-speed, web-based cloud storage service designed for online backup and archiving of data and applications on Amazon Web Services (AWS).

Saving your data to Amazon S3 can be particularly useful for making it accessible across multiple devices and for ensuring redundancy and durability of your data.

Storing data in Amazon S3 has several benefits:

  • Scalability: S3 automatically scales storage capacity to accommodate your growing data needs.
  • Durability: S3 is designed to provide 99.999999999% durability by redundantly storing objects across multiple devices in multiple facilities.
  • Accessibility: Data stored in S3 can be accessed from anywhere with an internet connection, making it ideal for distributed applications.
  • Security: S3 offers robust security features, including access control, encryption, and auditing capabilities.

We've already covered how to export the data to a CSV file in Part 2 of this series. Nonetheless, we've also included the code for that, which you can find in the last section.

Now, let's save the created CSV file directly to an Amazon S3 bucket (you'll need to have one set up already).

If you don't have one, you can follow the instructions on the AWS website.

To save the created CSV file directly to an Amazon S3 bucket, we will use the Boto3 library, which is the Amazon Web Services (AWS) SDK for Python. This library allows Python developers to write software that makes use of Amazon services like S3 and others.

Once you’ve a bucket, we'll start by installing Boto3 library using npm:

$ npm install aws-sdk

We will create a method called save_to_s3_bucket() within the ProductDataPipeline class. This method will perform the following actions:

  1. Set Up S3 Client: Create an S3 client using the aws-sdk library. This client will be used to interact with the S3 service.
  2. Upload File: Use the upload method of the S3 client to upload the file product_data.csv to the specified S3 bucket (self.aws_s3_bucket) with the key chocolate_data.csv.

Here is the implementation:

const AWS = require('aws-sdk');

class ProductDataPipeline {
constructor(csvFilename, jsonFilename, awsS3Bucket, storageQueueLimit = 5) {
this.namesSeen = [];
this.storageQueue = [];
this.storageQueueLimit = storageQueueLimit;
this.csvFilename = csvFilename;
this.jsonFilename = jsonFilename;
this.awsS3Bucket = awsS3Bucket;
}

saveToS3Bucket() {
const s3 = new AWS.S3({
accessKeyId: process.env.AWS_ACCESS_KEY_ID,
secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY
});

const params = {
Bucket: this.awsS3Bucket,
Key: 'chocolate_data.csv',
Body: fs.createReadStream(this.csvFilename)
};

s3.upload(params, function(err, data) {
if (err) {
console.log("Error", err);
}
if (data) {
console.log("Upload Success", data.Location);
}
});
}
}

Obviously, you’ll need to replace the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY with your own Amazon Key and Secret. Also, pass your bucket name and correct file path.

Here’s the snapshot showing that the file has been uploaded successfully:

AWS S3 bucket

Note: When saving data to AWS S3, employ delayed file uploads. This means that the file is first temporarily saved locally on the machine running the scraper, and then uploaded to AWS S3 once the scraper's job is complete.


Saving Data to MySQL Database

MySQL is a popular open-source relational database management system that uses structured query language (SQL).

Saving data to a MySQL database is beneficial for applications that require complex querying and data manipulation capabilities.

Storing data in a MySQL database has several benefits:

  • Structured Data: MySQL uses a structured format, which is ideal for applications that require complex queries and data manipulation.
  • Performance: MySQL is optimized for read-heavy operations, making it suitable for applications with high query loads.
  • Flexibility: MySQL supports a wide range of data types and indexing options, providing flexibility in how data is stored and retrieved.
  • Community Support: MySQL has a large and active community, offering extensive documentation, tutorials, and support resources.

Downloading MySQL Database

Ensure you have MySQL installed on your machine and have created a database. If you don't have MySQL installed, you can follow the instructions below:

Creating the MySQL Table

Ensure you have a table called chocolate_products in your database. If not, you can log in to your database and run the following command to create the table:

CREATE TABLE IF NOT EXISTS chocolate_products (
id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(255),
price_gb DECIMAL(10, 2),
price_usd DECIMAL(10, 2),
url TEXT
);

Connecting Scraper To MySQL Database

To interact with your MySQL database from Node.js, you'll need the appropriate MySQL connector. You can easily install it using npm:

$ npm install mysql2

We will create a method called save_to_mysql() within the ProductDataPipeline class. This method will perform the following actions:

  1. Establish Connection: Establish a connection to your MySQL database.
  2. Insert Data: Insert the product data into the chocolate_products table.

Here is the implementation:

const mysql = require('mysql2/promise');

class ProductDataPipeline {
constructor(csvFilename, jsonFilename, mysqlDbName, storageQueueLimit = 5) {
this.namesSeen = [];
this.storageQueue = [];
this.storageQueueLimit = storageQueueLimit;
this.csvFilename = csvFilename;
this.jsonFilename = jsonFilename;
this.mysqlDbName = mysqlDbName;
}

async saveToMySQL() {
const productsToSave = [...

this.storageQueue];
this.storageQueue = [];

if (!productsToSave.length) return;

const connection = await mysql.createConnection({
host: 'localhost',
user: 'root',
password: 'password',
database: this.mysqlDbName
});

const query = `INSERT INTO chocolate_products (name, price_gb, price_usd, url) VALUES (?, ?, ?, ?)`;

for (const product of productsToSave) {
await connection.execute(query, [product.name, product.priceGBP, product.priceUSD, product.url]);
}

await connection.end();
}
}

Here’s the snapshot of the data stored in the MySQL database:

MySQL Data


Saving Data to PostgreSQL Database

PostgreSQL is a powerful, open-source relational database management system known for its robustness, extensibility, and compliance with SQL standards.

Saving data to a PostgreSQL database is beneficial for applications that require complex querying, data integrity, and support for advanced data types.

Storing data in a PostgreSQL database has several benefits:

  • Advanced Features: PostgreSQL supports advanced features such as full-text search, custom data types, and complex queries.
  • Data Integrity: PostgreSQL offers robust data integrity features, including foreign keys, constraints, and transactional integrity.
  • Extensibility: PostgreSQL is highly extensible, allowing developers to add custom functions, operators, and data types.
  • Compliance: PostgreSQL adheres closely to SQL standards, ensuring compatibility with a wide range of SQL-based applications.

Downloading PostgreSQL

Ensure you have PostgreSQL installed on your machine and have created a database. For setting up a Postgres database, check out the following resources:

Creating the PostgreSQL Table

Ensure you have a table called chocolate_products in your database. If not, you can log in to your database and run the following command to create the table:

CREATE TABLE IF NOT EXISTS chocolate_products (
id SERIAL PRIMARY KEY,
name VARCHAR(255),
price_gb DECIMAL(10, 2),
price_usd DECIMAL(10, 2),
url TEXT
);

Installing PostgreSQL Connector

To interact with your PostgreSQL database from Node.js, you'll need the appropriate PostgreSQL connector. You can easily install it using npm:

$ npm install pg

We will create a method called save_to_postgresql() within the ProductDataPipeline class. This method will perform the following actions:

  1. Establish Connection: Establish a connection to your PostgreSQL database.
  2. Insert Data: Insert the product data into the chocolate_products table.

Here is the implementation:

const { Client } = require('pg');

class ProductDataPipeline {
constructor(csvFilename, jsonFilename, postgresDbName, storageQueueLimit = 5) {
this.namesSeen = [];
this.storageQueue = [];
this.storageQueueLimit = storageQueueLimit;
this.csvFilename = csvFilename;
this.jsonFilename = jsonFilename;
this.postgresDbName = postgresDbName;
}

async saveToPostgreSQL() {
const productsToSave = [...this.storageQueue];
this.storageQueue = [];

if (!productsToSave.length) return;

const client = new Client({
host: 'localhost',
user: 'postgres',
password: 'password',
database: this.postgresDbName
});

await client.connect();

const query = `INSERT INTO chocolate_products (name, price_gb, price_usd, url) VALUES ($1, $2, $3, $4)`;

for (const product of productsToSave) {
await client.query(query, [product.name, product.priceGBP, product.priceUSD, product.url]);
}

await client.end();
}
}

Here’s the snapshot of the data stored in the PostgreSQL database:

PostgreSQL Data


Complete Code

When you execute the code, it will create two files, product_data.csv and product_data.json. It will then store the data in MySQL, PostgreSQL, and the Amazon S3 bucket.

Execute the following command to see the data in your MySQL and PostgreSQL databases:

select * from chocolate_products;

Here’s the complete code integrating all the methods for saving data to different storage options:

const fs = require('fs');
const AWS = require('aws-sdk');
const mysql = require('mysql2/promise');
const { Client } = require('pg');
const puppeteer = require('puppeteer');

class Product {
constructor(name, priceString, url) {
this.name = this.cleanName(name);
this.priceGBP = this.cleanPrice(priceString);
this.priceUSD = this.convertPriceToUSD();
this.url = this.createAbsoluteURL(url);
}

cleanName(name) {
return name.trim() || "missing";
}

cleanPrice(priceString) {
if (!priceString) return 0.0;
priceString = priceString.replace(/[^0-9\.]+/g, '');
return parseFloat(priceString) || 0.0;
}

convertPriceToUSD() {
const exchangeRate = 1.21;
return this.priceGBP * exchangeRate;
}

createAbsoluteURL(relativeURL) {
const baseURL = "https://www.chocolate.co.uk";
return relativeURL ? `${baseURL}${relativeURL}` : "missing";
}
}

class ProductDataPipeline {
constructor(csvFilename, jsonFilename, mysqlDbName, postgresDbName, awsS3Bucket, storageQueueLimit = 5) {
this.namesSeen = [];
this.storageQueue = [];
this.storageQueueLimit = storageQueueLimit;
this.csvFilename = csvFilename;
this.jsonFilename = jsonFilename;
this.mysqlDbName = mysqlDbName;
this.postgresDbName = postgresDbName;
this.awsS3Bucket = awsS3Bucket;
}

saveToCsv() {
const productsToSave = [...this.storageQueue];
this.storageQueue = [];

if (!productsToSave.length) return;

const headers = Object.keys(productsToSave[0]);
const fileExists = fs.existsSync(this.csvFilename);

const csvWriter = fs.createWriteStream(this.csvFilename, { flags: 'a' });
if (!fileExists) {
csvWriter.write(headers.join(',') + '\n');
}

productsToSave.forEach(product => {
const row = headers.map(header => product[header]).join(',');
csvWriter.write(row + '\n');
});

csvWriter.end();
}

saveToJson() {
const productsToSave = [...this.storageQueue];
this.storageQueue = [];

if (!productsToSave.length) return;

let existingData = [];

try {
existingData = JSON.parse(fs.readFileSync(this.jsonFilename, 'utf8'));
} catch (error) {
if (error.code !== 'ENOENT') throw error;
}

existingData.push(...productsToSave);

fs.writeFileSync(this.jsonFilename, JSON.stringify(existingData, null, 2), 'utf8');
}

async saveToMySQL() {
const productsToSave = [...this.storageQueue];
this.storageQueue = [];

if (!productsToSave.length) return;

const connection = await mysql.createConnection({
host: 'localhost',
user: 'root',
password: 'password',
database: this.mysqlDbName
});

const query = `INSERT INTO chocolate_products (name, price_gb, price_usd, url) VALUES (?, ?, ?, ?)`;

for (const product of productsToSave) {
await connection.execute(query, [product.name, product.priceGBP, product.priceUSD, product.url]);
}

await connection.end();
}

async saveToPostgreSQL() {
const productsToSave = [...this.storageQueue];
this.storageQueue = [];

if (!productsToSave.length) return;

const client = new Client({
host: 'localhost',
user: 'postgres',
password: 'password',
database: this.postgresDbName
});

await client.connect();

const query = `INSERT INTO chocolate_products (name, price_gb, price_usd, url) VALUES ($1, $2, $3, $4)`;

for (const product of productsToSave) {
await client.query(query, [product.name, product.priceGBP, product.priceUSD, product.url]);
}

await client.end();
}

saveToS3Bucket() {
const s3 = new AWS.S3({
accessKeyId: process.env.AWS_ACCESS_KEY_ID,
secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY
});

const params = {
Bucket: this.awsS3Bucket,
Key: 'chocolate_data.csv',
Body: fs.createReadStream(this.csvFilename)
};

s3.upload(params,

function(err, data) {
if (err) {
console.log("Error", err);
}
if (data) {
console.log("Upload Success", data.Location);
}
});
}

cleanRawProduct(scrapedData) {
return new Product(
scrapedData.name || '',
scrapedData.price || '',
scrapedData.url || ''
);
}

isDuplicate(product) {
if (this.namesSeen.includes(product.name)) {
console.log(`Duplicate item found: ${product.name}. Item dropped.`);
return true;
}
this.namesSeen.push(product.name);
return false;
}

addProduct(scrapedData) {
const product = this.cleanRawProduct(scrapedData);
if (!this.isDuplicate(product)) {
this.storageQueue.push(product);
if (this.storageQueue.length >= this.storageQueueLimit) {
this.saveToCsv();
this.saveToJson();
this.saveToMySQL();
this.saveToPostgreSQL();
}
}
}

closePipeline() {
if (this.storageQueue.length > 0) {
this.saveToCsv();
this.saveToJson();
this.saveToMySQL();
this.saveToPostgreSQL();
}
}
}

const startScrape = async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const baseURL = 'https://www.chocolate.co.uk/collections/all';
const dataPipeline = new ProductDataPipeline('product_data.csv', 'product_data.json', 'chocolatedb', 'chocolatedb', 'chocolate-products');
let nextPageExists = true;
let currentPage = baseURL;

while (nextPageExists) {
await page.goto(currentPage, { waitUntil: 'networkidle2' });

const products = await page.evaluate(() => {
const items = document.querySelectorAll('.product-item');
return Array.from(items).map(item => ({
name: item.querySelector('.product-item-meta__title').innerText,
price: item.querySelector('.price').innerText,
url: item.querySelector('.product-item-meta a').getAttribute('href')
}));
});

products.forEach(product => dataPipeline.addProduct(product));

nextPageExists = await page.evaluate(() => {
const nextPage = document.querySelector('a[rel="next"]');
return nextPage ? nextPage.href : null;
});

if (nextPageExists) {
currentPage = nextPageExists;
}
}

await browser.close();
dataPipeline.closePipeline();
console.log("Congratulations! Data saved successfully on MySQL, PostgreSQL, JSON, and CSV.");
dataPipeline.saveToS3Bucket();
};

startScrape();

Next Steps

We hope you now have a good understanding of how to save the data you've scraped into the file or database you need! If you have any questions, leave them in the comments below and we'll do our best to help out!

In the next tutorial, we'll cover how to make our scraper more robust and scalable by handling failed requests and using concurrency. Stay tuned!