Node.js Axios/CheerioJS Beginners Series Part 3: Storing Data
In Part 1 and Part 2 of this Node.js Axios/CheerioJS Beginners Series, we learned how to build a basic web scraper and extract data from websites, as well as how to clean the scraped data.
In Part 3, we'll explore various methods for saving the data in formats suitable for common use cases. We'll cover saving data to JSON files, as well as storing the data in databases or AWS S3 buckets.
- Saving Data to a JSON File
- Saving Data to Amazon S3 Storage
- Saving Data to MySQL Database
- Saving Data to Postgres Database
- Complete Code
- Next Steps
Node.js Axios/CheerioJS 6-Part Beginner Series
This 6-part Node.js Axios/CheerioJS Beginner Series will walk you through building a web scraping project from scratch, covering everything from creating the scraper to deployment and scheduling.
- Part 1: Basic Node.js Cheerio Scraper - We'll learn the fundamentals of web scraping with Node.js and build your first scraper using Cheerio. (Part 1)
- Part 2: Cleaning Unruly Data & Handling Edge Cases - Web data can be messy and unpredictable. In this part, we'll create a robust scraper using data structures and cleaning techniques to handle these challenges. (Part 2)
- Part 3: Storing Scraped Data - Explore various options for storing your scraped data, including databases like MySQL or Postgres, cloud storage like AWS S3, and file formats like CSV and JSON. We'll discuss their pros, cons, and suitable use cases. (This article)
- Part 4: Managing Retries & Concurrency - Enhance your scraper's reliability and scalability by handling failed requests and utilizing concurrency. (Part 4)
- Part 5: Mimicking User Behavior - Learn how to create a production-ready scraper by simulating real users through user-agent and browser header manipulation. (Part 5)
- Part 6: Avoiding Detection with Proxies - Discover how to use proxies to bypass anti-bot systems by disguising your real IP address and location. (Coming soon)
The code for this project is available on Github.
Saving Data to a JSON File
💡Note: All methods for storing data in JSON files, databases, or S3 buckets will be implemented within the ProductDataPipeline
class, which was created in Part 2 of this series.
To save data to the JSON file, we’ll create a method called saveToJson()
. The method performs the following actions.
It checks if the given jsonFileName
already exists, similar to our previous saveToCsv
method we don't want to overwrite existing data. If the file exists, we load the content into the existingData
array.
Then we use the JavaScript spread operator (...
) to merge the existing data (if there is any) with the current storageQueue
values.
Finally, we write that merged array to the file with the given jsonFileName
.
saveToJson() {
if (this.storageQueue.length <= 0) {
return;
}
const fileExists = fs.existsSync(this.jsonFileName);
let existingData = [];
if (fileExists) {
const fileContent = fs.readFileSync(this.jsonFileName, "utf8");
existingData = JSON.parse(fileContent);
}
const mergedData = [...existingData, ...this.storageQueue];
fs.writeFileSync(this.jsonFileName, JSON.stringify(mergedData, null, 2));
}
Which produces JSON like this:
Saving Data to Amazon S3 Storage
We've already covered how to export the data to a CSV file in Part 2 of this series. Nonetheless, we've also included the code for that, which you can find in the last section. Now, let's save the created CSV file directly to an Amazon S3 bucket (you'll need to have one set up already).
If you don't have a bucket yet, you can follow the instructions on how to set one up.
Once you've made a bucket, start by installing the @aws-sdk/client-s3
package to use AWS S3 from Node.js.
npm i @aws-sdk/client-s3
To get started, set up the necessary credentials to connect to Amazon S3. With these, the method creates an S3 client using the library we just installed. Then we ensure that the CSV file exists, this is because our program can execute quite quickly and due to Node's asynchronous nature we want to make sure the file has been written before trying to upload it. Then we send a PutObjectCommand
with the bucket name, key and content of the CSV file.
Within the bucket, the Key
serves as the file name that would be used in the Amazon S3. The data to be uploaded is defined in the Body
parameter.
async saveToS3Bucket() {
const client = new S3Client({
region: "us-east-1",
credentials: {
accessKeyId: "YOUR_ACCESS_KEY_ID",
secretAccessKey: "YOUR_SECRET_ACCESS_KEY",
},
});
let retry = 3;
while (retry > 0) {
if (!fs.existsSync(this.csvFilename)) {
await new Promise((resolve) => setTimeout(resolve, 1000));
}
retry -= 1;
}
const params = {
Bucket: this.s3Bucket,
Key: this.csvFilename,
Body: fs.createReadStream(this.csvFilename),
};
await client.send(new PutObjectCommand(params));
}
Of course, you'll need to change YOUR_ACCESS_KEY_ID
and YOUR_SECRET_ACCESS_KEY
to be your own credentials. As well, make sure a bucket with the given name exists.
Here’s a snapshot showing that the file has been uploaded successfully:
Note: When saving data to AWS S3, employ delayed file uploads. This means that the file is first temporarily saved locally on the machine running the scraper, and then uploaded to AWS S3 once the scraper's job is complete.
Saving Data to MySQL Database
Here, we'll show you how to save data to MySQL databases. To follow along, we assume you already have a database named chocolate_db
set up.
Downloading MySQL
For setting up a MySQL database, check out the following resources: