How to Scrape Reddit With Puppeteer
Reddit is a platform where people can share news, opinions, ideas and all sorts of other things. In fact, Reddit has a reputation as the "Front Page of the Internet".
Today, we're going to learn how to scrape data from Reddit.
- [TLDR: How to Scrape Reddit](#tldr
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
---how-to-scrape-reddit)
- How To Architect Our Scraper
- Understanding How To Scrape Reddit
- Setting Up Our Reddit Scraper
- Build A Reddit Crawler
- Build A Reddit Post Scraper
- Legal and Ethical Considerations
- Conclusion
- More Cool Articles
TLDR: How to Scrape Reddit
We can actuallly fetch a batch of posts from Reddit by adding .json to the end of our url. In the example below, we've got a production ready Reddit scraper:
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = "YOUR-SUPER-SECRET-API-KEY";
async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}
async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));
for await (const record of parser) {
results.push(record);
}
return results;
}
function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}
async function getPosts(browser, feed, limit=10, retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const page = await browser.newPage();
const namesSeen = [];
try {
const url = `https://www.reddit.com/r/${feed}.json?limit=${limit}`;
await page.goto(getScrapeOpsUrl(url));
success = true;
const jsonText = await page.$eval("pre", pre => pre.textContent);
const resp = JSON.parse(jsonText);
if (resp) {
const children = resp.data.children;
for (const child of children) {
data = child.data;
const articleData = {
name: data.title,
author: data.author,
permalink: data.permalink,
upvoteRatio: data.upvote_ratio
}
if (!namesSeen.includes(articleData.name)) {
try {
await writeToCsv([articleData], `./${feed}.csv`);
namesSeen.push(articleData.name);
} catch {
throw new Error("failed to write csv file:", articleData);
}
}
}
}
} catch (e) {
console.log(`ERROR: ${e}`);
tries++;
} finally {
await page.close();
}
}
}
async function processPost(browser, postObject, location="us", retries=3) {
let tries = 0;
let success = false;
const r_url = `https://www.reddit.com${postObject.permalink}.json`;
const linkArray = postObject.permalink.split("/");
const fileName = linkArray[linkArray.length-2].replace(" ", "-");
while (tries <= retries && !success) {
const page = await browser.newPage();
try {
await page.goto(getScrapeOpsUrl(r_url), {timeout: 30000});
const commentData = await page.$eval("pre", pre => pre.textContent);
if (!commentData) {
throw new Error(`No comment data found: ${fileName}`);
}
const comments = JSON.parse(commentData);
const commentsList = comments[1].data.children;
if (commentsList.length === 0) {
return;
}
for (const comment of commentsList) {
if (comment.kind !== "more") {
const data = comment.data;
const commentData = {
name: data.author,
body: data.body,
upvotes: data.ups
}
await writeToCsv([commentData], `${fileName}.csv`);
success = true;
}
}
} catch (e) {
await page.screenshot({path: `ERROR-${fileName}.png`});
console.log(`Error fetching comments for ${fileName}, retries left: ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
if (!success) {
console.log(`Max retries exceeded for: ${postObject.permalink}`);
return;
}
return;
}
async function processPosts(browser, inputFile, concurrencyLimit, location="us", retries=3) {
const posts = await readCsv(inputFile);
while (posts.length > 0) {
const currentBatch = posts.splice(0, concurrencyLimit);
const tasks = currentBatch.map(post => processPost(browser, post, location, retries));
try {
await Promise.all(tasks);
} catch (e) {
console.log("Failed to process batch");
}
}
}
async function main() {
const FEEDS = ["news"];
const RETRIES = 4;
const BATCH_SIZE = 40;
const concurrencyLimit = 10;
AGGREGATED_FEEDS = [];
const browser = await puppeteer.launch();
for (const feed of FEEDS) {
await getPosts(browser, feed, limit=BATCH_SIZE, RETRIES);
AGGREGATED_FEEDS.push(`${feed}.csv`);
}
for (const individualFile of AGGREGATED_FEEDS) {
await processPosts(browser, individualFile,concurrencyLimit, RETRIES);
}
await browser.close();
}
main();
This scraper is production ready.
- To customize your feeds, simply add the subreddits you want into the FEEDS array.
- If you want the top 100 posts instead of 10, change BATCH_SIZE to 100.
- While Reddit information doesn't really change by location, you can pass other locations into the location constant and the ScrapeOps API will give you a server in that location.
concurrencyLimit
is the limit on pages you want Puppeteer to open at once. If you are noticing performance issue, reduce yourconcurrencyLimit
.
How To Architect Our Reddit Scraper
Our Reddit scraper needs to be able to do the following things:
- Fetch a batch of Reddit posts
- Retrieve data about each individual post
- Clean Our data
- Save our data to a CSV file
We're going to go through the process of building a scraper that does all these things from start to finish using Puppeteer. While Puppeteer isn't exactly lightweight, it is a highly optimized headless browser. Puppeteer is equipped with more than enough to handle the simple JSON parsing with power and speed.
We can scrape Reddit with Puppeteer and gather a ton of data...FAST.
Understanding How To Scrape Reddit
Step 1: How To Request Reddit Pages
When you lookup a feed on Reddit, you're given content in small batches. Whenever you scroll down, the site automatically fetches more content for your to read.
Infinite scrollers like this often present challenges when trying to scrape. While Puppeteer does give us the ability to scroll the page, there's actually a faster way. We'll dig into that soon enough.
Take a look at the image below.
As you can see, the url looks like this:
https://www.reddit.com/r/news/?rdt=51809.
By tweaking just a couple of things, we can completely change the result. Lets change the url to:
https://www.reddit.com/r/news.json.
Take a look at the result below.
By simply changing /?rdt=51809
to .json
, we've turned Reddit into a full blown JSON feed!
Step 2: How To Extract Data From Reddit Feeds
Since our data is stored in the form of JSON, to retrieve it, we just need to know how to handle simple key-value pairs or json objects.
Take the following example below:
JSON
const person = {name: "John Doe", age: 30};
If we want to access this data, we could do the following:
console.log(`Name: ${person.name}`);
console.log(`Age: ${person.age}`);
In order to index a JSON object, we simply use its keys. Our entire list of content comes in our resp
or response object. Once we have our response object, we can use .
to access different parts of the object.
Below, you'll see how we access individual articles from this JSON.
- First Article:
resp.data.children[0]
- Second Article:
resp.data.children[1]
- Third Article:
resp.data.children[2]
We can follow this method all the way up to last last child and we'll be finished collecting articles.
Step 3: How To Control Pagination
Think back to the link we looked at previously, https://www.reddit.com/r/news.json
. We can add a limit parameter to this url for finer control of our results.
If we want 100 news results, our url would look like this:
https://www.reddit.com/r/news.json?limit=100
This doesn't give us actual pages to sort through, but we do get long lists of results that we can control. All we have to do is pass this text into JSON.parse()
.
Setting Up Our Reddit Scraper Project
Let's get started with our project. First, we'll make a new project folder. I'll call mine reddit-scraper
.
mkdir reddit-scraper
Hop into our new folder.
cd reddit-scraper
Next, we'll activate this folder as a new JavaScript project.
npm init --y
We have three dependencies.
Puppeteer
npm install puppeteer
csv-writer
npm install csv-writer
csv-parse
npm install csv-parse
Build A Reddit Crawler
When scraping Reddit, we actually need to build two scrapers.
- We need a crawler to identify all of the different posts.
- This crawler will fetch lists of posts, extract data from them, and save that data to a CSV file.
- After our data is saved into the CSV file, each row will be read by an individual post scraper.
- The post scraper will go and fetch individual posts along with commments and some other metadata.
Step 1: Create A Reddit Data Parser
First, we need a simple parser for our Reddit data. In the script below,
- we have one function,
getPosts()
. This function takes three arguments:browser
,feed
, andretries
(this one is a keyword argument) which is set to 3 by default. - While we still have retries left, we try to fetch the json feed from Reddit.
- If we receive an error while we still have retries left, we take a screenshot of the page and try again.
- If we run out of retries, we throw an
Error
and allow the crawler to crash. - Our json data comes nested inside of a
pre
tag, so we simply use Puppeteer to find thepre
tag withpage.$eval()
before loading our text. - In our json response, we have an array of json objects called children.
- Each item in this array represents an individual Reddit post. Each post contains a
data
field. - From that data field, we pull the
title
,author
,permalink
, andupvote_ratio
. - Later on, these items will make up the data we wish to save from our search, but we'll just print them for now.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = "YOUR-SUPER-SECRET-API-KEY";
const DEFAULT_USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36";
async function getPosts(browser, feed, retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const page = await browser.newPage();
const namesSeen = [];
try {
const url = `https://www.reddit.com/r/${feed}.json?`;
await page.setUserAgent(DEFAULT_USER_AGENT);
await page.goto(url);
success = true;
const jsonText = await page.$eval("pre", pre => pre.textContent);
const resp = JSON.parse(jsonText);
if (resp) {
const children = resp.data.children;
for (const child of children) {
data = child.data;
const articleData = {
name: data.title,
author: data.author,
permalink: data.permalink,
upvoteRatio: data.upvote_ratio
}
if (!namesSeen.includes(articleData.name)) {
console.log(articleData);
namesSeen.push(articleData.name);
}
}
}
} catch (e) {
await page.screenshot({path: "error.png"});
console.log(`ERROR: ${e}`);
tries++;
} finally {
await page.close();
}
}
}
async function main() {
const FEEDS = ["news"];
const RETRIES = 4;
const BATCH_SIZE = 10;
AGGREGATED_FEEDS = [];
const browser = await puppeteer.launch();
for (const feed of FEEDS) {
await getPosts(browser, feed, RETRIES);
AGGREGATED_FEEDS.push(`${feed}.csv`);
}
await browser.close();
}
main();
If you run this script, you should get an output similar to this:
As you can see in the image above, we extract the following from each post:
- Name
- Author
- Permalink
- Upvote Ratio
Step 2: Add Pagination
Now that we're getting results, we need finer control over our results. If we want 100 results, we should get 100. If we only want 10 results, we should get 10. We can accomplish this by adding the limit parameter to our url.
Let's refactor our getPosts()
function to take an additional keyword, limit
. Taking our limit
into account, our url will now look like this:
https://www.reddit.com/r/{feed}.json?limit={limit}
Here is the updated script:
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = "YOUR-SUPER-SECRET-API-KEY";
const DEFAULT_USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36";
async function getPosts(browser, feed, limit=10, retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const page = await browser.newPage();
const namesSeen = [];
try {
const url = `https://www.reddit.com/r/${feed}.json?limit=${limit}`;
await page.setUserAgent(DEFAULT_USER_AGENT);
await page.goto(url);
success = true;
const jsonText = await page.$eval("pre", pre => pre.textContent);
const resp = JSON.parse(jsonText);
if (resp) {
const children = resp.data.children;
for (const child of children) {
data = child.data;
const articleData = {
name: data.title,
author: data.author,
permalink: data.permalink,
upvoteRatio: data.upvote_ratio
}
if (!namesSeen.includes(articleData.name)) {
console.log(articleData);
namesSeen.push(articleData.name);
}
}
}
} catch (e) {
await page.screenshot({path: "error.png"});
console.log(`ERROR: ${e}`);
tries++;
} finally {
await page.close();
}
}
}
async function main() {
const FEEDS = ["news"];
const RETRIES = 4;
const BATCH_SIZE = 10;
AGGREGATED_FEEDS = [];
const browser = await puppeteer.launch();
for (const feed of FEEDS) {
await getPosts(browser, feed, BATCH_SIZE, RETRIES);
AGGREGATED_FEEDS.push(`${feed}.csv`);
}
await browser.close();
}
main();
In the code above, we now add a limit
parameter to getPosts()
. We pass our BATCH_SIZE
into getPosts()
to control the size of our results. Feel free to try changing the batch size and examining your results. limit
is incredibly important.
We don't want to scrape through hundreds of results if we only need 10... and we certainly don't want to try scraping hundreds of results when we're only limited to 10!
This limit is the foundation of all the data we're going to scrape.
Step 3: Storing the Scraped Data
Now that we're retrieving the proper data, we need to be able to store that data. To store this data:,
- First, we'll need to be able to write it to a CSV file.
- Second, we need the ability to filter out duplicates. We'll create a
writeToCsv()
function and ournamesSeen
array will actually filter repeats out of the data we choose to save.
In this example, we utilize namesSeen
to filter out repeat data (just like we have been, but now it's particularly important). All data that we haven't seen goes straight into the CSV file.
We write each object to CSV individually. We do this for safety. In the event of a crash, our scraper will not miss out on a batch when it fails, but instead, just an individual result.
Everything scraped up until that point will be put into our CSV. Also, pay attention to the fileExists
variable in the writeCsv function. This is a simple boolean
value. If it's true, we append the file. If the file doesn't exist, we create it.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = "YOUR-SUPER-SECRET-API-KEY";
const DEFAULT_USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36";
async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
return;
}
const fileExists = fs.existsSync(outputFile);
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
console.log(`successfully wrote data to ${outputFile}`);
} catch (e) {
console.log(`failed to write to csv: ${e}`);
}
}
async function getPosts(browser, feed, limit=10, retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const page = await browser.newPage();
const namesSeen = [];
try {
const url = `https://www.reddit.com/r/${feed}.json?limit=${limit}`;
await page.setUserAgent(DEFAULT_USER_AGENT);
await page.goto(url);
success = true;
const jsonText = await page.$eval("pre", pre => pre.textContent);
const resp = JSON.parse(jsonText);
if (resp) {
const children = resp.data.children;
for (const child of children) {
data = child.data;
const articleData = {
name: data.title,
author: data.author,
permalink: data.permalink,
upvoteRatio: data.upvote_ratio
}
if (!namesSeen.includes(articleData.name)) {
await writeToCsv([articleData], `./${feed}.csv`);
namesSeen.push(articleData.name);
}
}
}
} catch (e) {
console.log(`ERROR: ${e}`);
tries++;
} finally {
await page.close();
}
}
}
async function main() {
const FEEDS = ["news"];
const RETRIES = 4;
const BATCH_SIZE = 10;
AGGREGATED_FEEDS = [];
const browser = await puppeteer.launch();
for (const feed of FEEDS) {
await getPosts(browser, feed, BATCH_SIZE, RETRIES);
AGGREGATED_FEEDS.push(`${feed}.csv`);
}
await browser.close();
}
main();
In the code example above, we use writeCsv()
to write each object to a CSV file as it comes. This allows us to write all objects as soon as we find them, they don't hang in memory while we complete the for loop.
We also use a boolean
, fileExists
which tells the writer whether to create or append our CSV file.
Step 4: Bypassing Anti-Bots
Anti-bots are often used to detect malicious software. While our crawler is not malicious, we are requesting json data with custom batches and it does make us look a bit abnormal.
In order to prevent from getting blocked, we're going to pass all of these page.goto()
requests through the ScrapeOps Proxy API Aggregator. This API gives us the benefit of rotating IP addresses, and it always selects the best proxy available.
In this code snippet, we create a really small function, getScrapeOpsUrl()
This function takes in a regular url and uses simple string formatting to create a proxied url.
function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}
Here is the full code example.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = "YOUR-SUPER-SECRET-API-KEYS";
async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
return;
}
const fileExists = fs.existsSync(outputFile);
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
console.log(`successfully wrote data to ${outputFile}`);
} catch (e) {
console.log(`failed to write to csv: ${e}`);
}
}
function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}
async function getPosts(browser, feed, limit=10, retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const page = await browser.newPage();
const namesSeen = [];
try {
const url = `https://www.reddit.com/r/${feed}.json?limit=${limit}`;
await page.goto(getScrapeOpsUrl(url));
success = true;
const jsonText = await page.$eval("pre", pre => pre.textContent);
const resp = JSON.parse(jsonText);
if (resp) {
const children = resp.data.children;
for (const child of children) {
data = child.data;
const articleData = {
name: data.title,
author: data.author,
permalink: data.permalink,
upvoteRatio: data.upvote_ratio
}
if (!namesSeen.includes(articleData.name)) {
await writeToCsv([articleData], `./${feed}.csv`);
namesSeen.push(articleData.name);
}
}
}
} catch (e) {
console.log(`ERROR: ${e}`);
tries++;
} finally {
await page.close();
}
}
}
async function main() {
const FEEDS = ["news"];
const RETRIES = 4;
const BATCH_SIZE = 10;
AGGREGATED_FEEDS = [];
const browser = await puppeteer.launch();
for (const feed of FEEDS) {
await getPosts(browser, feed, BATCH_SIZE, RETRIES);
AGGREGATED_FEEDS.push(`${feed}.csv`);
}
await browser.close();
}
main();
In the example above, we create a proxied url by passing our url
into getScrapeOpsUrl()
. We then pass the result directly into page.goto()
so Puppeteer will take us to the new proxied url instead of the regular one.
Step 5: Production Run
Now that we've got a working crawler, let's give it a production run. I'm going to change our batch size to 100.
async function main() {
const FEEDS = ["news"];
const RETRIES = 4;
const BATCH_SIZE = 100;
AGGREGATED_FEEDS = [];
const browser = await puppeteer.launch();
for (const feed of FEEDS) {
await getPosts(browser, feed, BATCH_SIZE, RETRIES);
AGGREGATED_FEEDS.push(`${feed}.csv`);
}
await browser.close();
}
Now let's take a look at the output file.
Build A Reddit Post Scraper
Now it's time to build our post scraper. The goal of this scraper is quite simple. It needs to use concurrency to do the following:
- Read a row from a CSV file
- Fetch the individual post data from each row in the CSV
- Extract relevant data from the post
- Save that data to a new CSV file... a file unique to each post that we're scraping
While we don't get native threading support from NodeJS, we have first class async support. We can utilize this to get optimal concurrency even though we're still only running on one thread.
Step 1: Create Simple Reddit Post Data Parser
Here is our parsing function for posts. We're once again retrieving json blobs and extracting important information from them. This function takes the permalink
from post objects we created earlier in our crawler. We're not ready to run this code yet, because we need to be able to read the CSV file we created earlier. If we can't read the file, our scraper won't know which posts to process.
async function processPost(browser, postObject, location="us", retries=3) {
let tries = 0;
let success = false;
const r_url = `https://www.reddit.com${postObject.permalink}.json`;
const linkArray = postObject.permalink.split("/");
const fileName = linkArray[linkArray.length-2].replace(" ", "-");
while (tries <= retries && !success) {
const page = await browser.newPage();
const namesSeen = [];
try {
await page.setUserAgent(DEFAULT_USER_AGENT);
await page.goto(r_url);
const commentData = await page.$eval("pre", pre => pre.textContent);
if (!commentData) {
throw new Error(`No comment data found: ${fileName}`);
}
const comments = JSON.parse(commentData);
const commentsList = comments[1].data.children;
for (const comment of commentsList) {
if (comment.kind !== "more") {
const data = comment.data;
const commentData = {
name: data.author,
body: data.body,
upvotes: data.ups
}
console.log("Comment Data:", commentData);
success = true;
}
}
} catch (e) {
await page.screenshot({path: "error.png"});
console.log(`Error fetching comments for ${fileName}`);
tries++;
} finally {
await page.close();
}
}
}
As long as our comment data comes back in the form of a list, we can then go through and parse the comments. If our comment.kind
is not "more"
, we assume that these are comments we want to process.
We pull the author
, body
, and upvotes
for each individual comment. If someone wants to look at this data in a large scope, they can then compare accurately to see which types of comments get the best reactions from people.
Step 2: Loading URLs To Scrape
In order to use the parsing function we just created, we need to read the data from our CSV. To do this, we'll use readCsv()
, which allows us to read individual rows from the CSV file. We'll call processPost()
on each row we read from the file.
Here is the full code example that reads rows from the CSV file and processes them. We have an additional function, processPosts()
. It uses a for loop as just a placeholder for now, but later on, this function will be rewritten for better concurrency with async
support.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = "YOUR-SUPER-SECRET-API-KEY";
const DEFAULT_USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36";
async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
return;
}
const fileExists = fs.existsSync(outputFile);
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
console.log(`successfully wrote data to ${outputFile}`);
} catch (e) {
console.log(`failed to write to csv: ${e}`);
}
}
async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));
for await (const record of parser) {
results.push(record);
}
return results;
}
function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}
async function getPosts(browser, feed, limit=10, retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const page = await browser.newPage();
const namesSeen = [];
try {
const url = `https://www.reddit.com/r/${feed}.json?limit=${limit}`;
await page.setUserAgent(DEFAULT_USER_AGENT);
await page.goto(getScrapeOpsUrl(url));
success = true;
const jsonText = await page.$eval("pre", pre => pre.textContent);
const resp = JSON.parse(jsonText);
if (resp) {
const children = resp.data.children;
for (const child of children) {
data = child.data;
const articleData = {
name: data.title,
author: data.author,
permalink: data.permalink,
upvoteRatio: data.upvote_ratio
}
if (!namesSeen.includes(articleData.name)) {
await writeToCsv([articleData], `./${feed}.csv`);
namesSeen.push(articleData.name);
}
}
}
} catch (e) {
console.log(`ERROR: ${e}`);
tries++;
} finally {
await page.close();
}
}
}
async function processPost(browser, postObject, location="us", retries=3) {
let tries = 0;
let success = false;
const r_url = `https://www.reddit.com${postObject.permalink}.json`;
const linkArray = postObject.permalink.split("/");
const fileName = linkArray[linkArray.length-2].replace(" ", "-");
while (tries <= retries && !success) {
const page = await browser.newPage();
const namesSeen = [];
try {
await page.setUserAgent(DEFAULT_USER_AGENT);
await page.goto(r_url);
const commentData = await page.$eval("pre", pre => pre.textContent);
if (!commentData) {
throw new Error(`No comment data found: ${fileName}`);
}
const comments = JSON.parse(commentData);
const commentsList = comments[1].data.children;
for (const comment of commentsList) {
if (comment.kind !== "more") {
const data = comment.data;
const commentData = {
name: data.author,
body: data.body,
upvotes: data.ups
}
console.log("Comment Data:", commentData);
success = true;
}
}
} catch (e) {
await page.screenshot({path: "error.png"});
console.log(`Error fetching comments for ${fileName}`);
tries++;
} finally {
await page.close();
}
}
}
async function processPosts(browser, inputFile, location="us", retries=3) {
const posts = await readCsv(inputFile);
for (const post of posts) {
await processPost(browser, post);
}
}
async function main() {
const FEEDS = ["news"];
const RETRIES = 4;
const BATCH_SIZE = 1;
AGGREGATED_FEEDS = [];
const browser = await puppeteer.launch();
for (const feed of FEEDS) {
await getPosts(browser, feed, BATCH_SIZE, RETRIES);
AGGREGATED_FEEDS.push(`${feed}.csv`);
}
for (const individualFile of AGGREGATED_FEEDS) {
await processPosts(browser, individualFile, RETRIES);
}
await browser.close();
}
main();
In the code above, we call processPosts()
to read all the data from a Subreddit CSV. This function runs processPost()
on each individual post so we can extract important comment data from the post.
Step 3: Storing the Scraped Data
We've already done most of the work as far as data storage. We just need to create a new commentData
object and pass it into our writeCsv()
function. Just like before, we write each object as soon as we've extracted the data. This allows us to write as much data as possible in the event of a crash.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = "YOUR-SUPER-SECRET-API-KEY";
async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
return;
}
const fileExists = fs.existsSync(outputFile);
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
console.log(`successfully wrote data to ${outputFile}`);
} catch (e) {
console.log(`failed to write to csv: ${e}`);
}
}
async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));
for await (const record of parser) {
results.push(record);
}
return results;
}
function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}
async function getPosts(browser, feed, limit=10, retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const page = await browser.newPage();
const namesSeen = [];
try {
const url = `https://www.reddit.com/r/${feed}.json?limit=${limit}`;
await page.setUserAgent(DEFAULT_USER_AGENT);
await page.goto(getScrapeOpsUrl(url));
success = true;
const jsonText = await page.$eval("pre", pre => pre.textContent);
const resp = JSON.parse(jsonText);
if (resp) {
const children = resp.data.children;
for (const child of children) {
data = child.data;
const articleData = {
name: data.title,
author: data.author,
permalink: data.permalink,
upvoteRatio: data.upvote_ratio
}
if (!namesSeen.includes(articleData.name)) {
await writeToCsv([articleData], `./${feed}.csv`);
namesSeen.push(articleData.name);
}
}
}
} catch (e) {
console.log(`ERROR: ${e}`);
tries++;
} finally {
await page.close();
}
}
}
async function processPost(browser, postObject, location="us", retries=3) {
let tries = 0;
let success = false;
const r_url = `https://www.reddit.com${postObject.permalink}.json`;
const linkArray = postObject.permalink.split("/");
const fileName = linkArray[linkArray.length-2].replace(" ", "-");
while (tries <= retries && !success) {
const page = await browser.newPage();
const namesSeen = [];
try {
await page.goto(r_url);
const commentData = await page.$eval("pre", pre => pre.textContent);
if (!commentData) {
throw new Error(`No comment data found: ${fileName}`);
}
const comments = JSON.parse(commentData);
const commentsList = comments[1].data.children;
for (const comment of commentsList) {
if (comment.kind !== "more") {
const data = comment.data;
const commentData = {
name: data.author,
body: data.body,
upvotes: data.ups
}
await writeToCsv([commentData], `${fileName}.csv`);
success = true;
}
}
} catch (e) {
await page.screenshot({path: "error.png"});
console.log(`Error fetching comments for ${fileName}`);
tries++;
} finally {
await page.close();
}
}
}
async function processPosts(browser, inputFile, location="us", retries=3) {
const posts = await readCsv(inputFile);
for (const post of posts) {
await processPost(browser, post);
}
}
async function main() {
const FEEDS = ["news"];
const RETRIES = 4;
const BATCH_SIZE = 1;
AGGREGATED_FEEDS = [];
const browser = await puppeteer.launch();
for (const feed of FEEDS) {
await getPosts(browser, feed, BATCH_SIZE, RETRIES);
AGGREGATED_FEEDS.push(`${feed}.csv`);
}
for (const individualFile of AGGREGATED_FEEDS) {
await processPosts(browser, individualFile, RETRIES);
}
await browser.close();
}
main();
In this code, we find all of our relevant information and create commentData
objects out of them. We then write these objects to CSV files just like we did before when we were parsing the articles themselves from the Subreddit feed. This code might get you blocked.
Puppeteer is faster than any human could possibly be and Reddit will notice abonormalities. Once we've add concurrency, we're going to add proxy support into our scraper as well.
Step 4: Adding Concurrency
To add concurrency, we're going to use Promise.all()
in combination with async
/await
support. Because this makes us even faster, this is actually going to increase our likelihood of getting blocked, so adding proxy support in the next section is super important! Now it's exponentially faster!
Here is our new processPosts()
:
async function processPosts(browser, inputFile, concurrencyLimit, location="us", retries=3) {
const posts = await readCsv(inputFile);
while (posts.length > 0) {
const currentBatch = posts.splice(0, concurrencyLimit);
const tasks = currentBatch.map(post => processPost(browser, post, location, retries));
try {
await Promise.all(tasks);
} catch (e) {
console.log("Failed to process batch");
}
}
}
First, we read all of our posts into an array. When reading a large file, this can use up a pretty substantial amount of memory (enough to heavily impact our performance).
While we still have posts, we slice()
a chunk out of our array and run processPost()
on each row in that chunk. This reduces the size of our array as we go and frees more memory for us. Instead of getting bogged down by the end of the operation, our scraper is actually running with more efficiency than it would have been in the beginning.
Once our posts
array is down to 0, we can then exit the function.
Step 5: Bypassing Anti-Bots
We already created our function for proxied urls earlier. To add a proxy to processPost()
, we only need to change one line: await page.goto(getScrapeOpsUrl(r_url));
. We once again pass getScrapeOpsUrl()
directly into page.goto()
so our scraper navigates directly to the proxied url.
Here is our final script that makes full use of both the crawler and scraper.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = "YOUR-SUPER-SECRET-API-KEY";
async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}
async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));
for await (const record of parser) {
results.push(record);
}
return results;
}
function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}
async function getPosts(browser, feed, limit=10, retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const page = await browser.newPage();
const namesSeen = [];
try {
const url = `https://www.reddit.com/r/${feed}.json?limit=${limit}`;
await page.goto(getScrapeOpsUrl(url));
success = true;
const jsonText = await page.$eval("pre", pre => pre.textContent);
const resp = JSON.parse(jsonText);
if (resp) {
const children = resp.data.children;
for (const child of children) {
data = child.data;
const articleData = {
name: data.title,
author: data.author,
permalink: data.permalink,
upvoteRatio: data.upvote_ratio
}
if (!namesSeen.includes(articleData.name)) {
try {
await writeToCsv([articleData], `./${feed}.csv`);
namesSeen.push(articleData.name);
} catch {
throw new Error("failed to write csv file:", articleData);
}
}
}
}
} catch (e) {
console.log(`ERROR: ${e}`);
tries++;
} finally {
await page.close();
}
}
}
async function processPost(browser, postObject, location="us", retries=3) {
let tries = 0;
let success = false;
const r_url = `https://www.reddit.com${postObject.permalink}.json`;
const linkArray = postObject.permalink.split("/");
const fileName = linkArray[linkArray.length-2].replace(" ", "-");
while (tries <= retries && !success) {
const page = await browser.newPage();
try {
await page.goto(getScrapeOpsUrl(r_url), {timeout: 30000});
const commentData = await page.$eval("pre", pre => pre.textContent);
if (!commentData) {
throw new Error(`No comment data found: ${fileName}`);
}
const comments = JSON.parse(commentData);
const commentsList = comments[1].data.children;
if (commentsList.length === 0) {
return;
}
for (const comment of commentsList) {
if (comment.kind !== "more") {
const data = comment.data;
const commentData = {
name: data.author,
body: data.body,
upvotes: data.ups
}
await writeToCsv([commentData], `${fileName}.csv`);
success = true;
}
}
} catch (e) {
await page.screenshot({path: `ERROR-${fileName}.png`});
console.log(`Error fetching comments for ${fileName}, retries left: ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
if (!success) {
console.log(`Max retries exceeded for: ${postObject.permalink}`);
return;
}
return;
}
async function processPosts(browser, inputFile, concurrencyLimit, location="us", retries=3) {
const posts = await readCsv(inputFile);
while (posts.length > 0) {
const currentBatch = posts.splice(0, concurrencyLimit);
const tasks = currentBatch.map(post => processPost(browser, post, location, retries));
try {
await Promise.all(tasks);
} catch (e) {
console.log("Failed to process batch");
}
}
}
async function main() {
const FEEDS = ["news"];
const RETRIES = 4;
const BATCH_SIZE = 100;
const concurrencyLimit = 20;
AGGREGATED_FEEDS = [];
const browser = await puppeteer.launch();
for (const feed of FEEDS) {
await getPosts(browser, feed, BATCH_SIZE, RETRIES);
AGGREGATED_FEEDS.push(`${feed}.csv`);
}
for (const individualFile of AGGREGATED_FEEDS) {
await processPosts(browser, individualFile,concurrencyLimit, RETRIES);
}
await browser.close();
}
main();
Step 6: Production Run
Take a look at the constants in this block:
async function main() {
const FEEDS = ["news"];
const RETRIES = 4;
const BATCH_SIZE = 100;
const concurrencyLimit = 20;
AGGREGATED_FEEDS = [];
const browser = await puppeteer.launch();
for (const feed of FEEDS) {
await getPosts(browser, feed, BATCH_SIZE, RETRIES);
AGGREGATED_FEEDS.push(`${feed}.csv`);
}
for (const individualFile of AGGREGATED_FEEDS) {
await processPosts(browser, individualFile,concurrencyLimit, RETRIES);
}
await browser.close();
}
To change your output, you can change any of these constants. If you'd like to scrape a different Subreddit, just add it to the FEEDS
list. If you'd like to change concurrencyLimit
feel free to do so.
In testing, we found that 20 pages was optimal and after that, we began to see bad results more often. Remember, your concurrencyLimit
is quite literally the amount of pages you have open in the browser.
If you set the limit to 100, Puppeteer will attempt to do all this work with 100 pages open and you probably will experience both performance issues and issues with the ScrapeOps API, you can only have so many concurrent pages open with the ScrapeOps API.
In the production run, we generated 100 CSV files all full of processed comments and metadata. It took 1 minute and 41 seconds to create our article list and generate all 100 of the reports.... That is lightning fast. During peak hours (when Reddit and ScrapeOps are being accessed more often), this same script has taken up to 3 minutes and 40 seconds.
Bear in mind that the speed of our responses depends on both the Reddit server, and the ScrapeOps server, so your results will probably vary. If you notice issues with your scraper getting stuck or moving too slow: decrease your concurrencyLimit
.
Legal and Ethical Considerations
When scraping, always pay attention to the Terms of Service and robots.txt. You can view Reddit's Terms here. You can view their robots.txt here. Reddit reserves the right to block, ban, or delete your account if they believe you are responsible for malicious activity.
It's typically legal to collect any public data. Public data is data that is not protected by a login. If you don't have to login, you are generally alright to scrape the data.
If you have concerns or aren't sure whether it's legal to scrape the data you're after, consult an attorney. Attorneys are best equipped to give you legal advice on the data you're scraping.
Conclusion
You've made it to the end!!! Go build something! You now know how to extract JSON data using Puppeteer and you have a solid grasp on how to retrieve specific items from JSON blobs. You understand that scraping Reddit requires a crawler to gather a list of posts as well as an individual post scraper for gathering specific data about each post.
Here are links to software docs used in this article:
More NodeJS Web Scraping Guides
Wanna learn more? Here at ScrapeOps, we have loads of resources for you to learn from. Check out our extensive NodeJS Puppeteer Web Scraping Playbook or take a look at some of these other ScrapeOps Guides.