How to Scrape Yelp With Puppeteer
If you've ever looked up restaurant reviews online, you've most likely used Yelp. Yelp is a crucial site where business owners really depend on those reviews and reviewers tend to be brutally honest. On top of this, Yelp has been extremely popular for more than a decade and it's existed since 2004. This gives Yelp a huge dataset for us to work with.
In this detailed tutorial, we'll go over how to scrape Yelp with NodeJS Puppeteer.
- TLDR: How to Scrape Yelp
- How To Architect Our Scraper
- Understanding How To Scrape Yelp
- Setting Up Our Yelp Scraper
- Build A Yelp Search Crawler
- Build A Yelp Scraper
- Legal and Ethical Considerations
- Conclusion
- More Puppeteer Web Scraping Guides
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
TLDR - How to Scrape Yelp
For those of you that don't have the time to read, here we have a restaurant scraper and it's ready to go!
All you need to do is create a new JavaScript project and add a config.json
with your ScrapeOps API keys to the folder.
Yelp is very good at blocking scrapers but you don't need to worry about it because this one comes pre-built with support for the ScrapeOps Residential Proxy!
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
console.log("api key:", API_KEY);
async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}
async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));
for await (const record of parser) {
results.push(record);
}
return results;
}
function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}
function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location,
residential: true,
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}
async function scrapeSearchResults(browser, keyword, pageNumber, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.yelp.com/search?find_desc=${formattedKeyword}&find_loc=${location}&start=${pageNumber*10}`;
const proxyUrl = getScrapeOpsUrl(url, location);
await page.goto(proxyUrl);
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[data-testid='serp-ia-card']");
for (const divCard of divCards) {
const cardText = await page.evaluate(element => element.textContent, divCard);
const img = await divCard.$("img");
const name = await page.evaluate(element => element.getAttribute("alt"), img);
const nameRemoved = cardText.replace(name, "");
let sponsored = isNaN(nameRemoved[0]);
let rank = 0;
if (!sponsored) {
rankString = nameRemoved.split(".");
rank = Number(rankString[0]);
}
let rating = 0.0;
const hasRating = await divCard.$("div span[data-font-weight='semibold']");
if (hasRating) {
const ratingText = await page.evaluate(element => element.textContent, hasRating);
if (ratingText.length > 0) {
rating = Number(ratingText);
}
}
let reviewCount = "0";
if (cardText.includes("review")) {
reviewCount = cardText.split("(")[1].split(")")[0].split(" ")[0];
}
const aElement = await divCard.$("a");
const link = await page.evaluate(element => element.getAttribute("href"), aElement);
const yelpUrl = `https://www.yelp.com${link.replace("https://proxy.scrapeops.io", "")}`
const searchData = {
name: name,
sponsored: sponsored,
stars: rating,
rank: rank,
review_count: reviewCount,
url: yelpUrl
}
await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startScrape(keyword, pages, location, concurrencyLimit, retries) {
const pageList = range(0, pages);
const browser = await puppeteer.launch()
while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
async function processBusiness(browser, row, location, retries = 3) {
const url = row.url;
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const page = await browser.newPage();
try {
await page.goto(getScrapeOpsUrl(url, location), { timeout: 60000 });
const infoSectionElement = await page.$("script[type='application/ld+json']");
const infoText = await page.evaluate(element => element.textContent, infoSectionElement);
const infoSection = JSON.parse(infoText);
const listElements = infoSection.itemListElement;
let anonCount = 1;
for (const element of listElements) {
let name = element.author.name;
if (name === "Unknown User") {
name = `${name}${anonCount}`;
anonCount++;
}
familyFriendly = element.isFamilyFriendly;
date = element.uploadDate,
position = element.position
const reviewData = {
name: name,
family_friendly: familyFriendly,
date: date,
position: position
}
await writeToCsv([reviewData], `${row.name.replace(" ", "-")}.csv`);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left: ${retries-tries}, url: ${getScrapeOpsUrl(url)}`);
tries++;
} finally {
await page.close();
}
}
}
async function processResults(csvFile, location, concurrencyLimit, retries) {
const businesses = await readCsv(csvFile);
const browser = await puppeteer.launch();
while (businesses.length > 0) {
const currentBatch = businesses.splice(0, concurrencyLimit);
const tasks = currentBatch.map(business => processBusiness(browser, business, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
async function main() {
const keywords = ["restaurants"];
const concurrencyLimit = 4;
const pages = 5;
const location = "uk";
const retries = 3;
const aggregateFiles = [];
for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, concurrencyLimit, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
console.log("Starting scrape");
for (const file of aggregateFiles) {
await processResults(file, location, concurrencyLimit, retries);
}
console.log("Scrape complete");
}
main();
If you'd like to tweak this scraper, feel free to change any of the following constants:
concurrencyLimit
: Limits the number of simultaneous tasks (or browser instances/pages) that can run concurrently.pages
: Determines how many pages of search results to scrape for each keyword.location
: Specifies the geographic location for the search.retries
: Sets the number of retry attempts if a scraping task fails due to an error (e.g., network issues or proxy blocks).
You can change the keywords
but you need to be cautious when doing so. Yelp uses different CSS and page layouts for different types of businesses.
If you add online bank
to the keywords, it will break the scraper. If you decide to change the keywords
, make sure you inspect the page for your search and adjust the parsing function to fit the page layout.
How To How To Architect Our Yelp Scraper
When we scrape Yelp, there are quite a few steps involved in the process and it's actually a two part project.
In part one, we build a crawler. In part two, we'll build a scraper.
The job of a crawler is relatively straightforward:
- Perform a search and parse results
- Paginate the search so that we can control our batches of results.
- Store the data that was extracted when parsing the page.
- Run steps 1 through 3 concurrently so we can crawl multiple result pages at the same time.
- Proxy Integration with the ScrapeOps API so that we don't need to worry about getting blocked.
In part 2 of this project, our scraper will need to:
- Read the data we stored in part 1
- Lookup the url of each business from the CSV file and parse it.
- Store the parsed data from each business.
- Run steps 2 and 3 for each business concurrently.
- Once again, use proxy integration to get around any potential roadblocks that may be in our way.
Understanding How To Scrape Yelp
Before we write our scraping code, we need to understand exactly how to get our information and how to extract it from the page.
We'll go through these next few steps in order to plan out how to build our scraper.
Step 1: How To Request Yelp Pages
Search URLs on Yelp look like this:
https://www.yelp.com/search?find_desc={formatted_keyword}&find_loc={location}
If we want to look up restaurants
in the us
, we would use the description parameter find_desc=restaurants
and the location parameter find_loc=us
.
So our complete URL would be:
https://www.yelp.com/search?find_desc=restaurants&find_loc=us
You can see how it looks in the browser below.
All Yelp business pages contain /biz/
, and then the name of the business. The URLs here aren't too much of a concern because we're pulling them from the search pages.
You can take a look at a Yelp business page below.
Step 2: How To Extract Data From Yelp Results and Pages
When we extract data from Yelp, we actually have to use a combination of strategies. To parse a results page, we need to actually parse the HTML and pull each element from the page.
When we parse a business page, we can actually get our data from a JSON blob located within a script
tag.
Yelp's restaurant results all contain a data-testid
of serp-ia-card
. You can see it in the image below.
Here is the JSON from the business page.
Step 3: How To Control Pagination
Without pagination, we won't get very far. With pagination, we can get our results in batches. To paginate our search URL, we can add the start
parameter. This parameter doesn't actually take our page number, it takes our result number.
Yelp gives us 10 results per page.
- So if we put in for page 0, we get results 1 through 10.
- Page 2 would get us results 11 through 20.
Step 4: Geolocated Data
For geoloaction, we'll actually use a combination of the ScrapeOps API and the find_loc
parameter.
If we pass us
into the ScrapeOps API, we will be routed through a server in the US. If we pass us
into the find_loc
parameter, Yelp will give us restaurants in the US.
Setting Up Our Yelp Scraper Project
Let's get started. You can run the following commands to get setup.
Create a New Project Folder
mkdir yelp-scraper
cd yelp-scraper
Create a New JavaScript Project
npm init --y
Install Our Dependencies
npm install puppeteer
npm install csv-writer
npm install csv-parse
npm install fs
Build A Yelp Search Crawler
As mentioned before, we need to build a crawler first. Our design will include parsing, pagination, data storage, concurrency and proxy integration.
In the coming sections, we'll go through step by step and add these into our design.
Step 1: Create Simple Search Data Parser
For starters, we need to be able to parse our data. The code below holds our basic structure: logging, retry logic, read the API key from a file, parse a page.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
console.log("api key:", API_KEY);
async function scrapeSearchResults(browser, keyword, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.yelp.com/search?find_desc=${formattedKeyword}&find_loc=${location}`;
await page.goto(url);
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[data-testid='serp-ia-card']");
for (const divCard of divCards) {
const cardText = await page.evaluate(element => element.textContent, divCard);
const img = await divCard.$("img");
const name = await page.evaluate(element => element.getAttribute("alt"), img);
const nameRemoved = cardText.replace(name, "");
let sponsored = isNaN(nameRemoved[0]);
let rank = 0;
if (!sponsored) {
rankString = nameRemoved.split(".");
rank = Number(rankString[0]);
}
let rating = 0.0;
const hasRating = await divCard.$("div span[data-font-weight='semibold']");
if (hasRating) {
const ratingText = await page.evaluate(element => element.textContent, hasRating);
if (ratingText.length > 0) {
rating = Number(ratingText);
}
}
let reviewCount = "0";
if (cardText.includes("review")) {
reviewCount = cardText.split("(")[1].split(")")[0].split(" ")[0];
}
const aElement = await divCard.$("a");
const link = await page.evaluate(element => element.getAttribute("href"), aElement);
const yelpUrl = `https://www.yelp.com${link.replace("https://proxy.scrapeops.io", "")}`
const searchData = {
name: name,
sponsored: sponsored,
stars: rating,
rank: rank,
review_count: reviewCount,
url: yelpUrl
}
console.log(searchData);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startScrape(keyword, location, retries) {
const browser = await puppeteer.launch()
await scrapeSearchResults(browser, keyword, location, retries);
await browser.close();
}
async function main() {
const keywords = ["restaurants"];
const concurrencyLimit = 4;
const pages = 1;
const location = "uk";
const retries = 3;
const aggregateFiles = [];
for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, location, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
}
main();
Aside from the basic structure of the overall program, you need to pay attention to a few things from the parsing function, scrapeSearchResults()
particularly.
- First, we find all of our business cards on the page using their CSS selector:
await page.$$("div[data-testid='serp-ia-card']")
- For each card, we do the following:
- Retrieve its text:
const img = await divCard.$("img");
- Find its image:
await divCard.$("img")
. - Pull the name of the business using the
alt
for the image:await page.evaluate(element => element.getAttribute("alt"), img)
. cardText.replace(name, "");
creates a variable of the title without the actual name of the business.- We can check if the business card is a sponsored ad using
let sponsored = isNaN(nameRemoved[0])
. All actual results come with a ranking number. If there is no rank in the title, the card is a sponsored ad. - If the card is not sponsored, we split the string at
.
and convert it to a number,Number(rankString[0])
. This gives us the ranking number of the result. - Then we check for the CSS selector of the rating,
div span[data-font-weight='semibold']
and if there's a rating present, we pull it from the card. - We use some string splitting to pull our review count similar to how we did earlier with the rank number.
- Finally, we pull the
aElement
and extract itshref
in order to get the Yelp page for the business.
- Retrieve its text:
Step 2: Add Pagination
Adding pagination is really simple. To add pagination we add pageNumber * 10
to the start
param of our URL.
Take a look at the code below, we added a couple more things, a range()
function and we tweaked the startScrape()
function to support scraping multiple pages.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
console.log("api key:", API_KEY);
function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}
async function scrapeSearchResults(browser, keyword, pageNumber, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.yelp.com/search?find_desc=${formattedKeyword}&find_loc=${location}&start=${pageNumber*10}`;
await page.goto(url);
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[data-testid='serp-ia-card']");
for (const divCard of divCards) {
const cardText = await page.evaluate(element => element.textContent, divCard);
const img = await divCard.$("img");
const name = await page.evaluate(element => element.getAttribute("alt"), img);
const nameRemoved = cardText.replace(name, "");
let sponsored = isNaN(nameRemoved[0]);
let rank = 0;
if (!sponsored) {
rankString = nameRemoved.split(".");
rank = Number(rankString[0]);
}
let rating = 0.0;
const hasRating = await divCard.$("div span[data-font-weight='semibold']");
if (hasRating) {
const ratingText = await page.evaluate(element => element.textContent, hasRating);
if (ratingText.length > 0) {
rating = Number(ratingText);
}
}
let reviewCount = "0";
if (cardText.includes("review")) {
reviewCount = cardText.split("(")[1].split(")")[0].split(" ")[0];
}
const aElement = await divCard.$("a");
const link = await page.evaluate(element => element.getAttribute("href"), aElement);
const yelpUrl = `https://www.yelp.com${link.replace("https://proxy.scrapeops.io", "")}`
const searchData = {
name: name,
sponsored: sponsored,
stars: rating,
rank: rank,
review_count: reviewCount,
url: yelpUrl
}
console.log(searchData);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startScrape(keyword, pages, location, retries) {
const pageList = range(0, pages);
const browser = await puppeteer.launch()
for (const page of pageList) {
await scrapeSearchResults(browser, keyword, page, location, retries);
}
await browser.close();
}
async function main() {
const keywords = ["restaurants"];
const concurrencyLimit = 4;
const pages = 1;
const location = "uk";
const retries = 3;
const aggregateFiles = [];
for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
}
main();
We can now control our batching with our modified url: https://www.yelp.com/search?find_desc=${formattedKeyword}&find_loc=${location}&start=${pageNumber*10}
.
Step 3: Storing the Scraped Data
Scraping is pointless if we don't store the data. To store our data, we'll write a function that saves to a CSV file. It's very important that this function opens in append
mode if the file exists.
If the file doesn't exist, our function needs to create it. It also takes in an array of JSON objects, so whenever we pass something into this function, we need to pass it in as an array.
Here is writeToCsv()
.
async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}
Here is our full code up to this point.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
console.log("api key:", API_KEY);
async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}
function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}
async function scrapeSearchResults(browser, keyword, pageNumber, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.yelp.com/search?find_desc=${formattedKeyword}&find_loc=${location}&start=${pageNumber*10}`;
await page.goto(url);
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[data-testid='serp-ia-card']");
for (const divCard of divCards) {
const cardText = await page.evaluate(element => element.textContent, divCard);
const img = await divCard.$("img");
const name = await page.evaluate(element => element.getAttribute("alt"), img);
const nameRemoved = cardText.replace(name, "");
let sponsored = isNaN(nameRemoved[0]);
let rank = 0;
if (!sponsored) {
rankString = nameRemoved.split(".");
rank = Number(rankString[0]);
}
let rating = 0.0;
const hasRating = await divCard.$("div span[data-font-weight='semibold']");
if (hasRating) {
const ratingText = await page.evaluate(element => element.textContent, hasRating);
if (ratingText.length > 0) {
rating = Number(ratingText);
}
}
let reviewCount = "0";
if (cardText.includes("review")) {
reviewCount = cardText.split("(")[1].split(")")[0].split(" ")[0];
}
const aElement = await divCard.$("a");
const link = await page.evaluate(element => element.getAttribute("href"), aElement);
const yelpUrl = `https://www.yelp.com${link.replace("https://proxy.scrapeops.io", "")}`
const searchData = {
name: name,
sponsored: sponsored,
stars: rating,
rank: rank,
review_count: reviewCount,
url: yelpUrl
}
await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startScrape(keyword, pages, location, retries) {
const pageList = range(0, pages);
const browser = await puppeteer.launch()
for (const page of pageList) {
await scrapeSearchResults(browser, keyword, page, location, retries);
}
await browser.close();
}
async function main() {
const keywords = ["restaurants"];
const concurrencyLimit = 4;
const pages = 1;
const location = "uk";
const retries = 3;
const aggregateFiles = [];
for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
}
main();
Instead of printing each result to the console, we now write it to a CSV file.
Step 4: Adding Concurrency
Time to add conurrency. Our crawler is now crawling multiple pages, but it needs to crawl multiple pages at the same time. To do this, we'll change startScrape()
.
Instead of a for
loop, we're going to shrink our pageList()
by splicing up to the concurrencyLimit
and running scrapeSearchResults()
on each page that got spliced.
We then await
all of these tasks and splice()
again. We continue this process until the array shrinks down to nothing.
Here is our refactored function.
async function startScrape(keyword, pages, location, concurrencyLimit, retries) {
const pageList = range(0, pages);
const browser = await puppeteer.launch()
while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
Here is our fully updated code.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
console.log("api key:", API_KEY);
async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}
function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}
async function scrapeSearchResults(browser, keyword, pageNumber, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.yelp.com/search?find_desc=${formattedKeyword}&find_loc=${location}&start=${pageNumber*10}`;
await page.goto(url);
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[data-testid='serp-ia-card']");
for (const divCard of divCards) {
const cardText = await page.evaluate(element => element.textContent, divCard);
const img = await divCard.$("img");
const name = await page.evaluate(element => element.getAttribute("alt"), img);
const nameRemoved = cardText.replace(name, "");
let sponsored = isNaN(nameRemoved[0]);
let rank = 0;
if (!sponsored) {
rankString = nameRemoved.split(".");
rank = Number(rankString[0]);
}
let rating = 0.0;
const hasRating = await divCard.$("div span[data-font-weight='semibold']");
if (hasRating) {
const ratingText = await page.evaluate(element => element.textContent, hasRating);
if (ratingText.length > 0) {
rating = Number(ratingText);
}
}
let reviewCount = "0";
if (cardText.includes("review")) {
reviewCount = cardText.split("(")[1].split(")")[0].split(" ")[0];
}
const aElement = await divCard.$("a");
const link = await page.evaluate(element => element.getAttribute("href"), aElement);
const yelpUrl = `https://www.yelp.com${link.replace("https://proxy.scrapeops.io", "")}`
const searchData = {
name: name,
sponsored: sponsored,
stars: rating,
rank: rank,
review_count: reviewCount,
url: yelpUrl
}
await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startScrape(keyword, pages, location, concurrencyLimit, retries) {
const pageList = range(0, pages);
const browser = await puppeteer.launch()
while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
async function main() {
const keywords = ["restaurants"];
const concurrencyLimit = 4;
const pages = 1;
const location = "uk";
const retries = 3;
const aggregateFiles = [];
for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, concurrencyLimit, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
}
main();
Some key points you should notice from this code:
const currentBatch = pageList.splice(0, concurrencyLimit);
creates a list of tasks to run.- We wait for the list of tasks to finish with
await Promise.all(tasks);
- We repeat this process until the
pageList
is gone.
Adding concurrency will greatly increase the efficieny of our crawler.
Step 5: Bypassing Anti-Bots
Anti-bots are the arch enemy of many developers. While they're designed to block malicious traffic, anti-bots tend to block non malicious scrapers as well.
Luckily, we have exactly the right tools to bypass them. The ScrapeOps Residential Proxy does wonders when trying to get result from Yelp.
The function below takes in a bunch of parameters and creates a proxied url. This holds the keys to everything.
function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location,
residential: true,
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}
api_key
: holds our ScrapeOps API key.url
: is the url of the website we'd like to scrape.country
: is the country you'd like to be routed through.residential
: allows us to use a residential IP address. This greatly increases our chances of getting through anti-bots because we're not showing up at a data center IP.
Here is our production ready code.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
console.log("api key:", API_KEY);
async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}
function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}
function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location,
residential: true,
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}
async function scrapeSearchResults(browser, keyword, pageNumber, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.yelp.com/search?find_desc=${formattedKeyword}&find_loc=${location}&start=${pageNumber*10}`;
const proxyUrl = getScrapeOpsUrl(url, location);
await page.goto(proxyUrl);
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[data-testid='serp-ia-card']");
for (const divCard of divCards) {
const cardText = await page.evaluate(element => element.textContent, divCard);
const img = await divCard.$("img");
const name = await page.evaluate(element => element.getAttribute("alt"), img);
const nameRemoved = cardText.replace(name, "");
let sponsored = isNaN(nameRemoved[0]);
let rank = 0;
if (!sponsored) {
rankString = nameRemoved.split(".");
rank = Number(rankString[0]);
}
let rating = 0.0;
const hasRating = await divCard.$("div span[data-font-weight='semibold']");
if (hasRating) {
const ratingText = await page.evaluate(element => element.textContent, hasRating);
if (ratingText.length > 0) {
rating = Number(ratingText);
}
}
let reviewCount = "0";
if (cardText.includes("review")) {
reviewCount = cardText.split("(")[1].split(")")[0].split(" ")[0];
}
const aElement = await divCard.$("a");
const link = await page.evaluate(element => element.getAttribute("href"), aElement);
const yelpUrl = `https://www.yelp.com${link.replace("https://proxy.scrapeops.io", "")}`
const searchData = {
name: name,
sponsored: sponsored,
stars: rating,
rank: rank,
review_count: reviewCount,
url: yelpUrl
}
await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startScrape(keyword, pages, location, concurrencyLimit, retries) {
const pageList = range(0, pages);
const browser = await puppeteer.launch()
while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
async function main() {
const keywords = ["restaurants"];
const concurrencyLimit = 4;
const pages = 1;
const location = "uk";
const retries = 3;
const aggregateFiles = [];
for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, concurrencyLimit, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
}
main();
Step 6: Production Run
Let's test this thing out. I'll run the main
on 5 pages of results.
async function main() {
const keywords = ["restaurants"];
const concurrencyLimit = 4;
const pages = 5;
const location = "uk";
const retries = 3;
const aggregateFiles = [];
for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, concurrencyLimit, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
}
Here are the results.
We finished 5 pages of results in roughly 27.67 seconds. This comes out to 5.54 seconds per page.
Build A Yelp Scraper
Time for part 2. Now, we'll build a scraper that reads our CSV file and then scrapes each business from the CSV file. Here are the steps laid out:
- Read the CSV file.
- Lookup and parse each business from the file.
- Save the parsed data to a CSV file.
- Concurrently run steps 2 and three on each business.
- Integrate with a proxy to get past any roadblocks.
Step 1: Create Simple Business Data Parser
Just like before, we'll get started with a parsing function. It's very similar to the parsing function from before, but this time, we find one script
tag and extract JSON from it.
async function processBusiness(browser, row, location, retries = 3) {
const url = row.url;
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const page = await browser.newPage();
try {
await page.goto(url, { timeout: 60000 });
const infoSectionElement = await page.$("script[type='application/ld+json']");
const infoText = await page.evaluate(element => element.textContent, infoSectionElement);
const infoSection = JSON.parse(infoText);
const listElements = infoSection.itemListElement;
let anonCount = 1;
for (const element of listElements) {
let name = element.author.name;
if (name === "Unknown User") {
name = `${name}${anonCount}`;
anonCount++;
}
familyFriendly = element.isFamilyFriendly;
date = element.uploadDate,
position = element.position
const reviewData = {
name: name,
family_friendly: familyFriendly,
date: date,
position: position
}
console.log(reviewData);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left: ${retries-tries}, url: ${getScrapeOpsUrl(url)}`);
tries++;
} finally {
await page.close();
}
}
}
- First we find the
script
tag that holds our JSON,await page.$("script[type='application/ld+json']");
- Once we've got our JSON, we pull the following:
name
: the name of the restaurant.familyFriendly
: whether or not the restaurant is family friendly.date
: the date the review was uploaded.position
: the position that the review shows up on the page.
Step 2: Loading URLs To Scrape
Before we parse our data, we need to be able to read our CSV file and we also need to be able run our parsing function on multiple rows.
Here is our readCsv()
function.
async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));
for await (const record of parser) {
results.push(record);
}
return results;
}
Here is our processResults()
function.
async function processResults(csvFile, location, retries) {
const businesses = await readCsv(csvFile);
const browser = await puppeteer.launch();
for (const business of businesses) {
await processBusiness(browser, business, location, retries)
}
await browser.close();
}
Here is our fully updated code.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
console.log("api key:", API_KEY);
async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}
async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));
for await (const record of parser) {
results.push(record);
}
return results;
}
function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}
function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location,
residential: true,
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}
async function scrapeSearchResults(browser, keyword, pageNumber, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.yelp.com/search?find_desc=${formattedKeyword}&find_loc=${location}&start=${pageNumber*10}`;
const proxyUrl = getScrapeOpsUrl(url, location);
await page.goto(proxyUrl);
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[data-testid='serp-ia-card']");
for (const divCard of divCards) {
const cardText = await page.evaluate(element => element.textContent, divCard);
const img = await divCard.$("img");
const name = await page.evaluate(element => element.getAttribute("alt"), img);
const nameRemoved = cardText.replace(name, "");
let sponsored = isNaN(nameRemoved[0]);
let rank = 0;
if (!sponsored) {
rankString = nameRemoved.split(".");
rank = Number(rankString[0]);
}
let rating = 0.0;
const hasRating = await divCard.$("div span[data-font-weight='semibold']");
if (hasRating) {
const ratingText = await page.evaluate(element => element.textContent, hasRating);
if (ratingText.length > 0) {
rating = Number(ratingText);
}
}
let reviewCount = "0";
if (cardText.includes("review")) {
reviewCount = cardText.split("(")[1].split(")")[0].split(" ")[0];
}
const aElement = await divCard.$("a");
const link = await page.evaluate(element => element.getAttribute("href"), aElement);
const yelpUrl = `https://www.yelp.com${link.replace("https://proxy.scrapeops.io", "")}`
const searchData = {
name: name,
sponsored: sponsored,
stars: rating,
rank: rank,
review_count: reviewCount,
url: yelpUrl
}
await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startScrape(keyword, pages, location, concurrencyLimit, retries) {
const pageList = range(0, pages);
const browser = await puppeteer.launch()
while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
async function processBusiness(browser, row, location, retries = 3) {
const url = row.url;
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const page = await browser.newPage();
try {
await page.goto(url, { timeout: 60000 });
const infoSectionElement = await page.$("script[type='application/ld+json']");
const infoText = await page.evaluate(element => element.textContent, infoSectionElement);
const infoSection = JSON.parse(infoText);
const listElements = infoSection.itemListElement;
let anonCount = 1;
for (const element of listElements) {
let name = element.author.name;
if (name === "Unknown User") {
name = `${name}${anonCount}`;
anonCount++;
}
familyFriendly = element.isFamilyFriendly;
date = element.uploadDate,
position = element.position
const reviewData = {
name: name,
family_friendly: familyFriendly,
date: date,
position: position
}
console.log(reviewData);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left: ${retries-tries}, url: ${getScrapeOpsUrl(url)}`);
tries++;
} finally {
await page.close();
}
}
}
async function processResults(csvFile, location, retries) {
const businesses = await readCsv(csvFile);
const browser = await puppeteer.launch();
for (const business of businesses) {
await processBusiness(browser, business, location, retries)
}
await browser.close();
}
async function main() {
const keywords = ["restaurants"];
const concurrencyLimit = 4;
const pages = 5;
const location = "uk";
const retries = 3;
const aggregateFiles = [];
for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, concurrencyLimit, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
console.log("Starting scrape");
for (const file of aggregateFiles) {
await processResults(file, location, retries);
}
console.log("Scrape complete");
}
main();
readCsv()
reads our CSV file into an array of JSON objects.processResults()
runsprocessBusiness()
on every single one of the rows from the CSV file.
Step 3: Storing the Scraped Data
Storing our data will be really simple. We already have a function that writes to CSV and all we need to do is change one line.
await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
Here is our full script at the moment.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
console.log("api key:", API_KEY);
async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}
async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));
for await (const record of parser) {
results.push(record);
}
return results;
}
function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}
function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location,
residential: true,
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}
async function scrapeSearchResults(browser, keyword, pageNumber, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.yelp.com/search?find_desc=${formattedKeyword}&find_loc=${location}&start=${pageNumber*10}`;
const proxyUrl = getScrapeOpsUrl(url, location);
await page.goto(proxyUrl);
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[data-testid='serp-ia-card']");
for (const divCard of divCards) {
const cardText = await page.evaluate(element => element.textContent, divCard);
const img = await divCard.$("img");
const name = await page.evaluate(element => element.getAttribute("alt"), img);
const nameRemoved = cardText.replace(name, "");
let sponsored = isNaN(nameRemoved[0]);
let rank = 0;
if (!sponsored) {
rankString = nameRemoved.split(".");
rank = Number(rankString[0]);
}
let rating = 0.0;
const hasRating = await divCard.$("div span[data-font-weight='semibold']");
if (hasRating) {
const ratingText = await page.evaluate(element => element.textContent, hasRating);
if (ratingText.length > 0) {
rating = Number(ratingText);
}
}
let reviewCount = "0";
if (cardText.includes("review")) {
reviewCount = cardText.split("(")[1].split(")")[0].split(" ")[0];
}
const aElement = await divCard.$("a");
const link = await page.evaluate(element => element.getAttribute("href"), aElement);
const yelpUrl = `https://www.yelp.com${link.replace("https://proxy.scrapeops.io", "")}`
const searchData = {
name: name,
sponsored: sponsored,
stars: rating,
rank: rank,
review_count: reviewCount,
url: yelpUrl
}
await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startScrape(keyword, pages, location, concurrencyLimit, retries) {
const pageList = range(0, pages);
const browser = await puppeteer.launch()
while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
async function processBusiness(browser, row, location, retries = 3) {
const url = row.url;
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const page = await browser.newPage();
try {
await page.goto(url, { timeout: 60000 });
const infoSectionElement = await page.$("script[type='application/ld+json']");
const infoText = await page.evaluate(element => element.textContent, infoSectionElement);
const infoSection = JSON.parse(infoText);
const listElements = infoSection.itemListElement;
let anonCount = 1;
for (const element of listElements) {
let name = element.author.name;
if (name === "Unknown User") {
name = `${name}${anonCount}`;
anonCount++;
}
familyFriendly = element.isFamilyFriendly;
date = element.uploadDate,
position = element.position
const reviewData = {
name: name,
family_friendly: familyFriendly,
date: date,
position: position
}
await writeToCsv([reviewData], `${row.name.replace(" ", "-")}.csv`);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left: ${retries-tries}, url: ${getScrapeOpsUrl(url)}`);
tries++;
} finally {
await page.close();
}
}
}
async function processResults(csvFile, location, retries) {
const businesses = await readCsv(csvFile);
const browser = await puppeteer.launch();
for (const business of businesses) {
await processBusiness(browser, business, location, retries)
}
await browser.close();
}
async function main() {
const keywords = ["restaurants"];
const concurrencyLimit = 4;
const pages = 5;
const location = "uk";
const retries = 3;
const aggregateFiles = [];
for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, concurrencyLimit, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
console.log("Starting scrape");
for (const file of aggregateFiles) {
await processResults(file, location, retries);
}
console.log("Scrape complete");
}
main();
Step 4: Adding Concurrency
To add concurrency, all we need to do is tweak our processResults()
function the same way we added concurrency earlier.
async function processResults(csvFile, location, concurrencyLimit, retries) {
const businesses = await readCsv(csvFile);
const browser = await puppeteer.launch();
while (businesses.length > 0) {
const currentBatch = businesses.splice(0, concurrencyLimit);
const tasks = currentBatch.map(business => processBusiness(browser, business, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
Step 5: Bypassing Anti-Bots
To bypass anti-bots, all we need to do is change one more line. We already have getScrapeOpsUrl()
, we simply need to add it in one more spot.
await page.goto(getScrapeOpsUrl(url, location), { timeout: 60000 });
Here is our production ready script.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
console.log("api key:", API_KEY);
async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}
async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));
for await (const record of parser) {
results.push(record);
}
return results;
}
function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}
function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location,
residential: true,
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}
async function scrapeSearchResults(browser, keyword, pageNumber, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.yelp.com/search?find_desc=${formattedKeyword}&find_loc=${location}&start=${pageNumber*10}`;
const proxyUrl = getScrapeOpsUrl(url, location);
await page.goto(proxyUrl);
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[data-testid='serp-ia-card']");
for (const divCard of divCards) {
const cardText = await page.evaluate(element => element.textContent, divCard);
const img = await divCard.$("img");
const name = await page.evaluate(element => element.getAttribute("alt"), img);
const nameRemoved = cardText.replace(name, "");
let sponsored = isNaN(nameRemoved[0]);
let rank = 0;
if (!sponsored) {
rankString = nameRemoved.split(".");
rank = Number(rankString[0]);
}
let rating = 0.0;
const hasRating = await divCard.$("div span[data-font-weight='semibold']");
if (hasRating) {
const ratingText = await page.evaluate(element => element.textContent, hasRating);
if (ratingText.length > 0) {
rating = Number(ratingText);
}
}
let reviewCount = "0";
if (cardText.includes("review")) {
reviewCount = cardText.split("(")[1].split(")")[0].split(" ")[0];
}
const aElement = await divCard.$("a");
const link = await page.evaluate(element => element.getAttribute("href"), aElement);
const yelpUrl = `https://www.yelp.com${link.replace("https://proxy.scrapeops.io", "")}`
const searchData = {
name: name,
sponsored: sponsored,
stars: rating,
rank: rank,
review_count: reviewCount,
url: yelpUrl
}
await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startScrape(keyword, pages, location, concurrencyLimit, retries) {
const pageList = range(0, pages);
const browser = await puppeteer.launch()
while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
async function processBusiness(browser, row, location, retries = 3) {
const url = row.url;
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const page = await browser.newPage();
try {
await page.goto(getScrapeOpsUrl(url, location), { timeout: 60000 });
const infoSectionElement = await page.$("script[type='application/ld+json']");
const infoText = await page.evaluate(element => element.textContent, infoSectionElement);
const infoSection = JSON.parse(infoText);
const listElements = infoSection.itemListElement;
let anonCount = 1;
for (const element of listElements) {
let name = element.author.name;
if (name === "Unknown User") {
name = `${name}${anonCount}`;
anonCount++;
}
familyFriendly = element.isFamilyFriendly;
date = element.uploadDate,
position = element.position
const reviewData = {
name: name,
family_friendly: familyFriendly,
date: date,
position: position
}
await writeToCsv([reviewData], `${row.name.replace(" ", "-")}.csv`);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left: ${retries-tries}, url: ${getScrapeOpsUrl(url)}`);
tries++;
} finally {
await page.close();
}
}
}
async function processResults(csvFile, location, concurrencyLimit, retries) {
const businesses = await readCsv(csvFile);
const browser = await puppeteer.launch();
while (businesses.length > 0) {
const currentBatch = businesses.splice(0, concurrencyLimit);
const tasks = currentBatch.map(business => processBusiness(browser, business, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
async function main() {
const keywords = ["restaurants"];
const concurrencyLimit = 4;
const pages = 5;
const location = "uk";
const retries = 3;
const aggregateFiles = [];
for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, concurrencyLimit, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
console.log("Starting scrape");
for (const file of aggregateFiles) {
await processResults(file, location, concurrencyLimit, retries);
}
console.log("Scrape complete");
}
main();
Step 6: Production Run
Here is our final main
function. We're running on 5 pages again.
async function main() {
const keywords = ["restaurants"];
const concurrencyLimit = 4;
const pages = 5;
const location = "uk";
const retries = 3;
const aggregateFiles = [];
for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, concurrencyLimit, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
console.log("Starting scrape");
for (const file of aggregateFiles) {
await processResults(file, location, concurrencyLimit, retries);
}
console.log("Scrape complete");
}
Here are the results.
The run finished in 3 minutes 19 seconds, or 199 seconds total. Earlier, it took 27 seconds to perform the crawl. That leaves us with approximately 172 seconds for 50 results, 172 seconds / 50 results = 3.44 seconds per result.
Comapared to other frameworks, this is lightning fast!
Legal and Ethical Considerations
Whenever you choose to interact with a website, you are subject to their Terms of Service.
Violating terms of service on any site will likely get you suspended or banned. Yelp's terms are available to read here.
When using any sort of bot such as a scraper, you also need to take a look at their robots.txt
here.
It's typically legal to scrape data as long as it's publicly available. Public data is any data that's not gated behind a login. If you need to login to view the data, this is private data.
If you have questions about the legality of a scraping job, you should consult an attorney.
Conclusion
You've finally finished the tutorial! You now have an understanding on how to use Puppeteer and how to incorporate parsing, pagination, data storage, concurrency and proxy integration into your design.
You also know how to parse different HTML elements and you also know how to extract data from nested JSON.
More Puppeteer Web Scraping Guides
Here at ScrapeOps, we've got all sorts of learning material. Whether you're new to scraping or you're a seasoned dev, we have something for you to add to your toolbox. Check out our Puppeteer Web Scraping Playbook.
If you enjoyed this article, check out a couple more from our How To Scrape series.