Skip to main content

Scrape Yelp With NodeJS Puppeteer

How to Scrape Yelp With Puppeteer

If you've ever looked up restaurant reviews online, you've most likely used Yelp. Yelp is a crucial site where business owners really depend on those reviews and reviewers tend to be brutally honest. On top of this, Yelp has been extremely popular for more than a decade and it's existed since 2004. This gives Yelp a huge dataset for us to work with.

In this detailed tutorial, we'll go over how to scrape Yelp with NodeJS Puppeteer.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.


TLDR - How to Scrape Yelp

For those of you that don't have the time to read, here we have a restaurant scraper and it's ready to go!

All you need to do is create a new JavaScript project and add a config.json with your ScrapeOps API keys to the folder.

Yelp is very good at blocking scrapers but you don't need to worry about it because this one comes pre-built with support for the ScrapeOps Residential Proxy!

const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");

const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;

console.log("api key:", API_KEY);

async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);

const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))

const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}

async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));

for await (const record of parser) {
results.push(record);
}
return results;
}

function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}

function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location,
residential: true,
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}

async function scrapeSearchResults(browser, keyword, pageNumber, location="us", retries=3) {
let tries = 0;
let success = false;

while (tries <= retries && !success) {

const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.yelp.com/search?find_desc=${formattedKeyword}&find_loc=${location}&start=${pageNumber*10}`;

const proxyUrl = getScrapeOpsUrl(url, location);

await page.goto(proxyUrl);
console.log(`Successfully fetched: ${url}`);

const divCards = await page.$$("div[data-testid='serp-ia-card']");

for (const divCard of divCards) {

const cardText = await page.evaluate(element => element.textContent, divCard);
const img = await divCard.$("img");
const name = await page.evaluate(element => element.getAttribute("alt"), img);
const nameRemoved = cardText.replace(name, "");

let sponsored = isNaN(nameRemoved[0]);

let rank = 0;
if (!sponsored) {
rankString = nameRemoved.split(".");
rank = Number(rankString[0]);
}

let rating = 0.0;
const hasRating = await divCard.$("div span[data-font-weight='semibold']");
if (hasRating) {
const ratingText = await page.evaluate(element => element.textContent, hasRating);
if (ratingText.length > 0) {
rating = Number(ratingText);
}
}

let reviewCount = "0";
if (cardText.includes("review")) {
reviewCount = cardText.split("(")[1].split(")")[0].split(" ")[0];
}

const aElement = await divCard.$("a");
const link = await page.evaluate(element => element.getAttribute("href"), aElement);
const yelpUrl = `https://www.yelp.com${link.replace("https://proxy.scrapeops.io", "")}`

const searchData = {
name: name,
sponsored: sponsored,
stars: rating,
rank: rank,
review_count: reviewCount,
url: yelpUrl
}

await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
}


success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}

async function startScrape(keyword, pages, location, concurrencyLimit, retries) {
const pageList = range(0, pages);

const browser = await puppeteer.launch()

while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, location, retries));

try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}

await browser.close();
}

async function processBusiness(browser, row, location, retries = 3) {
const url = row.url;
let tries = 0;
let success = false;


while (tries <= retries && !success) {
const page = await browser.newPage();

try {
await page.goto(getScrapeOpsUrl(url, location), { timeout: 60000 });
const infoSectionElement = await page.$("script[type='application/ld+json']");
const infoText = await page.evaluate(element => element.textContent, infoSectionElement);
const infoSection = JSON.parse(infoText);
const listElements = infoSection.itemListElement;

let anonCount = 1;
for (const element of listElements) {
let name = element.author.name;
if (name === "Unknown User") {
name = `${name}${anonCount}`;
anonCount++;
}
familyFriendly = element.isFamilyFriendly;
date = element.uploadDate,
position = element.position

const reviewData = {
name: name,
family_friendly: familyFriendly,
date: date,
position: position
}
await writeToCsv([reviewData], `${row.name.replace(" ", "-")}.csv`);
}

success = true;


} catch (err) {
console.log(`Error: ${err}, tries left: ${retries-tries}, url: ${getScrapeOpsUrl(url)}`);
tries++;
} finally {
await page.close();
}
}
}

async function processResults(csvFile, location, concurrencyLimit, retries) {
const businesses = await readCsv(csvFile);
const browser = await puppeteer.launch();

while (businesses.length > 0) {
const currentBatch = businesses.splice(0, concurrencyLimit);
const tasks = currentBatch.map(business => processBusiness(browser, business, location, retries));

try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();

}

async function main() {
const keywords = ["restaurants"];
const concurrencyLimit = 4;
const pages = 5;
const location = "uk";
const retries = 3;
const aggregateFiles = [];

for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, concurrencyLimit, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}


console.log("Starting scrape");
for (const file of aggregateFiles) {
await processResults(file, location, concurrencyLimit, retries);
}
console.log("Scrape complete");
}


main();

If you'd like to tweak this scraper, feel free to change any of the following constants:

  • concurrencyLimit: Limits the number of simultaneous tasks (or browser instances/pages) that can run concurrently.
  • pages: Determines how many pages of search results to scrape for each keyword.
  • location: Specifies the geographic location for the search.
  • retries: Sets the number of retry attempts if a scraping task fails due to an error (e.g., network issues or proxy blocks).

You can change the keywords but you need to be cautious when doing so. Yelp uses different CSS and page layouts for different types of businesses.

If you add online bank to the keywords, it will break the scraper. If you decide to change the keywords, make sure you inspect the page for your search and adjust the parsing function to fit the page layout.


How To How To Architect Our Yelp Scraper

When we scrape Yelp, there are quite a few steps involved in the process and it's actually a two part project.

In part one, we build a crawler. In part two, we'll build a scraper.

The job of a crawler is relatively straightforward:

  1. Perform a search and parse results
  2. Paginate the search so that we can control our batches of results.
  3. Store the data that was extracted when parsing the page.
  4. Run steps 1 through 3 concurrently so we can crawl multiple result pages at the same time.
  5. Proxy Integration with the ScrapeOps API so that we don't need to worry about getting blocked.

In part 2 of this project, our scraper will need to:

  1. Read the data we stored in part 1
  2. Lookup the url of each business from the CSV file and parse it.
  3. Store the parsed data from each business.
  4. Run steps 2 and 3 for each business concurrently.
  5. Once again, use proxy integration to get around any potential roadblocks that may be in our way.

Understanding How To Scrape Yelp

Before we write our scraping code, we need to understand exactly how to get our information and how to extract it from the page.

We'll go through these next few steps in order to plan out how to build our scraper.


Step 1: How To Request Yelp Pages

Search URLs on Yelp look like this:

https://www.yelp.com/search?find_desc={formatted_keyword}&find_loc={location}

If we want to look up restaurants in the us, we would use the description parameter find_desc=restaurants and the location parameter find_loc=us.

So our complete URL would be:

https://www.yelp.com/search?find_desc=restaurants&find_loc=us

You can see how it looks in the browser below.

Yelp Search Results

All Yelp business pages contain /biz/, and then the name of the business. The URLs here aren't too much of a concern because we're pulling them from the search pages.

You can take a look at a Yelp business page below.

Yelp Business Pages


Step 2: How To Extract Data From Yelp Results and Pages

When we extract data from Yelp, we actually have to use a combination of strategies. To parse a results page, we need to actually parse the HTML and pull each element from the page.

When we parse a business page, we can actually get our data from a JSON blob located within a script tag.

Yelp's restaurant results all contain a data-testid of serp-ia-card. You can see it in the image below.

Yelp HTML Inspection Search Results

Here is the JSON from the business page.

Yelp HTML Inspection Reviews


Step 3: How To Control Pagination

Without pagination, we won't get very far. With pagination, we can get our results in batches. To paginate our search URL, we can add the start parameter. This parameter doesn't actually take our page number, it takes our result number.

Yelp gives us 10 results per page.

  • So if we put in for page 0, we get results 1 through 10.
  • Page 2 would get us results 11 through 20.

Step 4: Geolocated Data

For geoloaction, we'll actually use a combination of the ScrapeOps API and the find_loc parameter.

If we pass us into the ScrapeOps API, we will be routed through a server in the US. If we pass us into the find_loc parameter, Yelp will give us restaurants in the US.


Setting Up Our Yelp Scraper Project

Let's get started. You can run the following commands to get setup.

Create a New Project Folder

mkdir yelp-scraper

cd yelp-scraper

Create a New JavaScript Project

npm init --y

Install Our Dependencies

npm install puppeteer
npm install csv-writer
npm install csv-parse
npm install fs

Build A Yelp Search Crawler

As mentioned before, we need to build a crawler first. Our design will include parsing, pagination, data storage, concurrency and proxy integration.

In the coming sections, we'll go through step by step and add these into our design.


Step 1: Create Simple Search Data Parser

For starters, we need to be able to parse our data. The code below holds our basic structure: logging, retry logic, read the API key from a file, parse a page.

const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");

const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;

console.log("api key:", API_KEY);


async function scrapeSearchResults(browser, keyword, location="us", retries=3) {
let tries = 0;
let success = false;

while (tries <= retries && !success) {

const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.yelp.com/search?find_desc=${formattedKeyword}&find_loc=${location}`;

await page.goto(url);
console.log(`Successfully fetched: ${url}`);

const divCards = await page.$$("div[data-testid='serp-ia-card']");

for (const divCard of divCards) {

const cardText = await page.evaluate(element => element.textContent, divCard);
const img = await divCard.$("img");
const name = await page.evaluate(element => element.getAttribute("alt"), img);
const nameRemoved = cardText.replace(name, "");

let sponsored = isNaN(nameRemoved[0]);

let rank = 0;
if (!sponsored) {
rankString = nameRemoved.split(".");
rank = Number(rankString[0]);
}

let rating = 0.0;
const hasRating = await divCard.$("div span[data-font-weight='semibold']");
if (hasRating) {
const ratingText = await page.evaluate(element => element.textContent, hasRating);
if (ratingText.length > 0) {
rating = Number(ratingText);
}
}

let reviewCount = "0";
if (cardText.includes("review")) {
reviewCount = cardText.split("(")[1].split(")")[0].split(" ")[0];
}

const aElement = await divCard.$("a");
const link = await page.evaluate(element => element.getAttribute("href"), aElement);
const yelpUrl = `https://www.yelp.com${link.replace("https://proxy.scrapeops.io", "")}`

const searchData = {
name: name,
sponsored: sponsored,
stars: rating,
rank: rank,
review_count: reviewCount,
url: yelpUrl
}

console.log(searchData);
}


success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}

async function startScrape(keyword, location, retries) {
const browser = await puppeteer.launch()

await scrapeSearchResults(browser, keyword, location, retries);

await browser.close();
}


async function main() {
const keywords = ["restaurants"];
const concurrencyLimit = 4;
const pages = 1;
const location = "uk";
const retries = 3;
const aggregateFiles = [];

for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, location, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}

}


main();

Aside from the basic structure of the overall program, you need to pay attention to a few things from the parsing function, scrapeSearchResults() particularly.

  • First, we find all of our business cards on the page using their CSS selector: await page.$$("div[data-testid='serp-ia-card']")
  • For each card, we do the following:
    • Retrieve its text: const img = await divCard.$("img");
    • Find its image: await divCard.$("img").
    • Pull the name of the business using the alt for the image: await page.evaluate(element => element.getAttribute("alt"), img).
    • cardText.replace(name, ""); creates a variable of the title without the actual name of the business.
    • We can check if the business card is a sponsored ad using let sponsored = isNaN(nameRemoved[0]). All actual results come with a ranking number. If there is no rank in the title, the card is a sponsored ad.
    • If the card is not sponsored, we split the string at . and convert it to a number, Number(rankString[0]). This gives us the ranking number of the result.
    • Then we check for the CSS selector of the rating, div span[data-font-weight='semibold'] and if there's a rating present, we pull it from the card.
    • We use some string splitting to pull our review count similar to how we did earlier with the rank number.
    • Finally, we pull the aElement and extract its href in order to get the Yelp page for the business.

Step 2: Add Pagination

Adding pagination is really simple. To add pagination we add pageNumber * 10 to the start param of our URL.

Take a look at the code below, we added a couple more things, a range() function and we tweaked the startScrape() function to support scraping multiple pages.

const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");

const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;

console.log("api key:", API_KEY);


function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}

async function scrapeSearchResults(browser, keyword, pageNumber, location="us", retries=3) {
let tries = 0;
let success = false;

while (tries <= retries && !success) {

const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.yelp.com/search?find_desc=${formattedKeyword}&find_loc=${location}&start=${pageNumber*10}`;

await page.goto(url);
console.log(`Successfully fetched: ${url}`);

const divCards = await page.$$("div[data-testid='serp-ia-card']");

for (const divCard of divCards) {

const cardText = await page.evaluate(element => element.textContent, divCard);
const img = await divCard.$("img");
const name = await page.evaluate(element => element.getAttribute("alt"), img);
const nameRemoved = cardText.replace(name, "");

let sponsored = isNaN(nameRemoved[0]);

let rank = 0;
if (!sponsored) {
rankString = nameRemoved.split(".");
rank = Number(rankString[0]);
}

let rating = 0.0;
const hasRating = await divCard.$("div span[data-font-weight='semibold']");
if (hasRating) {
const ratingText = await page.evaluate(element => element.textContent, hasRating);
if (ratingText.length > 0) {
rating = Number(ratingText);
}
}

let reviewCount = "0";
if (cardText.includes("review")) {
reviewCount = cardText.split("(")[1].split(")")[0].split(" ")[0];
}

const aElement = await divCard.$("a");
const link = await page.evaluate(element => element.getAttribute("href"), aElement);
const yelpUrl = `https://www.yelp.com${link.replace("https://proxy.scrapeops.io", "")}`

const searchData = {
name: name,
sponsored: sponsored,
stars: rating,
rank: rank,
review_count: reviewCount,
url: yelpUrl
}

console.log(searchData);
}


success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}

async function startScrape(keyword, pages, location, retries) {
const pageList = range(0, pages);

const browser = await puppeteer.launch()

for (const page of pageList) {
await scrapeSearchResults(browser, keyword, page, location, retries);
}

await browser.close();
}


async function main() {
const keywords = ["restaurants"];
const concurrencyLimit = 4;
const pages = 1;
const location = "uk";
const retries = 3;
const aggregateFiles = [];

for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}

}


main();

We can now control our batching with our modified url: https://www.yelp.com/search?find_desc=${formattedKeyword}&find_loc=${location}&start=${pageNumber*10}.


Step 3: Storing the Scraped Data

Scraping is pointless if we don't store the data. To store our data, we'll write a function that saves to a CSV file. It's very important that this function opens in append mode if the file exists.

If the file doesn't exist, our function needs to create it. It also takes in an array of JSON objects, so whenever we pass something into this function, we need to pass it in as an array.

Here is writeToCsv().

async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);

const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))

const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}

Here is our full code up to this point.

const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");

const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;

console.log("api key:", API_KEY);

async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);

const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))

const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}


function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}

async function scrapeSearchResults(browser, keyword, pageNumber, location="us", retries=3) {
let tries = 0;
let success = false;

while (tries <= retries && !success) {

const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.yelp.com/search?find_desc=${formattedKeyword}&find_loc=${location}&start=${pageNumber*10}`;

await page.goto(url);
console.log(`Successfully fetched: ${url}`);

const divCards = await page.$$("div[data-testid='serp-ia-card']");

for (const divCard of divCards) {

const cardText = await page.evaluate(element => element.textContent, divCard);
const img = await divCard.$("img");
const name = await page.evaluate(element => element.getAttribute("alt"), img);
const nameRemoved = cardText.replace(name, "");

let sponsored = isNaN(nameRemoved[0]);

let rank = 0;
if (!sponsored) {
rankString = nameRemoved.split(".");
rank = Number(rankString[0]);
}

let rating = 0.0;
const hasRating = await divCard.$("div span[data-font-weight='semibold']");
if (hasRating) {
const ratingText = await page.evaluate(element => element.textContent, hasRating);
if (ratingText.length > 0) {
rating = Number(ratingText);
}
}

let reviewCount = "0";
if (cardText.includes("review")) {
reviewCount = cardText.split("(")[1].split(")")[0].split(" ")[0];
}

const aElement = await divCard.$("a");
const link = await page.evaluate(element => element.getAttribute("href"), aElement);
const yelpUrl = `https://www.yelp.com${link.replace("https://proxy.scrapeops.io", "")}`

const searchData = {
name: name,
sponsored: sponsored,
stars: rating,
rank: rank,
review_count: reviewCount,
url: yelpUrl
}

await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
}


success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}

async function startScrape(keyword, pages, location, retries) {
const pageList = range(0, pages);

const browser = await puppeteer.launch()

for (const page of pageList) {
await scrapeSearchResults(browser, keyword, page, location, retries);
}

await browser.close();
}


async function main() {
const keywords = ["restaurants"];
const concurrencyLimit = 4;
const pages = 1;
const location = "uk";
const retries = 3;
const aggregateFiles = [];

for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}

}


main();

Instead of printing each result to the console, we now write it to a CSV file.


Step 4: Adding Concurrency

Time to add conurrency. Our crawler is now crawling multiple pages, but it needs to crawl multiple pages at the same time. To do this, we'll change startScrape().

Instead of a for loop, we're going to shrink our pageList() by splicing up to the concurrencyLimit and running scrapeSearchResults() on each page that got spliced.

We then await all of these tasks and splice() again. We continue this process until the array shrinks down to nothing.

Here is our refactored function.

async function startScrape(keyword, pages, location, concurrencyLimit, retries) {
const pageList = range(0, pages);

const browser = await puppeteer.launch()

while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, location, retries));

try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}

await browser.close();
}

Here is our fully updated code.

const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");

const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;

console.log("api key:", API_KEY);

async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);

const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))

const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}


function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}

async function scrapeSearchResults(browser, keyword, pageNumber, location="us", retries=3) {
let tries = 0;
let success = false;

while (tries <= retries && !success) {

const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.yelp.com/search?find_desc=${formattedKeyword}&find_loc=${location}&start=${pageNumber*10}`;

await page.goto(url);
console.log(`Successfully fetched: ${url}`);

const divCards = await page.$$("div[data-testid='serp-ia-card']");

for (const divCard of divCards) {

const cardText = await page.evaluate(element => element.textContent, divCard);
const img = await divCard.$("img");
const name = await page.evaluate(element => element.getAttribute("alt"), img);
const nameRemoved = cardText.replace(name, "");

let sponsored = isNaN(nameRemoved[0]);

let rank = 0;
if (!sponsored) {
rankString = nameRemoved.split(".");
rank = Number(rankString[0]);
}

let rating = 0.0;
const hasRating = await divCard.$("div span[data-font-weight='semibold']");
if (hasRating) {
const ratingText = await page.evaluate(element => element.textContent, hasRating);
if (ratingText.length > 0) {
rating = Number(ratingText);
}
}

let reviewCount = "0";
if (cardText.includes("review")) {
reviewCount = cardText.split("(")[1].split(")")[0].split(" ")[0];
}

const aElement = await divCard.$("a");
const link = await page.evaluate(element => element.getAttribute("href"), aElement);
const yelpUrl = `https://www.yelp.com${link.replace("https://proxy.scrapeops.io", "")}`

const searchData = {
name: name,
sponsored: sponsored,
stars: rating,
rank: rank,
review_count: reviewCount,
url: yelpUrl
}

await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
}


success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}

async function startScrape(keyword, pages, location, concurrencyLimit, retries) {
const pageList = range(0, pages);

const browser = await puppeteer.launch()

while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, location, retries));

try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}

await browser.close();
}


async function main() {
const keywords = ["restaurants"];
const concurrencyLimit = 4;
const pages = 1;
const location = "uk";
const retries = 3;
const aggregateFiles = [];

for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, concurrencyLimit, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}

}


main();

Some key points you should notice from this code:

  • const currentBatch = pageList.splice(0, concurrencyLimit); creates a list of tasks to run.
  • We wait for the list of tasks to finish with await Promise.all(tasks);
  • We repeat this process until the pageList is gone.

Adding concurrency will greatly increase the efficieny of our crawler.


Step 5: Bypassing Anti-Bots

Anti-bots are the arch enemy of many developers. While they're designed to block malicious traffic, anti-bots tend to block non malicious scrapers as well.

Luckily, we have exactly the right tools to bypass them. The ScrapeOps Residential Proxy does wonders when trying to get result from Yelp.

The function below takes in a bunch of parameters and creates a proxied url. This holds the keys to everything.

function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location,
residential: true,
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}
  • api_key: holds our ScrapeOps API key.
  • url: is the url of the website we'd like to scrape.
  • country: is the country you'd like to be routed through.
  • residential: allows us to use a residential IP address. This greatly increases our chances of getting through anti-bots because we're not showing up at a data center IP.

Here is our production ready code.

const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");

const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;

console.log("api key:", API_KEY);

async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);

const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))

const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}

function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}

function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location,
residential: true,
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}

async function scrapeSearchResults(browser, keyword, pageNumber, location="us", retries=3) {
let tries = 0;
let success = false;

while (tries <= retries && !success) {

const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.yelp.com/search?find_desc=${formattedKeyword}&find_loc=${location}&start=${pageNumber*10}`;

const proxyUrl = getScrapeOpsUrl(url, location);

await page.goto(proxyUrl);
console.log(`Successfully fetched: ${url}`);

const divCards = await page.$$("div[data-testid='serp-ia-card']");

for (const divCard of divCards) {

const cardText = await page.evaluate(element => element.textContent, divCard);
const img = await divCard.$("img");
const name = await page.evaluate(element => element.getAttribute("alt"), img);
const nameRemoved = cardText.replace(name, "");

let sponsored = isNaN(nameRemoved[0]);

let rank = 0;
if (!sponsored) {
rankString = nameRemoved.split(".");
rank = Number(rankString[0]);
}

let rating = 0.0;
const hasRating = await divCard.$("div span[data-font-weight='semibold']");
if (hasRating) {
const ratingText = await page.evaluate(element => element.textContent, hasRating);
if (ratingText.length > 0) {
rating = Number(ratingText);
}
}

let reviewCount = "0";
if (cardText.includes("review")) {
reviewCount = cardText.split("(")[1].split(")")[0].split(" ")[0];
}

const aElement = await divCard.$("a");
const link = await page.evaluate(element => element.getAttribute("href"), aElement);
const yelpUrl = `https://www.yelp.com${link.replace("https://proxy.scrapeops.io", "")}`

const searchData = {
name: name,
sponsored: sponsored,
stars: rating,
rank: rank,
review_count: reviewCount,
url: yelpUrl
}

await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
}


success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}

async function startScrape(keyword, pages, location, concurrencyLimit, retries) {
const pageList = range(0, pages);

const browser = await puppeteer.launch()

while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, location, retries));

try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}

await browser.close();
}


async function main() {
const keywords = ["restaurants"];
const concurrencyLimit = 4;
const pages = 1;
const location = "uk";
const retries = 3;
const aggregateFiles = [];

for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, concurrencyLimit, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}

}


main();

Step 6: Production Run

Let's test this thing out. I'll run the main on 5 pages of results.

async function main() {
const keywords = ["restaurants"];
const concurrencyLimit = 4;
const pages = 5;
const location = "uk";
const retries = 3;
const aggregateFiles = [];

for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, concurrencyLimit, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}

}

Here are the results. Crawler Performance

We finished 5 pages of results in roughly 27.67 seconds. This comes out to 5.54 seconds per page.


Build A Yelp Scraper

Time for part 2. Now, we'll build a scraper that reads our CSV file and then scrapes each business from the CSV file. Here are the steps laid out:

  1. Read the CSV file.
  2. Lookup and parse each business from the file.
  3. Save the parsed data to a CSV file.
  4. Concurrently run steps 2 and three on each business.
  5. Integrate with a proxy to get past any roadblocks.

Step 1: Create Simple Business Data Parser

Just like before, we'll get started with a parsing function. It's very similar to the parsing function from before, but this time, we find one script tag and extract JSON from it.

async function processBusiness(browser, row, location, retries = 3) {
const url = row.url;
let tries = 0;
let success = false;


while (tries <= retries && !success) {
const page = await browser.newPage();

try {
await page.goto(url, { timeout: 60000 });
const infoSectionElement = await page.$("script[type='application/ld+json']");
const infoText = await page.evaluate(element => element.textContent, infoSectionElement);
const infoSection = JSON.parse(infoText);
const listElements = infoSection.itemListElement;

let anonCount = 1;
for (const element of listElements) {
let name = element.author.name;
if (name === "Unknown User") {
name = `${name}${anonCount}`;
anonCount++;
}
familyFriendly = element.isFamilyFriendly;
date = element.uploadDate,
position = element.position

const reviewData = {
name: name,
family_friendly: familyFriendly,
date: date,
position: position
}
console.log(reviewData);
}

success = true;


} catch (err) {
console.log(`Error: ${err}, tries left: ${retries-tries}, url: ${getScrapeOpsUrl(url)}`);
tries++;
} finally {
await page.close();
}
}
}
  • First we find the script tag that holds our JSON, await page.$("script[type='application/ld+json']");
  • Once we've got our JSON, we pull the following:
    • name: the name of the restaurant.
    • familyFriendly: whether or not the restaurant is family friendly.
    • date: the date the review was uploaded.
    • position: the position that the review shows up on the page.

Step 2: Loading URLs To Scrape

Before we parse our data, we need to be able to read our CSV file and we also need to be able run our parsing function on multiple rows.

Here is our readCsv() function.

async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));

for await (const record of parser) {
results.push(record);
}
return results;
}

Here is our processResults() function.

async function processResults(csvFile, location, retries) {
const businesses = await readCsv(csvFile);
const browser = await puppeteer.launch();

for (const business of businesses) {
await processBusiness(browser, business, location, retries)
}
await browser.close();

}

Here is our fully updated code.

const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");

const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;

console.log("api key:", API_KEY);

async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);

const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))

const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}

async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));

for await (const record of parser) {
results.push(record);
}
return results;
}

function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}

function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location,
residential: true,
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}

async function scrapeSearchResults(browser, keyword, pageNumber, location="us", retries=3) {
let tries = 0;
let success = false;

while (tries <= retries && !success) {

const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.yelp.com/search?find_desc=${formattedKeyword}&find_loc=${location}&start=${pageNumber*10}`;

const proxyUrl = getScrapeOpsUrl(url, location);

await page.goto(proxyUrl);
console.log(`Successfully fetched: ${url}`);

const divCards = await page.$$("div[data-testid='serp-ia-card']");

for (const divCard of divCards) {

const cardText = await page.evaluate(element => element.textContent, divCard);
const img = await divCard.$("img");
const name = await page.evaluate(element => element.getAttribute("alt"), img);
const nameRemoved = cardText.replace(name, "");

let sponsored = isNaN(nameRemoved[0]);

let rank = 0;
if (!sponsored) {
rankString = nameRemoved.split(".");
rank = Number(rankString[0]);
}

let rating = 0.0;
const hasRating = await divCard.$("div span[data-font-weight='semibold']");
if (hasRating) {
const ratingText = await page.evaluate(element => element.textContent, hasRating);
if (ratingText.length > 0) {
rating = Number(ratingText);
}
}

let reviewCount = "0";
if (cardText.includes("review")) {
reviewCount = cardText.split("(")[1].split(")")[0].split(" ")[0];
}

const aElement = await divCard.$("a");
const link = await page.evaluate(element => element.getAttribute("href"), aElement);
const yelpUrl = `https://www.yelp.com${link.replace("https://proxy.scrapeops.io", "")}`

const searchData = {
name: name,
sponsored: sponsored,
stars: rating,
rank: rank,
review_count: reviewCount,
url: yelpUrl
}

await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
}


success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}

async function startScrape(keyword, pages, location, concurrencyLimit, retries) {
const pageList = range(0, pages);

const browser = await puppeteer.launch()

while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, location, retries));

try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}

await browser.close();
}

async function processBusiness(browser, row, location, retries = 3) {
const url = row.url;
let tries = 0;
let success = false;


while (tries <= retries && !success) {
const page = await browser.newPage();

try {
await page.goto(url, { timeout: 60000 });
const infoSectionElement = await page.$("script[type='application/ld+json']");
const infoText = await page.evaluate(element => element.textContent, infoSectionElement);
const infoSection = JSON.parse(infoText);
const listElements = infoSection.itemListElement;

let anonCount = 1;
for (const element of listElements) {
let name = element.author.name;
if (name === "Unknown User") {
name = `${name}${anonCount}`;
anonCount++;
}
familyFriendly = element.isFamilyFriendly;
date = element.uploadDate,
position = element.position

const reviewData = {
name: name,
family_friendly: familyFriendly,
date: date,
position: position
}
console.log(reviewData);
}

success = true;


} catch (err) {
console.log(`Error: ${err}, tries left: ${retries-tries}, url: ${getScrapeOpsUrl(url)}`);
tries++;
} finally {
await page.close();
}
}
}

async function processResults(csvFile, location, retries) {
const businesses = await readCsv(csvFile);
const browser = await puppeteer.launch();

for (const business of businesses) {
await processBusiness(browser, business, location, retries)
}
await browser.close();

}

async function main() {
const keywords = ["restaurants"];
const concurrencyLimit = 4;
const pages = 5;
const location = "uk";
const retries = 3;
const aggregateFiles = [];

for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, concurrencyLimit, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}


console.log("Starting scrape");
for (const file of aggregateFiles) {
await processResults(file, location, retries);
}
console.log("Scrape complete");
}


main();
  • readCsv() reads our CSV file into an array of JSON objects.
  • processResults() runs processBusiness() on every single one of the rows from the CSV file.

Step 3: Storing the Scraped Data

Storing our data will be really simple. We already have a function that writes to CSV and all we need to do is change one line.

await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);

Here is our full script at the moment.

const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");

const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;

console.log("api key:", API_KEY);

async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);

const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))

const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}

async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));

for await (const record of parser) {
results.push(record);
}
return results;
}

function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}

function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location,
residential: true,
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}

async function scrapeSearchResults(browser, keyword, pageNumber, location="us", retries=3) {
let tries = 0;
let success = false;

while (tries <= retries && !success) {

const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.yelp.com/search?find_desc=${formattedKeyword}&find_loc=${location}&start=${pageNumber*10}`;

const proxyUrl = getScrapeOpsUrl(url, location);

await page.goto(proxyUrl);
console.log(`Successfully fetched: ${url}`);

const divCards = await page.$$("div[data-testid='serp-ia-card']");

for (const divCard of divCards) {

const cardText = await page.evaluate(element => element.textContent, divCard);
const img = await divCard.$("img");
const name = await page.evaluate(element => element.getAttribute("alt"), img);
const nameRemoved = cardText.replace(name, "");

let sponsored = isNaN(nameRemoved[0]);

let rank = 0;
if (!sponsored) {
rankString = nameRemoved.split(".");
rank = Number(rankString[0]);
}

let rating = 0.0;
const hasRating = await divCard.$("div span[data-font-weight='semibold']");
if (hasRating) {
const ratingText = await page.evaluate(element => element.textContent, hasRating);
if (ratingText.length > 0) {
rating = Number(ratingText);
}
}

let reviewCount = "0";
if (cardText.includes("review")) {
reviewCount = cardText.split("(")[1].split(")")[0].split(" ")[0];
}

const aElement = await divCard.$("a");
const link = await page.evaluate(element => element.getAttribute("href"), aElement);
const yelpUrl = `https://www.yelp.com${link.replace("https://proxy.scrapeops.io", "")}`

const searchData = {
name: name,
sponsored: sponsored,
stars: rating,
rank: rank,
review_count: reviewCount,
url: yelpUrl
}

await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
}


success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}

async function startScrape(keyword, pages, location, concurrencyLimit, retries) {
const pageList = range(0, pages);

const browser = await puppeteer.launch()

while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, location, retries));

try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}

await browser.close();
}

async function processBusiness(browser, row, location, retries = 3) {
const url = row.url;
let tries = 0;
let success = false;


while (tries <= retries && !success) {
const page = await browser.newPage();

try {
await page.goto(url, { timeout: 60000 });
const infoSectionElement = await page.$("script[type='application/ld+json']");
const infoText = await page.evaluate(element => element.textContent, infoSectionElement);
const infoSection = JSON.parse(infoText);
const listElements = infoSection.itemListElement;

let anonCount = 1;
for (const element of listElements) {
let name = element.author.name;
if (name === "Unknown User") {
name = `${name}${anonCount}`;
anonCount++;
}
familyFriendly = element.isFamilyFriendly;
date = element.uploadDate,
position = element.position

const reviewData = {
name: name,
family_friendly: familyFriendly,
date: date,
position: position
}
await writeToCsv([reviewData], `${row.name.replace(" ", "-")}.csv`);
}

success = true;


} catch (err) {
console.log(`Error: ${err}, tries left: ${retries-tries}, url: ${getScrapeOpsUrl(url)}`);
tries++;
} finally {
await page.close();
}
}
}

async function processResults(csvFile, location, retries) {
const businesses = await readCsv(csvFile);
const browser = await puppeteer.launch();

for (const business of businesses) {
await processBusiness(browser, business, location, retries)
}
await browser.close();

}

async function main() {
const keywords = ["restaurants"];
const concurrencyLimit = 4;
const pages = 5;
const location = "uk";
const retries = 3;
const aggregateFiles = [];

for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, concurrencyLimit, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}


console.log("Starting scrape");
for (const file of aggregateFiles) {
await processResults(file, location, retries);
}
console.log("Scrape complete");
}


main();

Step 4: Adding Concurrency

To add concurrency, all we need to do is tweak our processResults() function the same way we added concurrency earlier.

async function processResults(csvFile, location, concurrencyLimit, retries) {
const businesses = await readCsv(csvFile);
const browser = await puppeteer.launch();

while (businesses.length > 0) {
const currentBatch = businesses.splice(0, concurrencyLimit);
const tasks = currentBatch.map(business => processBusiness(browser, business, location, retries));

try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();

}

Step 5: Bypassing Anti-Bots

To bypass anti-bots, all we need to do is change one more line. We already have getScrapeOpsUrl(), we simply need to add it in one more spot.

await page.goto(getScrapeOpsUrl(url, location), { timeout: 60000 });

Here is our production ready script.

const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");

const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;

console.log("api key:", API_KEY);

async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);

const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))

const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}

async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));

for await (const record of parser) {
results.push(record);
}
return results;
}

function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}

function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location,
residential: true,
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}

async function scrapeSearchResults(browser, keyword, pageNumber, location="us", retries=3) {
let tries = 0;
let success = false;

while (tries <= retries && !success) {

const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.yelp.com/search?find_desc=${formattedKeyword}&find_loc=${location}&start=${pageNumber*10}`;

const proxyUrl = getScrapeOpsUrl(url, location);

await page.goto(proxyUrl);
console.log(`Successfully fetched: ${url}`);

const divCards = await page.$$("div[data-testid='serp-ia-card']");

for (const divCard of divCards) {

const cardText = await page.evaluate(element => element.textContent, divCard);
const img = await divCard.$("img");
const name = await page.evaluate(element => element.getAttribute("alt"), img);
const nameRemoved = cardText.replace(name, "");

let sponsored = isNaN(nameRemoved[0]);

let rank = 0;
if (!sponsored) {
rankString = nameRemoved.split(".");
rank = Number(rankString[0]);
}

let rating = 0.0;
const hasRating = await divCard.$("div span[data-font-weight='semibold']");
if (hasRating) {
const ratingText = await page.evaluate(element => element.textContent, hasRating);
if (ratingText.length > 0) {
rating = Number(ratingText);
}
}

let reviewCount = "0";
if (cardText.includes("review")) {
reviewCount = cardText.split("(")[1].split(")")[0].split(" ")[0];
}

const aElement = await divCard.$("a");
const link = await page.evaluate(element => element.getAttribute("href"), aElement);
const yelpUrl = `https://www.yelp.com${link.replace("https://proxy.scrapeops.io", "")}`

const searchData = {
name: name,
sponsored: sponsored,
stars: rating,
rank: rank,
review_count: reviewCount,
url: yelpUrl
}

await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
}


success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}

async function startScrape(keyword, pages, location, concurrencyLimit, retries) {
const pageList = range(0, pages);

const browser = await puppeteer.launch()

while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, location, retries));

try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}

await browser.close();
}

async function processBusiness(browser, row, location, retries = 3) {
const url = row.url;
let tries = 0;
let success = false;


while (tries <= retries && !success) {
const page = await browser.newPage();

try {
await page.goto(getScrapeOpsUrl(url, location), { timeout: 60000 });
const infoSectionElement = await page.$("script[type='application/ld+json']");
const infoText = await page.evaluate(element => element.textContent, infoSectionElement);
const infoSection = JSON.parse(infoText);
const listElements = infoSection.itemListElement;

let anonCount = 1;
for (const element of listElements) {
let name = element.author.name;
if (name === "Unknown User") {
name = `${name}${anonCount}`;
anonCount++;
}
familyFriendly = element.isFamilyFriendly;
date = element.uploadDate,
position = element.position

const reviewData = {
name: name,
family_friendly: familyFriendly,
date: date,
position: position
}
await writeToCsv([reviewData], `${row.name.replace(" ", "-")}.csv`);
}

success = true;


} catch (err) {
console.log(`Error: ${err}, tries left: ${retries-tries}, url: ${getScrapeOpsUrl(url)}`);
tries++;
} finally {
await page.close();
}
}
}

async function processResults(csvFile, location, concurrencyLimit, retries) {
const businesses = await readCsv(csvFile);
const browser = await puppeteer.launch();

while (businesses.length > 0) {
const currentBatch = businesses.splice(0, concurrencyLimit);
const tasks = currentBatch.map(business => processBusiness(browser, business, location, retries));

try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();

}

async function main() {
const keywords = ["restaurants"];
const concurrencyLimit = 4;
const pages = 5;
const location = "uk";
const retries = 3;
const aggregateFiles = [];

for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, concurrencyLimit, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}


console.log("Starting scrape");
for (const file of aggregateFiles) {
await processResults(file, location, concurrencyLimit, retries);
}
console.log("Scrape complete");
}


main();

Step 6: Production Run

Here is our final main function. We're running on 5 pages again.

async function main() {
const keywords = ["restaurants"];
const concurrencyLimit = 4;
const pages = 5;
const location = "uk";
const retries = 3;
const aggregateFiles = [];

for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, concurrencyLimit, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}


console.log("Starting scrape");
for (const file of aggregateFiles) {
await processResults(file, location, concurrencyLimit, retries);
}
console.log("Scrape complete");
}

Here are the results.

Scraper Performance

The run finished in 3 minutes 19 seconds, or 199 seconds total. Earlier, it took 27 seconds to perform the crawl. That leaves us with approximately 172 seconds for 50 results, 172 seconds / 50 results = 3.44 seconds per result.

Comapared to other frameworks, this is lightning fast!


Whenever you choose to interact with a website, you are subject to their Terms of Service.

Violating terms of service on any site will likely get you suspended or banned. Yelp's terms are available to read here.

When using any sort of bot such as a scraper, you also need to take a look at their robots.txt here.

It's typically legal to scrape data as long as it's publicly available. Public data is any data that's not gated behind a login. If you need to login to view the data, this is private data.

If you have questions about the legality of a scraping job, you should consult an attorney.


Conclusion

You've finally finished the tutorial! You now have an understanding on how to use Puppeteer and how to incorporate parsing, pagination, data storage, concurrency and proxy integration into your design.

You also know how to parse different HTML elements and you also know how to extract data from nested JSON.


More Puppeteer Web Scraping Guides

Here at ScrapeOps, we've got all sorts of learning material. Whether you're new to scraping or you're a seasoned dev, we have something for you to add to your toolbox. Check out our Puppeteer Web Scraping Playbook.

If you enjoyed this article, check out a couple more from our How To Scrape series.