Skip to main content

Scrape g2 With NodeJS Puppeteer

How to Scrape G2 with Puppeteer

G2 is one of the leading websites if you're looking to get detailed reviews about different businesses. If you're looking to get a real feel for a company, G2 is definitely the place to do it.

Once you finish this tutorial, you'll be able to retrieve all sorts of data from G2 and you'll learn how to do the following when building scrapers in the future.


TLDR - How to Scrape G2

The biggest pain when scraping G2 is pulling the data from the page. G2 data is nested extremely deeply inside the HTML elements and CSS classes. Lucky for you, we've got production ready a G2 scraper right here in the TLDR.

To run this scraper, simply create a config.json file with your ScrapeOps API key and place it in the same folder as this script.

const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");

const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;

console.log("api key:", API_KEY);

async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);

const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))

const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}

async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));

for await (const record of parser) {
results.push(record);
}
return results;
}

function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}

function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}

async function scrapeSearchResults(browser, keyword, pageNumber, location="us", retries=3) {
let tries = 0;
let success = false;

while (tries <= retries && !success) {

const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.g2.com/search?page=${pageNumber+1}&query=${formattedKeyword}`;

const proxyUrl = getScrapeOpsUrl(url, location);

console.log(proxyUrl)
await page.goto(proxyUrl);

console.log(`Successfully fetched: ${url}`);

const divCards = await page.$$("div[class='product-listing mb-1 border-bottom']");

for (const divCard of divCards) {

const nameElement = await divCard.$("div[class='product-listing__product-name']");
const name = await page.evaluate(element => element.textContent, nameElement);

const g2UrlElement = await nameElement.$("a");
const g2Url = await page.evaluate(element => element.getAttribute("href"), g2UrlElement);

let rating = 0.0;
const ratingElement = await divCard.$("span[class='fw-semibold']");
if (ratingElement) {
rating = await page.evaluate(element => element.textContent, ratingElement);
}

const descriptionElement = await divCard.$("p");
const description = await page.evaluate(element => element.textContent, descriptionElement)

const businessInfo = {
name: name,
stars: rating,
g2_url: g2Url,
description: description
};

await writeToCsv([businessInfo], `${keyword.replace(" ", "-")}.csv`);
}


success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}

async function startScrape(keyword, pages, location, concurrencyLimit, retries) {
const pageList = range(0, pages);

const browser = await puppeteer.launch()

while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, location, retries));

try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}

await browser.close();
}

async function processBusiness(browser, row, location, retries = 3) {
const url = row.g2_url;
let tries = 0;
let success = false;


while (tries <= retries && !success) {
const page = await browser.newPage();

try {
await page.goto(getScrapeOpsUrl(url, location), { timeout: 60000 });

const reviewCards = await page.$$("div[class='paper paper--white paper--box mb-2 position-relative border-bottom']");
let anonCount = 0;

for (const reviewCard of reviewCards) {
reviewDateElement = await reviewCard.$("time");
reviewTextElement = await reviewCard.$("div[itemprop='reviewBody']");

if (reviewDateElement && reviewTextElement) {
const date = await page.evaluate(element => element.getAttribute("datetime"), reviewDateElement);
const reviewBody = await page.evaluate(element => element.textContent, reviewTextElement);

const nameElement = await reviewCard.$("a[class='link--header-color']");
let name;
if (nameElement) {
name = await page.evaluate(element => element.textContent, nameElement);
} else {
name = `anonymous-${anonCount}`;
anonCount++;
}

const jobTitleElement = await reviewCard.$("div[class='mt-4th']");
let jobTitle;
if (jobTitleElement) {
jobTitle = await page.evaluate(element => element.textContent, jobTitleElement);
} else {
jobTitle = "n/a";
}

const ratingContainer = await reviewCard.$("div[class='f-1 d-f ai-c mb-half-small-only']");
const ratingDiv = await ratingContainer.$("div");
const ratingClass = await page.evaluate(element => element.getAttribute("class"), ratingDiv);
const ratingArray = ratingClass.split("-");
const rating = Number(ratingArray[ratingArray.length-1])/2;

const infoContainer = await reviewCard.$("div[class='tags--teal']");
const incentivesDirty = await infoContainer.$$("div");
const incentivesClean = [];

let source = "";
for (const incentive of incentivesDirty) {
const text = await page.evaluate(element => element.textContent, incentive);
if (!incentivesClean.includes(text)) {
if (text.includes("Review source:")) {
textArray = text.split(": ");
source = textArray[textArray.length-1];
} else {
incentivesClean.push(text);
}
}
}
const validated = incentivesClean.includes("Validated Reviewer");
const incentivized = incentivesClean.includes("Incentivized Review");

const reviewData = {
name: name,
date: date,
job_title: jobTitle,
rating: rating,
full_review: reviewBody,
review_source: source,
validated: validated,
incentivized: incentivized
}
await writeToCsv([reviewData], `${row.name.replace(" ", "-")}.csv`);
}
}

success = true;
} catch (err) {
console.log(`Error: ${err}, tries left: ${retries-tries}, url: ${getScrapeOpsUrl(url)}`);
tries++;
} finally {
await page.close();
}
}
}

async function processResults(csvFile, location, concurrencyLimit, retries) {
const businesses = await readCsv(csvFile);
const browser = await puppeteer.launch();

while (businesses.length > 0) {
const currentBatch = businesses.splice(0, concurrencyLimit);
const tasks = currentBatch.map(business => processBusiness(browser, business, location, retries));

try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();

}

async function main() {
const keywords = ["online bank"];
const concurrencyLimit = 5;
const pages = 1;
const location = "us";
const retries = 3;
const aggregateFiles = [];

for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, concurrencyLimit, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}


console.log("Starting scrape");
for (const file of aggregateFiles) {
await processResults(file, location, concurrencyLimit, retries);
}
console.log("Scrape complete");
}


main();

If you'd like to tweak this scraper, feel free to change any of the following from the main function as well:

  • keywords: Contains a list of keywords to be searched and scraped.
  • retries: Specifies the number of times the scraper will retry fetching a page if it encounters an error.
  • concurrencyLimit: Defines the maximum number of threads to be used for concurrent scraping.
  • pages: Specifies the number of pages to scrape for each keyword.
  • location: Defines the geographic location from which the scraping requests appear to originate.

How To How To Architect Our G2 Scraper

When we scrape G2, we need to build two different scrapers.

  1. The first one, our crawler, is designed to perform a search using keywords. The crawler then takes the results from the search and creates a CSV report from the results.
  2. After the crawler, we build our scraper. Once we've generated search results, our scraper reads the CSV file we just wrote. For each business in the CSV file, the scraper then pulls up their individual G2 page and extracts all the review information.

The crawler generates a detailed list of businesses. The scraper then gets detailed reviews for each business.

To ensure these scrapers are both performant and stable, each of our scrapers will use the following:

  • Parsing: so we can pull proper information from a page.
  • Pagination: so we can pull up different pages be more selective about our data.
  • Data Storage: to store our data in a safe, efficient and readable way.
  • Concurrency: to scrape multiple pages at once.
  • Proxy Integration: when scraping anything at scale, we often face the issue of getting blocked. Proxies allow us a redundant connection and reduce our likelihood of getting blocked by different websites.

Understanding How To Scrape G2

Step 1: How To Request G2 Pages

A typical URL from G2 looks like this:

https://www.g2.com/search?query=online+bank`

https://www.g2.com/search? holds the actual domain of our URL. The query is on the end: query=online+bank. We can also add more parameters with &.

Take a look at the search below for online bank.

G2 Search Results

Along with a report of search results, we need to create a report on each individual business as well. The URL for each business looks like this

https://www.g2.com/products/name-of-business/reviews

Below is a screenshot of one of G2's individual business pages.

G2 Business Details Page


Step 2: How To Extract Data From G2 Results and Pages

G2 data gets very deeply nested within the page. The screesnhot below shows the name of a business nested within the page.

The results page isn't too difficult to parse, and we're only going to be taking 4 pieces of data from each result.

g2 HTML Inspection

However, extracting data from the individual pages is much tougher. Take a look at the screenshot below:

g2 HTML Inspection Business Page

Look at stars-8 at the end of the class name.

  • Our rating number is actually masked and held inside of this CSS class.
  • The stars-number is actually double the number of our rating. stars-10 would be a 5 star review.
  • stars-9 would be 4.5. stars-8 would be 4. You get the idea.
  • Divide the stars-number by two and you get your rating.

Step 3: How To Control Pagination

A paginated URL looks like this:

https://www.g2.com/search?page={page_number+1}&query={formatted_keyword}

Once our results are paginated, we can get batches of results. This allows us to get finer control over the data we're receiving. Without pagination, we'd be stuck on page 1!

The individual URLs look like this:

https://www.g2.com/products/name-of-business/reviews

Now that we know how to get our data, it's almost time to fetch it and pull it from the webpages.


Step 4: Geolocated Data

To handle Geoloacated Data, we'll be using the ScrapeOps Proxy API. If we want to be in Great Britain, we simply set our country parameter to "uk", if we want to be in the US, we can set this param to "us".

When we pass our country into the ScrapeOps API, ScrapeOps will actually route our requests through a server in that country, so even if the site checks our geolocation, our geolocation will show up correctly!


Setting Up Our G2 Scraper Project

Let's get started. You can run the following commands to get setup.

Create a New Project Folder

mkdir g2-scraper

cd g2-scraper

Create A New JavaScript Project

npm init --y

Install Our Dependencies

npm install puppeteer
npm install csv-writer
npm install csv-parse
npm install fs

Build A G2 Search Crawler

Step 1: Create Simple Search Data Parser

When we're extracting our data from a search page, we're parsing that page. Let's get setup with a basic script that parses our page for us.

In the code below, we do the following:

  • while we still have retries left and the operation hasn't succeeded:
    • await page.goto(url) fetches the site
    • We then pull the name with await page.evaluate(element => element.textContent, nameElement)
    • await page.evaluate(element => element.getAttribute("href"), g2UrlElement) gets the link to the business, g2_url
    • If there is a rating present on the page, we pull it from the page. If there is no rating present, we set a default of 0.0
    • await page.evaluate(element => element.textContent, descriptionElement) gives us the description of the business
    • Afterward, we log our businessInfo to the console
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");

const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;

function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}

async function scrapeSearchResults(browser, keyword, location="us", retries=3) {
let tries = 0;
let success = false;

while (tries <= retries && !success) {

const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.g2.com/search?query=${formattedKeyword}`;

await page.goto(url);

console.log(`Successfully fetched: ${url}`);

const divCards = await page.$$("div[class='product-listing mb-1 border-bottom']");

for (const divCard of divCards) {

const nameElement = await divCard.$("div[class='product-listing__product-name']");
const name = await page.evaluate(element => element.textContent, nameElement);

const g2UrlElement = await nameElement.$("a");
const g2Url = await page.evaluate(element => element.getAttribute("href"), g2UrlElement);

let rating = 0.0;
const ratingElement = await divCard.$("span[class='fw-semibold']");
if (ratingElement) {
rating = await page.evaluate(element => element.textContent, ratingElement);
}

const descriptionElement = await divCard.$("p");
const description = await page.evaluate(element => element.textContent, descriptionElement);

const businessInfo = {
name: name,
stars: rating,
g2_url: g2Url,
description: description
};

console.log(businessInfo);
}


success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}

async function startScrape(keyword, location, retries) {

const browser = await puppeteer.launch()

await scrapeSearchResults(browser, keyword, location, retries)
await browser.close();
}

async function main() {
const keywords = ["online bank"];
const concurrencyLimit = 5;
const pages = 1;
const location = "us";
const retries = 3;
const aggregateFiles = [];

for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, location, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
}


main();

For each business in the results, we find the following: name, stars, g2_url, and description. This data allows us to create objects that represent each business.

Everything we do from here depends on the data we pull with this parsing function.


Step 2: Add Pagination

Before we start storing our data, we need to start fetching it in batches. To get our batches, we need to add pagination to our crawler. Here is a paginated URL:

https://www.g2.com/search?page={page_number+1}&query={formatted_keyword}

We use page_number+1 because startScrape() begins counting at zero.

Take a look at the updated code below:

const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");

const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;

function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}

async function scrapeSearchResults(browser, keyword, pageNumber, location="us", retries=3) {
let tries = 0;
let success = false;

while (tries <= retries && !success) {

const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.g2.com/search?page=${pageNumber+1}&query=${formattedKeyword}`;

await page.goto(url);

console.log(`Successfully fetched: ${url}`);

const divCards = await page.$$("div[class='product-listing mb-1 border-bottom']");

for (const divCard of divCards) {

const nameElement = await divCard.$("div[class='product-listing__product-name']");
const name = await page.evaluate(element => element.textContent, nameElement);

const g2UrlElement = await nameElement.$("a");
const g2Url = await page.evaluate(element => element.getAttribute("href"), g2UrlElement);

let rating = 0.0;
const ratingElement = await divCard.$("span[class='fw-semibold']");
if (ratingElement) {
rating = await page.evaluate(element => element.textContent, ratingElement);
}

const descriptionElement = await divCard.$("p");
const description = await page.evaluate(element => element.textContent, descriptionElement)

const businessInfo = {
name: name,
stars: rating,
g2_url: g2Url,
description: description
};

console.log(businessInfo);
}


success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}

async function startScrape(keyword, pages, location, retries) {
const pageList = range(0, pages);

const browser = await puppeteer.launch()

for (const page of pageList) {
await scrapeSearchResults(browser, keyword, page, location, retries);
}

await browser.close();
}

async function main() {
const keywords = ["online bank"];
const concurrencyLimit = 5;
const pages = 1;
const location = "us";
const retries = 3;
const aggregateFiles = [];

for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
}


main();

We've added pageNumber to scrapeSearchResults(). We also added some functionality to startScrape().

It now creates a list of pages to scrape. It then iterates through the list and runs scrapeSearchResults() on each of the pages from the list.


Step 3: Storing the Scraped Data

To store our data, we're goiing to use the writeToCsv() function. You can look at it in the snippet below:

async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);

const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))

const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}

This function takes an array of JSON objects, and an outputFile. If outputFile already exists, we open it in append mode so we don't overwrite any important data. If the file doesn't exist yet, this function will create it.

Here is the full code after it's been updated to write our information to a CSV.

const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");

const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;

async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);

const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))

const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}


function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}

async function scrapeSearchResults(browser, keyword, pageNumber, location="us", retries=3) {
let tries = 0;
let success = false;

while (tries <= retries && !success) {

const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.g2.com/search?page=${pageNumber+1}&query=${formattedKeyword}`;

await page.goto(url);

console.log(`Successfully fetched: ${url}`);

const divCards = await page.$$("div[class='product-listing mb-1 border-bottom']");

for (const divCard of divCards) {

const nameElement = await divCard.$("div[class='product-listing__product-name']");
const name = await page.evaluate(element => element.textContent, nameElement);

const g2UrlElement = await nameElement.$("a");
const g2Url = await page.evaluate(element => element.getAttribute("href"), g2UrlElement);

let rating = 0.0;
const ratingElement = await divCard.$("span[class='fw-semibold']");
if (ratingElement) {
rating = await page.evaluate(element => element.textContent, ratingElement);
}

const descriptionElement = await divCard.$("p");
const description = await page.evaluate(element => element.textContent, descriptionElement)

const businessInfo = {
name: name,
stars: rating,
g2_url: g2Url,
description: description
};

await writeToCsv([businessInfo], `${keyword.replace(" ", "-")}.csv`);
}


success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}

async function startScrape(keyword, pages, location, retries) {
const pageList = range(0, pages);

const browser = await puppeteer.launch()

for (const page of pageList) {
await scrapeSearchResults(browser, keyword, page, location, retries);
}

await browser.close();
}

async function main() {
const keywords = ["online bank"];
const concurrencyLimit = 5;
const pages = 1;
const location = "us";
const retries = 3;
const aggregateFiles = [];

for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
}


main();
  • writeToCsv() takes an array of objects and writes them to a CSV file
  • As we parse the page, we pass each object in as soon as we've parsed it. In the event of a crash, this allows us to save as much data as possible.

Step 4: Adding Concurrency

A for loop isn't good enough for a production level crawler. We refactored startScrape() to use a concurrencyLimit and take advantage of the multiple pages inside the browser.

async function startScrape(keyword, pages, location, concurrencyLimit, retries) {
const pageList = range(0, pages);

const browser = await puppeteer.launch()

while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, location, retries));

try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}

await browser.close();
}

Let's break down this down:

  • As before, we create a list of pages to scrape
  • while pageList.length is greater than zero, we splice()a batch from index zero up to our concurrencyLimit
  • We run scrapeSearchResults() on each page in the batch simultaneously and then await the results of the batch
  • As this process continues, our pageList shrinks all the way down to zero. Each time a batch is processed, this list gets smaller and frees up more memory. Theoretically, the longer this function runs, the faster it gets.

Here is the fully updated code.

const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");

const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;

async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);

const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))

const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}


function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}

async function scrapeSearchResults(browser, keyword, pageNumber, location="us", retries=3) {
let tries = 0;
let success = false;

while (tries <= retries && !success) {

const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.g2.com/search?page=${pageNumber+1}&query=${formattedKeyword}`;

await page.goto(url);

console.log(`Successfully fetched: ${url}`);

const divCards = await page.$$("div[class='product-listing mb-1 border-bottom']");

for (const divCard of divCards) {

const nameElement = await divCard.$("div[class='product-listing__product-name']");
const name = await page.evaluate(element => element.textContent, nameElement);

const g2UrlElement = await nameElement.$("a");
const g2Url = await page.evaluate(element => element.getAttribute("href"), g2UrlElement);

let rating = 0.0;
const ratingElement = await divCard.$("span[class='fw-semibold']");
if (ratingElement) {
rating = await page.evaluate(element => element.textContent, ratingElement);
}

const descriptionElement = await divCard.$("p");
const description = await page.evaluate(element => element.textContent, descriptionElement)

const businessInfo = {
name: name,
stars: rating,
g2_url: g2Url,
description: description
};

await writeToCsv([businessInfo], `${keyword.replace(" ", "-")}.csv`);
}


success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}

async function startScrape(keyword, pages, location, concurrencyLimit, retries) {
const pageList = range(0, pages);

const browser = await puppeteer.launch()

while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, location, retries));

try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}

await browser.close();
}

async function main() {
const keywords = ["online bank"];
const concurrencyLimit = 5;
const pages = 1;
const location = "us";
const retries = 3;
const aggregateFiles = [];

for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, concurrencyLimit, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
}


main();

Now that we can process pages concurrently, we can get that sweet data much much faster!.


Step 5: Bypassing Anti-Bots

Anti-bots are used all over the web to detect and block malicious traffic. They protect against things like DDOS attacks and lots of other bad stuff. Our crawler isn't malicious, but it looks really weird compared to a normal user. It can make dozens of requests in under a second... not very human at all. To get past anti-bot software, we need the ScrapeOps API.

The function below uses simple string formatting and converts any regular URL into a proxied one using the ScrapeOps Proxy API.

function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}

The ScrapeOps Proxy API rotates our IP addresses and always gives us a server located in the country we choose. Each time we do something, that request comes from a different IP address!

We don't appear like one really bizarre fast user, our crawler looks like a bunch of normal users.

Our code barely changes at all here, but we're now a production ready level. Take a look at the full code example below.

const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");

const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;

async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);

const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))

const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}


function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}

function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}

async function scrapeSearchResults(browser, keyword, pageNumber, location="us", retries=3) {
let tries = 0;
let success = false;

while (tries <= retries && !success) {

const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.g2.com/search?page=${pageNumber+1}&query=${formattedKeyword}`;

const proxyUrl = getScrapeOpsUrl(url, location);
await page.goto(proxyUrl);

console.log(`Successfully fetched: ${url}`);

const divCards = await page.$$("div[class='product-listing mb-1 border-bottom']");

for (const divCard of divCards) {

const nameElement = await divCard.$("div[class='product-listing__product-name']");
const name = await page.evaluate(element => element.textContent, nameElement);

const g2UrlElement = await nameElement.$("a");
const g2Url = await page.evaluate(element => element.getAttribute("href"), g2UrlElement);

let rating = 0.0;
const ratingElement = await divCard.$("span[class='fw-semibold']");
if (ratingElement) {
rating = await page.evaluate(element => element.textContent, ratingElement);
}

const descriptionElement = await divCard.$("p");
const description = await page.evaluate(element => element.textContent, descriptionElement)

const businessInfo = {
name: name,
stars: rating,
g2_url: g2Url,
description: description
};

await writeToCsv([businessInfo], `${keyword.replace(" ", "-")}.csv`);
}


success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}

async function startScrape(keyword, pages, location, concurrencyLimit, retries) {
const pageList = range(0, pages);

const browser = await puppeteer.launch()

while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, location, retries));

try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}

await browser.close();
}

async function main() {
const keywords = ["online bank"];
const concurrencyLimit = 5;
const pages = 1;
const location = "us";
const retries = 3;
const aggregateFiles = [];

for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, concurrencyLimit, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
}


main();

Step 6: Production Run

Let's run this baby in production! Take a look at the main below.

async function main() {
const keywords = ["online bank"];
const concurrencyLimit = 5;
const pages = 10;
const location = "us";
const retries = 3;
const aggregateFiles = [];

for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, concurrencyLimit, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
}

pages is set to 10, location gets set to "us", and concurrencyLimit is set to 5. Now, we need to process 10 pages of data.

Here are the results:

Crawler Results

This run took 25 seconds to process 10 pages of results...so 2.5 seconds per page.


Build A G2 Scraper

We now have a crawler that builds the reports we want based on different search criteria. Time to utilize these reports. Our scraper will read the search report and run an individual scraping job on every single business from the report.

Our scraper needs to be able to do the following:

  1. Open the report we created
  2. Get the pages from that report
  3. Pull information from these pages
  4. Create an individual report for each of the businesses we've looked up

Our review scraper is going to use some things you're already familiar with by now: parsing, storage, concurrency, and proxy integration.


Step 1: Create Simple Business Data Parser

Time to build a (not so) simple parser that takes in a row from our CSV file and scrapes reviews for the business in that row.

async function processBusiness(browser, row, location, retries = 3) {
const url = row.g2_url;
let tries = 0;
let success = false;


while (tries <= retries && !success) {
const page = await browser.newPage();

try {
await page.goto(url);

const reviewCards = await page.$$("div[class='paper paper--white paper--box mb-2 position-relative border-bottom']");
let anonCount = 0;

for (const reviewCard of reviewCards) {
reviewDateElement = await reviewCard.$("time");
reviewTextElement = await reviewCard.$("div[itemprop='reviewBody']");

if (reviewDateElement && reviewTextElement) {
const date = await page.evaluate(element => element.getAttribute("datetime"), reviewDateElement);
const reviewBody = await page.evaluate(element => element.textContent, reviewTextElement);

const nameElement = await reviewCard.$("a[class='link--header-color']");
let name;
if (nameElement) {
name = await page.evaluate(element => element.textContent, nameElement);
} else {
name = `anonymous-${anonCount}`;
anonCount++;
}

const jobTitleElement = await reviewCard.$("div[class='mt-4th']");
let jobTitle;
if (jobTitleElement) {
jobTitle = await page.evaluate(element => element.textContent, jobTitleElement);
} else {
jobTitle = "n/a";
}

const ratingContainer = await reviewCard.$("div[class='f-1 d-f ai-c mb-half-small-only']");
const ratingDiv = await ratingContainer.$("div");
const ratingClass = await page.evaluate(element => element.getAttribute("class"), ratingDiv);
const ratingArray = ratingClass.split("-");
const rating = Number(ratingArray[ratingArray.length-1])/2;

const infoContainer = await reviewCard.$("div[class='tags--teal']");
const incentivesDirty = await infoContainer.$$("div");
const incentivesClean = [];

let source = "";
for (const incentive of incentivesDirty) {
const text = await page.evaluate(element => element.textContent, incentive);
if (!incentivesClean.includes(text)) {
if (text.includes("Review source:")) {
textArray = text.split(": ");
source = textArray[textArray.length-1];
} else {
incentivesClean.push(text);
}
}
}
const validated = incentivesClean.includes("Validated Reviewer");
const incentivized = incentivesClean.includes("Incentivized Review");

const reviewData = {
name: name,
date: date,
job_title: jobTitle,
rating: rating,
full_review: reviewBody,
review_source: source,
validated: validated,
incentivized: incentivized
}
console.log(reviewData);
}
}

success = true;
} catch (err) {
console.log(`Error: ${err}, tries left: ${retries-tries}`);
tries++;
} finally {
await page.close();
}
}
}
  • Each review has a date. From each review, we pull the date with await page.evaluate(element => element.getAttribute("datetime"), reviewDateElement);
  • Then, we check if the user's name is present. If it's not, we name the viewer, "anonymous" and give them a number. This prevents different anonymous reviews from getting filtered out
  • await reviewCard.$("div[class='mt-4th']") checks if the job_title is present. If it is not, we give it a default value of "n/a". Otherwise we pull the user's job_title from the post.
  • await page.evaluate(element => element.getAttribute("class"), ratingDiv) pulls the CSS class from our rating. We then split("-") to separate the number of stars from the CSS class. After splitting the stars, we divide them by 2 to get the actual rating.
  • await page.evaluate(element => element.textContent, reviewTextElement) gives us the actual review
  • We created an incentives_dirty list to hold all of the incentive tags from the review. If "Review source:" is in the text of the incentive item, we split(": ") to separate the source name and pull it. All other non duplicate items get pushed into the incentives_clean list.
  • If "Validated Reviewer" or "Incentivized Review" is inside the incentives_clean list, we set those variables to True

Our parsing function takes in a row from our CSV file. It then gets the g2_url for the business using page.goto(). Now that we can get the right data from our site, we're ready to read our CSV file and pull all this important data.


Step 2: Loading URLs To Scrape

In order to use processBusiness(), we need to read the CSV that the crawler creates. Let's update our code so our new function, processResults() can handle this.

Take a look at the new function below:

async function processResults(csvFile, location, retries) {
const businesses = await readCsv(csvFile);
const browser = await puppeteer.launch();

for (const business of businesses) {
await processBusiness(browser, business, location, retries);
}
await browser.close();

}

Here is the readCsv() function as well. It takes in a CSV file and spits out an array of JSON objects.

async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));

for await (const record of parser) {
results.push(record);
}
return results;
}

processResults() reads the CSV file and then converts all the rows into an array. We then iterate through this array and pass each row into processBusiness(). You can view the updated code below.

const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");

const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;

async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);

const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))

const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}

async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));

for await (const record of parser) {
results.push(record);
}
return results;
}

function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}

function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}

async function scrapeSearchResults(browser, keyword, pageNumber, location="us", retries=3) {
let tries = 0;
let success = false;

while (tries <= retries && !success) {

const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.g2.com/search?page=${pageNumber+1}&query=${formattedKeyword}`;

const proxyUrl = getScrapeOpsUrl(url, location);
await page.goto(proxyUrl);

console.log(`Successfully fetched: ${url}`);

const divCards = await page.$$("div[class='product-listing mb-1 border-bottom']");

for (const divCard of divCards) {

const nameElement = await divCard.$("div[class='product-listing__product-name']");
const name = await page.evaluate(element => element.textContent, nameElement);

const g2UrlElement = await nameElement.$("a");
const g2Url = await page.evaluate(element => element.getAttribute("href"), g2UrlElement);

let rating = 0.0;
const ratingElement = await divCard.$("span[class='fw-semibold']");
if (ratingElement) {
rating = await page.evaluate(element => element.textContent, ratingElement);
}

const descriptionElement = await divCard.$("p");
const description = await page.evaluate(element => element.textContent, descriptionElement)

const businessInfo = {
name: name,
stars: rating,
g2_url: g2Url,
description: description
};

await writeToCsv([businessInfo], `${keyword.replace(" ", "-")}.csv`);
}


success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}

async function startScrape(keyword, pages, location, concurrencyLimit, retries) {
const pageList = range(0, pages);

const browser = await puppeteer.launch()

while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, location, retries));

try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}

await browser.close();
}

async function processBusiness(browser, row, location, retries = 3) {
const url = row.g2_url;
let tries = 0;
let success = false;


while (tries <= retries && !success) {
const page = await browser.newPage();

try {
await page.goto(url);

const reviewCards = await page.$$("div[class='paper paper--white paper--box mb-2 position-relative border-bottom']");
let anonCount = 0;

for (const reviewCard of reviewCards) {
reviewDateElement = await reviewCard.$("time");
reviewTextElement = await reviewCard.$("div[itemprop='reviewBody']");

if (reviewDateElement && reviewTextElement) {
const date = await page.evaluate(element => element.getAttribute("datetime"), reviewDateElement);
const reviewBody = await page.evaluate(element => element.textContent, reviewTextElement);

const nameElement = await reviewCard.$("a[class='link--header-color']");
let name;
if (nameElement) {
name = await page.evaluate(element => element.textContent, nameElement);
} else {
name = `anonymous-${anonCount}`;
anonCount++;
}

const jobTitleElement = await reviewCard.$("div[class='mt-4th']");
let jobTitle;
if (jobTitleElement) {
jobTitle = await page.evaluate(element => element.textContent, jobTitleElement);
} else {
jobTitle = "n/a";
}

const ratingContainer = await reviewCard.$("div[class='f-1 d-f ai-c mb-half-small-only']");
const ratingDiv = await ratingContainer.$("div");
const ratingClass = await page.evaluate(element => element.getAttribute("class"), ratingDiv);
const ratingArray = ratingClass.split("-");
const rating = Number(ratingArray[ratingArray.length-1])/2;

const infoContainer = await reviewCard.$("div[class='tags--teal']");
const incentivesDirty = await infoContainer.$$("div");
const incentivesClean = [];

let source = "";
for (const incentive of incentivesDirty) {
const text = await page.evaluate(element => element.textContent, incentive);
if (!incentivesClean.includes(text)) {
if (text.includes("Review source:")) {
textArray = text.split(": ");
source = textArray[textArray.length-1];
} else {
incentivesClean.push(text);
}
}
}
const validated = incentivesClean.includes("Validated Reviewer");
const incentivized = incentivesClean.includes("Incentivized Review");

const reviewData = {
name: name,
date: date,
job_title: jobTitle,
rating: rating,
full_review: reviewBody,
review_source: source,
validated: validated,
incentivized: incentivized
}
console.log(reviewData);
}
}

success = true;
} catch (err) {
console.log(`Error: ${err}, tries left: ${retries-tries}`);
tries++;
} finally {
await page.close();
}
}
}

async function processResults(csvFile, location, retries) {
const businesses = await readCsv(csvFile);
const browser = await puppeteer.launch();

for (const business of businesses) {
await processBusiness(browser, business, location, retries);
}
await browser.close();

}

async function main() {
const keywords = ["online bank"];
const concurrencyLimit = 5;
const pages = 1;
const location = "us";
const retries = 3;
const aggregateFiles = [];

for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, concurrencyLimit, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}

for (const file of aggregateFiles) {
await processResults(file, location, concurrencyLimit, retries);
}
}


main();

In our updated code, processResults() reads our CSV file. Then, it passes each row into processBusiness(). processBusiness() extracts our data and then prints it to the console.


Step 3: Storing the Scraped Data

Once again, now that we've scraped our data we need to store it. Because of our existing code, we can do this by removing console.log() and replacing it with the following line:

await writeToCsv([reviewData], `${row.name.replace(" ", "-")}.csv`);

Our reviewData object uses the following fields to represent reviews from the webpage:

  • name
  • date
  • job_title
  • rating
  • full_review
  • review_source
  • validated
  • incentivized

In the updated code below, we pass our reviewData into writeToCsv(). Just like before, we pass each object in as soon as it's been processed so we can save as much data as possible in the event of a crash.

const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");

const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;

async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);

const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))

const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}

async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));

for await (const record of parser) {
results.push(record);
}
return results;
}

function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}

function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}

async function scrapeSearchResults(browser, keyword, pageNumber, location="us", retries=3) {
let tries = 0;
let success = false;

while (tries <= retries && !success) {

const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.g2.com/search?page=${pageNumber+1}&query=${formattedKeyword}`;

const proxyUrl = getScrapeOpsUrl(url, location);
await page.goto(proxyUrl);

console.log(`Successfully fetched: ${url}`);

const divCards = await page.$$("div[class='product-listing mb-1 border-bottom']");

for (const divCard of divCards) {

const nameElement = await divCard.$("div[class='product-listing__product-name']");
const name = await page.evaluate(element => element.textContent, nameElement);

const g2UrlElement = await nameElement.$("a");
const g2Url = await page.evaluate(element => element.getAttribute("href"), g2UrlElement);

let rating = 0.0;
const ratingElement = await divCard.$("span[class='fw-semibold']");
if (ratingElement) {
rating = await page.evaluate(element => element.textContent, ratingElement);
}

const descriptionElement = await divCard.$("p");
const description = await page.evaluate(element => element.textContent, descriptionElement)

const businessInfo = {
name: name,
stars: rating,
g2_url: g2Url,
description: description
};

await writeToCsv([businessInfo], `${keyword.replace(" ", "-")}.csv`);
}


success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}

async function startScrape(keyword, pages, location, concurrencyLimit, retries) {
const pageList = range(0, pages);

const browser = await puppeteer.launch()

while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, location, retries));

try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}

await browser.close();
}

async function processBusiness(browser, row, location, retries = 3) {
const url = row.g2_url;
let tries = 0;
let success = false;


while (tries <= retries && !success) {
const page = await browser.newPage();

try {
await page.goto(url);

const reviewCards = await page.$$("div[class='paper paper--white paper--box mb-2 position-relative border-bottom']");
let anonCount = 0;

for (const reviewCard of reviewCards) {
reviewDateElement = await reviewCard.$("time");
reviewTextElement = await reviewCard.$("div[itemprop='reviewBody']");

if (reviewDateElement && reviewTextElement) {
const date = await page.evaluate(element => element.getAttribute("datetime"), reviewDateElement);
const reviewBody = await page.evaluate(element => element.textContent, reviewTextElement);

const nameElement = await reviewCard.$("a[class='link--header-color']");
let name;
if (nameElement) {
name = await page.evaluate(element => element.textContent, nameElement);
} else {
name = `anonymous-${anonCount}`;
anonCount++;
}

const jobTitleElement = await reviewCard.$("div[class='mt-4th']");
let jobTitle;
if (jobTitleElement) {
jobTitle = await page.evaluate(element => element.textContent, jobTitleElement);
} else {
jobTitle = "n/a";
}

const ratingContainer = await reviewCard.$("div[class='f-1 d-f ai-c mb-half-small-only']");
const ratingDiv = await ratingContainer.$("div");
const ratingClass = await page.evaluate(element => element.getAttribute("class"), ratingDiv);
const ratingArray = ratingClass.split("-");
const rating = Number(ratingArray[ratingArray.length-1])/2;

const infoContainer = await reviewCard.$("div[class='tags--teal']");
const incentivesDirty = await infoContainer.$$("div");
const incentivesClean = [];

let source = "";
for (const incentive of incentivesDirty) {
const text = await page.evaluate(element => element.textContent, incentive);
if (!incentivesClean.includes(text)) {
if (text.includes("Review source:")) {
textArray = text.split(": ");
source = textArray[textArray.length-1];
} else {
incentivesClean.push(text);
}
}
}
const validated = incentivesClean.includes("Validated Reviewer");
const incentivized = incentivesClean.includes("Incentivized Review");

const reviewData = {
name: name,
date: date,
job_title: jobTitle,
rating: rating,
full_review: reviewBody,
review_source: source,
validated: validated,
incentivized: incentivized
}
await writeToCsv([reviewData], `${row.name.replace(" ", "-")}.csv`);
}
}

success = true;
} catch (err) {
console.log(`Error: ${err}, tries left: ${retries-tries}`);
tries++;
} finally {
await page.close();
}
}
}

async function processResults(csvFile, location, retries) {
const businesses = await readCsv(csvFile);
const browser = await puppeteer.launch();

for (const business of businesses) {
await processBusiness(browser, business, location, retries);
}
await browser.close();

}

async function main() {
const keywords = ["online bank"];
const concurrencyLimit = 5;
const pages = 1;
const location = "us";
const retries = 3;
const aggregateFiles = [];

for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, concurrencyLimit, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}

for (const file of aggregateFiles) {
await processResults(file, location, concurrencyLimit, retries);
}
}


main();

Step 4: Adding Concurrency

Just like before, we use a concurrencyLimit to open many pages simultaneously. The largest difference here is our array. Instead of an array of page numbers, we have a much larger array of CSV rows.

async function processResults(csvFile, location, concurrencyLimit, retries) {
const businesses = await readCsv(csvFile);
const browser = await puppeteer.launch();

while (businesses.length > 0) {
const currentBatch = businesses.splice(0, concurrencyLimit);
const tasks = currentBatch.map(business => processBusiness(browser, business, location, retries));

try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();

}

We removed the for and replaced it with cuncurrent batches that use async and await. Aside from the changes here, the rest of our code remains basically the same!


Step 5: Bypassing Anti-Bots

Now, we need to add proxy support again. We've already got our getScrapeOpsUrl() function. We just need to change one line.

await page.goto(getScrapeOpsUrl(url, location));

Here is the fully updated code:

const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");

const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;

console.log("api key:", API_KEY);

async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);

const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))

const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}

async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));

for await (const record of parser) {
results.push(record);
}
return results;
}

function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}

function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}

async function scrapeSearchResults(browser, keyword, pageNumber, location="us", retries=3) {
let tries = 0;
let success = false;

while (tries <= retries && !success) {

const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.g2.com/search?page=${pageNumber+1}&query=${formattedKeyword}`;

const proxyUrl = getScrapeOpsUrl(url, location);

console.log(proxyUrl)
await page.goto(proxyUrl);

console.log(`Successfully fetched: ${url}`);

const divCards = await page.$$("div[class='product-listing mb-1 border-bottom']");

for (const divCard of divCards) {

const nameElement = await divCard.$("div[class='product-listing__product-name']");
const name = await page.evaluate(element => element.textContent, nameElement);

const g2UrlElement = await nameElement.$("a");
const g2Url = await page.evaluate(element => element.getAttribute("href"), g2UrlElement);

let rating = 0.0;
const ratingElement = await divCard.$("span[class='fw-semibold']");
if (ratingElement) {
rating = await page.evaluate(element => element.textContent, ratingElement);
}

const descriptionElement = await divCard.$("p");
const description = await page.evaluate(element => element.textContent, descriptionElement)

const businessInfo = {
name: name,
stars: rating,
g2_url: g2Url,
description: description
};

await writeToCsv([businessInfo], `${keyword.replace(" ", "-")}.csv`);
}


success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}

async function startScrape(keyword, pages, location, concurrencyLimit, retries) {
const pageList = range(0, pages);

const browser = await puppeteer.launch()

while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, location, retries));

try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}

await browser.close();
}

async function processBusiness(browser, row, location, retries = 3) {
const url = row.g2_url;
let tries = 0;
let success = false;


while (tries <= retries && !success) {
const page = await browser.newPage();

try {
await page.goto(getScrapeOpsUrl(url, location), { timeout: 60000 });

const reviewCards = await page.$$("div[class='paper paper--white paper--box mb-2 position-relative border-bottom']");
let anonCount = 0;

for (const reviewCard of reviewCards) {
reviewDateElement = await reviewCard.$("time");
reviewTextElement = await reviewCard.$("div[itemprop='reviewBody']");

if (reviewDateElement && reviewTextElement) {
const date = await page.evaluate(element => element.getAttribute("datetime"), reviewDateElement);
const reviewBody = await page.evaluate(element => element.textContent, reviewTextElement);

const nameElement = await reviewCard.$("a[class='link--header-color']");
let name;
if (nameElement) {
name = await page.evaluate(element => element.textContent, nameElement);
} else {
name = `anonymous-${anonCount}`;
anonCount++;
}

const jobTitleElement = await reviewCard.$("div[class='mt-4th']");
let jobTitle;
if (jobTitleElement) {
jobTitle = await page.evaluate(element => element.textContent, jobTitleElement);
} else {
jobTitle = "n/a";
}

const ratingContainer = await reviewCard.$("div[class='f-1 d-f ai-c mb-half-small-only']");
const ratingDiv = await ratingContainer.$("div");
const ratingClass = await page.evaluate(element => element.getAttribute("class"), ratingDiv);
const ratingArray = ratingClass.split("-");
const rating = Number(ratingArray[ratingArray.length-1])/2;

const infoContainer = await reviewCard.$("div[class='tags--teal']");
const incentivesDirty = await infoContainer.$$("div");
const incentivesClean = [];

let source = "";
for (const incentive of incentivesDirty) {
const text = await page.evaluate(element => element.textContent, incentive);
if (!incentivesClean.includes(text)) {
if (text.includes("Review source:")) {
textArray = text.split(": ");
source = textArray[textArray.length-1];
} else {
incentivesClean.push(text);
}
}
}
const validated = incentivesClean.includes("Validated Reviewer");
const incentivized = incentivesClean.includes("Incentivized Review");

const reviewData = {
name: name,
date: date,
job_title: jobTitle,
rating: rating,
full_review: reviewBody,
review_source: source,
validated: validated,
incentivized: incentivized
}
await writeToCsv([reviewData], `${row.name.replace(" ", "-")}.csv`);
}
}

success = true;
} catch (err) {
console.log(`Error: ${err}, tries left: ${retries-tries}, url: ${getScrapeOpsUrl(url)}`);
tries++;
} finally {
await page.close();
}
}
}

async function processResults(csvFile, location, concurrencyLimit, retries) {
const businesses = await readCsv(csvFile);
const browser = await puppeteer.launch();

while (businesses.length > 0) {
const currentBatch = businesses.splice(0, concurrencyLimit);
const tasks = currentBatch.map(business => processBusiness(browser, business, location, retries));

try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();

}

async function main() {
const keywords = ["online bank"];
const concurrencyLimit = 5;
const pages = 1;
const location = "us";
const retries = 3;
const aggregateFiles = [];

for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, concurrencyLimit, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}


console.log("Starting scrape");
for (const file of aggregateFiles) {
await processResults(file, location, concurrencyLimit, retries);
}
console.log("Scrape complete");
}


main();

Step 6: Production Run

Now, it's time to run both our scraper and crawler in production together! Just like before, we'll run a scrape job on 10 pages of search results.

async function main() {
const keywords = ["online bank"];
const concurrencyLimit = 5;
const pages = 10;
const location = "us";
const retries = 3;
const aggregateFiles = [];

for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, concurrencyLimit, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}

for (const file of aggregateFiles) {
await processResults(file, location, concurrencyLimit, retries);
}
}

Just like before, I set our pages to 10, our location to "us", and our concurrencyLimit to 5. Here are the results.

Scraper Results

In total, it took just over 690 seconds (including the time it took to create our initial report) to generate a full report and process all the results (197 rows). The average speed per page is about 3.5 seconds.


Whenever you're doing a scrape job, there will always legal and ethical questions to consider. You should always comply with a site's Terms of Use and robots.txt.

You can view G2's terms here and their robots.txt is available here.

Always be careful about the information you extract and don't scrape private or confidential data. If a website is hidden behind a login, that is generally considered private data.

If your data does not require a login, it is generally considered to be public data. If you have questions about the legality of your scraping job, it is best to consult an attorney familiar with the laws and localities you're dealing with.


Conclusion

You now know how to build G2 scrapers and parsers with Puppeteer. You should have decent understanding of the following terms: parsing, pagination, data storage, concurrency, and proxy integration.

You should also have a decent understanding of the page.$(), page.$$() and page.evaluate() methods from Puppeteer and you should understand some pretty complex string operations for extracting data such as split(), replace() and includes().

To learn more about the tools we used in this article, check out the links below:


More Web Scraping Guides

Build something and practice your new skills. Whatever you do...Go build something! Here at ScrapeOps, we've got loads of resources for you to learn from. If you're in the mood to learn more, take a look at the articles below.

If you're in the mood to learn more, check our Puppeteer Web Scraping Playbook or take a look at the articles below: