How to Scrape G2 with Puppeteer
G2 is one of the leading websites if you're looking to get detailed reviews about different businesses. If you're looking to get a real feel for a company, G2 is definitely the place to do it.
Once you finish this tutorial, you'll be able to retrieve all sorts of data from G2 and you'll learn how to do the following when building scrapers in the future.
- [TLDR - How to Scrape G2](#tldr
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
---how-to-scrape-g2)
- How To Architect Our Scraper
- Understanding How To Scrape G2
- Setting Up Our G2 Scraper
- Build a G2 Search Crawler
- Build a G2 Scraper
- Legal and Ethical Considerations
- Conclusion
- More Web Scraping Guides
TLDR - How to Scrape G2
The biggest pain when scraping G2 is pulling the data from the page. G2 data is nested extremely deeply inside the HTML elements and CSS classes. Lucky for you, we've got production ready a G2 scraper right here in the TLDR.
To run this scraper, simply create a config.json
file with your ScrapeOps API key and place it in the same folder as this script.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
console.log("api key:", API_KEY);
async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}
async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));
for await (const record of parser) {
results.push(record);
}
return results;
}
function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}
function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}
async function scrapeSearchResults(browser, keyword, pageNumber, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.g2.com/search?page=${pageNumber+1}&query=${formattedKeyword}`;
const proxyUrl = getScrapeOpsUrl(url, location);
console.log(proxyUrl)
await page.goto(proxyUrl);
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[class='product-listing mb-1 border-bottom']");
for (const divCard of divCards) {
const nameElement = await divCard.$("div[class='product-listing__product-name']");
const name = await page.evaluate(element => element.textContent, nameElement);
const g2UrlElement = await nameElement.$("a");
const g2Url = await page.evaluate(element => element.getAttribute("href"), g2UrlElement);
let rating = 0.0;
const ratingElement = await divCard.$("span[class='fw-semibold']");
if (ratingElement) {
rating = await page.evaluate(element => element.textContent, ratingElement);
}
const descriptionElement = await divCard.$("p");
const description = await page.evaluate(element => element.textContent, descriptionElement)
const businessInfo = {
name: name,
stars: rating,
g2_url: g2Url,
description: description
};
await writeToCsv([businessInfo], `${keyword.replace(" ", "-")}.csv`);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startScrape(keyword, pages, location, concurrencyLimit, retries) {
const pageList = range(0, pages);
const browser = await puppeteer.launch()
while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
async function processBusiness(browser, row, location, retries = 3) {
const url = row.g2_url;
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const page = await browser.newPage();
try {
await page.goto(getScrapeOpsUrl(url, location), { timeout: 60000 });
const reviewCards = await page.$$("div[class='paper paper--white paper--box mb-2 position-relative border-bottom']");
let anonCount = 0;
for (const reviewCard of reviewCards) {
reviewDateElement = await reviewCard.$("time");
reviewTextElement = await reviewCard.$("div[itemprop='reviewBody']");
if (reviewDateElement && reviewTextElement) {
const date = await page.evaluate(element => element.getAttribute("datetime"), reviewDateElement);
const reviewBody = await page.evaluate(element => element.textContent, reviewTextElement);
const nameElement = await reviewCard.$("a[class='link--header-color']");
let name;
if (nameElement) {
name = await page.evaluate(element => element.textContent, nameElement);
} else {
name = `anonymous-${anonCount}`;
anonCount++;
}
const jobTitleElement = await reviewCard.$("div[class='mt-4th']");
let jobTitle;
if (jobTitleElement) {
jobTitle = await page.evaluate(element => element.textContent, jobTitleElement);
} else {
jobTitle = "n/a";
}
const ratingContainer = await reviewCard.$("div[class='f-1 d-f ai-c mb-half-small-only']");
const ratingDiv = await ratingContainer.$("div");
const ratingClass = await page.evaluate(element => element.getAttribute("class"), ratingDiv);
const ratingArray = ratingClass.split("-");
const rating = Number(ratingArray[ratingArray.length-1])/2;
const infoContainer = await reviewCard.$("div[class='tags--teal']");
const incentivesDirty = await infoContainer.$$("div");
const incentivesClean = [];
let source = "";
for (const incentive of incentivesDirty) {
const text = await page.evaluate(element => element.textContent, incentive);
if (!incentivesClean.includes(text)) {
if (text.includes("Review source:")) {
textArray = text.split(": ");
source = textArray[textArray.length-1];
} else {
incentivesClean.push(text);
}
}
}
const validated = incentivesClean.includes("Validated Reviewer");
const incentivized = incentivesClean.includes("Incentivized Review");
const reviewData = {
name: name,
date: date,
job_title: jobTitle,
rating: rating,
full_review: reviewBody,
review_source: source,
validated: validated,
incentivized: incentivized
}
await writeToCsv([reviewData], `${row.name.replace(" ", "-")}.csv`);
}
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left: ${retries-tries}, url: ${getScrapeOpsUrl(url)}`);
tries++;
} finally {
await page.close();
}
}
}
async function processResults(csvFile, location, concurrencyLimit, retries) {
const businesses = await readCsv(csvFile);
const browser = await puppeteer.launch();
while (businesses.length > 0) {
const currentBatch = businesses.splice(0, concurrencyLimit);
const tasks = currentBatch.map(business => processBusiness(browser, business, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
async function main() {
const keywords = ["online bank"];
const concurrencyLimit = 5;
const pages = 1;
const location = "us";
const retries = 3;
const aggregateFiles = [];
for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, concurrencyLimit, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
console.log("Starting scrape");
for (const file of aggregateFiles) {
await processResults(file, location, concurrencyLimit, retries);
}
console.log("Scrape complete");
}
main();
If you'd like to tweak this scraper, feel free to change any of the following from the main
function as well:
keywords
: Contains a list of keywords to be searched and scraped.retries
: Specifies the number of times the scraper will retry fetching a page if it encounters an error.concurrencyLimit
: Defines the maximum number of threads to be used for concurrent scraping.pages
: Specifies the number of pages to scrape for each keyword.location
: Defines the geographic location from which the scraping requests appear to originate.
How To How To Architect Our G2 Scraper
When we scrape G2, we need to build two different scrapers.
- The first one, our crawler, is designed to perform a search using keywords. The crawler then takes the results from the search and creates a CSV report from the results.
- After the crawler, we build our scraper. Once we've generated search results, our scraper reads the CSV file we just wrote. For each business in the CSV file, the scraper then pulls up their individual G2 page and extracts all the review information.
The crawler generates a detailed list of businesses. The scraper then gets detailed reviews for each business.
To ensure these scrapers are both performant and stable, each of our scrapers will use the following:
- Parsing: so we can pull proper information from a page.
- Pagination: so we can pull up different pages be more selective about our data.
- Data Storage: to store our data in a safe, efficient and readable way.
- Concurrency: to scrape multiple pages at once.
- Proxy Integration: when scraping anything at scale, we often face the issue of getting blocked. Proxies allow us a redundant connection and reduce our likelihood of getting blocked by different websites.
Understanding How To Scrape G2
Step 1: How To Request G2 Pages
A typical URL from G2 looks like this:
https://www.g2.com/search?query=online+bank`
https://www.g2.com/search?
holds the actual domain of our URL. The query is on the end: query=online+bank
. We can also add more parameters with &
.
Take a look at the search below for online bank.
Along with a report of search results, we need to create a report on each individual business as well. The URL for each business looks like this
https://www.g2.com/products/name-of-business/reviews
Below is a screenshot of one of G2's individual business pages.
Step 2: How To Extract Data From G2 Results and Pages
G2 data gets very deeply nested within the page. The screesnhot below shows the name
of a business nested within the page.
The results page isn't too difficult to parse, and we're only going to be taking 4 pieces of data from each result.
However, extracting data from the individual pages is much tougher. Take a look at the screenshot below:
Look at stars-8
at the end of the class name.
- Our rating number is actually masked and held inside of this CSS class.
- The
stars-number
is actually double the number of our rating.stars-10
would be a 5 star review. stars-9
would be 4.5.stars-8
would be 4. You get the idea.- Divide the
stars-number
by two and you get your rating.
Step 3: How To Control Pagination
A paginated URL looks like this:
https://www.g2.com/search?page={page_number+1}&query={formatted_keyword}
Once our results are paginated, we can get batches of results. This allows us to get finer control over the data we're receiving. Without pagination, we'd be stuck on page 1!
The individual URLs look like this:
https://www.g2.com/products/name-of-business/reviews
Now that we know how to get our data, it's almost time to fetch it and pull it from the webpages.
Step 4: Geolocated Data
To handle Geoloacated Data, we'll be using the ScrapeOps Proxy API. If we want to be in Great Britain, we simply set our country
parameter to "uk"
, if we want to be in the US, we can set this param to "us"
.
When we pass our country
into the ScrapeOps API, ScrapeOps will actually route our requests through a server in that country, so even if the site checks our geolocation, our geolocation will show up correctly!
Setting Up Our G2 Scraper Project
Let's get started. You can run the following commands to get setup.
Create a New Project Folder
mkdir g2-scraper
cd g2-scraper
Create A New JavaScript Project
npm init --y
Install Our Dependencies
npm install puppeteer
npm install csv-writer
npm install csv-parse
npm install fs
Build A G2 Search Crawler
Step 1: Create Simple Search Data Parser
When we're extracting our data from a search page, we're parsing that page. Let's get setup with a basic script that parses our page for us.
In the code below, we do the following:
while
we still haveretries
left and the operation hasn't succeeded:await page.goto(url)
fetches the site- We then pull the
name
withawait page.evaluate(element => element.textContent, nameElement)
await page.evaluate(element => element.getAttribute("href"), g2UrlElement)
gets the link to the business,g2_url
- If there is a
rating
present on the page, we pull it from the page. If there is no rating present, we set a default of 0.0 await page.evaluate(element => element.textContent, descriptionElement)
gives us the description of the business- Afterward, we log our
businessInfo
to the console
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}
async function scrapeSearchResults(browser, keyword, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.g2.com/search?query=${formattedKeyword}`;
await page.goto(url);
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[class='product-listing mb-1 border-bottom']");
for (const divCard of divCards) {
const nameElement = await divCard.$("div[class='product-listing__product-name']");
const name = await page.evaluate(element => element.textContent, nameElement);
const g2UrlElement = await nameElement.$("a");
const g2Url = await page.evaluate(element => element.getAttribute("href"), g2UrlElement);
let rating = 0.0;
const ratingElement = await divCard.$("span[class='fw-semibold']");
if (ratingElement) {
rating = await page.evaluate(element => element.textContent, ratingElement);
}
const descriptionElement = await divCard.$("p");
const description = await page.evaluate(element => element.textContent, descriptionElement);
const businessInfo = {
name: name,
stars: rating,
g2_url: g2Url,
description: description
};
console.log(businessInfo);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startScrape(keyword, location, retries) {
const browser = await puppeteer.launch()
await scrapeSearchResults(browser, keyword, location, retries)
await browser.close();
}
async function main() {
const keywords = ["online bank"];
const concurrencyLimit = 5;
const pages = 1;
const location = "us";
const retries = 3;
const aggregateFiles = [];
for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, location, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
}
main();
For each business in the results, we find the following: name
, stars
, g2_url
, and description
. This data allows us to create objects that represent each business.
Everything we do from here depends on the data we pull with this parsing function.
Step 2: Add Pagination
Before we start storing our data, we need to start fetching it in batches. To get our batches, we need to add pagination to our crawler. Here is a paginated URL:
https://www.g2.com/search?page={page_number+1}&query={formatted_keyword}
We use page_number+1
because startScrape()
begins counting at zero.
Take a look at the updated code below:
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}
async function scrapeSearchResults(browser, keyword, pageNumber, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.g2.com/search?page=${pageNumber+1}&query=${formattedKeyword}`;
await page.goto(url);
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[class='product-listing mb-1 border-bottom']");
for (const divCard of divCards) {
const nameElement = await divCard.$("div[class='product-listing__product-name']");
const name = await page.evaluate(element => element.textContent, nameElement);
const g2UrlElement = await nameElement.$("a");
const g2Url = await page.evaluate(element => element.getAttribute("href"), g2UrlElement);
let rating = 0.0;
const ratingElement = await divCard.$("span[class='fw-semibold']");
if (ratingElement) {
rating = await page.evaluate(element => element.textContent, ratingElement);
}
const descriptionElement = await divCard.$("p");
const description = await page.evaluate(element => element.textContent, descriptionElement)
const businessInfo = {
name: name,
stars: rating,
g2_url: g2Url,
description: description
};
console.log(businessInfo);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startScrape(keyword, pages, location, retries) {
const pageList = range(0, pages);
const browser = await puppeteer.launch()
for (const page of pageList) {
await scrapeSearchResults(browser, keyword, page, location, retries);
}
await browser.close();
}
async function main() {
const keywords = ["online bank"];
const concurrencyLimit = 5;
const pages = 1;
const location = "us";
const retries = 3;
const aggregateFiles = [];
for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
}
main();
We've added pageNumber
to scrapeSearchResults()
. We also added some functionality to startScrape()
.
It now creates a list of pages to scrape. It then iterates through the list and runs scrapeSearchResults()
on each of the pages from the list.
Step 3: Storing the Scraped Data
To store our data, we're goiing to use the writeToCsv()
function. You can look at it in the snippet below:
async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}
This function takes an array of JSON objects, and an outputFile
. If outputFile
already exists, we open it in append mode so we don't overwrite any important data. If the file doesn't exist yet, this function will create it.
Here is the full code after it's been updated to write our information to a CSV.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}
function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}
async function scrapeSearchResults(browser, keyword, pageNumber, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.g2.com/search?page=${pageNumber+1}&query=${formattedKeyword}`;
await page.goto(url);
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[class='product-listing mb-1 border-bottom']");
for (const divCard of divCards) {
const nameElement = await divCard.$("div[class='product-listing__product-name']");
const name = await page.evaluate(element => element.textContent, nameElement);
const g2UrlElement = await nameElement.$("a");
const g2Url = await page.evaluate(element => element.getAttribute("href"), g2UrlElement);
let rating = 0.0;
const ratingElement = await divCard.$("span[class='fw-semibold']");
if (ratingElement) {
rating = await page.evaluate(element => element.textContent, ratingElement);
}
const descriptionElement = await divCard.$("p");
const description = await page.evaluate(element => element.textContent, descriptionElement)
const businessInfo = {
name: name,
stars: rating,
g2_url: g2Url,
description: description
};
await writeToCsv([businessInfo], `${keyword.replace(" ", "-")}.csv`);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startScrape(keyword, pages, location, retries) {
const pageList = range(0, pages);
const browser = await puppeteer.launch()
for (const page of pageList) {
await scrapeSearchResults(browser, keyword, page, location, retries);
}
await browser.close();
}
async function main() {
const keywords = ["online bank"];
const concurrencyLimit = 5;
const pages = 1;
const location = "us";
const retries = 3;
const aggregateFiles = [];
for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
}
main();
writeToCsv()
takes an array of objects and writes them to a CSV file- As we parse the page, we pass each object in as soon as we've parsed it. In the event of a crash, this allows us to save as much data as possible.
Step 4: Adding Concurrency
A for
loop isn't good enough for a production level crawler. We refactored startScrape()
to use a concurrencyLimit
and take advantage of the multiple pages inside the browser.
async function startScrape(keyword, pages, location, concurrencyLimit, retries) {
const pageList = range(0, pages);
const browser = await puppeteer.launch()
while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
Let's break down this down:
- As before, we create a list of pages to scrape
while
pageList.length
is greater than zero, wesplice()
a batch from index zero up to ourconcurrencyLimit
- We run
scrapeSearchResults()
on each page in the batch simultaneously and thenawait
the results of the batch - As this process continues, our
pageList
shrinks all the way down to zero. Each time a batch is processed, this list gets smaller and frees up more memory. Theoretically, the longer this function runs, the faster it gets.
Here is the fully updated code.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}
function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}
async function scrapeSearchResults(browser, keyword, pageNumber, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.g2.com/search?page=${pageNumber+1}&query=${formattedKeyword}`;
await page.goto(url);
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[class='product-listing mb-1 border-bottom']");
for (const divCard of divCards) {
const nameElement = await divCard.$("div[class='product-listing__product-name']");
const name = await page.evaluate(element => element.textContent, nameElement);
const g2UrlElement = await nameElement.$("a");
const g2Url = await page.evaluate(element => element.getAttribute("href"), g2UrlElement);
let rating = 0.0;
const ratingElement = await divCard.$("span[class='fw-semibold']");
if (ratingElement) {
rating = await page.evaluate(element => element.textContent, ratingElement);
}
const descriptionElement = await divCard.$("p");
const description = await page.evaluate(element => element.textContent, descriptionElement)
const businessInfo = {
name: name,
stars: rating,
g2_url: g2Url,
description: description
};
await writeToCsv([businessInfo], `${keyword.replace(" ", "-")}.csv`);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startScrape(keyword, pages, location, concurrencyLimit, retries) {
const pageList = range(0, pages);
const browser = await puppeteer.launch()
while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
async function main() {
const keywords = ["online bank"];
const concurrencyLimit = 5;
const pages = 1;
const location = "us";
const retries = 3;
const aggregateFiles = [];
for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, concurrencyLimit, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
}
main();
Now that we can process pages concurrently, we can get that sweet data much much faster!.
Step 5: Bypassing Anti-Bots
Anti-bots are used all over the web to detect and block malicious traffic. They protect against things like DDOS attacks and lots of other bad stuff. Our crawler isn't malicious, but it looks really weird compared to a normal user. It can make dozens of requests in under a second... not very human at all. To get past anti-bot software, we need the ScrapeOps API.
The function below uses simple string formatting and converts any regular URL into a proxied one using the ScrapeOps Proxy API.
function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}
The ScrapeOps Proxy API rotates our IP addresses and always gives us a server located in the country
we choose. Each time we do something, that request comes from a different IP address!
We don't appear like one really bizarre fast user, our crawler looks like a bunch of normal users.
Our code barely changes at all here, but we're now a production ready level. Take a look at the full code example below.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}
function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}
function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}
async function scrapeSearchResults(browser, keyword, pageNumber, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.g2.com/search?page=${pageNumber+1}&query=${formattedKeyword}`;
const proxyUrl = getScrapeOpsUrl(url, location);
await page.goto(proxyUrl);
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[class='product-listing mb-1 border-bottom']");
for (const divCard of divCards) {
const nameElement = await divCard.$("div[class='product-listing__product-name']");
const name = await page.evaluate(element => element.textContent, nameElement);
const g2UrlElement = await nameElement.$("a");
const g2Url = await page.evaluate(element => element.getAttribute("href"), g2UrlElement);
let rating = 0.0;
const ratingElement = await divCard.$("span[class='fw-semibold']");
if (ratingElement) {
rating = await page.evaluate(element => element.textContent, ratingElement);
}
const descriptionElement = await divCard.$("p");
const description = await page.evaluate(element => element.textContent, descriptionElement)
const businessInfo = {
name: name,
stars: rating,
g2_url: g2Url,
description: description
};
await writeToCsv([businessInfo], `${keyword.replace(" ", "-")}.csv`);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startScrape(keyword, pages, location, concurrencyLimit, retries) {
const pageList = range(0, pages);
const browser = await puppeteer.launch()
while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
async function main() {
const keywords = ["online bank"];
const concurrencyLimit = 5;
const pages = 1;
const location = "us";
const retries = 3;
const aggregateFiles = [];
for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, concurrencyLimit, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
}
main();
Step 6: Production Run
Let's run this baby in production! Take a look at the main
below.
async function main() {
const keywords = ["online bank"];
const concurrencyLimit = 5;
const pages = 10;
const location = "us";
const retries = 3;
const aggregateFiles = [];
for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, concurrencyLimit, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
}
pages
is set to 10
, location
gets set to "us"
, and concurrencyLimit
is set to 5. Now, we need to process 10 pages of data.
Here are the results:
This run took 25 seconds to process 10 pages of results...so 2.5 seconds per page.
Build A G2 Scraper
We now have a crawler that builds the reports we want based on different search criteria. Time to utilize these reports. Our scraper will read the search report and run an individual scraping job on every single business from the report.
Our scraper needs to be able to do the following:
- Open the report we created
- Get the pages from that report
- Pull information from these pages
- Create an individual report for each of the businesses we've looked up
Our review scraper is going to use some things you're already familiar with by now: parsing, storage, concurrency, and proxy integration.
Step 1: Create Simple Business Data Parser
Time to build a (not so) simple parser that takes in a row from our CSV file and scrapes reviews for the business in that row.
async function processBusiness(browser, row, location, retries = 3) {
const url = row.g2_url;
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const page = await browser.newPage();
try {
await page.goto(url);
const reviewCards = await page.$$("div[class='paper paper--white paper--box mb-2 position-relative border-bottom']");
let anonCount = 0;
for (const reviewCard of reviewCards) {
reviewDateElement = await reviewCard.$("time");
reviewTextElement = await reviewCard.$("div[itemprop='reviewBody']");
if (reviewDateElement && reviewTextElement) {
const date = await page.evaluate(element => element.getAttribute("datetime"), reviewDateElement);
const reviewBody = await page.evaluate(element => element.textContent, reviewTextElement);
const nameElement = await reviewCard.$("a[class='link--header-color']");
let name;
if (nameElement) {
name = await page.evaluate(element => element.textContent, nameElement);
} else {
name = `anonymous-${anonCount}`;
anonCount++;
}
const jobTitleElement = await reviewCard.$("div[class='mt-4th']");
let jobTitle;
if (jobTitleElement) {
jobTitle = await page.evaluate(element => element.textContent, jobTitleElement);
} else {
jobTitle = "n/a";
}
const ratingContainer = await reviewCard.$("div[class='f-1 d-f ai-c mb-half-small-only']");
const ratingDiv = await ratingContainer.$("div");
const ratingClass = await page.evaluate(element => element.getAttribute("class"), ratingDiv);
const ratingArray = ratingClass.split("-");
const rating = Number(ratingArray[ratingArray.length-1])/2;
const infoContainer = await reviewCard.$("div[class='tags--teal']");
const incentivesDirty = await infoContainer.$$("div");
const incentivesClean = [];
let source = "";
for (const incentive of incentivesDirty) {
const text = await page.evaluate(element => element.textContent, incentive);
if (!incentivesClean.includes(text)) {
if (text.includes("Review source:")) {
textArray = text.split(": ");
source = textArray[textArray.length-1];
} else {
incentivesClean.push(text);
}
}
}
const validated = incentivesClean.includes("Validated Reviewer");
const incentivized = incentivesClean.includes("Incentivized Review");
const reviewData = {
name: name,
date: date,
job_title: jobTitle,
rating: rating,
full_review: reviewBody,
review_source: source,
validated: validated,
incentivized: incentivized
}
console.log(reviewData);
}
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left: ${retries-tries}`);
tries++;
} finally {
await page.close();
}
}
}
- Each review has a
date
. From each review, we pull thedate
withawait page.evaluate(element => element.getAttribute("datetime"), reviewDateElement);
- Then, we check if the user's name is present. If it's not, we name the viewer,
"anonymous"
and give them a number. This prevents different anonymous reviews from getting filtered out await reviewCard.$("div[class='mt-4th']")
checks if thejob_title
is present. If it is not, we give it a default value of"n/a"
. Otherwise we pull the user'sjob_title
from the post.await page.evaluate(element => element.getAttribute("class"), ratingDiv)
pulls the CSS class from our rating. We thensplit("-")
to separate the number of stars from the CSS class. After splitting the stars, we divide them by 2 to get the actual rating.await page.evaluate(element => element.textContent, reviewTextElement)
gives us the actual review- We created an
incentives_dirty
list to hold all of the incentive tags from the review. If"Review source:"
is in the text of the incentive item, wesplit(": ")
to separate the source name and pull it. All other non duplicate items get pushed into theincentives_clean
list. - If
"Validated Reviewer"
or"Incentivized Review"
is inside theincentives_clean
list, we set those variables toTrue
Our parsing function takes in a row
from our CSV file. It then gets the g2_url
for the business using page.goto()
. Now that we can get the right data from our site, we're ready to read our CSV file and pull all this important data.
Step 2: Loading URLs To Scrape
In order to use processBusiness()
, we need to read the CSV that the crawler creates. Let's update our code so our new function, processResults()
can handle this.
Take a look at the new function below:
async function processResults(csvFile, location, retries) {
const businesses = await readCsv(csvFile);
const browser = await puppeteer.launch();
for (const business of businesses) {
await processBusiness(browser, business, location, retries);
}
await browser.close();
}
Here is the readCsv()
function as well. It takes in a CSV file and spits out an array of JSON objects.
async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));
for await (const record of parser) {
results.push(record);
}
return results;
}
processResults()
reads the CSV file and then converts all the rows into an array. We then iterate through this array and pass each row into processBusiness()
. You can view the updated code below.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}
async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));
for await (const record of parser) {
results.push(record);
}
return results;
}
function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}
function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}
async function scrapeSearchResults(browser, keyword, pageNumber, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.g2.com/search?page=${pageNumber+1}&query=${formattedKeyword}`;
const proxyUrl = getScrapeOpsUrl(url, location);
await page.goto(proxyUrl);
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[class='product-listing mb-1 border-bottom']");
for (const divCard of divCards) {
const nameElement = await divCard.$("div[class='product-listing__product-name']");
const name = await page.evaluate(element => element.textContent, nameElement);
const g2UrlElement = await nameElement.$("a");
const g2Url = await page.evaluate(element => element.getAttribute("href"), g2UrlElement);
let rating = 0.0;
const ratingElement = await divCard.$("span[class='fw-semibold']");
if (ratingElement) {
rating = await page.evaluate(element => element.textContent, ratingElement);
}
const descriptionElement = await divCard.$("p");
const description = await page.evaluate(element => element.textContent, descriptionElement)
const businessInfo = {
name: name,
stars: rating,
g2_url: g2Url,
description: description
};
await writeToCsv([businessInfo], `${keyword.replace(" ", "-")}.csv`);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startScrape(keyword, pages, location, concurrencyLimit, retries) {
const pageList = range(0, pages);
const browser = await puppeteer.launch()
while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
async function processBusiness(browser, row, location, retries = 3) {
const url = row.g2_url;
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const page = await browser.newPage();
try {
await page.goto(url);
const reviewCards = await page.$$("div[class='paper paper--white paper--box mb-2 position-relative border-bottom']");
let anonCount = 0;
for (const reviewCard of reviewCards) {
reviewDateElement = await reviewCard.$("time");
reviewTextElement = await reviewCard.$("div[itemprop='reviewBody']");
if (reviewDateElement && reviewTextElement) {
const date = await page.evaluate(element => element.getAttribute("datetime"), reviewDateElement);
const reviewBody = await page.evaluate(element => element.textContent, reviewTextElement);
const nameElement = await reviewCard.$("a[class='link--header-color']");
let name;
if (nameElement) {
name = await page.evaluate(element => element.textContent, nameElement);
} else {
name = `anonymous-${anonCount}`;
anonCount++;
}
const jobTitleElement = await reviewCard.$("div[class='mt-4th']");
let jobTitle;
if (jobTitleElement) {
jobTitle = await page.evaluate(element => element.textContent, jobTitleElement);
} else {
jobTitle = "n/a";
}
const ratingContainer = await reviewCard.$("div[class='f-1 d-f ai-c mb-half-small-only']");
const ratingDiv = await ratingContainer.$("div");
const ratingClass = await page.evaluate(element => element.getAttribute("class"), ratingDiv);
const ratingArray = ratingClass.split("-");
const rating = Number(ratingArray[ratingArray.length-1])/2;
const infoContainer = await reviewCard.$("div[class='tags--teal']");
const incentivesDirty = await infoContainer.$$("div");
const incentivesClean = [];
let source = "";
for (const incentive of incentivesDirty) {
const text = await page.evaluate(element => element.textContent, incentive);
if (!incentivesClean.includes(text)) {
if (text.includes("Review source:")) {
textArray = text.split(": ");
source = textArray[textArray.length-1];
} else {
incentivesClean.push(text);
}
}
}
const validated = incentivesClean.includes("Validated Reviewer");
const incentivized = incentivesClean.includes("Incentivized Review");
const reviewData = {
name: name,
date: date,
job_title: jobTitle,
rating: rating,
full_review: reviewBody,
review_source: source,
validated: validated,
incentivized: incentivized
}
console.log(reviewData);
}
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left: ${retries-tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function processResults(csvFile, location, retries) {
const businesses = await readCsv(csvFile);
const browser = await puppeteer.launch();
for (const business of businesses) {
await processBusiness(browser, business, location, retries);
}
await browser.close();
}
async function main() {
const keywords = ["online bank"];
const concurrencyLimit = 5;
const pages = 1;
const location = "us";
const retries = 3;
const aggregateFiles = [];
for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, concurrencyLimit, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
for (const file of aggregateFiles) {
await processResults(file, location, concurrencyLimit, retries);
}
}
main();
In our updated code, processResults()
reads our CSV file. Then, it passes each row into processBusiness()
. processBusiness()
extracts our data and then prints it to the console.
Step 3: Storing the Scraped Data
Once again, now that we've scraped our data we need to store it. Because of our existing code, we can do this by removing console.log()
and replacing it with the following line:
await writeToCsv([reviewData], `${row.name.replace(" ", "-")}.csv`);
Our reviewData
object uses the following fields to represent reviews from the webpage:
name
date
job_title
rating
full_review
review_source
validated
incentivized
In the updated code below, we pass our reviewData
into writeToCsv()
. Just like before, we pass each object in as soon as it's been processed so we can save as much data as possible in the event of a crash.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}
async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));
for await (const record of parser) {
results.push(record);
}
return results;
}
function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}
function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}
async function scrapeSearchResults(browser, keyword, pageNumber, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.g2.com/search?page=${pageNumber+1}&query=${formattedKeyword}`;
const proxyUrl = getScrapeOpsUrl(url, location);
await page.goto(proxyUrl);
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[class='product-listing mb-1 border-bottom']");
for (const divCard of divCards) {
const nameElement = await divCard.$("div[class='product-listing__product-name']");
const name = await page.evaluate(element => element.textContent, nameElement);
const g2UrlElement = await nameElement.$("a");
const g2Url = await page.evaluate(element => element.getAttribute("href"), g2UrlElement);
let rating = 0.0;
const ratingElement = await divCard.$("span[class='fw-semibold']");
if (ratingElement) {
rating = await page.evaluate(element => element.textContent, ratingElement);
}
const descriptionElement = await divCard.$("p");
const description = await page.evaluate(element => element.textContent, descriptionElement)
const businessInfo = {
name: name,
stars: rating,
g2_url: g2Url,
description: description
};
await writeToCsv([businessInfo], `${keyword.replace(" ", "-")}.csv`);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startScrape(keyword, pages, location, concurrencyLimit, retries) {
const pageList = range(0, pages);
const browser = await puppeteer.launch()
while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
async function processBusiness(browser, row, location, retries = 3) {
const url = row.g2_url;
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const page = await browser.newPage();
try {
await page.goto(url);
const reviewCards = await page.$$("div[class='paper paper--white paper--box mb-2 position-relative border-bottom']");
let anonCount = 0;
for (const reviewCard of reviewCards) {
reviewDateElement = await reviewCard.$("time");
reviewTextElement = await reviewCard.$("div[itemprop='reviewBody']");
if (reviewDateElement && reviewTextElement) {
const date = await page.evaluate(element => element.getAttribute("datetime"), reviewDateElement);
const reviewBody = await page.evaluate(element => element.textContent, reviewTextElement);
const nameElement = await reviewCard.$("a[class='link--header-color']");
let name;
if (nameElement) {
name = await page.evaluate(element => element.textContent, nameElement);
} else {
name = `anonymous-${anonCount}`;
anonCount++;
}
const jobTitleElement = await reviewCard.$("div[class='mt-4th']");
let jobTitle;
if (jobTitleElement) {
jobTitle = await page.evaluate(element => element.textContent, jobTitleElement);
} else {
jobTitle = "n/a";
}
const ratingContainer = await reviewCard.$("div[class='f-1 d-f ai-c mb-half-small-only']");
const ratingDiv = await ratingContainer.$("div");
const ratingClass = await page.evaluate(element => element.getAttribute("class"), ratingDiv);
const ratingArray = ratingClass.split("-");
const rating = Number(ratingArray[ratingArray.length-1])/2;
const infoContainer = await reviewCard.$("div[class='tags--teal']");
const incentivesDirty = await infoContainer.$$("div");
const incentivesClean = [];
let source = "";
for (const incentive of incentivesDirty) {
const text = await page.evaluate(element => element.textContent, incentive);
if (!incentivesClean.includes(text)) {
if (text.includes("Review source:")) {
textArray = text.split(": ");
source = textArray[textArray.length-1];
} else {
incentivesClean.push(text);
}
}
}
const validated = incentivesClean.includes("Validated Reviewer");
const incentivized = incentivesClean.includes("Incentivized Review");
const reviewData = {
name: name,
date: date,
job_title: jobTitle,
rating: rating,
full_review: reviewBody,
review_source: source,
validated: validated,
incentivized: incentivized
}
await writeToCsv([reviewData], `${row.name.replace(" ", "-")}.csv`);
}
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left: ${retries-tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function processResults(csvFile, location, retries) {
const businesses = await readCsv(csvFile);
const browser = await puppeteer.launch();
for (const business of businesses) {
await processBusiness(browser, business, location, retries);
}
await browser.close();
}
async function main() {
const keywords = ["online bank"];
const concurrencyLimit = 5;
const pages = 1;
const location = "us";
const retries = 3;
const aggregateFiles = [];
for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, concurrencyLimit, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
for (const file of aggregateFiles) {
await processResults(file, location, concurrencyLimit, retries);
}
}
main();
Step 4: Adding Concurrency
Just like before, we use a concurrencyLimit
to open many pages simultaneously. The largest difference here is our array. Instead of an array of page numbers, we have a much larger array of CSV rows.
async function processResults(csvFile, location, concurrencyLimit, retries) {
const businesses = await readCsv(csvFile);
const browser = await puppeteer.launch();
while (businesses.length > 0) {
const currentBatch = businesses.splice(0, concurrencyLimit);
const tasks = currentBatch.map(business => processBusiness(browser, business, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
We removed the for
and replaced it with cuncurrent batches that use async
and await
. Aside from the changes here, the rest of our code remains basically the same!
Step 5: Bypassing Anti-Bots
Now, we need to add proxy support again. We've already got our getScrapeOpsUrl()
function. We just need to change one line.
await page.goto(getScrapeOpsUrl(url, location));
Here is the fully updated code:
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
console.log("api key:", API_KEY);
async function writeToCsv(data, outputFile) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
} catch (e) {
throw new Error("Failed to write to csv");
}
}
async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));
for await (const record of parser) {
results.push(record);
}
return results;
}
function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}
function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}
async function scrapeSearchResults(browser, keyword, pageNumber, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const formattedKeyword = keyword.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.g2.com/search?page=${pageNumber+1}&query=${formattedKeyword}`;
const proxyUrl = getScrapeOpsUrl(url, location);
console.log(proxyUrl)
await page.goto(proxyUrl);
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[class='product-listing mb-1 border-bottom']");
for (const divCard of divCards) {
const nameElement = await divCard.$("div[class='product-listing__product-name']");
const name = await page.evaluate(element => element.textContent, nameElement);
const g2UrlElement = await nameElement.$("a");
const g2Url = await page.evaluate(element => element.getAttribute("href"), g2UrlElement);
let rating = 0.0;
const ratingElement = await divCard.$("span[class='fw-semibold']");
if (ratingElement) {
rating = await page.evaluate(element => element.textContent, ratingElement);
}
const descriptionElement = await divCard.$("p");
const description = await page.evaluate(element => element.textContent, descriptionElement)
const businessInfo = {
name: name,
stars: rating,
g2_url: g2Url,
description: description
};
await writeToCsv([businessInfo], `${keyword.replace(" ", "-")}.csv`);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startScrape(keyword, pages, location, concurrencyLimit, retries) {
const pageList = range(0, pages);
const browser = await puppeteer.launch()
while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
async function processBusiness(browser, row, location, retries = 3) {
const url = row.g2_url;
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const page = await browser.newPage();
try {
await page.goto(getScrapeOpsUrl(url, location), { timeout: 60000 });
const reviewCards = await page.$$("div[class='paper paper--white paper--box mb-2 position-relative border-bottom']");
let anonCount = 0;
for (const reviewCard of reviewCards) {
reviewDateElement = await reviewCard.$("time");
reviewTextElement = await reviewCard.$("div[itemprop='reviewBody']");
if (reviewDateElement && reviewTextElement) {
const date = await page.evaluate(element => element.getAttribute("datetime"), reviewDateElement);
const reviewBody = await page.evaluate(element => element.textContent, reviewTextElement);
const nameElement = await reviewCard.$("a[class='link--header-color']");
let name;
if (nameElement) {
name = await page.evaluate(element => element.textContent, nameElement);
} else {
name = `anonymous-${anonCount}`;
anonCount++;
}
const jobTitleElement = await reviewCard.$("div[class='mt-4th']");
let jobTitle;
if (jobTitleElement) {
jobTitle = await page.evaluate(element => element.textContent, jobTitleElement);
} else {
jobTitle = "n/a";
}
const ratingContainer = await reviewCard.$("div[class='f-1 d-f ai-c mb-half-small-only']");
const ratingDiv = await ratingContainer.$("div");
const ratingClass = await page.evaluate(element => element.getAttribute("class"), ratingDiv);
const ratingArray = ratingClass.split("-");
const rating = Number(ratingArray[ratingArray.length-1])/2;
const infoContainer = await reviewCard.$("div[class='tags--teal']");
const incentivesDirty = await infoContainer.$$("div");
const incentivesClean = [];
let source = "";
for (const incentive of incentivesDirty) {
const text = await page.evaluate(element => element.textContent, incentive);
if (!incentivesClean.includes(text)) {
if (text.includes("Review source:")) {
textArray = text.split(": ");
source = textArray[textArray.length-1];
} else {
incentivesClean.push(text);
}
}
}
const validated = incentivesClean.includes("Validated Reviewer");
const incentivized = incentivesClean.includes("Incentivized Review");
const reviewData = {
name: name,
date: date,
job_title: jobTitle,
rating: rating,
full_review: reviewBody,
review_source: source,
validated: validated,
incentivized: incentivized
}
await writeToCsv([reviewData], `${row.name.replace(" ", "-")}.csv`);
}
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left: ${retries-tries}, url: ${getScrapeOpsUrl(url)}`);
tries++;
} finally {
await page.close();
}
}
}
async function processResults(csvFile, location, concurrencyLimit, retries) {
const businesses = await readCsv(csvFile);
const browser = await puppeteer.launch();
while (businesses.length > 0) {
const currentBatch = businesses.splice(0, concurrencyLimit);
const tasks = currentBatch.map(business => processBusiness(browser, business, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
async function main() {
const keywords = ["online bank"];
const concurrencyLimit = 5;
const pages = 1;
const location = "us";
const retries = 3;
const aggregateFiles = [];
for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, concurrencyLimit, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
console.log("Starting scrape");
for (const file of aggregateFiles) {
await processResults(file, location, concurrencyLimit, retries);
}
console.log("Scrape complete");
}
main();
Step 6: Production Run
Now, it's time to run both our scraper and crawler in production together! Just like before, we'll run a scrape job on 10 pages of search results.
async function main() {
const keywords = ["online bank"];
const concurrencyLimit = 5;
const pages = 10;
const location = "us";
const retries = 3;
const aggregateFiles = [];
for (const keyword of keywords) {
console.log("Crawl starting");
await startScrape(keyword, pages, location, concurrencyLimit, retries);
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
for (const file of aggregateFiles) {
await processResults(file, location, concurrencyLimit, retries);
}
}
Just like before, I set our pages
to 10
, our location
to "us"
, and our concurrencyLimit
to 5. Here are the results.
In total, it took just over 690 seconds (including the time it took to create our initial report) to generate a full report and process all the results (197 rows). The average speed per page is about 3.5 seconds.
Legal and Ethical Considerations
Whenever you're doing a scrape job, there will always legal and ethical questions to consider. You should always comply with a site's Terms of Use and robots.txt
.
You can view G2's terms here and their robots.txt
is available here.
Always be careful about the information you extract and don't scrape private or confidential data. If a website is hidden behind a login, that is generally considered private data.
If your data does not require a login, it is generally considered to be public data. If you have questions about the legality of your scraping job, it is best to consult an attorney familiar with the laws and localities you're dealing with.
Conclusion
You now know how to build G2 scrapers and parsers with Puppeteer. You should have decent understanding of the following terms: parsing, pagination, data storage, concurrency, and proxy integration.
You should also have a decent understanding of the page.$()
, page.$$()
and page.evaluate()
methods from Puppeteer and you should understand some pretty complex string operations for extracting data such as split()
, replace()
and includes()
.
To learn more about the tools we used in this article, check out the links below:
More Web Scraping Guides
Build something and practice your new skills. Whatever you do...Go build something! Here at ScrapeOps, we've got loads of resources for you to learn from. If you're in the mood to learn more, take a look at the articles below.
If you're in the mood to learn more, check our Puppeteer Web Scraping Playbook or take a look at the articles below: