Skip to main content

Scrape LinkedIn Profiles With NodeJS Puppeteer

How to Scrape LinkedIn Profiles With Puppeteer

LinkedIn was created in 2003. Over the course of its existence, LinkedIn has accumulated tons and tons of data.

In today's guide, we'll scrape LinkedIn profiles. We'll explore this in excruciating detail. While the profiles are very difficult to scrape, if you know what to do, you can get past their seemingly unbeatable system of redirects.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.


TLDR - How to Scrape LinkedIn Profiles

For those of you without time to read, we've got a prebuilt scraper you can use.

  • It first runs a crawl and generates a report based on our search results.
  • Once we've got a report generated, our scraper will read the report and scrape each individual profile discovered during the crawl.
  1. Start by creating a new project folder with a config.json file.
  2. Inside your config file, add your [ScrapeOps API key], {"api_key": "your-super-secret-api-key"}.
  3. Then, copy and paste the code below into a Python file.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");

const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;


async function crawlProfiles(browser, keyword, location="us", retries=3) {
let tries = 0;
let success = false;

while (tries <= retries && !success) {

const firstName = keyword.split(" ")[0];
const lastName = keyword.split(" ")[1]


const page = await browser.newPage();
try {
const url = `https://www.linkedin.com/pub/dir?firstName=${firstName}&lastName=${lastName}&trk=people-guest_people-search-bar_search-submit`;

await page.goto(url);

console.log(`Successfully fetched: ${url}`);

const divCards = await page.$$("div[class='base-search-card__info']");

for (const divCard of divCards) {

const link = await page.evaluate(element => element.parentElement.getAttribute("href"), divCard);
const splitLink = link.split("/")
const name = splitLink[splitLink.length-1].split("?")[0];

const displayNameElement = await divCard.$("h3[class='base-search-card__title']");
const displayName = await page.evaluate(element => element.textContent, displayNameElement);

const locationElement = await page.$("p[class='people-search-card__location']");
const location = await page.evaluate(element => element.textContent, locationElement);

let companies = "n/a";

const hasCompanies = await page.$("span[class='entity-list-meta__entities-list']");

if (hasCompanies) {
companies = await page.evaluate(element => element.textContent, hasCompanies);
}


const searchData = {
name: name.trim(),
display_name: displayName.trim(),
url: link.trim(),
location: location.trim(),
companies: companies.trim()
};

console.log(searchData);

}

success = true;

} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;

} finally {
await page.close();
}
}
}

async function startCrawl(keywordList, location, retries) {

const browser = await puppeteer.launch();

for (const keyword of keywordList) {
await crawlProfiles(browser, keyword, location, retries);
}

await browser.close();
}


async function main() {
const keywords = ["bill gates", "elon musk"];
const concurrencyLimit = 5;
const location = "us";
const retries = 3;
const aggregateFiles = [];

console.log("Crawl starting");
console.time("startCrawl");
for (const keyword of keywords) {
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
await startCrawl(keywords, location, retries);
console.timeEnd("startCrawl");
console.log("Crawl complete");
}


main();

To change your results, you can change any of the following from our main:

  • keywords
  • concurrencyLimit
  • location
  • retries

How To Architect Our LinkedIn Profiles Scraper

LinkedIn is difficult to scrape. When you navigate to their site from your browser, if you're not logged in, you get redirected and prompted to sign in. If you're new to scraping, their anti-bot system can seem impassable. With some due diligence, we can get around all of this. We're going to write a search crawler and a profile scraper.

Our crawler takes in a keyword and searches for it. If we want to search for Bill Gates, our crawler will run that search and then it'll save each Bill Gates that it finds from the results.

Afterward, it'll be time for our profile scraper. The profile scraper starts right where the crawler leaves off. It reads the CSV and then scrapes each individual profile found in the CSV file.

At a high level, our profile crawler needs to:

  1. Perform a search and parse the search results.
  2. Store those parsed results.
  3. Concurrently run steps 1 and 2 on multiple searches.
  4. Use proxy integration to get past LinkedIn's anti-bots.

Our profile scraper needs to perform the following steps:

  1. Read the crawler's report into an array.
  2. Parse a row from the array.
  3. Store parsed profile data.
  4. Run steps 2 and 3 on multiple pages concurrently.
  5. Utilize a proxy to bypass anti-bots.

Understanding How To Scrape LinkedIn Profiles

We can't just start building our scrapers, we need to understand exactly where our data is and plan out how to extract it from the page. We'll use the ScrapeOps Proxy Aggregator API to handle our geolocation and bypass anti-bots.

These next few sections will highlight our requirements when building the crawler and the scraper.

Step 1: How To Request LinkedIn Profiles Pages

We need to know how to GET our webpages from LinkedIn. We need to GET our search results and the individual profile page.

Look at the images below so you can gain a better understanding of these types of pages.

First, we'll look at our search results page, then we'll examine the individual profile page.

You can view a search for Bill Gates in the shot below. Our URL is:

https://www.linkedin.com/pub/dir?firstName=bill&lastName=gates&trk=people-guest_people-search-bar_search-submit

We're prompted to sign in as soon as we get to the page, but this isn't really an issue because our full page is still intact under the prompt.

Our final URL format looks like this:

https://www.linkedin.com/pub/dir?firstName={first_name}&lastName={last_name}&trk=people-guest_people-search-bar_search-submit

LinkedIn Search Results

To scrape individual profiles, we need a better feel for the profile layout. Here's a look at the profile of Bill Gates. We're once again prompted to sign in, but the underlying page is in tact.

Our url is:

https://www.linkedin.com/in/williamhgates?trk=people-guest_people_search-card

All of our profile links look like this:

https://www.linkedin.com/in/{name_of_profile}

We remove the queries at the end because (for some unknown reason), anti-bots are less likely to block us when we format the url this way.

Bill Gates LinkedIn Profile


Step 2: How To Extract Data From LinkedIn Profiles Results and Pages

Time to figure out how to get our data. If you look at our search results, each one is a div with a class of 'base-search-card__info'. For individual profiles, we pull our data from a JSON blob inside the head of the page.

Look at each result. It's div element. Its class is base-search=card__info.

HTML Inspection LinkedIn Search Results Page

In the image below, you can see a profile page. As you can see, there is a ton of data inside the JSON blob.

HTML Inspection LinkedIn Profile Page


Step 3: Geolocated Data

With the ScrapeOps Proxy Aggregator, we can choose which country we want to appear in.

The ScrapeOps API allows us to pass a country parameter. ScrapeOps then reads this parameter and routes our request through the corresponding country.

  • If we want to appear in the US, we can pass "country": "us".
  • If we want to appear in the UK, we can pass "country": "uk".

You can view the full use of supported countries on this page.

ScrapeOps gives great geotargeting support at no additional charge. There are other proxy providers that charge you extra API credits to use their geotargeting.


Setting Up Our LinkedIn Profiles Scraper Project

Time to start building. We need to create a new project folder and initialize it as a NodeJS project. Then we'll install Puppeteer and a few other basic dependencies.

Create a New Project Folder

mkdir linkedin-profiles-scraper

cd linkedin-profiles-scraper

Turn it into a JavaScript Project

npm init --y

Install Our Dependencies

npm install puppeteer
npm install csv-writer
npm install csv-parse
npm install fs

We're all set to begin coding.


Build A LinkedIn Profiles Search Crawler

We've already outlined the requirements for our crawler. Time to go about building our crawler and adding these features in. As previously mentioned, our whole project starts with our crawler.

Our crawler will run a search, parse the results, and then save our data to a CSV file. Once our crawler can do these tasks, we'll need to add concurrency and proxy support.

In the coming sections, we'll go through step by step and build all of these features into our crawler.


Step 1: Create Simple Search Data Parser

Everything stems from our parsing function.

In the script below, we'll handle our imports, retries and, of course, parsing logic. Everything built afterward will be on top of this basic design. Take a look at our parsing function, crawlProfiles().

As we discovered earlier, we need find all of our target div elements. Once we've got them, we'll iterate through them with a for loop and extract their data.

const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");

const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;


async function crawlProfiles(browser, keyword, location="us", retries=3) {
let tries = 0;
let success = false;

while (tries <= retries && !success) {

const firstName = keyword.split(" ")[0];
const lastName = keyword.split(" ")[1]


const page = await browser.newPage();
try {
const url = `https://www.linkedin.com/pub/dir?firstName=${firstName}&lastName=${lastName}&trk=people-guest_people-search-bar_search-submit`;

await page.goto(url);

console.log(`Successfully fetched: ${url}`);

const divCards = await page.$$("div[class='base-search-card__info']");

for (const divCard of divCards) {

const link = await page.evaluate(element => element.parentElement.getAttribute("href"), divCard);
const splitLink = link.split("/")
const name = splitLink[splitLink.length-1].split("?")[0];

const displayNameElement = await divCard.$("h3[class='base-search-card__title']");
const displayName = await page.evaluate(element => element.textContent, displayNameElement);

const locationElement = await page.$("p[class='people-search-card__location']");
const location = await page.evaluate(element => element.textContent, locationElement);

let companies = "n/a";

const hasCompanies = await page.$("span[class='entity-list-meta__entities-list']");

if (hasCompanies) {
companies = await page.evaluate(element => element.textContent, hasCompanies);
}


const searchData = {
name: name.trim(),
display_name: displayName.trim(),
url: link.trim(),
location: location.trim(),
companies: companies.trim()
};

console.log(searchData);

}

success = true;

} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;

} finally {
await page.close();
}
}
}

async function startCrawl(keywordList, location, retries) {

const browser = await puppeteer.launch();

for (const keyword of keywordList) {
await crawlProfiles(browser, keyword, location, retries);
}

await browser.close();
}


async function main() {
const keywords = ["bill gates", "elon musk"];
const concurrencyLimit = 5;
const location = "us";
const retries = 3;
const aggregateFiles = [];

console.log("Crawl starting");
console.time("startCrawl");
for (const keyword of keywords) {
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
await startCrawl(keywords, location, retries);
console.timeEnd("startCrawl");
console.log("Crawl complete");
}


main();
  • await page.$$("div[class='base-search-card__info']"); returns all of the profile cards we're looking for.
  • As we iterate through the profile cards:
    • await page.evaluate(element => element.parentElement.getAttribute("href"), divCard) finds our link.
    • await divCard.$("h3[class='base-search-card__title']") yields our displayNameElement.
      • We extract its text with await page.evaluate(element => element.textContent, displayNameElement).
    • await page.$("p[class='people-search-card__location']") gives us the locationElement.
      • We extract its text the same way we extracted the text from our displayNameElement.
    • We check the span elements to see if there are companies present and if there are companies, we extract them. If there are no companies, we assign a default value of "n/a".

Step 2: Storing the Scraped Data

We need to store our extracted data. Without a way to store it, this extracted data is useless. In this section, we'll write a function that takes in an array of JSON objects and writes the array to a CSV file. We should craft this function carefully.

This function should check to see if a file exists.

  • If the file already exists, we should open it in append mode, otherwise, we need to create a new one. It should also check if our data is an array.
  • If the data isn't an array, we need to convert it to one. Also, it shouldn't exit until the CSV file has been written. Storage failure shouldn't be an option.

Here is writetoCsv().

async function writeToCsv(data, outputFile) {
let success = false;
while (!success) {

if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);

if (!(data instanceof Array)) {
data = [data]
}

const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))

const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
success = true;
} catch (e) {
console.log("Failed data", data);
throw new Error("Failed to write to csv");
}
}
}

Now that we have data storage, our code now looks like this.

const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");

const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;

async function writeToCsv(data, outputFile) {
let success = false;
while (!success) {

if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);

if (!(data instanceof Array)) {
data = [data]
}

const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))

const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
success = true;
} catch (e) {
console.log("Failed data", data);
throw new Error("Failed to write to csv");
}
}
}


async function crawlProfiles(browser, keyword, location="us", retries=3) {
let tries = 0;
let success = false;

while (tries <= retries && !success) {

const firstName = keyword.split(" ")[0];
const lastName = keyword.split(" ")[1]


const page = await browser.newPage();
try {
const url = `https://www.linkedin.com/pub/dir?firstName=${firstName}&lastName=${lastName}&trk=people-guest_people-search-bar_search-submit`;

await page.goto(url);

console.log(`Successfully fetched: ${url}`);

const divCards = await page.$$("div[class='base-search-card__info']");

for (const divCard of divCards) {

const link = await page.evaluate(element => element.parentElement.getAttribute("href"), divCard);
const splitLink = link.split("/")
const name = splitLink[splitLink.length-1].split("?")[0];

const displayNameElement = await divCard.$("h3[class='base-search-card__title']");
const displayName = await page.evaluate(element => element.textContent, displayNameElement);

const locationElement = await page.$("p[class='people-search-card__location']");
const location = await page.evaluate(element => element.textContent, locationElement);

let companies = "n/a";

const hasCompanies = await page.$("span[class='entity-list-meta__entities-list']");

if (hasCompanies) {
companies = await page.evaluate(element => element.textContent, hasCompanies);
}


const searchData = {
name: name.trim(),
display_name: displayName.trim(),
url: link.trim(),
location: location.trim(),
companies: companies.trim()
};

await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
}

success = true;

} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;

} finally {
await page.close();
}
}
}

async function startCrawl(keywordList, location, retries) {

const browser = await puppeteer.launch();

for (const keyword of keywordList) {
await crawlProfiles(browser, keyword, location, retries);
}

await browser.close();
}


async function main() {
const keywords = ["bill gates", "elon musk"];
const concurrencyLimit = 5;
const location = "us";
const retries = 3;
const aggregateFiles = [];

console.log("Crawl starting");
console.time("startCrawl");
for (const keyword of keywords) {
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
await startCrawl(keywords, location, retries);
console.timeEnd("startCrawl");
console.log("Crawl complete");
}


main();
  • Like earlier, we use our extracted data to create a searchData object.
  • We pass our searchData into writeToCsv() and store it to a CSV file.

Step 3: Adding Concurrency

When deploying a scraper to production, it should be fast and efficient. Now that we have a working scraper, we need to make ours faster and more efficient. NodeJS is designed to run in a single threaded environment.

However, we don't need multithreading to scrape pages concurrently. We need to rewrite start crawl to run on multiple pages simultaneously.

To accomplish this, we're going to take advantage of JavaScript's async support. Take a look at the example below.

async function startCrawl(keywordList, location, concurrencyLimit, retries) {

const browser = await puppeteer.launch();

while (keywordList.length > 0) {
const currentBatch = keywordList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(keyword => crawlProfiles(browser, keyword, location, retries));

try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}

await browser.close();
}

We no longer have to depend on a for loop. Instead, we create a list of async tasks and we use Promise.all() to wait for them all to resolve.

When we search for bill gates and elon musk, both of these pages get fetched and parsed concurrently. We wait from the both to resolve before closing the browser and exiting the function.

const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");

const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;

async function writeToCsv(data, outputFile) {
let success = false;
while (!success) {

if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);

if (!(data instanceof Array)) {
data = [data]
}

const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))

const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
success = true;
} catch (e) {
console.log("Failed data", data);
throw new Error("Failed to write to csv");
}
}
}


async function crawlProfiles(browser, keyword, location="us", retries=3) {
let tries = 0;
let success = false;

while (tries <= retries && !success) {

const firstName = keyword.split(" ")[0];
const lastName = keyword.split(" ")[1]


const page = await browser.newPage();
try {
const url = `https://www.linkedin.com/pub/dir?firstName=${firstName}&lastName=${lastName}&trk=people-guest_people-search-bar_search-submit`;

await page.goto(url);

console.log(`Successfully fetched: ${url}`);

const divCards = await page.$$("div[class='base-search-card__info']");

for (const divCard of divCards) {

const link = await page.evaluate(element => element.parentElement.getAttribute("href"), divCard);
const splitLink = link.split("/")
const name = splitLink[splitLink.length-1].split("?")[0];

const displayNameElement = await divCard.$("h3[class='base-search-card__title']");
const displayName = await page.evaluate(element => element.textContent, displayNameElement);

const locationElement = await page.$("p[class='people-search-card__location']");
const location = await page.evaluate(element => element.textContent, locationElement);

let companies = "n/a";

const hasCompanies = await page.$("span[class='entity-list-meta__entities-list']");

if (hasCompanies) {
companies = await page.evaluate(element => element.textContent, hasCompanies);
}


const searchData = {
name: name.trim(),
display_name: displayName.trim(),
url: link.trim(),
location: location.trim(),
companies: companies.trim()
};

await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
}

success = true;

} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;

} finally {
await page.close();
}
}
}

async function startCrawl(keywordList, location, concurrencyLimit, retries) {

const browser = await puppeteer.launch();

while (keywordList.length > 0) {
const currentBatch = keywordList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(keyword => crawlProfiles(browser, keyword, location, retries));

try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}

await browser.close();
}


async function main() {
const keywords = ["bill gates", "elon musk"];
const concurrencyLimit = 5;
const location = "us";
const retries = 3;
const aggregateFiles = [];

console.log("Crawl starting");
console.time("startCrawl");
for (const keyword of keywords) {
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
await startCrawl(keywords, location, concurrencyLimit, retries);
console.timeEnd("startCrawl");
console.log("Crawl complete");
}


main();

Step 5: Bypassing Anti-Bots

Like we mentioned previously, we'll use the ScrapeOps Proxy Aggregator to bypass anti-bots.

This one function will unlock the power of the ScrapeOps Proxy. It needs to take in a URL, and then wrap it up with our api_key, and location using some URL encoding. Then it'll return a new ScrapeOps proxied URL.

When we talk to the ScrapeOps API, the country param tells ScrapeOps our location of choice. ScrapeOps then routes us through a server based in that location.

There are many other options we can use such as residential and mobile but typically, our country parameter is enough.

function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}

Here is our finished crawler.

const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");

const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;

async function writeToCsv(data, outputFile) {
let success = false;
while (!success) {

if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);

if (!(data instanceof Array)) {
data = [data]
}

const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))

const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
success = true;
} catch (e) {
console.log("Failed data", data);
throw new Error("Failed to write to csv");
}
}
}



function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}

async function crawlProfiles(browser, keyword, location="us", retries=3) {
let tries = 0;
let success = false;

while (tries <= retries && !success) {

const firstName = keyword.split(" ")[0];
const lastName = keyword.split(" ")[1]


const page = await browser.newPage();
try {
const url = `https://www.linkedin.com/pub/dir?firstName=${firstName}&lastName=${lastName}&trk=people-guest_people-search-bar_search-submit`;

const proxyUrl = getScrapeOpsUrl(url, location);
await page.goto(proxyUrl, { timeout: 0 });

console.log(`Successfully fetched: ${url}`);

const divCards = await page.$$("div[class='base-search-card__info']");

for (const divCard of divCards) {

const link = await page.evaluate(element => element.parentElement.getAttribute("href"), divCard);
const splitLink = link.split("/")
const name = splitLink[splitLink.length-1].split("?")[0];

const displayNameElement = await divCard.$("h3[class='base-search-card__title']");
const displayName = await page.evaluate(element => element.textContent, displayNameElement);

const locationElement = await page.$("p[class='people-search-card__location']");
const location = await page.evaluate(element => element.textContent, locationElement);

let companies = "n/a";

const hasCompanies = await page.$("span[class='entity-list-meta__entities-list']");

if (hasCompanies) {
companies = await page.evaluate(element => element.textContent, hasCompanies);
}


const searchData = {
name: name.trim(),
display_name: displayName.trim(),
url: link.trim(),
location: location.trim(),
companies: companies.trim()
};

await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
}

success = true;

} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;

} finally {
await page.close();
}
}
}

async function startCrawl(keywordList, location, concurrencyLimit, retries) {

const browser = await puppeteer.launch();

while (keywordList.length > 0) {
const currentBatch = keywordList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(keyword => crawlProfiles(browser, keyword, location, retries));

try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}

await browser.close();
}


async function main() {
const keywords = ["bill gates", "elon musk"];
const concurrencyLimit = 5;
const location = "us";
const retries = 3;
const aggregateFiles = [];

console.log("Crawl starting");
console.time("startCrawl");
for (const keyword of keywords) {
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
await startCrawl(keywords, location, concurrencyLimit, retries);
console.timeEnd("startCrawl");
console.log("Crawl complete");
}


main();

Step 6: Production Run

It's finally time to test out the performance of our crawler. Feel free to change any of the following from the main() function.

  • keywords
  • concurrencyLimit
  • location
  • retries

Crawler Performance

As you can see in the screenshot above, we crawled two names in 20.244 seconds. 20.244 / 2 = 10.122 seconds per search. This isn't lightning fast, but it's not bad at all.


Build A LinkedIn Profile Scraper

Now that we're getting a crawler report, we need to read that report and scrape the profiles from it. Our next step involves building a scraper.

Our scraper will read the report from our crawler and scrape each individual profile that we extracted during the crawl. We'll add each feature with iterative building, just like we did with the crawler.


Step 1: Create Simple Profile Data Parser

To start, we're going to write another parsing function. We'll give it retry logic, error handling and we'll use the basic structure from the beginning of this article.

scrape_profile() fetches a profile. We find the head of the page. From inside the head, we find the JSON blob that contains all of our profile_data.

async function processProfile(browser, row, location, retries = 3) {
const url = row.url;
let tries = 0;
let success = false;

while (tries <= retries && !success) {
const page = await browser.newPage();

try {
const response = await page.goto(url);
if (!response || response.status() !== 200) {
throw new Error("Failed to fetch page, status:", response.status());
}

const head = await page.$("head");
const scriptElement = await head.$("script[type='application/ld+json']");
const jsonText = await page.evaluate(element => element.textContent, scriptElement);

const jsonDataGraph = JSON.parse(jsonText)["@graph"];
let jsonData = {};
for (const element of jsonDataGraph) {
if (element["@type"] === "Person") {
jsonData = element;
break;
}
}

let company = "n/a";
let companyProfile = "n/a";
let jobTitle = "n/a";

if ("jobTitle" in jsonData && Array.isArray(jsonData.jobTitle) && jsonData.jobTitle.length > 0) {
jobTitle = jsonData.jobTitle[0];
}

const hasCompany = "worksFor" in jsonData && jsonData.worksFor.length > 0;

if (hasCompany) {
company = jsonData.worksFor[0].name;
const hasCompanyUrl = "url" in jsonData.worksFor[0];
if (hasCompanyUrl) {
companyProfile = jsonData.worksFor[0].url
}
}

const hasInteractions = "interactionStatistic" in jsonData;
let followers = 0;
if (hasInteractions) {
const stats = jsonData.interactionStatistic;
if (stats.name === "Follows" && stats["@type"] === "InteractionCounter") {
followers = stats.userInteractionCount;
}
}

const profileData = {
name: row.name,
company: company,
company_profile: companyProfile,
job_title: jobTitle,
followers: followers
}

console.log(profileData);

success = true;
console.log("Successfully parsed", row.url);


} catch (err) {
tries++;
console.log(`Error: ${err}, tries left: ${retries-tries}, url: ${getScrapeOpsUrl(url)}`);

} finally {
await page.close();
}
}
}
  • First, we find the head of the page: await page.$("head").
  • await head.$("script[type='application/ld+json']") finds the JSON blob inside the head.
  • We load the JSON and iterate through the "@graph" element until we find a field called "Person". We use this "Person" field to extract our data.
  • We attempt to extract the following and set defaults just in case something is not found:
    • company: the company that a person works for.
    • company_profile: the company's LinkedIn profile.
    • job_title: the person's official job title.
    • followers: the amount of other people following this person.

Step 2: Loading URLs To Scrape

Our parsing function takes in a row and uses its url to lookup a profile. Here, we'll write another function called processResults(). The goal here is simple: read our CSV file into an array of JSON objects. Then, run processProfile() on each profile from the array.

We set this function up alot like the startCrawl() function from earlier. You might notice that we take a concurrencyLimit as one of our arguments.

We don't do anything with it now, but we'll use it when we add concurrency later.

async function processResults(csvFile, location, concurrencyLimit, retries) {
const rows = await readCsv(csvFile);
const browser = await puppeteer.launch();;

for (const row of rows) {
await processProfile(browser, row, location, retries);
}
await browser.close();

}

As you can see above, our function begins by reading a CSV file. We also write a function to do that.

async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));

for await (const record of parser) {
results.push(record);
}
return results;
}

You can see how everything fits together in our code below.

const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");

const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;

async function writeToCsv(data, outputFile) {
let success = false;
while (!success) {

if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);

if (!(data instanceof Array)) {
data = [data]
}

const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))

const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
success = true;
} catch (e) {
console.log("Failed data", data);
throw new Error("Failed to write to csv");
}
}
}

async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));

for await (const record of parser) {
results.push(record);
}
return results;
}


function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}

async function crawlProfiles(browser, keyword, location="us", retries=3) {
let tries = 0;
let success = false;

while (tries <= retries && !success) {

const firstName = keyword.split(" ")[0];
const lastName = keyword.split(" ")[1]


const page = await browser.newPage();
try {
const url = `https://www.linkedin.com/pub/dir?firstName=${firstName}&lastName=${lastName}&trk=people-guest_people-search-bar_search-submit`;

const proxyUrl = getScrapeOpsUrl(url, location);
await page.goto(proxyUrl, { timeout: 0 });

console.log(`Successfully fetched: ${url}`);

const divCards = await page.$$("div[class='base-search-card__info']");

for (const divCard of divCards) {

const link = await page.evaluate(element => element.parentElement.getAttribute("href"), divCard);
const splitLink = link.split("/")
const name = splitLink[splitLink.length-1].split("?")[0];

const displayNameElement = await divCard.$("h3[class='base-search-card__title']");
const displayName = await page.evaluate(element => element.textContent, displayNameElement);

const locationElement = await page.$("p[class='people-search-card__location']");
const location = await page.evaluate(element => element.textContent, locationElement);

let companies = "n/a";

const hasCompanies = await page.$("span[class='entity-list-meta__entities-list']");

if (hasCompanies) {
companies = await page.evaluate(element => element.textContent, hasCompanies);
}


const searchData = {
name: name.trim(),
display_name: displayName.trim(),
url: link.trim(),
location: location.trim(),
companies: companies.trim()
};

await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
}

success = true;

} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;

} finally {
await page.close();
}
}
}

async function startCrawl(keywordList, location, concurrencyLimit, retries) {

const browser = await puppeteer.launch();

while (keywordList.length > 0) {
const currentBatch = keywordList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(keyword => crawlProfiles(browser, keyword, location, retries));

try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}

await browser.close();
}

async function processProfile(browser, row, location, retries = 3) {
const url = row.url;
let tries = 0;
let success = false;

while (tries <= retries && !success) {
const page = await browser.newPage();

try {
const response = await page.goto(url);
if (!response || response.status() !== 200) {
throw new Error("Failed to fetch page, status:", response.status());
}

const head = await page.$("head");
const scriptElement = await head.$("script[type='application/ld+json']");
const jsonText = await page.evaluate(element => element.textContent, scriptElement);

const jsonDataGraph = JSON.parse(jsonText)["@graph"];
let jsonData = {};
for (const element of jsonDataGraph) {
if (element["@type"] === "Person") {
jsonData = element;
break;
}
}

let company = "n/a";
let companyProfile = "n/a";
let jobTitle = "n/a";

if ("jobTitle" in jsonData && Array.isArray(jsonData.jobTitle) && jsonData.jobTitle.length > 0) {
jobTitle = jsonData.jobTitle[0];
}

const hasCompany = "worksFor" in jsonData && jsonData.worksFor.length > 0;

if (hasCompany) {
company = jsonData.worksFor[0].name;
const hasCompanyUrl = "url" in jsonData.worksFor[0];
if (hasCompanyUrl) {
companyProfile = jsonData.worksFor[0].url
}
}

const hasInteractions = "interactionStatistic" in jsonData;
let followers = 0;
if (hasInteractions) {
const stats = jsonData.interactionStatistic;
if (stats.name === "Follows" && stats["@type"] === "InteractionCounter") {
followers = stats.userInteractionCount;
}
}

const profileData = {
name: row.name,
company: company,
company_profile: companyProfile,
job_title: jobTitle,
followers: followers
}

console.log(profileData);

success = true;
console.log("Successfully parsed", row.url);


} catch (err) {
tries++;
console.log(`Error: ${err}, tries left: ${retries-tries}, url: ${getScrapeOpsUrl(url)}`);

} finally {
await page.close();
}
}
}

async function processResults(csvFile, location, concurrencyLimit, retries) {
const rows = await readCsv(csvFile);
const browser = await puppeteer.launch();;

for (const row of rows) {
await processProfile(browser, row, location, retries);
}
await browser.close();

}

async function main() {
const keywords = ["bill gates", "elon musk"];
const concurrencyLimit = 5;
const location = "us";
const retries = 3;
const aggregateFiles = [];

console.log("Crawl starting");
console.time("startCrawl");
for (const keyword of keywords) {
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
await startCrawl(keywords, location, concurrencyLimit, retries);
console.timeEnd("startCrawl");
console.log("Crawl complete");


console.log("Starting scrape");
for (const file of aggregateFiles) {
console.log(file)
console.time("processResults");
await processResults(file, location, concurrencyLimit, retries);
console.timeEnd("processResults");
}
console.log("Scrape complete");
}


main();
  • processProfile() extracts data from individual profiles.
  • processResults() reads our CSV file and runs processProfile() on all of the profiles from our CSV.

Step 3: Storing the Scraped Data

writeToCsv() already gives us the ability to write JSON objects to a CSV file. We also already convert our extracted data into a JSON object.

Instead of printing our JSON object to the console, we need to pass it into writeToCsv(). That's the only line that changes here.

Here's our fully updated code.

const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");

const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;

async function writeToCsv(data, outputFile) {
let success = false;
while (!success) {

if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);

if (!(data instanceof Array)) {
data = [data]
}

const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))

const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
success = true;
} catch (e) {
console.log("Failed data", data);
throw new Error("Failed to write to csv");
}
}
}

async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));

for await (const record of parser) {
results.push(record);
}
return results;
}


function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}

async function crawlProfiles(browser, keyword, location="us", retries=3) {
let tries = 0;
let success = false;

while (tries <= retries && !success) {

const firstName = keyword.split(" ")[0];
const lastName = keyword.split(" ")[1]


const page = await browser.newPage();
try {
const url = `https://www.linkedin.com/pub/dir?firstName=${firstName}&lastName=${lastName}&trk=people-guest_people-search-bar_search-submit`;

const proxyUrl = getScrapeOpsUrl(url, location);
await page.goto(proxyUrl, { timeout: 0 });

console.log(`Successfully fetched: ${url}`);

const divCards = await page.$$("div[class='base-search-card__info']");

for (const divCard of divCards) {

const link = await page.evaluate(element => element.parentElement.getAttribute("href"), divCard);
const splitLink = link.split("/")
const name = splitLink[splitLink.length-1].split("?")[0];

const displayNameElement = await divCard.$("h3[class='base-search-card__title']");
const displayName = await page.evaluate(element => element.textContent, displayNameElement);

const locationElement = await page.$("p[class='people-search-card__location']");
const location = await page.evaluate(element => element.textContent, locationElement);

let companies = "n/a";

const hasCompanies = await page.$("span[class='entity-list-meta__entities-list']");

if (hasCompanies) {
companies = await page.evaluate(element => element.textContent, hasCompanies);
}


const searchData = {
name: name.trim(),
display_name: displayName.trim(),
url: link.trim(),
location: location.trim(),
companies: companies.trim()
};

await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
}

success = true;

} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;

} finally {
await page.close();
}
}
}

async function startCrawl(keywordList, location, concurrencyLimit, retries) {

const browser = await puppeteer.launch();

while (keywordList.length > 0) {
const currentBatch = keywordList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(keyword => crawlProfiles(browser, keyword, location, retries));

try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}

await browser.close();
}

async function processProfile(browser, row, location, retries = 3) {
const url = row.url;
let tries = 0;
let success = false;

while (tries <= retries && !success) {
const page = await browser.newPage();

try {
const response = await page.goto(url);
if (!response || response.status() !== 200) {
throw new Error("Failed to fetch page, status:", response.status());
}

const head = await page.$("head");
const scriptElement = await head.$("script[type='application/ld+json']");
const jsonText = await page.evaluate(element => element.textContent, scriptElement);

const jsonDataGraph = JSON.parse(jsonText)["@graph"];
let jsonData = {};
for (const element of jsonDataGraph) {
if (element["@type"] === "Person") {
jsonData = element;
break;
}
}

let company = "n/a";
let companyProfile = "n/a";
let jobTitle = "n/a";

if ("jobTitle" in jsonData && Array.isArray(jsonData.jobTitle) && jsonData.jobTitle.length > 0) {
jobTitle = jsonData.jobTitle[0];
}

const hasCompany = "worksFor" in jsonData && jsonData.worksFor.length > 0;

if (hasCompany) {
company = jsonData.worksFor[0].name;
const hasCompanyUrl = "url" in jsonData.worksFor[0];
if (hasCompanyUrl) {
companyProfile = jsonData.worksFor[0].url
}
}

const hasInteractions = "interactionStatistic" in jsonData;
let followers = 0;
if (hasInteractions) {
const stats = jsonData.interactionStatistic;
if (stats.name === "Follows" && stats["@type"] === "InteractionCounter") {
followers = stats.userInteractionCount;
}
}

const profileData = {
name: row.name,
company: company,
company_profile: companyProfile,
job_title: jobTitle,
followers: followers
}

await writeToCsv([profileData], `${row.name.replace(" ", "-")}.csv`);

success = true;
console.log("Successfully parsed", row.url);


} catch (err) {
tries++;
console.log(`Error: ${err}, tries left: ${retries-tries}, url: ${getScrapeOpsUrl(url)}`);

} finally {
await page.close();
}
}
}

async function processResults(csvFile, location, concurrencyLimit, retries) {
const rows = await readCsv(csvFile);
const browser = await puppeteer.launch();;

for (const row of rows) {
await processProfile(browser, row, location, retries);
}
await browser.close();

}

async function main() {
const keywords = ["bill gates", "elon musk"];
const concurrencyLimit = 5;
const location = "us";
const retries = 3;
const aggregateFiles = [];

console.log("Crawl starting");
console.time("startCrawl");
for (const keyword of keywords) {
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
await startCrawl(keywords, location, concurrencyLimit, retries);
console.timeEnd("startCrawl");
console.log("Crawl complete");


console.log("Starting scrape");
for (const file of aggregateFiles) {
console.log(file)
console.time("processResults");
await processResults(file, location, concurrencyLimit, retries);
console.timeEnd("processResults");
}
console.log("Scrape complete");
}


main();

We now pass our profileData into writeToCsv(). This stores our extracted data safely.


Step 4: Adding Concurrency

Remember when we mentioned the concurrencyLimit before?

Now it's time to actually use it. Here, we'll once again use splice() to cut our array into chunks.

We convert each chunk into an array of async tasks. Then we await our tasks using Promise.all() so each task can resolve.

async function processResults(csvFile, location, concurrencyLimit, retries) {
const rows = await readCsv(csvFile);
const browser = await puppeteer.launch();;

while (rows.length > 0) {
const currentBatch = rows.splice(0, concurrencyLimit);
const tasks = currentBatch.map(row => processProfile(browser, row, location, retries));

try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();

}

Step 5: Bypassing Anti-Bots

Our crawler has already been integrated with Proxy Aggregator using getScrapeOpsUrl(). We need it to get past any anti-bots they use on the profile pages as well.

We'll going to change one line from our parsing function, await page.goto().

const response = await page.goto(getScrapeOpsUrl(url, location), { timeout: 0 });

We have unlocked the power of proxy.

Our finished profile scraper is available below.

const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");

const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;

async function writeToCsv(data, outputFile) {
let success = false;
while (!success) {

if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);

if (!(data instanceof Array)) {
data = [data]
}

const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))

const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
success = true;
} catch (e) {
console.log("Failed data", data);
throw new Error("Failed to write to csv");
}
}
}

async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));

for await (const record of parser) {
results.push(record);
}
return results;
}


function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}

async function crawlProfiles(browser, keyword, location="us", retries=3) {
let tries = 0;
let success = false;

while (tries <= retries && !success) {

const firstName = keyword.split(" ")[0];
const lastName = keyword.split(" ")[1]


const page = await browser.newPage();
try {
const url = `https://www.linkedin.com/pub/dir?firstName=${firstName}&lastName=${lastName}&trk=people-guest_people-search-bar_search-submit`;

const proxyUrl = getScrapeOpsUrl(url, location);
await page.goto(proxyUrl, { timeout: 0 });

console.log(`Successfully fetched: ${url}`);

const divCards = await page.$$("div[class='base-search-card__info']");

for (const divCard of divCards) {

const link = await page.evaluate(element => element.parentElement.getAttribute("href"), divCard);
const splitLink = link.split("/")
const name = splitLink[splitLink.length-1].split("?")[0];

const displayNameElement = await divCard.$("h3[class='base-search-card__title']");
const displayName = await page.evaluate(element => element.textContent, displayNameElement);

const locationElement = await page.$("p[class='people-search-card__location']");
const location = await page.evaluate(element => element.textContent, locationElement);

let companies = "n/a";

const hasCompanies = await page.$("span[class='entity-list-meta__entities-list']");

if (hasCompanies) {
companies = await page.evaluate(element => element.textContent, hasCompanies);
}


const searchData = {
name: name.trim(),
display_name: displayName.trim(),
url: link.trim(),
location: location.trim(),
companies: companies.trim()
};

await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
}

success = true;

} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;

} finally {
await page.close();
}
}
}

async function startCrawl(keywordList, location, concurrencyLimit, retries) {

const browser = await puppeteer.launch();

while (keywordList.length > 0) {
const currentBatch = keywordList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(keyword => crawlProfiles(browser, keyword, location, retries));

try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}

await browser.close();
}

async function processProfile(browser, row, location, retries = 3) {
const url = row.url;
let tries = 0;
let success = false;

while (tries <= retries && !success) {
const page = await browser.newPage();

try {
const response = await page.goto(getScrapeOpsUrl(url, location), { timeout: 0 });
if (!response || response.status() !== 200) {
throw new Error("Failed to fetch page, status:", response.status());
}

const head = await page.$("head");
const scriptElement = await head.$("script[type='application/ld+json']");
const jsonText = await page.evaluate(element => element.textContent, scriptElement);

const jsonDataGraph = JSON.parse(jsonText)["@graph"];
let jsonData = {};
for (const element of jsonDataGraph) {
if (element["@type"] === "Person") {
jsonData = element;
break;
}
}

let company = "n/a";
let companyProfile = "n/a";
let jobTitle = "n/a";

if ("jobTitle" in jsonData && Array.isArray(jsonData.jobTitle) && jsonData.jobTitle.length > 0) {
jobTitle = jsonData.jobTitle[0];
}

const hasCompany = "worksFor" in jsonData && jsonData.worksFor.length > 0;

if (hasCompany) {
company = jsonData.worksFor[0].name;
const hasCompanyUrl = "url" in jsonData.worksFor[0];
if (hasCompanyUrl) {
companyProfile = jsonData.worksFor[0].url
}
}

const hasInteractions = "interactionStatistic" in jsonData;
let followers = 0;
if (hasInteractions) {
const stats = jsonData.interactionStatistic;
if (stats.name === "Follows" && stats["@type"] === "InteractionCounter") {
followers = stats.userInteractionCount;
}
}

const profileData = {
name: row.name,
company: company,
company_profile: companyProfile,
job_title: jobTitle,
followers: followers
}

await writeToCsv([profileData], `${row.name.replace(" ", "-")}.csv`);

success = true;
console.log("Successfully parsed", row.url);


} catch (err) {
tries++;
console.log(`Error: ${err}, tries left: ${retries-tries}, url: ${getScrapeOpsUrl(url)}`);

} finally {
await page.close();
}
}
}

async function processResults(csvFile, location, concurrencyLimit, retries) {
const rows = await readCsv(csvFile);
const browser = await puppeteer.launch();;

while (rows.length > 0) {
const currentBatch = rows.splice(0, concurrencyLimit);
const tasks = currentBatch.map(row => processProfile(browser, row, location, retries));

try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();

}

async function main() {
const keywords = ["bill gates", "elon musk"];
const concurrencyLimit = 5;
const location = "us";
const retries = 3;
const aggregateFiles = [];

console.log("Crawl starting");
console.time("startCrawl");
for (const keyword of keywords) {
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
await startCrawl(keywords, location, concurrencyLimit, retries);
console.timeEnd("startCrawl");
console.log("Crawl complete");


console.log("Starting scrape");
for (const file of aggregateFiles) {
console.log(file)
console.time("processResults");
await processResults(file, location, concurrencyLimit, retries);
console.timeEnd("processResults");
}
console.log("Scrape complete");
}


main();

Step 6: Production Run

Now, we'll test the full script out in production. Like before, feel free to change any of the following:

  • keywords
  • concurrencyLimit
  • location
  • retries

This time, our crawl took 73.6 seconds. You can see a screenshot of our full results below.

Scraper Performance

This time around, we generated two crawl reports with a total of 78 results. It took 327.482 seconds to scrape all of the bill gates results. The elon musk scrape took 248.634 seconds.

Our total time spent crawling is 636.116 seconds. 636.116 seconds / 78 results = 8.155 seconds per result. We're scraping pages even faster than we crawled them!

Our overall program is running pretty well!


According to precedent set by numerous court cases, scraping the public web (including LinkedIn) is perfectly legal. In this tutorial, we made sure to only scrape public data from LinkedIn.

When scraping private data (data behind a login), that's a completely different story and you're subject to a completely different set of rules and regulations.

Although our crawler and scraper here were completely legal, we definitely violated LinkedIn's terms of service and robots.txt. You can view their terms here and you may view their robots.txt here.

Failure to comply with these policies can result in suspension or even permanent removal of your LinkedIn account.

If you're unsure whether your scraper is legal or not, consult an attorney.


Conclusion

LinkedIn Profiles are among the most difficult pages to scrape on the web. The ScrapeOps Proxy Aggregator easily pushes through their anti-bots and gets us through to the data we need.

By this point, you've completed the tutorial and you should have a solid grasp on how to use Puppeteer to extract data from LinkedIn. You now should also have a solid grasp of parsing, data storage, concurrency, and proxy integration.

You can dig deeper into the tech we used by clicking the links below.


More Python Web Scraping Guides

We always have something for you here at ScrapeOps. Whether you're just learning how to code, or you're a seasoned dev, you can gain something from our tutorials.

Check out our Puppeteer Web Scraping Playbook.

If you want to learn how to scrape another tricky site, check out the links below!