How to Scrape LinkedIn Profiles With Puppeteer
LinkedIn was created in 2003. Over the course of its existence, LinkedIn has accumulated tons and tons of data.
In today's guide, we'll scrape LinkedIn profiles. We'll explore this in excruciating detail. While the profiles are very difficult to scrape, if you know what to do, you can get past their seemingly unbeatable system of redirects.
- TLDR: How to Scrape LinkedIn Profiles
- How To Architect Our Scraper
- Understanding How To Scrape LinkedIn Profiles
- Setting Up Our LinkedIn Profiles Scraper
- Build A LinkedIn Profiles Search Crawler
- Build A LinkedIn Profile Scraper
- Legal and Ethical Considerations
- Conclusion
- More Puppeteer Web Scraping Guides
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
TLDR - How to Scrape LinkedIn Profiles
For those of you without time to read, we've got a prebuilt scraper you can use.
- It first runs a crawl and generates a report based on our search results.
- Once we've got a report generated, our scraper will read the report and scrape each individual profile discovered during the crawl.
- Start by creating a new project folder with a
config.json
file. - Inside your config file, add your [ScrapeOps API key],
{"api_key": "your-super-secret-api-key"}
. - Then, copy and paste the code below into a Python file.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
async function crawlProfiles(browser, keyword, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const firstName = keyword.split(" ")[0];
const lastName = keyword.split(" ")[1]
const page = await browser.newPage();
try {
const url = `https://www.linkedin.com/pub/dir?firstName=${firstName}&lastName=${lastName}&trk=people-guest_people-search-bar_search-submit`;
await page.goto(url);
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[class='base-search-card__info']");
for (const divCard of divCards) {
const link = await page.evaluate(element => element.parentElement.getAttribute("href"), divCard);
const splitLink = link.split("/")
const name = splitLink[splitLink.length-1].split("?")[0];
const displayNameElement = await divCard.$("h3[class='base-search-card__title']");
const displayName = await page.evaluate(element => element.textContent, displayNameElement);
const locationElement = await page.$("p[class='people-search-card__location']");
const location = await page.evaluate(element => element.textContent, locationElement);
let companies = "n/a";
const hasCompanies = await page.$("span[class='entity-list-meta__entities-list']");
if (hasCompanies) {
companies = await page.evaluate(element => element.textContent, hasCompanies);
}
const searchData = {
name: name.trim(),
display_name: displayName.trim(),
url: link.trim(),
location: location.trim(),
companies: companies.trim()
};
console.log(searchData);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startCrawl(keywordList, location, retries) {
const browser = await puppeteer.launch();
for (const keyword of keywordList) {
await crawlProfiles(browser, keyword, location, retries);
}
await browser.close();
}
async function main() {
const keywords = ["bill gates", "elon musk"];
const concurrencyLimit = 5;
const location = "us";
const retries = 3;
const aggregateFiles = [];
console.log("Crawl starting");
console.time("startCrawl");
for (const keyword of keywords) {
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
await startCrawl(keywords, location, retries);
console.timeEnd("startCrawl");
console.log("Crawl complete");
}
main();
To change your results, you can change any of the following from our main
:
keywords
concurrencyLimit
location
retries
How To Architect Our LinkedIn Profiles Scraper
LinkedIn is difficult to scrape. When you navigate to their site from your browser, if you're not logged in, you get redirected and prompted to sign in. If you're new to scraping, their anti-bot system can seem impassable. With some due diligence, we can get around all of this. We're going to write a search crawler and a profile scraper.
Our crawler takes in a keyword and searches for it. If we want to search for Bill Gates, our crawler will run that search and then it'll save each Bill Gates that it finds from the results.
Afterward, it'll be time for our profile scraper. The profile scraper starts right where the crawler leaves off. It reads the CSV and then scrapes each individual profile found in the CSV file.
At a high level, our profile crawler needs to:
- Perform a search and parse the search results.
- Store those parsed results.
- Concurrently run steps 1 and 2 on multiple searches.
- Use proxy integration to get past LinkedIn's anti-bots.
Our profile scraper needs to perform the following steps:
- Read the crawler's report into an array.
- Parse a row from the array.
- Store parsed profile data.
- Run steps 2 and 3 on multiple pages concurrently.
- Utilize a proxy to bypass anti-bots.
Understanding How To Scrape LinkedIn Profiles
We can't just start building our scrapers, we need to understand exactly where our data is and plan out how to extract it from the page. We'll use the ScrapeOps Proxy Aggregator API to handle our geolocation and bypass anti-bots.
These next few sections will highlight our requirements when building the crawler and the scraper.
Step 1: How To Request LinkedIn Profiles Pages
We need to know how to GET our webpages from LinkedIn. We need to GET our search results and the individual profile page.
Look at the images below so you can gain a better understanding of these types of pages.
First, we'll look at our search results page, then we'll examine the individual profile page.
You can view a search for Bill Gates in the shot below. Our URL is:
https://www.linkedin.com/pub/dir?firstName=bill&lastName=gates&trk=people-guest_people-search-bar_search-submit
We're prompted to sign in as soon as we get to the page, but this isn't really an issue because our full page is still intact under the prompt.
Our final URL format looks like this:
https://www.linkedin.com/pub/dir?firstName={first_name}&lastName={last_name}&trk=people-guest_people-search-bar_search-submit
To scrape individual profiles, we need a better feel for the profile layout. Here's a look at the profile of Bill Gates. We're once again prompted to sign in, but the underlying page is in tact.
Our url is:
https://www.linkedin.com/in/williamhgates?trk=people-guest_people_search-card
All of our profile links look like this:
https://www.linkedin.com/in/{name_of_profile}
We remove the queries at the end because (for some unknown reason), anti-bots are less likely to block us when we format the url this way.
Step 2: How To Extract Data From LinkedIn Profiles Results and Pages
Time to figure out how to get our data. If you look at our search results, each one is a div
with a class
of 'base-search-card__info'
. For individual profiles, we pull our data from a JSON blob inside the head
of the page.
Look at each result. It's div
element. Its class
is base-search=card__info
.
In the image below, you can see a profile page. As you can see, there is a ton of data inside the JSON blob.
Step 3: Geolocated Data
With the ScrapeOps Proxy Aggregator, we can choose which country we want to appear in.
The ScrapeOps API allows us to pass a country
parameter. ScrapeOps then reads this parameter and routes our request through the corresponding country.
- If we want to appear in the US, we can pass
"country": "us"
. - If we want to appear in the UK, we can pass
"country": "uk"
.
You can view the full use of supported countries on this page.
ScrapeOps gives great geotargeting support at no additional charge. There are other proxy providers that charge you extra API credits to use their geotargeting.
Setting Up Our LinkedIn Profiles Scraper Project
Time to start building. We need to create a new project folder and initialize it as a NodeJS project. Then we'll install Puppeteer and a few other basic dependencies.
Create a New Project Folder
mkdir linkedin-profiles-scraper
cd linkedin-profiles-scraper
Turn it into a JavaScript Project
npm init --y
Install Our Dependencies
npm install puppeteer
npm install csv-writer
npm install csv-parse
npm install fs
We're all set to begin coding.
Build A LinkedIn Profiles Search Crawler
We've already outlined the requirements for our crawler. Time to go about building our crawler and adding these features in. As previously mentioned, our whole project starts with our crawler.
Our crawler will run a search, parse the results, and then save our data to a CSV file. Once our crawler can do these tasks, we'll need to add concurrency and proxy support.
In the coming sections, we'll go through step by step and build all of these features into our crawler.
Step 1: Create Simple Search Data Parser
Everything stems from our parsing function.
In the script below, we'll handle our imports, retries and, of course, parsing logic. Everything built afterward will be on top of this basic design. Take a look at our parsing function, crawlProfiles()
.
As we discovered earlier, we need find all of our target div
elements. Once we've got them, we'll iterate through them with a for
loop and extract their data.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
async function crawlProfiles(browser, keyword, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const firstName = keyword.split(" ")[0];
const lastName = keyword.split(" ")[1]
const page = await browser.newPage();
try {
const url = `https://www.linkedin.com/pub/dir?firstName=${firstName}&lastName=${lastName}&trk=people-guest_people-search-bar_search-submit`;
await page.goto(url);
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[class='base-search-card__info']");
for (const divCard of divCards) {
const link = await page.evaluate(element => element.parentElement.getAttribute("href"), divCard);
const splitLink = link.split("/")
const name = splitLink[splitLink.length-1].split("?")[0];
const displayNameElement = await divCard.$("h3[class='base-search-card__title']");
const displayName = await page.evaluate(element => element.textContent, displayNameElement);
const locationElement = await page.$("p[class='people-search-card__location']");
const location = await page.evaluate(element => element.textContent, locationElement);
let companies = "n/a";
const hasCompanies = await page.$("span[class='entity-list-meta__entities-list']");
if (hasCompanies) {
companies = await page.evaluate(element => element.textContent, hasCompanies);
}
const searchData = {
name: name.trim(),
display_name: displayName.trim(),
url: link.trim(),
location: location.trim(),
companies: companies.trim()
};
console.log(searchData);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startCrawl(keywordList, location, retries) {
const browser = await puppeteer.launch();
for (const keyword of keywordList) {
await crawlProfiles(browser, keyword, location, retries);
}
await browser.close();
}
async function main() {
const keywords = ["bill gates", "elon musk"];
const concurrencyLimit = 5;
const location = "us";
const retries = 3;
const aggregateFiles = [];
console.log("Crawl starting");
console.time("startCrawl");
for (const keyword of keywords) {
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
await startCrawl(keywords, location, retries);
console.timeEnd("startCrawl");
console.log("Crawl complete");
}
main();
await page.$$("div[class='base-search-card__info']");
returns all of the profile cards we're looking for.- As we iterate through the profile cards:
await page.evaluate(element => element.parentElement.getAttribute("href"), divCard)
finds ourlink
.await divCard.$("h3[class='base-search-card__title']")
yields ourdisplayNameElement
.- We extract its text with
await page.evaluate(element => element.textContent, displayNameElement)
.
- We extract its text with
await page.$("p[class='people-search-card__location']")
gives us thelocationElement
.- We extract its text the same way we extracted the text from our
displayNameElement
.
- We extract its text the same way we extracted the text from our
- We check the
span
elements to see if there are companies present and if there are companies, we extract them. If there are no companies, we assign a default value of"n/a"
.
Step 2: Storing the Scraped Data
We need to store our extracted data. Without a way to store it, this extracted data is useless. In this section, we'll write a function that takes in an array of JSON objects and writes the array to a CSV file. We should craft this function carefully.
This function should check to see if a file exists.
- If the file already exists, we should open it in append mode, otherwise, we need to create a new one. It should also check if our data is an array.
- If the data isn't an array, we need to convert it to one. Also, it shouldn't exit until the CSV file has been written. Storage failure shouldn't be an option.
Here is writetoCsv()
.
async function writeToCsv(data, outputFile) {
let success = false;
while (!success) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);
if (!(data instanceof Array)) {
data = [data]
}
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
success = true;
} catch (e) {
console.log("Failed data", data);
throw new Error("Failed to write to csv");
}
}
}
Now that we have data storage, our code now looks like this.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
async function writeToCsv(data, outputFile) {
let success = false;
while (!success) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);
if (!(data instanceof Array)) {
data = [data]
}
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
success = true;
} catch (e) {
console.log("Failed data", data);
throw new Error("Failed to write to csv");
}
}
}
async function crawlProfiles(browser, keyword, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const firstName = keyword.split(" ")[0];
const lastName = keyword.split(" ")[1]
const page = await browser.newPage();
try {
const url = `https://www.linkedin.com/pub/dir?firstName=${firstName}&lastName=${lastName}&trk=people-guest_people-search-bar_search-submit`;
await page.goto(url);
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[class='base-search-card__info']");
for (const divCard of divCards) {
const link = await page.evaluate(element => element.parentElement.getAttribute("href"), divCard);
const splitLink = link.split("/")
const name = splitLink[splitLink.length-1].split("?")[0];
const displayNameElement = await divCard.$("h3[class='base-search-card__title']");
const displayName = await page.evaluate(element => element.textContent, displayNameElement);
const locationElement = await page.$("p[class='people-search-card__location']");
const location = await page.evaluate(element => element.textContent, locationElement);
let companies = "n/a";
const hasCompanies = await page.$("span[class='entity-list-meta__entities-list']");
if (hasCompanies) {
companies = await page.evaluate(element => element.textContent, hasCompanies);
}
const searchData = {
name: name.trim(),
display_name: displayName.trim(),
url: link.trim(),
location: location.trim(),
companies: companies.trim()
};
await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startCrawl(keywordList, location, retries) {
const browser = await puppeteer.launch();
for (const keyword of keywordList) {
await crawlProfiles(browser, keyword, location, retries);
}
await browser.close();
}
async function main() {
const keywords = ["bill gates", "elon musk"];
const concurrencyLimit = 5;
const location = "us";
const retries = 3;
const aggregateFiles = [];
console.log("Crawl starting");
console.time("startCrawl");
for (const keyword of keywords) {
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
await startCrawl(keywords, location, retries);
console.timeEnd("startCrawl");
console.log("Crawl complete");
}
main();
- Like earlier, we use our extracted data to create a
searchData
object. - We pass our
searchData
intowriteToCsv()
and store it to a CSV file.
Step 3: Adding Concurrency
When deploying a scraper to production, it should be fast and efficient. Now that we have a working scraper, we need to make ours faster and more efficient. NodeJS is designed to run in a single threaded environment.
However, we don't need multithreading to scrape pages concurrently. We need to rewrite start crawl to run on multiple pages simultaneously.
To accomplish this, we're going to take advantage of JavaScript's async
support. Take a look at the example below.
async function startCrawl(keywordList, location, concurrencyLimit, retries) {
const browser = await puppeteer.launch();
while (keywordList.length > 0) {
const currentBatch = keywordList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(keyword => crawlProfiles(browser, keyword, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
We no longer have to depend on a for
loop. Instead, we create a list of async
tasks and we use Promise.all()
to wait for them all to resolve.
When we search for bill gates and elon musk, both of these pages get fetched and parsed concurrently. We wait from the both to resolve before closing the browser and exiting the function.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
async function writeToCsv(data, outputFile) {
let success = false;
while (!success) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);
if (!(data instanceof Array)) {
data = [data]
}
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
success = true;
} catch (e) {
console.log("Failed data", data);
throw new Error("Failed to write to csv");
}
}
}
async function crawlProfiles(browser, keyword, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const firstName = keyword.split(" ")[0];
const lastName = keyword.split(" ")[1]
const page = await browser.newPage();
try {
const url = `https://www.linkedin.com/pub/dir?firstName=${firstName}&lastName=${lastName}&trk=people-guest_people-search-bar_search-submit`;
await page.goto(url);
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[class='base-search-card__info']");
for (const divCard of divCards) {
const link = await page.evaluate(element => element.parentElement.getAttribute("href"), divCard);
const splitLink = link.split("/")
const name = splitLink[splitLink.length-1].split("?")[0];
const displayNameElement = await divCard.$("h3[class='base-search-card__title']");
const displayName = await page.evaluate(element => element.textContent, displayNameElement);
const locationElement = await page.$("p[class='people-search-card__location']");
const location = await page.evaluate(element => element.textContent, locationElement);
let companies = "n/a";
const hasCompanies = await page.$("span[class='entity-list-meta__entities-list']");
if (hasCompanies) {
companies = await page.evaluate(element => element.textContent, hasCompanies);
}
const searchData = {
name: name.trim(),
display_name: displayName.trim(),
url: link.trim(),
location: location.trim(),
companies: companies.trim()
};
await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startCrawl(keywordList, location, concurrencyLimit, retries) {
const browser = await puppeteer.launch();
while (keywordList.length > 0) {
const currentBatch = keywordList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(keyword => crawlProfiles(browser, keyword, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
async function main() {
const keywords = ["bill gates", "elon musk"];
const concurrencyLimit = 5;
const location = "us";
const retries = 3;
const aggregateFiles = [];
console.log("Crawl starting");
console.time("startCrawl");
for (const keyword of keywords) {
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
await startCrawl(keywords, location, concurrencyLimit, retries);
console.timeEnd("startCrawl");
console.log("Crawl complete");
}
main();
Step 5: Bypassing Anti-Bots
Like we mentioned previously, we'll use the ScrapeOps Proxy Aggregator to bypass anti-bots.
This one function will unlock the power of the ScrapeOps Proxy. It needs to take in a URL, and then wrap it up with our api_key
, and location
using some URL encoding. Then it'll return a new ScrapeOps proxied URL.
When we talk to the ScrapeOps API, the country
param tells ScrapeOps our location of choice. ScrapeOps then routes us through a server based in that location.
There are many other options we can use such as residential
and mobile
but typically, our country
parameter is enough.
function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}
Here is our finished crawler.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
async function writeToCsv(data, outputFile) {
let success = false;
while (!success) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);
if (!(data instanceof Array)) {
data = [data]
}
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
success = true;
} catch (e) {
console.log("Failed data", data);
throw new Error("Failed to write to csv");
}
}
}
function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}
async function crawlProfiles(browser, keyword, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const firstName = keyword.split(" ")[0];
const lastName = keyword.split(" ")[1]
const page = await browser.newPage();
try {
const url = `https://www.linkedin.com/pub/dir?firstName=${firstName}&lastName=${lastName}&trk=people-guest_people-search-bar_search-submit`;
const proxyUrl = getScrapeOpsUrl(url, location);
await page.goto(proxyUrl, { timeout: 0 });
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[class='base-search-card__info']");
for (const divCard of divCards) {
const link = await page.evaluate(element => element.parentElement.getAttribute("href"), divCard);
const splitLink = link.split("/")
const name = splitLink[splitLink.length-1].split("?")[0];
const displayNameElement = await divCard.$("h3[class='base-search-card__title']");
const displayName = await page.evaluate(element => element.textContent, displayNameElement);
const locationElement = await page.$("p[class='people-search-card__location']");
const location = await page.evaluate(element => element.textContent, locationElement);
let companies = "n/a";
const hasCompanies = await page.$("span[class='entity-list-meta__entities-list']");
if (hasCompanies) {
companies = await page.evaluate(element => element.textContent, hasCompanies);
}
const searchData = {
name: name.trim(),
display_name: displayName.trim(),
url: link.trim(),
location: location.trim(),
companies: companies.trim()
};
await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startCrawl(keywordList, location, concurrencyLimit, retries) {
const browser = await puppeteer.launch();
while (keywordList.length > 0) {
const currentBatch = keywordList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(keyword => crawlProfiles(browser, keyword, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
async function main() {
const keywords = ["bill gates", "elon musk"];
const concurrencyLimit = 5;
const location = "us";
const retries = 3;
const aggregateFiles = [];
console.log("Crawl starting");
console.time("startCrawl");
for (const keyword of keywords) {
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
await startCrawl(keywords, location, concurrencyLimit, retries);
console.timeEnd("startCrawl");
console.log("Crawl complete");
}
main();
Step 6: Production Run
It's finally time to test out the performance of our crawler. Feel free to change any of the following from the main()
function.
keywords
concurrencyLimit
location
retries
As you can see in the screenshot above, we crawled two names in 20.244 seconds. 20.244 / 2 = 10.122 seconds per search. This isn't lightning fast, but it's not bad at all.
Build A LinkedIn Profile Scraper
Now that we're getting a crawler report, we need to read that report and scrape the profiles from it. Our next step involves building a scraper.
Our scraper will read the report from our crawler and scrape each individual profile that we extracted during the crawl. We'll add each feature with iterative building, just like we did with the crawler.
Step 1: Create Simple Profile Data Parser
To start, we're going to write another parsing function. We'll give it retry logic, error handling and we'll use the basic structure from the beginning of this article.
scrape_profile()
fetches a profile. We find the head
of the page. From inside the head
, we find the JSON blob that contains all of our profile_data
.
async function processProfile(browser, row, location, retries = 3) {
const url = row.url;
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const page = await browser.newPage();
try {
const response = await page.goto(url);
if (!response || response.status() !== 200) {
throw new Error("Failed to fetch page, status:", response.status());
}
const head = await page.$("head");
const scriptElement = await head.$("script[type='application/ld+json']");
const jsonText = await page.evaluate(element => element.textContent, scriptElement);
const jsonDataGraph = JSON.parse(jsonText)["@graph"];
let jsonData = {};
for (const element of jsonDataGraph) {
if (element["@type"] === "Person") {
jsonData = element;
break;
}
}
let company = "n/a";
let companyProfile = "n/a";
let jobTitle = "n/a";
if ("jobTitle" in jsonData && Array.isArray(jsonData.jobTitle) && jsonData.jobTitle.length > 0) {
jobTitle = jsonData.jobTitle[0];
}
const hasCompany = "worksFor" in jsonData && jsonData.worksFor.length > 0;
if (hasCompany) {
company = jsonData.worksFor[0].name;
const hasCompanyUrl = "url" in jsonData.worksFor[0];
if (hasCompanyUrl) {
companyProfile = jsonData.worksFor[0].url
}
}
const hasInteractions = "interactionStatistic" in jsonData;
let followers = 0;
if (hasInteractions) {
const stats = jsonData.interactionStatistic;
if (stats.name === "Follows" && stats["@type"] === "InteractionCounter") {
followers = stats.userInteractionCount;
}
}
const profileData = {
name: row.name,
company: company,
company_profile: companyProfile,
job_title: jobTitle,
followers: followers
}
console.log(profileData);
success = true;
console.log("Successfully parsed", row.url);
} catch (err) {
tries++;
console.log(`Error: ${err}, tries left: ${retries-tries}, url: ${getScrapeOpsUrl(url)}`);
} finally {
await page.close();
}
}
}
- First, we find the
head
of the page:await page.$("head")
. await head.$("script[type='application/ld+json']")
finds the JSON blob inside thehead
.- We load the JSON and iterate through the
"@graph"
element until we find a field called"Person"
. We use this"Person"
field to extract our data. - We attempt to extract the following and set defaults just in case something is not found:
company
: the company that a person works for.company_profile
: the company's LinkedIn profile.job_title
: the person's official job title.followers
: the amount of other people following this person.
Step 2: Loading URLs To Scrape
Our parsing function takes in a row and uses its url to lookup a profile. Here, we'll write another function called processResults()
. The goal here is simple: read our CSV file into an array of JSON objects. Then, run processProfile()
on each profile from the array.
We set this function up alot like the startCrawl()
function from earlier. You might notice that we take a concurrencyLimit
as one of our arguments.
We don't do anything with it now, but we'll use it when we add concurrency later.
async function processResults(csvFile, location, concurrencyLimit, retries) {
const rows = await readCsv(csvFile);
const browser = await puppeteer.launch();;
for (const row of rows) {
await processProfile(browser, row, location, retries);
}
await browser.close();
}
As you can see above, our function begins by reading a CSV file. We also write a function to do that.
async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));
for await (const record of parser) {
results.push(record);
}
return results;
}
You can see how everything fits together in our code below.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
async function writeToCsv(data, outputFile) {
let success = false;
while (!success) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);
if (!(data instanceof Array)) {
data = [data]
}
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
success = true;
} catch (e) {
console.log("Failed data", data);
throw new Error("Failed to write to csv");
}
}
}
async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));
for await (const record of parser) {
results.push(record);
}
return results;
}
function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}
async function crawlProfiles(browser, keyword, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const firstName = keyword.split(" ")[0];
const lastName = keyword.split(" ")[1]
const page = await browser.newPage();
try {
const url = `https://www.linkedin.com/pub/dir?firstName=${firstName}&lastName=${lastName}&trk=people-guest_people-search-bar_search-submit`;
const proxyUrl = getScrapeOpsUrl(url, location);
await page.goto(proxyUrl, { timeout: 0 });
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[class='base-search-card__info']");
for (const divCard of divCards) {
const link = await page.evaluate(element => element.parentElement.getAttribute("href"), divCard);
const splitLink = link.split("/")
const name = splitLink[splitLink.length-1].split("?")[0];
const displayNameElement = await divCard.$("h3[class='base-search-card__title']");
const displayName = await page.evaluate(element => element.textContent, displayNameElement);
const locationElement = await page.$("p[class='people-search-card__location']");
const location = await page.evaluate(element => element.textContent, locationElement);
let companies = "n/a";
const hasCompanies = await page.$("span[class='entity-list-meta__entities-list']");
if (hasCompanies) {
companies = await page.evaluate(element => element.textContent, hasCompanies);
}
const searchData = {
name: name.trim(),
display_name: displayName.trim(),
url: link.trim(),
location: location.trim(),
companies: companies.trim()
};
await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startCrawl(keywordList, location, concurrencyLimit, retries) {
const browser = await puppeteer.launch();
while (keywordList.length > 0) {
const currentBatch = keywordList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(keyword => crawlProfiles(browser, keyword, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
async function processProfile(browser, row, location, retries = 3) {
const url = row.url;
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const page = await browser.newPage();
try {
const response = await page.goto(url);
if (!response || response.status() !== 200) {
throw new Error("Failed to fetch page, status:", response.status());
}
const head = await page.$("head");
const scriptElement = await head.$("script[type='application/ld+json']");
const jsonText = await page.evaluate(element => element.textContent, scriptElement);
const jsonDataGraph = JSON.parse(jsonText)["@graph"];
let jsonData = {};
for (const element of jsonDataGraph) {
if (element["@type"] === "Person") {
jsonData = element;
break;
}
}
let company = "n/a";
let companyProfile = "n/a";
let jobTitle = "n/a";
if ("jobTitle" in jsonData && Array.isArray(jsonData.jobTitle) && jsonData.jobTitle.length > 0) {
jobTitle = jsonData.jobTitle[0];
}
const hasCompany = "worksFor" in jsonData && jsonData.worksFor.length > 0;
if (hasCompany) {
company = jsonData.worksFor[0].name;
const hasCompanyUrl = "url" in jsonData.worksFor[0];
if (hasCompanyUrl) {
companyProfile = jsonData.worksFor[0].url
}
}
const hasInteractions = "interactionStatistic" in jsonData;
let followers = 0;
if (hasInteractions) {
const stats = jsonData.interactionStatistic;
if (stats.name === "Follows" && stats["@type"] === "InteractionCounter") {
followers = stats.userInteractionCount;
}
}
const profileData = {
name: row.name,
company: company,
company_profile: companyProfile,
job_title: jobTitle,
followers: followers
}
console.log(profileData);
success = true;
console.log("Successfully parsed", row.url);
} catch (err) {
tries++;
console.log(`Error: ${err}, tries left: ${retries-tries}, url: ${getScrapeOpsUrl(url)}`);
} finally {
await page.close();
}
}
}
async function processResults(csvFile, location, concurrencyLimit, retries) {
const rows = await readCsv(csvFile);
const browser = await puppeteer.launch();;
for (const row of rows) {
await processProfile(browser, row, location, retries);
}
await browser.close();
}
async function main() {
const keywords = ["bill gates", "elon musk"];
const concurrencyLimit = 5;
const location = "us";
const retries = 3;
const aggregateFiles = [];
console.log("Crawl starting");
console.time("startCrawl");
for (const keyword of keywords) {
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
await startCrawl(keywords, location, concurrencyLimit, retries);
console.timeEnd("startCrawl");
console.log("Crawl complete");
console.log("Starting scrape");
for (const file of aggregateFiles) {
console.log(file)
console.time("processResults");
await processResults(file, location, concurrencyLimit, retries);
console.timeEnd("processResults");
}
console.log("Scrape complete");
}
main();
processProfile()
extracts data from individual profiles.processResults()
reads our CSV file and runsprocessProfile()
on all of the profiles from our CSV.
Step 3: Storing the Scraped Data
writeToCsv()
already gives us the ability to write JSON objects to a CSV file. We also already convert our extracted data into a JSON object.
Instead of printing our JSON object to the console, we need to pass it into writeToCsv()
. That's the only line that changes here.
Here's our fully updated code.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
async function writeToCsv(data, outputFile) {
let success = false;
while (!success) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);
if (!(data instanceof Array)) {
data = [data]
}
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
success = true;
} catch (e) {
console.log("Failed data", data);
throw new Error("Failed to write to csv");
}
}
}
async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));
for await (const record of parser) {
results.push(record);
}
return results;
}
function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}
async function crawlProfiles(browser, keyword, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const firstName = keyword.split(" ")[0];
const lastName = keyword.split(" ")[1]
const page = await browser.newPage();
try {
const url = `https://www.linkedin.com/pub/dir?firstName=${firstName}&lastName=${lastName}&trk=people-guest_people-search-bar_search-submit`;
const proxyUrl = getScrapeOpsUrl(url, location);
await page.goto(proxyUrl, { timeout: 0 });
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[class='base-search-card__info']");
for (const divCard of divCards) {
const link = await page.evaluate(element => element.parentElement.getAttribute("href"), divCard);
const splitLink = link.split("/")
const name = splitLink[splitLink.length-1].split("?")[0];
const displayNameElement = await divCard.$("h3[class='base-search-card__title']");
const displayName = await page.evaluate(element => element.textContent, displayNameElement);
const locationElement = await page.$("p[class='people-search-card__location']");
const location = await page.evaluate(element => element.textContent, locationElement);
let companies = "n/a";
const hasCompanies = await page.$("span[class='entity-list-meta__entities-list']");
if (hasCompanies) {
companies = await page.evaluate(element => element.textContent, hasCompanies);
}
const searchData = {
name: name.trim(),
display_name: displayName.trim(),
url: link.trim(),
location: location.trim(),
companies: companies.trim()
};
await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startCrawl(keywordList, location, concurrencyLimit, retries) {
const browser = await puppeteer.launch();
while (keywordList.length > 0) {
const currentBatch = keywordList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(keyword => crawlProfiles(browser, keyword, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
async function processProfile(browser, row, location, retries = 3) {
const url = row.url;
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const page = await browser.newPage();
try {
const response = await page.goto(url);
if (!response || response.status() !== 200) {
throw new Error("Failed to fetch page, status:", response.status());
}
const head = await page.$("head");
const scriptElement = await head.$("script[type='application/ld+json']");
const jsonText = await page.evaluate(element => element.textContent, scriptElement);
const jsonDataGraph = JSON.parse(jsonText)["@graph"];
let jsonData = {};
for (const element of jsonDataGraph) {
if (element["@type"] === "Person") {
jsonData = element;
break;
}
}
let company = "n/a";
let companyProfile = "n/a";
let jobTitle = "n/a";
if ("jobTitle" in jsonData && Array.isArray(jsonData.jobTitle) && jsonData.jobTitle.length > 0) {
jobTitle = jsonData.jobTitle[0];
}
const hasCompany = "worksFor" in jsonData && jsonData.worksFor.length > 0;
if (hasCompany) {
company = jsonData.worksFor[0].name;
const hasCompanyUrl = "url" in jsonData.worksFor[0];
if (hasCompanyUrl) {
companyProfile = jsonData.worksFor[0].url
}
}
const hasInteractions = "interactionStatistic" in jsonData;
let followers = 0;
if (hasInteractions) {
const stats = jsonData.interactionStatistic;
if (stats.name === "Follows" && stats["@type"] === "InteractionCounter") {
followers = stats.userInteractionCount;
}
}
const profileData = {
name: row.name,
company: company,
company_profile: companyProfile,
job_title: jobTitle,
followers: followers
}
await writeToCsv([profileData], `${row.name.replace(" ", "-")}.csv`);
success = true;
console.log("Successfully parsed", row.url);
} catch (err) {
tries++;
console.log(`Error: ${err}, tries left: ${retries-tries}, url: ${getScrapeOpsUrl(url)}`);
} finally {
await page.close();
}
}
}
async function processResults(csvFile, location, concurrencyLimit, retries) {
const rows = await readCsv(csvFile);
const browser = await puppeteer.launch();;
for (const row of rows) {
await processProfile(browser, row, location, retries);
}
await browser.close();
}
async function main() {
const keywords = ["bill gates", "elon musk"];
const concurrencyLimit = 5;
const location = "us";
const retries = 3;
const aggregateFiles = [];
console.log("Crawl starting");
console.time("startCrawl");
for (const keyword of keywords) {
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
await startCrawl(keywords, location, concurrencyLimit, retries);
console.timeEnd("startCrawl");
console.log("Crawl complete");
console.log("Starting scrape");
for (const file of aggregateFiles) {
console.log(file)
console.time("processResults");
await processResults(file, location, concurrencyLimit, retries);
console.timeEnd("processResults");
}
console.log("Scrape complete");
}
main();
We now pass our profileData
into writeToCsv()
. This stores our extracted data safely.
Step 4: Adding Concurrency
Remember when we mentioned the concurrencyLimit
before?
Now it's time to actually use it. Here, we'll once again use splice()
to cut our array into chunks.
We convert each chunk into an array of async
tasks. Then we await
our tasks
using Promise.all()
so each task can resolve.
async function processResults(csvFile, location, concurrencyLimit, retries) {
const rows = await readCsv(csvFile);
const browser = await puppeteer.launch();;
while (rows.length > 0) {
const currentBatch = rows.splice(0, concurrencyLimit);
const tasks = currentBatch.map(row => processProfile(browser, row, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
Step 5: Bypassing Anti-Bots
Our crawler has already been integrated with Proxy Aggregator using getScrapeOpsUrl()
. We need it to get past any anti-bots they use on the profile pages as well.
We'll going to change one line from our parsing function, await page.goto()
.
const response = await page.goto(getScrapeOpsUrl(url, location), { timeout: 0 });
We have unlocked the power of proxy.
Our finished profile scraper is available below.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
async function writeToCsv(data, outputFile) {
let success = false;
while (!success) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);
if (!(data instanceof Array)) {
data = [data]
}
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
success = true;
} catch (e) {
console.log("Failed data", data);
throw new Error("Failed to write to csv");
}
}
}
async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));
for await (const record of parser) {
results.push(record);
}
return results;
}
function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}
async function crawlProfiles(browser, keyword, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const firstName = keyword.split(" ")[0];
const lastName = keyword.split(" ")[1]
const page = await browser.newPage();
try {
const url = `https://www.linkedin.com/pub/dir?firstName=${firstName}&lastName=${lastName}&trk=people-guest_people-search-bar_search-submit`;
const proxyUrl = getScrapeOpsUrl(url, location);
await page.goto(proxyUrl, { timeout: 0 });
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[class='base-search-card__info']");
for (const divCard of divCards) {
const link = await page.evaluate(element => element.parentElement.getAttribute("href"), divCard);
const splitLink = link.split("/")
const name = splitLink[splitLink.length-1].split("?")[0];
const displayNameElement = await divCard.$("h3[class='base-search-card__title']");
const displayName = await page.evaluate(element => element.textContent, displayNameElement);
const locationElement = await page.$("p[class='people-search-card__location']");
const location = await page.evaluate(element => element.textContent, locationElement);
let companies = "n/a";
const hasCompanies = await page.$("span[class='entity-list-meta__entities-list']");
if (hasCompanies) {
companies = await page.evaluate(element => element.textContent, hasCompanies);
}
const searchData = {
name: name.trim(),
display_name: displayName.trim(),
url: link.trim(),
location: location.trim(),
companies: companies.trim()
};
await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startCrawl(keywordList, location, concurrencyLimit, retries) {
const browser = await puppeteer.launch();
while (keywordList.length > 0) {
const currentBatch = keywordList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(keyword => crawlProfiles(browser, keyword, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
async function processProfile(browser, row, location, retries = 3) {
const url = row.url;
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const page = await browser.newPage();
try {
const response = await page.goto(getScrapeOpsUrl(url, location), { timeout: 0 });
if (!response || response.status() !== 200) {
throw new Error("Failed to fetch page, status:", response.status());
}
const head = await page.$("head");
const scriptElement = await head.$("script[type='application/ld+json']");
const jsonText = await page.evaluate(element => element.textContent, scriptElement);
const jsonDataGraph = JSON.parse(jsonText)["@graph"];
let jsonData = {};
for (const element of jsonDataGraph) {
if (element["@type"] === "Person") {
jsonData = element;
break;
}
}
let company = "n/a";
let companyProfile = "n/a";
let jobTitle = "n/a";
if ("jobTitle" in jsonData && Array.isArray(jsonData.jobTitle) && jsonData.jobTitle.length > 0) {
jobTitle = jsonData.jobTitle[0];
}
const hasCompany = "worksFor" in jsonData && jsonData.worksFor.length > 0;
if (hasCompany) {
company = jsonData.worksFor[0].name;
const hasCompanyUrl = "url" in jsonData.worksFor[0];
if (hasCompanyUrl) {
companyProfile = jsonData.worksFor[0].url
}
}
const hasInteractions = "interactionStatistic" in jsonData;
let followers = 0;
if (hasInteractions) {
const stats = jsonData.interactionStatistic;
if (stats.name === "Follows" && stats["@type"] === "InteractionCounter") {
followers = stats.userInteractionCount;
}
}
const profileData = {
name: row.name,
company: company,
company_profile: companyProfile,
job_title: jobTitle,
followers: followers
}
await writeToCsv([profileData], `${row.name.replace(" ", "-")}.csv`);
success = true;
console.log("Successfully parsed", row.url);
} catch (err) {
tries++;
console.log(`Error: ${err}, tries left: ${retries-tries}, url: ${getScrapeOpsUrl(url)}`);
} finally {
await page.close();
}
}
}
async function processResults(csvFile, location, concurrencyLimit, retries) {
const rows = await readCsv(csvFile);
const browser = await puppeteer.launch();;
while (rows.length > 0) {
const currentBatch = rows.splice(0, concurrencyLimit);
const tasks = currentBatch.map(row => processProfile(browser, row, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
async function main() {
const keywords = ["bill gates", "elon musk"];
const concurrencyLimit = 5;
const location = "us";
const retries = 3;
const aggregateFiles = [];
console.log("Crawl starting");
console.time("startCrawl");
for (const keyword of keywords) {
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
await startCrawl(keywords, location, concurrencyLimit, retries);
console.timeEnd("startCrawl");
console.log("Crawl complete");
console.log("Starting scrape");
for (const file of aggregateFiles) {
console.log(file)
console.time("processResults");
await processResults(file, location, concurrencyLimit, retries);
console.timeEnd("processResults");
}
console.log("Scrape complete");
}
main();
Step 6: Production Run
Now, we'll test the full script out in production. Like before, feel free to change any of the following:
keywords
concurrencyLimit
location
retries
This time, our crawl took 73.6 seconds. You can see a screenshot of our full results below.
This time around, we generated two crawl reports with a total of 78 results. It took 327.482 seconds to scrape all of the bill gates results. The elon musk scrape took 248.634 seconds.
Our total time spent crawling is 636.116 seconds. 636.116 seconds / 78 results = 8.155 seconds per result. We're scraping pages even faster than we crawled them!
Our overall program is running pretty well!
Legal and Ethical Considerations
According to precedent set by numerous court cases, scraping the public web (including LinkedIn) is perfectly legal. In this tutorial, we made sure to only scrape public data from LinkedIn.
When scraping private data (data behind a login), that's a completely different story and you're subject to a completely different set of rules and regulations.
Although our crawler and scraper here were completely legal, we definitely violated LinkedIn's terms of service and robots.txt
. You can view their terms here and you may view their robots.txt
here.
Failure to comply with these policies can result in suspension or even permanent removal of your LinkedIn account.
If you're unsure whether your scraper is legal or not, consult an attorney.
Conclusion
LinkedIn Profiles are among the most difficult pages to scrape on the web. The ScrapeOps Proxy Aggregator easily pushes through their anti-bots and gets us through to the data we need.
By this point, you've completed the tutorial and you should have a solid grasp on how to use Puppeteer to extract data from LinkedIn. You now should also have a solid grasp of parsing, data storage, concurrency, and proxy integration.
You can dig deeper into the tech we used by clicking the links below.
More Python Web Scraping Guides
We always have something for you here at ScrapeOps. Whether you're just learning how to code, or you're a seasoned dev, you can gain something from our tutorials.
Check out our Puppeteer Web Scraping Playbook.
If you want to learn how to scrape another tricky site, check out the links below!