How to Scrape Linkedin Jobs With Puppeteer
LinkedIn was founded in 2003 and it's been a powerhouse ever since. LinkedIn is host to millions of job postings. LinkedIn was built specifically with scrapers in mind and they make an active attempt to stop them. If you know what you're doing, a vast majority of their data is still publicly available. You just need to know where to look!
In this tutorial, we'll build a LinkedIn jobs scraper from start to finish.
- TLDR: How to Scrape LinkedIn Jobs
- How To Architect Our Scraper
- Understanding How To Scrape LinkedIn Jobs
- Setting Up Our LinkedIn Jobs Scraper
- Build A LinkedIn Jobs Search Crawler
- Build A LinkedIn Jobs Scraper
- Legal and Ethical Considerations
- Conclusion
- More Puppeteer Web Scraping Guides
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
TLDR - How to Scrape LinkedIn Jobs
Wanna skip the article and just scrape LinkedIn jobs? You can use our prebuilt scraper!
- You need to create a new NodeJS project and add a
config.json
file to it. - Add you ScrapeOps API key to the config file:
{"api_key": "your-super-secret-api-key"}
. - Then copy and paste the code below into a new JavaScript file.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
async function writeToCsv(data, outputFile) {
let success = false;
while (!success) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);
if (!(data instanceof Array)) {
data = [data]
}
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
success = true;
} catch (e) {
console.log("Failed data", data);
throw new Error("Failed to write to csv");
}
}
}
async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));
for await (const record of parser) {
results.push(record);
}
return results;
}
function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}
function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}
async function scrapeSearchResults(browser, keyword, pageNumber, locality, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const formattedKeyword = keyword.replace(" ", "+");
const formattedLocality = locality.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${formattedKeyword}&location=${formattedLocality}&original_referer=&start=${pageNumber*10}`;
const proxyUrl = getScrapeOpsUrl(url, location);
await page.goto(proxyUrl, { timeout: 0 });
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[class='base-search-card__info']");
for (const divCard of divCards) {
const nameElement = await divCard.$("h4[class='base-search-card__subtitle']");
const name = await page.evaluate(element => element.textContent, nameElement);
const jobTitleElement = await divCard.$("h3[class='base-search-card__title']");
const jobTitle = await page.evaluate(element => element.textContent, jobTitleElement);
const parentElement = await page.evaluateHandle(element => element.parentElement, divCard);
const aTag = await parentElement.$("a");
const link = await page.evaluate(element => element.getAttribute("href"), aTag);
const jobLocationElement = await divCard.$("span[class='job-search-card__location']");
const jobLocation = await page.evaluate(element => element.textContent, jobLocationElement);
const searchData = {
name: name.trim(),
job_title: jobTitle.trim(),
url: link.trim(),
location: jobLocation.trim()
};
await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startCrawl(keyword, pages, locality, location, concurrencyLimit, retries) {
const pageList = range(0, pages);
const browser = await puppeteer.launch();
while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, locality, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
async function processJob(browser, row, location, retries = 3) {
const url = row.url;
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const page = await browser.newPage();
try {
const response = await page.goto(getScrapeOpsUrl(url, location), { timeout: 0 });
if (!response || response.status() !== 200) {
throw new Error("Failed to fetch page, status:", response.status());
}
const jobCriteria = await page.$$("li[class='description__job-criteria-item']");
if (jobCriteria.length < 4) {
throw new Error("Job Criteria Not Found!");
}
const seniority = (await page.evaluate(element => element.textContent, jobCriteria[0])).replace("Seniority level", "");
const positionType = (await page.evaluate(element => element.textContent, jobCriteria[1])).replace("Employment type", "");
const jobFunction = (await page.evaluate(element => element.textContent, jobCriteria[2])).replace("Job function", "");
const industry = (await page.evaluate(element => element.textContent, jobCriteria[3])).replace("Industries", "");
const jobData = {
name: row.name,
seniority: seniority.trim(),
position_type: positionType.trim(),
job_function: jobFunction.trim(),
industry: industry.trim()
}
await writeToCsv([jobData], `${row.name.replace(" ", "-")}-${row.job_title.replace(" ", "-")}.csv`);
success = true;
console.log("Successfully parsed", row.url);
} catch (err) {
tries++;
console.log(`Error: ${err}, tries left: ${retries-tries}, url: ${getScrapeOpsUrl(url)}`);
} finally {
await page.close();
}
}
}
async function processResults(csvFile, location, concurrencyLimit, retries) {
const rows = await readCsv(csvFile);
const browser = await puppeteer.launch();;
while (rows.length > 0) {
const currentBatch = rows.splice(0, concurrencyLimit);
const tasks = currentBatch.map(row => processJob(browser, row, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
async function main() {
const keywords = ["software engineer"];
const concurrencyLimit = 5;
const pages = 1;
const location = "us";
const locality = "United States";
const retries = 3;
const aggregateFiles = [];
for (const keyword of keywords) {
console.log("Crawl starting");
console.time("startCrawl");
await startCrawl(keyword, pages, locality, location, concurrencyLimit, retries);
console.timeEnd("startCrawl");
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
console.log("Starting scrape");
for (const file of aggregateFiles) {
console.time("processResults");
await processResults(file, location, concurrencyLimit, retries);
console.timeEnd("processResults");
}
console.log("Scrape complete");
}
main();
You can change any of the following from main
to fine-tune your results:
keywords
: An array of job titles or terms to be used as search queries on LinkedInconcurrencyLimit
: The maximum number of pages or tasks processed concurrently.pages
: The number of pages of search results to crawl for each keyword.location
: A two-letter country code (e.g., "us") specifying the country for the search results.locality
: The human-readable location name (e.g., "United States") used in the search query.retries
: The number of retry attempts allowed for failed tasks (e.g., failed page loads or data extractions).
node name-of-your-script
or node name-of-your-script.js
will run the scraper.
Modern NodeJS doesn't require a file extension in the name.
Once it's done running, you'll get a CSV named after your search. This one will contain all of your search data. You get an individual report generated for each job listing as well. These individual files contain more detailed information about each job posting.
How To Architect Our LinkedIn Jobs Scraper
If we want to scrape LinkedIn jobs thoroughly, we need a result crawler and a job scraper. Our crawler does a keyword search and saves our results. Once our crawl is finished, our job scraper reads the report from the crawler. Then, it looks up every individual listing from the CSV and collects more data on each one.
If you perform a search for Software Engineer, the crawler will extract and save all the Software Engineer jobs from the search. Then, the scraper will lookup each individual job posting and generate a special report for each posting it looks up.
At this point, this might sound a little intimidating. We need to take our larger project and break it into smaller pieces.. Step by step, we'll define exactly what we want from our crawler. Then, we'll identify the steps we need to take when building our scraper.
Here are the steps to building the crawler:
- Write a search results parser to extract our data.
- Add pagination, this way, we get more results and finer control over them.
- Create some classes for data storage, and then use them to save our parsed results.
- Use
ThreadPoolExecutor
to add support for multithreading and therefore concurrency. - Write a function for proxy integration and use it to bypass LinkedIn's anti-bot system.
Now, take a look at what we need to build the scraper.
- Write a parser to pull information from individual job postings.
- Give our scraper the ability to read a CSV file.
- Add another class for data storage and build the storage into our parsing function.
- Add
ThreadPoolExecutor
to scrape posting data concurrently. - Use our proxy function from earlier to bypass anti-bots.
Understanding How To Scrape LinkedIn Jobs
As much as you might want to, we can't just start coding.
- We need to see how all this works from a high level.
- We need to request specific pages.
- We need to know where our data is located on the page and come up with a method for extracting it.
- To get control over our results, we also need pagination and geolocation support.
In the next few sections, we'll explore all these concepts in finer detail. By the time we write our code, we'll know exactly what we want it to do.
Step 1: How To Request LinkedIn Jobs Pages
Whenever you go to a page on the web, it begins with a simple GET request.
- If you look at LinkedIn from your browser, the browser sends a GET to LinkedIn.
- LinkedIn sends an HTML response back to your browser.
- Your browser then reads and renders the HTML.
- When scraping, we don't actually need to render the page.
- We need to pick through the HTML and pull our data from it. This allows us to search much faster and more efficiently than a human could.
You can view our URL format here:
https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${formattedKeyword}&location=${formattedLocality}&original_referer=&start=
For the Software Engineer search we mentioned earlier, our URL looks like this:
https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=$software+engineer&location=united+states&original_referer=
Look closer at the base URL:
https://www.linkedin.com/jobs-guest/jobs/api
We have endpoint, /api
inside of it. Our requests are actually going to their API.
Surprisingly, this API endpoint doesn't respond with JSON or XML, it gives us straight HTML. In years of web development and scraping, LinkedIn is the only place I've ever seen this.
The screenshot below gives us a barebones HTML page without any styling whatsoever, but it is in fact a webpage. When you're viewing data from the main page, the page fetches this HTML and uses to to update your screen.
Once we're finished searching, we'll scrape individual listing data. Look at the screenshot below. This is the basic layout for any job posted on LinkedIn. We don't need to worry about the URLs for these. We'll find these URLs when we crawl the search results.
Step 2: How To Extract Data From LinkedIn Jobs Results and Pages
We know which pages we're scraping. Now we need to figure out exactly where our data is located. Our search results hold a bunch of div
elements. Each one we want has a class name of base-search-card__info
.
For individual job pages, we look for li
elements with a class of description__job-criteria-item
.
In the image below, you can see a div
. Its class name is base-search-card__info
. This is one of our search results. To extract this data, we need to find each div
matching this class.
The next shot holds the li
element we want to scrape. Each li
element has the classname, description__job-criteria-item
. For these, we'll extract all li
elements matching our target class.
Step 3: How To Control Pagination
If you want a lot of data, you need to paginate your results. Pagination allows us to get our results in batches.
We'll have to add one, more param to our URL, &start={pageNumber*10}
. For page 1 of the Software Engineer search, we get this URL:
https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=software+engineer&location=United+States&original_referer=&start=0
We use pageNumber*10
because we begin counting at 0 and each request yields 10 results. Page 0 (0 * 10) yields results 1 through 10. Page 1 yields 11 through 20 and so on and so forth.
Look below to see how our fully formatted url looks:
`https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${formattedKeyword}&location=${formattedLocality}&original_referer=&start=${pageNumber*10}`
Step 4: Geolocated Data
The ScrapeOps Proxy Aggregator gives us excellent geotargeting support. This API takes in all sorts of arguments, but the one we want is called country
.
- If we want to appear in the US, we can pass
"country": "us"
into the API. - If we want to appear in the UK, we'd pass
"country": "uk"
.
You can find a full list of ScrapeOps supported countries here.
Some other providers charge extra for geotargeting, we don't.
Setting Up Our LinkedIn Jobs Scraper Project
Let's get started. We need to make a new NodeJS project. Then we need to install our dependencies. You can run the following commands to get set up.
Create a New Project Folder
mkdir linkedin-jobs-scraper
cd linkedin-jobs-scraper
Create a New NodeJS Project
npm init --y
Install Our Dependencies
npm install puppeteer
npm install csv-writer
npm install csv-parse
npm install fs
We've finished setting everything up. Time to start coding.
Build A LinkedIn Jobs Search Crawler
We're past the boring stuff. It's finally time to start building. We'll start on our crawler. Each time we implement one of the steps below, we'll build a new iteration of our crawler. Iterative building is a great way to simplify your development process.
- First, we're going to build a basic script with error handling, retry logic, and our basic parser.
- Next, we'll add pagination.
- Once we're getting proper result batches, we need to create a couple classes and use them for data storage.
- Then, we'll add concurrency to scrape multiple pages simultaneously.
- Finally, we'll use the ScrapeOps Proxy Aggregator to get past any roadblocks that might get in our way.
Step 1: Create Simple Search Data Parser
We won't get very far if we can't parse a page.
In our code below, we'll write our parsing function for the crawler. Everything else we add will be built on top of this basic script. We've got our imports and retry logic, but you need to pay close attention to our parsing function.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
async function scrapeSearchResults(browser, keyword, locality, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const formattedKeyword = keyword.replace(" ", "+");
const formattedLocality = locality.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${formattedKeyword}&location=${formattedLocality}&original_referer=`;
await page.goto(url);
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[class='base-search-card__info']");
for (const divCard of divCards) {
const nameElement = await divCard.$("h4[class='base-search-card__subtitle']");
const name = await page.evaluate(element => element.textContent, nameElement);
const jobTitleElement = await divCard.$("h3[class='base-search-card__title']");
const jobTitle = await page.evaluate(element => element.textContent, jobTitleElement);
const parentElement = await page.evaluateHandle(element => element.parentElement, divCard);
const aTag = await parentElement.$("a");
const link = await page.evaluate(element => element.getAttribute("href"), aTag);
const jobLocationElement = await divCard.$("span[class='job-search-card__location']");
const jobLocation = await page.evaluate(element => element.textContent, jobLocationElement);
const searchData = {
name: name.trim(),
job_title: jobTitle.trim(),
url: link.trim(),
location: jobLocation.trim()
};
console.log(searchData);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startCrawl(keyword, locality, location, retries) {
const browser = await puppeteer.launch();
await scrapeSearchResults(browser, keyword, locality, location, retries);
await browser.close();
}
async function main() {
const keywords = ["software engineer"];
const concurrencyLimit = 5;
const pages = 1;
const location = "us";
const locality = "United States";
const retries = 3;
const aggregateFiles = [];
for (const keyword of keywords) {
console.log("Crawl starting");
console.time("startCrawl");
await startCrawl(keyword, pages, locality, location, retries);
console.timeEnd("startCrawl");
console.log("Crawl complete");
}
}
main();
- In our
main()
, we callstartCrawl()
. At the moment, this function opens a browser and passes it into our parsing function,startScrape()
.await puppeteer.launch();
launches the browser.- We pass it into our parser with
scrapeSearchResults(browser, keyword, locality, location, retries)
. - Once the parsing function has finished, we close the browser:
await browser.close();
- The real magic happens from inside
scrapeSearchResults()
.- We find all of our
divCards
withawait page.$$("div[class='base-search-card__info']");
. - When we extract text from the page elements, we use
page.evauluate()
:await page.evaluate(element => element.textContent, nameElement)
. This method is used for thename
,jobTitle
,link
, andjobLocation
. - We then save these inside of a
searchData
object and remove the whitespace and any newline characters with thetrim()
method. - Once we've got our
searchData
, we print it to the console.
- We find all of our
Step 2: Add Pagination
Adding pagination is a pretty easy job.
- We just append our URL. We append
start={pageNumber*10}
to the end of our URL. - We also need to alter
startCrawl()
to scrape multiple pages. - We add a simple
for
loop that allows us to do this. This is only temporary, later on, we'll replace it with some more powerful code that performs our search concurrently.
Here is our URL format with pagination support.
`https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${formattedKeyword}&location=${formattedLocality}&original_referer=&start=${pageNumber*10}`
This next function isn't a requirement, but it makes our code easier to write.
Here's a homemade range()
function similar to the one from Python.
function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}
This next little snippet includes our rewritten startCrawl()
. It uses a simple for
loop to iterate through our pages.
async function startCrawl(keyword, pages, locality, location, retries) {
const pageList = range(0, pages);
const browser = await puppeteer.launch();
for (const page of pageList) {
await scrapeSearchResults(browser, keyword, page, locality, location, retries)
}
await browser.close();
}
Below, you can see how everything fits together now.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}
async function scrapeSearchResults(browser, keyword, pageNumber, locality, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const formattedKeyword = keyword.replace(" ", "+");
const formattedLocality = locality.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${formattedKeyword}&location=${formattedLocality}&original_referer=&start=${pageNumber*10}`;
await page.goto(url);
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[class='base-search-card__info']");
for (const divCard of divCards) {
const nameElement = await divCard.$("h4[class='base-search-card__subtitle']");
const name = await page.evaluate(element => element.textContent, nameElement);
const jobTitleElement = await divCard.$("h3[class='base-search-card__title']");
const jobTitle = await page.evaluate(element => element.textContent, jobTitleElement);
const parentElement = await page.evaluateHandle(element => element.parentElement, divCard);
const aTag = await parentElement.$("a");
const link = await page.evaluate(element => element.getAttribute("href"), aTag);
const jobLocationElement = await divCard.$("span[class='job-search-card__location']");
const jobLocation = await page.evaluate(element => element.textContent, jobLocationElement);
const searchData = {
name: name.trim(),
job_title: jobTitle.trim(),
url: link.trim(),
location: jobLocation.trim()
};
console.log(searchData);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startCrawl(keyword, pages, locality, location, retries) {
const pageList = range(0, pages);
const browser = await puppeteer.launch();
for (const page of pageList) {
await scrapeSearchResults(browser, keyword, page, locality, location, retries)
}
await browser.close();
}
async function main() {
const keywords = ["software engineer"];
const concurrencyLimit = 5;
const pages = 1;
const location = "us";
const locality = "United States";
const retries = 3;
const aggregateFiles = [];
for (const keyword of keywords) {
console.log("Crawl starting");
console.time("startCrawl");
await startCrawl(keyword, pages, locality, location, retries);
console.timeEnd("startCrawl");
console.log("Crawl complete");
}
}
main();
start={pageNumber*10}
allows us to control our pagination. We usepageNumber*10
because we get 10 results per page and our results start at zero.- With
range()
andstartCrawl()
, we can now scrape an array of pages.
Step 3: Storing the Scraped Data
When you're scraping, you need to be able to store your data. Without storage, our data is gone as soon as the program exits.
In this section, we'll create a writetoCsv()
function.
- This function can take in either a JSON object or an array and write it to a CSV file. We need to write it carefully though.
- If the file already exists, we should append it. This will prevent us from overwriting valuable data.
Here is writeToCsv()
.
- We start with by creating a
success
variable and setting it tofalse
. - While the operation hasn't succeeded, we check to see if the file exists. We set
append
to thefileExists
variable. - This way, if the file already exists, we append it instead of writing a new file. If our
data
isn't an array, we convert it to one. - We use
await csvWriter.writeRecords(data);
to write our data to the CSV file. - Once the write has finished we set
success
totrue
. - If the operation fails, we remain in the loop and keep retrying the operation until it succeeds.
async function writeToCsv(data, outputFile) {
let success = false;
while (!success) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);
if (!(data instanceof Array)) {
data = [data]
}
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
success = true;
} catch (e) {
console.log("Failed data", data);
throw new Error("Failed to write to csv");
}
}
}
Here is our newest iteration. Aside from the new function, not much has changed. Instead of printing to the screen, we write our data
to a CSV file.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
async function writeToCsv(data, outputFile) {
let success = false;
while (!success) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);
if (!(data instanceof Array)) {
data = [data]
}
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
success = true;
} catch (e) {
console.log("Failed data", data);
throw new Error("Failed to write to csv");
}
}
}
function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}
async function scrapeSearchResults(browser, keyword, pageNumber, locality, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const formattedKeyword = keyword.replace(" ", "+");
const formattedLocality = locality.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${formattedKeyword}&location=${formattedLocality}&original_referer=&start=${pageNumber*10}`;
await page.goto(url);
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[class='base-search-card__info']");
for (const divCard of divCards) {
const nameElement = await divCard.$("h4[class='base-search-card__subtitle']");
const name = await page.evaluate(element => element.textContent, nameElement);
const jobTitleElement = await divCard.$("h3[class='base-search-card__title']");
const jobTitle = await page.evaluate(element => element.textContent, jobTitleElement);
const parentElement = await page.evaluateHandle(element => element.parentElement, divCard);
const aTag = await parentElement.$("a");
const link = await page.evaluate(element => element.getAttribute("href"), aTag);
const jobLocationElement = await divCard.$("span[class='job-search-card__location']");
const jobLocation = await page.evaluate(element => element.textContent, jobLocationElement);
const searchData = {
name: name.trim(),
job_title: jobTitle.trim(),
url: link.trim(),
location: jobLocation.trim()
};
await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startCrawl(keyword, pages, locality, location, retries) {
const pageList = range(0, pages);
const browser = await puppeteer.launch();
for (const page of pageList) {
await scrapeSearchResults(browser, keyword, page, locality, location, retries)
}
await browser.close();
}
async function main() {
const keywords = ["software engineer"];
const concurrencyLimit = 5;
const pages = 1;
const location = "us";
const locality = "United States";
const retries = 3;
const aggregateFiles = [];
for (const keyword of keywords) {
console.log("Crawl starting");
console.time("startCrawl");
await startCrawl(keyword, pages, locality, location, retries);
console.timeEnd("startCrawl");
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
}
main();
When we scrape objects from the page, now we write them to a CSV file.
Step 4: Adding Concurrency
NodeJS is actually built to run in a single threaded environment. This makes it seem like concurrency would be difficult to handle, however, it's not. We can harness the first class async
support to scrape concurrently. We'll rewrite startCrawl()
to handle this.
Here is our final startCrawl()
function.
- Instead of using a
for
loop, we create a list oftasks
by splicing from ourpageList
up to ourconcurrencyLimit
. - We then
await
all thesetasks
to resolve withPromise.all()
. - If we set our
concurrencyLimit
to 5, we'll scrape up to 5 pages at a time. - Careful when setting your concurrency limit. Each task opens a browser page inside of Puppeteer. You don't want too many tasks running at once because this can overwhelm your machine.
- You need also to be careful because most proxy providers (ScrapeOps included) give you a concurrency limit with their API.
async function startCrawl(keyword, pages, locality, location, concurrencyLimit, retries) {
const pageList = range(0, pages);
const browser = await puppeteer.launch();
while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, locality, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
Our full code now looks like this.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
async function writeToCsv(data, outputFile) {
let success = false;
while (!success) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);
if (!(data instanceof Array)) {
data = [data]
}
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
success = true;
} catch (e) {
console.log("Failed data", data);
throw new Error("Failed to write to csv");
}
}
}
function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}
async function scrapeSearchResults(browser, keyword, pageNumber, locality, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const formattedKeyword = keyword.replace(" ", "+");
const formattedLocality = locality.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${formattedKeyword}&location=${formattedLocality}&original_referer=&start=${pageNumber*10}`;
await page.goto(url);
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[class='base-search-card__info']");
for (const divCard of divCards) {
const nameElement = await divCard.$("h4[class='base-search-card__subtitle']");
const name = await page.evaluate(element => element.textContent, nameElement);
const jobTitleElement = await divCard.$("h3[class='base-search-card__title']");
const jobTitle = await page.evaluate(element => element.textContent, jobTitleElement);
const parentElement = await page.evaluateHandle(element => element.parentElement, divCard);
const aTag = await parentElement.$("a");
const link = await page.evaluate(element => element.getAttribute("href"), aTag);
const jobLocationElement = await divCard.$("span[class='job-search-card__location']");
const jobLocation = await page.evaluate(element => element.textContent, jobLocationElement);
const searchData = {
name: name.trim(),
job_title: jobTitle.trim(),
url: link.trim(),
location: jobLocation.trim()
};
await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startCrawl(keyword, pages, locality, location, concurrencyLimit, retries) {
const pageList = range(0, pages);
const browser = await puppeteer.launch();
while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, locality, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
async function main() {
const keywords = ["software engineer"];
const concurrencyLimit = 5;
const pages = 1;
const location = "us";
const locality = "United States";
const retries = 3;
const aggregateFiles = [];
for (const keyword of keywords) {
console.log("Crawl starting");
console.time("startCrawl");
await startCrawl(keyword, pages, locality, location, concurrencyLimit, retries);
console.timeEnd("startCrawl");
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
}
main();
We can now crawl multiple pages simultaneously.
Step 5: Bypassing Anti-Bots
While port integration is possible, the best way to use the ScrapeOps Proxy Aggregator is through the API. With the Proxy Aggregator API, we get really fine control over our proxy connection by passing simple parameters to the API.
There are all sorts of things we can use to customize our connection, but today we only need an api_key
, url
and a country
.
Let's explain these a little better.
api_key
: This is literally a key to our ScrapeOps account. Your API key is used to authenticate your accout when making requests.url
: This is the url of the site we want to scrape. ScrapeOps will fetch this site and send the result back to us.country
: We pass a country code in for this parameter. ScrapeOps reads our country code and routes our request through a server in the country we chose.
function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}
Our full production crawler is available below.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
async function writeToCsv(data, outputFile) {
let success = false;
while (!success) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);
if (!(data instanceof Array)) {
data = [data]
}
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
success = true;
} catch (e) {
console.log("Failed data", data);
throw new Error("Failed to write to csv");
}
}
}
function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}
function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}
async function scrapeSearchResults(browser, keyword, pageNumber, locality, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const formattedKeyword = keyword.replace(" ", "+");
const formattedLocality = locality.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${formattedKeyword}&location=${formattedLocality}&original_referer=&start=${pageNumber*10}`;
const proxyUrl = getScrapeOpsUrl(url, location);
await page.goto(proxyUrl, { timeout: 0 });
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[class='base-search-card__info']");
for (const divCard of divCards) {
const nameElement = await divCard.$("h4[class='base-search-card__subtitle']");
const name = await page.evaluate(element => element.textContent, nameElement);
const jobTitleElement = await divCard.$("h3[class='base-search-card__title']");
const jobTitle = await page.evaluate(element => element.textContent, jobTitleElement);
const parentElement = await page.evaluateHandle(element => element.parentElement, divCard);
const aTag = await parentElement.$("a");
const link = await page.evaluate(element => element.getAttribute("href"), aTag);
const jobLocationElement = await divCard.$("span[class='job-search-card__location']");
const jobLocation = await page.evaluate(element => element.textContent, jobLocationElement);
const searchData = {
name: name.trim(),
job_title: jobTitle.trim(),
url: link.trim(),
location: jobLocation.trim()
};
await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startCrawl(keyword, pages, locality, location, concurrencyLimit, retries) {
const pageList = range(0, pages);
const browser = await puppeteer.launch();
while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, locality, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
async function main() {
const keywords = ["software engineer"];
const concurrencyLimit = 5;
const pages = 1;
const location = "us";
const locality = "United States";
const retries = 3;
const aggregateFiles = [];
for (const keyword of keywords) {
console.log("Crawl starting");
console.time("startCrawl");
await startCrawl(keyword, pages, locality, location, concurrencyLimit, retries);
console.timeEnd("startCrawl");
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
}
main();
Step 6: Production Run
Next, we need to run this thing in production. We're goingto crawl 3 pages with a concurrencyLimit
of 5.
Feel free to change any of the following from the main()
function.
keywords
concurrencyLimit
pages
location
locality
retries
Here is our full main()
if you'd like to review it.
async function main() {
const keywords = ["software engineer"];
const concurrencyLimit = 5;
const pages = 3;
const location = "us";
const locality = "United States";
const retries = 3;
const aggregateFiles = [];
for (const keyword of keywords) {
console.log("Crawl starting");
console.time("startCrawl");
await startCrawl(keyword, pages, locality, location, concurrencyLimit, retries);
console.timeEnd("startCrawl");
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
}
Take a look at our results.
As you can see, we crawled 3 pages in 42.08 seconds. This comes out to an average of 14.02 seconds per page.
Build A LinkedIn Jobs Scraper
Now, for the second part of our project. Our crawler is generating a report. Now, we need a scraper that reads that report. After reading that report, it needs to go through and scrape individual details about each job posting.
We'll build this scraper in several iterations, just like we did with the crawler.
Step 1: Create Simple Job Data Parser
We'll start with our parsing function. Just like earlier, we'll add some error handling and retries, but our parsing logic is most important.
Take a look at processJob()
. We check for bad responses and throw
an Error
if we don't receive the correct response. If we get a good response, we continue on and parse the page.
async function processJob(browser, row, location, retries = 3) {
const url = row.url;
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const page = await browser.newPage();
try {
const response = await page.goto(url);
if (!response || response.status() !== 200) {
throw new Error("Failed to fetch page, status:", response.status());
}
const jobCriteria = await page.$$("li[class='description__job-criteria-item']");
if (jobCriteria.length < 4) {
throw new Error("Job Criteria Not Found!");
}
const seniority = (await page.evaluate(element => element.textContent, jobCriteria[0])).replace("Seniority level", "");
const positionType = (await page.evaluate(element => element.textContent, jobCriteria[1])).replace("Employment type", "");
const jobFunction = (await page.evaluate(element => element.textContent, jobCriteria[2])).replace("Job function", "");
const industry = (await page.evaluate(element => element.textContent, jobCriteria[3])).replace("Industries", "");
const jobData = {
name: row.name,
seniority: seniority.trim(),
position_type: positionType.trim(),
job_function: jobFunction.trim(),
industry: industry.trim()
}
console.log(jobData)
success = true;
console.log("Successfully parsed", row.url);
} catch (err) {
tries++;
console.log(`Error: ${err}, tries left: ${retries-tries}, url: ${getScrapeOpsUrl(url)}`);
} finally {
await page.close();
}
}
}
jobCriteria = await page.$$("li[class='description__job-criteria-item']");
finds the items from our criteria list.- The criteria list goes as follows:
const seniority = (await page.evaluate(element => element.textContent, jobCriteria[0])).replace("Seniority level", "");
: seniority levelconst positionType = (await page.evaluate(element => element.textContent, jobCriteria[1])).replace("Employment type", "");
: position typeconst jobFunction = (await page.evaluate(element => element.textContent, jobCriteria[2])).replace("Job function", "");
: job functionconst industry = (await page.evaluate(element => element.textContent, jobCriteria[3])).replace("Industries", "");
: industry
We use page.evaluate()
to pull the text from each element we find.
Step 2: Loading URLs To Scrape
Our parsing function takes a row as an argument. To give it a row
, we need to read the rows from our CSV file. We'll read our file into an array and then we'll use a for
loop to scrape details from every posting we found.
Here is our first iteration of processResults()
.
Later on, we'll rewrite it and add concurrency support. It;s pretty similar to our startCrawl()
function from earlier in this tutorial.
async function processResults(csvFile, location, retries) {
const rows = await readCsv(csvFile);
const browser = await puppeteer.launch();;
for (const row of rows) {
await processJob(browser, row, location, retries)
}
await browser.close();
}
When we fit it into our script, here's how everything should look.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
async function writeToCsv(data, outputFile) {
let success = false;
while (!success) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);
if (!(data instanceof Array)) {
data = [data]
}
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
success = true;
} catch (e) {
console.log("Failed data", data);
throw new Error("Failed to write to csv");
}
}
}
async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));
for await (const record of parser) {
results.push(record);
}
return results;
}
function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}
function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}
async function scrapeSearchResults(browser, keyword, pageNumber, locality, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const formattedKeyword = keyword.replace(" ", "+");
const formattedLocality = locality.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${formattedKeyword}&location=${formattedLocality}&original_referer=&start=${pageNumber*10}`;
const proxyUrl = getScrapeOpsUrl(url, location);
await page.goto(proxyUrl, { timeout: 0 });
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[class='base-search-card__info']");
for (const divCard of divCards) {
const nameElement = await divCard.$("h4[class='base-search-card__subtitle']");
const name = await page.evaluate(element => element.textContent, nameElement);
const jobTitleElement = await divCard.$("h3[class='base-search-card__title']");
const jobTitle = await page.evaluate(element => element.textContent, jobTitleElement);
const parentElement = await page.evaluateHandle(element => element.parentElement, divCard);
const aTag = await parentElement.$("a");
const link = await page.evaluate(element => element.getAttribute("href"), aTag);
const jobLocationElement = await divCard.$("span[class='job-search-card__location']");
const jobLocation = await page.evaluate(element => element.textContent, jobLocationElement);
const searchData = {
name: name.trim(),
job_title: jobTitle.trim(),
url: link.trim(),
location: jobLocation.trim()
};
await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startCrawl(keyword, pages, locality, location, concurrencyLimit, retries) {
const pageList = range(0, pages);
const browser = await puppeteer.launch();
while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, locality, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
async function processJob(browser, row, location, retries = 3) {
const url = row.url;
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const page = await browser.newPage();
try {
const response = await page.goto(url);
if (!response || response.status() !== 200) {
throw new Error("Failed to fetch page, status:", response.status());
}
const jobCriteria = await page.$$("li[class='description__job-criteria-item']");
if (jobCriteria.length < 4) {
throw new Error("Job Criteria Not Found!");
}
const seniority = (await page.evaluate(element => element.textContent, jobCriteria[0])).replace("Seniority level", "");
const positionType = (await page.evaluate(element => element.textContent, jobCriteria[1])).replace("Employment type", "");
const jobFunction = (await page.evaluate(element => element.textContent, jobCriteria[2])).replace("Job function", "");
const industry = (await page.evaluate(element => element.textContent, jobCriteria[3])).replace("Industries", "");
const jobData = {
name: row.name,
seniority: seniority.trim(),
position_type: positionType.trim(),
job_function: jobFunction.trim(),
industry: industry.trim()
}
console.log(jobData)
success = true;
console.log("Successfully parsed", row.url);
} catch (err) {
tries++;
console.log(`Error: ${err}, tries left: ${retries-tries}, url: ${getScrapeOpsUrl(url)}`);
} finally {
await page.close();
}
}
}
async function processResults(csvFile, location, retries) {
const rows = await readCsv(csvFile);
const browser = await puppeteer.launch();;
for (const row of rows) {
await processJob(browser, row, location, retries)
}
await browser.close();
}
async function main() {
const keywords = ["software engineer"];
const concurrencyLimit = 5;
const pages = 1;
const location = "us";
const locality = "United States";
const retries = 3;
const aggregateFiles = [];
for (const keyword of keywords) {
console.log("Crawl starting");
console.time("startCrawl");
await startCrawl(keyword, pages, locality, location, concurrencyLimit, retries);
console.timeEnd("startCrawl");
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
console.log("Starting scrape");
for (const file of aggregateFiles) {
console.time("processResults");
await processResults(file, location, retries);
console.timeEnd("processResults");
}
console.log("Scrape complete");
}
main();
Step 3: Storing the Scraped Data
Just like we did earlier, we need to store our scraped data. In our parsing function, we're already creating a jobData
object. We also already have a writeToCsv()
function. Instead of logging our jobData
to the console, we just need to store it.
In the code below, we're going to do exactly that.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
async function writeToCsv(data, outputFile) {
let success = false;
while (!success) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);
if (!(data instanceof Array)) {
data = [data]
}
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
success = true;
} catch (e) {
console.log("Failed data", data);
throw new Error("Failed to write to csv");
}
}
}
async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));
for await (const record of parser) {
results.push(record);
}
return results;
}
function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}
function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}
async function scrapeSearchResults(browser, keyword, pageNumber, locality, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const formattedKeyword = keyword.replace(" ", "+");
const formattedLocality = locality.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${formattedKeyword}&location=${formattedLocality}&original_referer=&start=${pageNumber*10}`;
const proxyUrl = getScrapeOpsUrl(url, location);
await page.goto(proxyUrl, { timeout: 0 });
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[class='base-search-card__info']");
for (const divCard of divCards) {
const nameElement = await divCard.$("h4[class='base-search-card__subtitle']");
const name = await page.evaluate(element => element.textContent, nameElement);
const jobTitleElement = await divCard.$("h3[class='base-search-card__title']");
const jobTitle = await page.evaluate(element => element.textContent, jobTitleElement);
const parentElement = await page.evaluateHandle(element => element.parentElement, divCard);
const aTag = await parentElement.$("a");
const link = await page.evaluate(element => element.getAttribute("href"), aTag);
const jobLocationElement = await divCard.$("span[class='job-search-card__location']");
const jobLocation = await page.evaluate(element => element.textContent, jobLocationElement);
const searchData = {
name: name.trim(),
job_title: jobTitle.trim(),
url: link.trim(),
location: jobLocation.trim()
};
await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startCrawl(keyword, pages, locality, location, concurrencyLimit, retries) {
const pageList = range(0, pages);
const browser = await puppeteer.launch();
while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, locality, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
async function processJob(browser, row, location, retries = 3) {
const url = row.url;
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const page = await browser.newPage();
try {
const response = await page.goto(url);
if (!response || response.status() !== 200) {
throw new Error("Failed to fetch page, status:", response.status());
}
const jobCriteria = await page.$$("li[class='description__job-criteria-item']");
if (jobCriteria.length < 4) {
throw new Error("Job Criteria Not Found!");
}
const seniority = (await page.evaluate(element => element.textContent, jobCriteria[0])).replace("Seniority level", "");
const positionType = (await page.evaluate(element => element.textContent, jobCriteria[1])).replace("Employment type", "");
const jobFunction = (await page.evaluate(element => element.textContent, jobCriteria[2])).replace("Job function", "");
const industry = (await page.evaluate(element => element.textContent, jobCriteria[3])).replace("Industries", "");
const jobData = {
name: row.name,
seniority: seniority.trim(),
position_type: positionType.trim(),
job_function: jobFunction.trim(),
industry: industry.trim()
}
await writeToCsv([jobData], `${row.name.replace(" ", "-")}-${row.job_title.replace(" ", "-")}.csv`);
success = true;
console.log("Successfully parsed", row.url);
} catch (err) {
tries++;
console.log(`Error: ${err}, tries left: ${retries-tries}, url: ${getScrapeOpsUrl(url)}`);
} finally {
await page.close();
}
}
}
async function processResults(csvFile, location, retries) {
const rows = await readCsv(csvFile);
const browser = await puppeteer.launch();;
for (const row of rows) {
await processJob(browser, row, location, retries)
}
await browser.close();
}
async function main() {
const keywords = ["software engineer"];
const concurrencyLimit = 5;
const pages = 1;
const location = "us";
const locality = "United States";
const retries = 3;
const aggregateFiles = [];
for (const keyword of keywords) {
console.log("Crawl starting");
console.time("startCrawl");
await startCrawl(keyword, pages, locality, location, concurrencyLimit, retries);
console.timeEnd("startCrawl");
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
console.log("Starting scrape");
for (const file of aggregateFiles) {
console.time("processResults");
await processResults(file, location, retries);
console.timeEnd("processResults");
}
console.log("Scrape complete");
}
main();
jobData
holds the data we pull from the page.- We pass our
jobData
intowriteToCsv()
and it then gets saved to a CSV file.
Step 4: Adding Concurrency
Adding concurrency here will be done almost exactly the same way we did it earlier.
- We first read our file into an array. We'll make an array of
tasks
by splicing ourrows
by ourconcurrencyLimit
. - Then, we'll
await
everything to resolve usingPromise.all()
. - This allows us to fetch and scrape multiple pages simultaneously.
- Like before, if we set our
concurrencyLimit
to 5, we'll be processing therows
in batches of 5.
async function processResults(csvFile, location, concurrencyLimit, retries) {
const rows = await readCsv(csvFile);
const browser = await puppeteer.launch();;
while (rows.length > 0) {
const currentBatch = rows.splice(0, concurrencyLimit);
const tasks = currentBatch.map(row => processJob(browser, row, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
await readCsv(csvFile);
: This returns all the rows from the CSV file in an array.rows.splice(0, concurrencyLimit);
shrinks therows
array and gives us a chunk to work with.currentBatch.map(row => processJob(browser, row, location, retries))
runsprocessJob()
on each element in the chunk.await Promise.all(tasks);
waits for each one of ourtasks
to resolve.- This process repeats until our
rows
array is completely gone.
Step 5: Bypassing Anti-Bots
We're almost finished with the project. However, there is one thing we still need to add, proxy support. We've already got a function that accomplishes this as well. We just need use it in the correct place. We're only going to change one line of code here.
const response = await page.goto(getScrapeOpsUrl(url, location), { timeout: 0 });
- We add set
{ timeout: 0 }
to tell Puppeteer not to time out. When dealing with a proxy along with a site as difficult as LinkedIn, pages sometimes take awhile to come back to us. - Now that our
location
is getting passed into our proxy function, we're actually going to be routed through a server in the country of our choice.
Take a look at the finished scraper.
const puppeteer = require("puppeteer");
const createCsvWriter = require("csv-writer").createObjectCsvWriter;
const csvParse = require("csv-parse");
const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
async function writeToCsv(data, outputFile) {
let success = false;
while (!success) {
if (!data || data.length === 0) {
throw new Error("No data to write!");
}
const fileExists = fs.existsSync(outputFile);
if (!(data instanceof Array)) {
data = [data]
}
const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))
const csvWriter = createCsvWriter({
path: outputFile,
header: headers,
append: fileExists
});
try {
await csvWriter.writeRecords(data);
success = true;
} catch (e) {
console.log("Failed data", data);
throw new Error("Failed to write to csv");
}
}
}
async function readCsv(inputFile) {
const results = [];
const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({
columns: true,
delimiter: ",",
trim: true,
skip_empty_lines: true
}));
for await (const record of parser) {
results.push(record);
}
return results;
}
function range(start, end) {
const array = [];
for (let i=start; i<end; i++) {
array.push(i);
}
return array;
}
function getScrapeOpsUrl(url, location="us") {
const params = new URLSearchParams({
api_key: API_KEY,
url: url,
country: location
});
return `https://proxy.scrapeops.io/v1/?${params.toString()}`;
}
async function scrapeSearchResults(browser, keyword, pageNumber, locality, location="us", retries=3) {
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const formattedKeyword = keyword.replace(" ", "+");
const formattedLocality = locality.replace(" ", "+");
const page = await browser.newPage();
try {
const url = `https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${formattedKeyword}&location=${formattedLocality}&original_referer=&start=${pageNumber*10}`;
const proxyUrl = getScrapeOpsUrl(url, location);
await page.goto(proxyUrl, { timeout: 0 });
console.log(`Successfully fetched: ${url}`);
const divCards = await page.$$("div[class='base-search-card__info']");
for (const divCard of divCards) {
const nameElement = await divCard.$("h4[class='base-search-card__subtitle']");
const name = await page.evaluate(element => element.textContent, nameElement);
const jobTitleElement = await divCard.$("h3[class='base-search-card__title']");
const jobTitle = await page.evaluate(element => element.textContent, jobTitleElement);
const parentElement = await page.evaluateHandle(element => element.parentElement, divCard);
const aTag = await parentElement.$("a");
const link = await page.evaluate(element => element.getAttribute("href"), aTag);
const jobLocationElement = await divCard.$("span[class='job-search-card__location']");
const jobLocation = await page.evaluate(element => element.textContent, jobLocationElement);
const searchData = {
name: name.trim(),
job_title: jobTitle.trim(),
url: link.trim(),
location: jobLocation.trim()
};
await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);
}
success = true;
} catch (err) {
console.log(`Error: ${err}, tries left ${retries - tries}`);
tries++;
} finally {
await page.close();
}
}
}
async function startCrawl(keyword, pages, locality, location, concurrencyLimit, retries) {
const pageList = range(0, pages);
const browser = await puppeteer.launch();
while (pageList.length > 0) {
const currentBatch = pageList.splice(0, concurrencyLimit);
const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, locality, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
async function processJob(browser, row, location, retries = 3) {
const url = row.url;
let tries = 0;
let success = false;
while (tries <= retries && !success) {
const page = await browser.newPage();
try {
const response = await page.goto(getScrapeOpsUrl(url, location), { timeout: 0 });
if (!response || response.status() !== 200) {
throw new Error("Failed to fetch page, status:", response.status());
}
const jobCriteria = await page.$$("li[class='description__job-criteria-item']");
if (jobCriteria.length < 4) {
throw new Error("Job Criteria Not Found!");
}
const seniority = (await page.evaluate(element => element.textContent, jobCriteria[0])).replace("Seniority level", "");
const positionType = (await page.evaluate(element => element.textContent, jobCriteria[1])).replace("Employment type", "");
const jobFunction = (await page.evaluate(element => element.textContent, jobCriteria[2])).replace("Job function", "");
const industry = (await page.evaluate(element => element.textContent, jobCriteria[3])).replace("Industries", "");
const jobData = {
name: row.name,
seniority: seniority.trim(),
position_type: positionType.trim(),
job_function: jobFunction.trim(),
industry: industry.trim()
}
await writeToCsv([jobData], `${row.name.replace(" ", "-")}-${row.job_title.replace(" ", "-")}.csv`);
success = true;
console.log("Successfully parsed", row.url);
} catch (err) {
tries++;
console.log(`Error: ${err}, tries left: ${retries-tries}, url: ${getScrapeOpsUrl(url)}`);
} finally {
await page.close();
}
}
}
async function processResults(csvFile, location, concurrencyLimit, retries) {
const rows = await readCsv(csvFile);
const browser = await puppeteer.launch();;
while (rows.length > 0) {
const currentBatch = rows.splice(0, concurrencyLimit);
const tasks = currentBatch.map(row => processJob(browser, row, location, retries));
try {
await Promise.all(tasks);
} catch (err) {
console.log(`Failed to process batch: ${err}`);
}
}
await browser.close();
}
async function main() {
const keywords = ["software engineer"];
const concurrencyLimit = 5;
const pages = 1;
const location = "us";
const locality = "United States";
const retries = 3;
const aggregateFiles = [];
for (const keyword of keywords) {
console.log("Crawl starting");
console.time("startCrawl");
await startCrawl(keyword, pages, locality, location, concurrencyLimit, retries);
console.timeEnd("startCrawl");
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
console.log("Starting scrape");
for (const file of aggregateFiles) {
console.time("processResults");
await processResults(file, location, concurrencyLimit, retries);
console.timeEnd("processResults");
}
console.log("Scrape complete");
}
main();
Step 6: Production Run
Time for our final run. As we did earlier, we use 5 threads to crawl 3 pages of results. Then, we scrape each job from our search results.
If you need a refresher, take a look at our main()
below. As we mentioned earlier, you can change the following to tweak your results.
keywords
concurrencyLimit
pages
location
locality
retries
async function main() {
const keywords = ["software engineer"];
const concurrencyLimit = 5;
const pages = 3;
const location = "us";
const locality = "United States";
const retries = 3;
const aggregateFiles = [];
for (const keyword of keywords) {
console.log("Crawl starting");
console.time("startCrawl");
await startCrawl(keyword, pages, locality, location, concurrencyLimit, retries);
console.timeEnd("startCrawl");
console.log("Crawl complete");
aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);
}
console.log("Starting scrape");
for (const file of aggregateFiles) {
console.time("processResults");
await processResults(file, location, concurrencyLimit, retries);
console.timeEnd("processResults");
}
console.log("Scrape complete");
}
Here are the results.
If you remember earlier, our 3 page crawl took 33.694 seconds. Our crawl gave us a CSV with 30 results. Our crawl took much longer this time (over a minute).
The scrape took a total of 4 minutes and 59.347 seconds. If we convert this all to seconds, we get 259.347 seconds. 259.347 seconds / 30 pages = 8.645 seconds per page.
As our scrape gets larger the rate at which w scrape tends to get faster. This is due to our concurrency functions.**
Legal and Ethical Considerations
Don't scrape private data. Private data is any data that's gated behind a login page. When we scrape LinkedIn jobs, we're not logging in and we're scraping publicly available data. You should do the same.
If your scraper is legally questionable, you need to consult an attorney. Laws are different all over the world, but it is generally legal to scrape public data. It's not much different than taking a picture of a public billboard.
You also need to make some ethical considerations when scraping the web (especially LinkedIn). We're not legally subject to LinkedIn's terms of service and their robots.txt
because we haven't agreed to anything, but they take these policies very seriously.
Their terms are available here and their robots.txt
is here. As stated at the top of their robots.txt
, crawling LinkedIn is explicitly prohibited.
By scraping LinkedIn, you can have your account suspended, banned, or even deleted.
Conclusion
You've seen it (and possibly done it) yourself! It is completely possible to scrape LinkedIn.
At this point, you should have a pretty solid grasp of how to use Puppeteer for basic scraping operations. You also understand our iterative build process for the following features: parsing, pagination, data storage, concurrency and proxy integration.
If you want to know more about the tech stack from this article, check out the links below!
More Python Web Scraping Guides
At ScrapeOps, we love to scrape the web. We wrote the playbook on scraping with NodeJS Puppeteer. Whether you're brand new, or an experienced dev, we've got something for you.
If you'd like to read more of our "How To Scrape" series, take a look at the links below.