How to Scrape Linkedin Jobs

Since 2003, LinkedIn, LinkedIn has been a one-stop shop for all sorts of career opportunities. From job postings to professional networking, LinkedIn pretty much has it all. LinkedIn has been built specifically to stop scrapers, but if you know what you're doing, you can get tons of aggregate data. This data is extremely valuable. Begin by selecting a language that fits your development environment.

How to Scrape Linkedin Jobs With Requests and BeautifulSoup

Today, we're going to crawl LinkedIn's job postings and then scrape the individual pages for those listings with Python. This is a great way to collect large amounts of data and get a larger picture of the market for certain jobs.

💡GitHub CodeThe full code for this LinkedIn Jobs Scraper is available on Github here.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

TLDR - How to Scrape LinkedIn Jobs with Python

Looking to scrape LinkedIn jobs? Look no further!To use our prebuilt scraper:

Make a new project folder with a config.json file.
Inside the config file, add you ScrapeOps API key: {"api_key": "your-super-secret-api-key"}.
Then copy and paste the code below into a Python file.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    job_title: str = ""    url: str = ""    location: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass JobData:    name: str = ""    seniority: str = ""    position_type: str = ""    job_function: str = ""    industry: str = ""

    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, locality, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    formatted_locality = locality.replace(" ", "+")    url = f"https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords={formatted_keyword}&location={formatted_locality}&original_referer=&start={page_number*10}"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                                        soup = BeautifulSoup(response.text, "html.parser")                        div_cards = soup.find_all("div", class_="base-search-card__info")            for div_card in div_cards:                company_name = div_card.find("h4", class_="base-search-card__subtitle").text                job_title = div_card.find("h3", class_="base-search-card__title").text                link = div_card.parent.find("a")                job_link = link.get("href")                location = div_card.find("span", class_="job-search-card__location").text                                search_data = SearchData(                    name=company_name,                    job_title=job_title,                    url=job_link,                    location=location                )           
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, locality, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            [locality] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

def process_posting(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(get_scrapeops_url(url, location=location))        try:            if response.status_code != 200:                raise Exception(f"Failed Request, status code: {response.status_code}")
            logger.info(f"Status: {response.status_code}")            soup = BeautifulSoup(response.text, "html.parser")            job_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")                        job_criteria = soup.find_all("li", class_="description__job-criteria-item")            seniority = job_criteria[0].text.replace("Seniority level", "")            position_type = job_criteria[1].text.replace("Employment type", "")            job_function = job_criteria[2].text.replace("Job function", "")            industry = job_criteria[3].text.replace("Industries", "")
            job_data = JobData(                name=row["name"],                seniority=seniority,                position_type=position_type,                job_function=job_function,                industry=industry            )            job_pipeline.add_data(job_data)            job_pipeline.close_pipeline()            success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}, retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")



def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_posting,                reader,                [location] * len(reader),                [retries] * len(reader)            )
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 3    LOCATION = "us"    LOCALITY = "United States"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["software engineer"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, LOCALITY, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Feel free to change any of the following to control your results:

MAX_RETRIES: Defines the maximum number of times the script will attempt to retrieve a webpage if the initial request fails (e.g., due to network issues or rate limiting).
MAX_THREADS: Sets the maximum number of threads that the script will use concurrently during scraping.
PAGES: The number of pages of job listings to scrape for each keyword.
LOCATION: The country code or identifier for the region from which job listings should be scraped (e.g., "us" for the United States).
LOCALITY: The textual representation of the location where the jobs are being scraped (e.g., "United States").
keyword_list: A list of keywords representing job titles or roles to search for on LinkedIn (e.g., ["software engineer"]).

You can then run your scraper with python name_of_your_script.py. You'll get a CSV named after the keyword you searched. Then, you'll get an individual CSV report on each job as well.

How To Architect Our LinkedIn Jobs Scraper

In order to scrape LinkedIn, we're going to build two different scrapers, a search crawler, and a job scraper. At the highest level, this process is relatively simple.

First, our crawler runs a keyword search for jobs with a certain title, and then saves the results.
Once we've got results, our scraper will then go through and scrape each individual job posting we find in the results.

Let's break this process down into smaller pieces. We'll start with defining our crawl from start to finish, and then we'll do our scrape.Step by step, here is how we'll build our crawler:

Write a search results parser to interpret our data.
Add pagination, this way, we get more results and finer control over them.
Create some classes for data storage, and then use them to save our parsed results.
Use ThreadPoolExecutor to add support for multithreading and therefore concurrency.
Write a function for proxy integration and use it to bypass LinkedIn's anti-bot system.

Here are the steps we'll go through when building our scraper.

Write a parser to pull information from individual job postings.
Give our scraper the ability to read a CSV file.
Add another class for data storage and build the storage into our parsing function.
Add ThreadPoolExecutor to scrape posting data concurrently.
Use our proxy function from earlier to bypass anti-bots.

Understanding How To Scrape LinkedIn Jobs

We can't just plunge straight into coding. We need to look at how this is done from a user standpoint first. We need to understand how to request these pages and how to extract data from them. We also need to see how pagination works and we need to know how to control our location.

Step 1: How To Request LinkedIn Jobs Pages

Anytime you go to a website, it starts with a GET request.

Our browser makes a GET to LinkedIn.
Then, LinkedIn sends back an HTML page.
Our browser reads the page and displays our results.
Instead of displaying results, our goal is to write a program that reads the results and finds the relevant information.
We save the information we want and get on with our day.

To GET a search from LinkedIn Job results, we'll use the following URL structure:

https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords={formatted_keyword}&location={formatted_locality}&original_referer="

If we wanted to search the US for Software Engineer jobs, our URL would look like this:

https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=software+engineer&location={formatted_locality}&original_referer=

As you might have noticed from the URL, we're going to use their builtin API to fetch our jobs. Interestingly enough, this API doesn't give us JSON or XML, it sends back straight HTML. Take a look at our search results below.We also need to make another GET when we scrape an individual job posting. If you look at this page, you can see some finer details such as seniority, job function, employment type, and industries.

Step 2: How To Extract Data From LinkedIn Jobs Results and Pages

Now that we know what these pages look like, we need to see exactly where to pull our information from.

On the search results page, all of our information is connected to a div card with a class name, base-search-card__info.
On a job page, our information comes as an li element with a class name, description__job-criteria-item.

Take a look below it our base-search-card__info.In this next image, you'll see one of the li items that we would extract.

Step 3: How To Control Pagination

We use pagination to control our search results. We need to add one parameter to our URL, &start={page_number*10}.Our full URL for page 1 of the Software Engineer search would look like this:

https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=software+engineer&location=United+States&original_referer=&start=0

We use page_number*10 because we begin counting at 0 and each request yields 10 results. Page 0 (0 * 10) gives us results 1 through 10. Page 1 gives us 11 through 20 and so on and so forth.Inside our Python code, the URL would look like this:

f"https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords={formatted_keyword}&location={formatted_locality}&original_referer=&start={page_number*10}"

Step 4: Geolocated Data

To control our geolocation, we'll be using the ScrapeOps Proxy API. This API can take in all sorts of arguments, but the one we use for this is called country.

If we want to appear in the US, we can pass "country": "us" into the API.
If we want to appear in the UK, we'd pass "country": "uk".

You can find a full list of ScrapeOps supported countries here.

Setting Up Our LinkedIn Jobs Scraper Project

Let's get started. You can run the following commands to get setup.Create a New Project Folder

mkdir linkedin-jobs-scraper
cd linkedin-jobs-scraper

Create a New Virtual Environment

python -m venv venv

Activate the Environment

source venv/bin/activate

Install Our Dependencies

pip install requests

pip install beautifulsoup4

Build A LinkedIn Jobs Search Crawler

We're now ready to build our crawler. We know what it needs to do and we'll implement it in a series of 5 steps.

First, we're going to build a basic script with error handling, retry logic, and our basic parser.
We'll add pagination.
Create a couple classes and use them to implement data storage.
Add concurrency to scrape multiple pages simultaneously.
Integrate with the ScrapeOps Proxy API in order to get past even the most stringent of anti-bot protocols.

Step 1: Create Simple Search Data Parser

Time to begin. We're going to start with a basic script that sets up our basic structure and adds error handling, retry logic and our basic parsing function. This will lay the foundation for everything else we add later on.Pay close attention to our parsing function, scrape_search_results().

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


def scrape_search_results(keyword, location, locality, retries=3):    formatted_keyword = keyword.replace(" ", "+")    formatted_locality = locality.replace(" ", "+")    url = f"https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords={formatted_keyword}&location={formatted_locality}&original_referer="    tries = 0    success = False        while tries <= retries and not success:        try:            response = requests.get(url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                                        soup = BeautifulSoup(response.text, "html.parser")                        div_cards = soup.find_all("div", class_="base-search-card__info")            for div_card in div_cards:                company_name = div_card.find("h4", class_="base-search-card__subtitle").text                job_title = div_card.find("h3", class_="base-search-card__title").text                link = div_card.parent.find("a")                job_link = link.get("href")                location = div_card.find("span", class_="job-search-card__location").text                                search_data = {                    "name": company_name,                    "job_title": job_title,                    "url": job_link,                    "location": location                }                
                print(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 3    LOCATION = "us"    LOCALITY = "United States"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["software engineer"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        scrape_search_results(keyword, LOCATION, LOCALITY)    logger.info(f"Crawl complete.")

We use soup.find_all("div", class_="base-search-card__info") to find all of our base result cards.
div_card.find("h4", class_="base-search-card__subtitle").text finds our company_name.
Our job title is inside an h3, so we use div_card.find("h3", class_="base-search-card__title").text to find it.
Our link is actually embedded in the parent element, so we extract it with div_card.parent.find("a").
We then pull the href from the link element with link.get("href").
Finally, div_card.find("span", class_="job-search-card__location").text gets the job location from the card.

Step 2: Add Pagination

As mentioned earlier, adding pagination is very simple. We just need to add start={page_number*10} to the end of our URL. We also need a function that allows us to scrape multiple pages, we'll call it start_scrape().Our fully paginated urls are laid out in the snippet you see below.

url = f"https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords={formatted_keyword}&location={formatted_locality}&original_referer=&start={page_number*10}"

start_scrape() is in our next snippet. At the moment, it's just a simple for loop that parses pages using iteration. Later on, we'll make some improvements to it.

def start_scrape(keyword, pages, location, locality, retries=3):    for page in range(pages):        scrape_search_results(keyword, location, locality, page_number, retries=retries)

You can see how it all fits together in the full code below.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


def scrape_search_results(keyword, location, locality, page_number, retries=3):    formatted_keyword = keyword.replace(" ", "+")    formatted_locality = locality.replace(" ", "+")    url = f"https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords={formatted_keyword}&location={formatted_locality}&original_referer=&start={page_number*10}"    tries = 0    success = False        while tries <= retries and not success:        try:            response = requests.get(url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                                        soup = BeautifulSoup(response.text, "html.parser")                        div_cards = soup.find_all("div", class_="base-search-card__info")            for div_card in div_cards:                company_name = div_card.find("h4", class_="base-search-card__subtitle").text                job_title = div_card.find("h3", class_="base-search-card__title").text                link = div_card.parent.find("a")                job_link = link.get("href")                location = div_card.find("span", class_="job-search-card__location").text                                search_data = {                    "name": company_name,                    "job_title": job_title,                    "url": job_link,                    "location": location                }                
                print(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, locality, retries=3):    for page in range(pages):        scrape_search_results(keyword, location, locality, page_number, retries=retries)

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 3    LOCATION = "us"    LOCALITY = "United States"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["software engineer"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        start_scrape(keyword, PAGES, LOCATION, LOCALITY, retries=MAX_RETRIES)    logger.info(f"Crawl complete.")

start={page_number*10} gives us the ability to control pagination inside our url.
start_scrape() allows us to parse a list of pages.

Step 3: Storing the Scraped Data

In order to store our data, we need to write a couple of classes.

Our first one is a dataclass called SearchData.
The second one is our DataPipeline. SearchData simply needs to represent individual search items.

DataPipeline needs to open a pipe to a CSV file and store SearchData objects inside our CSV.Here is our SearchData. It holds the name, job_title, url and location that we find during the parse.

@dataclassclass SearchData:    name: str = ""    job_title: str = ""    url: str = ""    location: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

Once we've got our SearchData, it gets passed into the DataPipeline you see below. The DataPipeline first checks to see if our CSV file exists. If it exists, we append the file.If the file doesn't exist, we create one. This approach stops us from accidentally destroying important data. This class also filters out duplicates using the name attribute.

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()

Putting it all together, we get a script that looks like this.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    job_title: str = ""    url: str = ""    location: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, locality, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    formatted_locality = locality.replace(" ", "+")    url = f"https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords={formatted_keyword}&location={formatted_locality}&original_referer=&start={page_number*10}"    tries = 0    success = False        while tries <= retries and not success:        try:            response = requests.get(url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                                        soup = BeautifulSoup(response.text, "html.parser")                        div_cards = soup.find_all("div", class_="base-search-card__info")            for div_card in div_cards:                company_name = div_card.find("h4", class_="base-search-card__subtitle").text                job_title = div_card.find("h3", class_="base-search-card__title").text                link = div_card.parent.find("a")                job_link = link.get("href")                location = div_card.find("span", class_="job-search-card__location").text                                search_data = SearchData(                    name=company_name,                    job_title=job_title,                    url=job_link,                    location=location                )           
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, locality, data_pipeline=None, retries=3):    for page in range(pages):        scrape_search_results(keyword, location, locality, page_number, data_pipeline=data_pipeline, retries=retries)

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 3    LOCATION = "us"    LOCALITY = "United States"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["software engineer"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, LOCALITY, data_pipeline=crawl_pipeline, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

We use SearchData to represent individual results from our search results page.
DataPipeline is used to store these objects in a safe and effficient way.

Step 4: Adding Concurrency

To add concurrency support, we're going to use multithreading.To add multithreading, we're going to use ThreadPoolExecutor and we're going to remove our for loop from start_scrape().ThreadPoolExecutor allows us to open a pool with max_threads. If we want to use 4 threads, we pass max_threads=4.

def start_scrape(keyword, pages, location, locality, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            [locality] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

Our arguments to executor.map() go as follows:

scrape_search_results: the function we want to call on all these available threads.
All other arguments get passed in as arrays.
These arrays of arguments then get passed into the function we're calling on multiple threads.

Our full code now looks like this.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    job_title: str = ""    url: str = ""    location: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, locality, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    formatted_locality = locality.replace(" ", "+")    url = f"https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords={formatted_keyword}&location={formatted_locality}&original_referer=&start={page_number*10}"    tries = 0    success = False        while tries <= retries and not success:        try:            response = requests.get(url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                                        soup = BeautifulSoup(response.text, "html.parser")                        div_cards = soup.find_all("div", class_="base-search-card__info")            for div_card in div_cards:                company_name = div_card.find("h4", class_="base-search-card__subtitle").text                job_title = div_card.find("h3", class_="base-search-card__title").text                link = div_card.parent.find("a")                job_link = link.get("href")                location = div_card.find("span", class_="job-search-card__location").text                                search_data = SearchData(                    name=company_name,                    job_title=job_title,                    url=job_link,                    location=location                )           
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, locality, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            [locality] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 3    LOCATION = "us"    LOCALITY = "United States"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["software engineer"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, LOCALITY, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

We can now crawl multiple pages simultaneously.

Step 5: Bypassing Anti-Bots

To bypass anti-bots, we're going to write a simple function that takes a url and a location. Along with these, the function will handle some set parameters and spit out a ScrapeOps proxied URL.Take a look at get_scrapeops_url().

def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

Let's unpack our payload.

"api_key": our ScrapeOps API key.
"url": the url we want to scrape.
"country": the country we want to appear in.

Our full production crawler is available below.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    job_title: str = ""    url: str = ""    location: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, locality, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    formatted_locality = locality.replace(" ", "+")    url = f"https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords={formatted_keyword}&location={formatted_locality}&original_referer=&start={page_number*10}"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                                        soup = BeautifulSoup(response.text, "html.parser")                        div_cards = soup.find_all("div", class_="base-search-card__info")            for div_card in div_cards:                company_name = div_card.find("h4", class_="base-search-card__subtitle").text                job_title = div_card.find("h3", class_="base-search-card__title").text                link = div_card.parent.find("a")                job_link = link.get("href")                location = div_card.find("span", class_="job-search-card__location").text                                search_data = SearchData(                    name=company_name,                    job_title=job_title,                    url=job_link,                    location=location                )           
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, locality, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            [locality] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 3    LOCATION = "us"    LOCALITY = "United States"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["software engineer"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, LOCALITY, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

Step 6: Production Run

Time to run it in production. We're going to crawl 3 pages using 5 threads. If you're looking for different results, try changing any of the following.

MAX_RETRIES: Defines the maximum number of times the script will attempt to retrieve a webpage if the initial request fails (e.g., due to network issues or rate limiting).
MAX_THREADS: Sets the maximum number of threads that the script will use concurrently during scraping.
PAGES: The number of pages of job listings to scrape for each keyword.
LOCATION: The country code or identifier for the region from which job listings should be scraped (e.g., "us" for the United States).
LOCALITY: The textual representation of the location where the jobs are being scraped (e.g., "United States").
keyword_list: A list of keywords representing job titles or roles to search for on LinkedIn (e.g., ["software engineer"]).

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 3    LOCATION = "us"    LOCALITY = "United States"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["software engineer"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, LOCALITY, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

Take a look at our results.As you can see, we crawled 3 pages in 6.99 seconds. This comes out to an average of 2.33 secconds per page.

Build A LinkedIn Jobs Scraper

Now, it's time to build our scraper. Our scraper needs to be able to read a CSV file. Then, it needs to parse each page from the file. It needs to store the parsed data. Once it can do the things we just mentioned, we need to go through and add concurrency and proxy support.

Step 1: Create Simple Job Data Parser

Like we did earlier, we'll start by writing a simple parsing function. This function will have error handling and retry logic just like before. Take a look at process_posting(). Like before, pay close attention to our parsing logic.

def process_posting(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(url, location=location)        try:            if response.status_code != 200:                raise Exception(f"Failed Request, status code: {response.status_code}")
            logger.info(f"Status: {response.status_code}")            soup = BeautifulSoup(response.text, "html.parser")                        job_criteria = soup.find_all("li", class_="description__job-criteria-item")            seniority = job_criteria[0].text.replace("Seniority level", "")            position_type = job_criteria[1].text.replace("Employment type", "")            job_function = job_criteria[2].text.replace("Job function", "")            industry = job_criteria[3].text.replace("Industries", "")
            job_data = {                "name": row["name"],                "seniority": seniority,                "position_type": position_type,                "job_function": job_function,                "industry": industry            }
            print(job_data)                        success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}, retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")

soup.find_all("li", class_="description__job-criteria-item") finds all of our criteria pieces.
The criteria list goes as follows:
- job_criteria[0]: senority level
- job_criteria[1]: position type
- job_criteria[2]: job function
- job_criteria[3]: industry

Step 2: Loading URLs To Scrape

Without a CSV file to read, our parsing function is pretty useless. We're going to write a function that reads a CSV file and uses a for loop to call process_posting() on each row from the file.Here is our first iteration of process_results(). Later on, we'll rewrite it and add multithreading support.

def process_results(csv_file, location, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_posting(row, location, retries=retries)

In the full code below, we're now updated to perform a crawl, and then scrape individual job postings.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    job_title: str = ""    url: str = ""    location: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, locality, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    formatted_locality = locality.replace(" ", "+")    url = f"https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords={formatted_keyword}&location={formatted_locality}&original_referer=&start={page_number*10}"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                                        soup = BeautifulSoup(response.text, "html.parser")                        div_cards = soup.find_all("div", class_="base-search-card__info")            for div_card in div_cards:                company_name = div_card.find("h4", class_="base-search-card__subtitle").text                job_title = div_card.find("h3", class_="base-search-card__title").text                link = div_card.parent.find("a")                job_link = link.get("href")                location = div_card.find("span", class_="job-search-card__location").text                                search_data = SearchData(                    name=company_name,                    job_title=job_title,                    url=job_link,                    location=location                )           
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, locality, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            [locality] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

def process_posting(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(url, location=location)        try:            if response.status_code != 200:                raise Exception(f"Failed Request, status code: {response.status_code}")
            logger.info(f"Status: {response.status_code}")            soup = BeautifulSoup(response.text, "html.parser")                        job_criteria = soup.find_all("li", class_="description__job-criteria-item")            seniority = job_criteria[0].text.replace("Seniority level", "")            position_type = job_criteria[1].text.replace("Employment type", "")            job_function = job_criteria[2].text.replace("Job function", "")            industry = job_criteria[3].text.replace("Industries", "")
            job_data = {                "name": row["name"],                "seniority": seniority,                "position_type": position_type,                "job_function": job_function,                "industry": industry            }
            print(job_data)                        success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}, retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")



def process_results(csv_file, location, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_posting(row, location, retries=retries)
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 3    LOCATION = "us"    LOCALITY = "United States"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["software engineer"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, LOCALITY, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, retries=MAX_RETRIES)

Step 3: Storing the Scraped Data

We already have a DataPipeline. Storing our data will be very easy at this point. We just need another dataclass. Take a look below at JobData.Just like our SearchData from earlier, we use it to represent the data we scraped from the page.

@dataclassclass JobData:    name: str = ""    seniority: str = ""    position_type: str = ""    job_function: str = ""    industry: str = ""

    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

In our full code below, our parsing function now opens a DataPipeline. Then, instead of printing our parsed data, we create a JobData object out of it and then pass our JobData into the pipeline.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    job_title: str = ""    url: str = ""    location: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass JobData:    name: str = ""    seniority: str = ""    position_type: str = ""    job_function: str = ""    industry: str = ""

    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, locality, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    formatted_locality = locality.replace(" ", "+")    url = f"https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords={formatted_keyword}&location={formatted_locality}&original_referer=&start={page_number*10}"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                                        soup = BeautifulSoup(response.text, "html.parser")                        div_cards = soup.find_all("div", class_="base-search-card__info")            for div_card in div_cards:                company_name = div_card.find("h4", class_="base-search-card__subtitle").text                job_title = div_card.find("h3", class_="base-search-card__title").text                link = div_card.parent.find("a")                job_link = link.get("href")                location = div_card.find("span", class_="job-search-card__location").text                                search_data = SearchData(                    name=company_name,                    job_title=job_title,                    url=job_link,                    location=location                )           
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, locality, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            [locality] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

def process_posting(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(url, location=location)        try:            if response.status_code != 200:                raise Exception(f"Failed Request, status code: {response.status_code}")
            logger.info(f"Status: {response.status_code}")            soup = BeautifulSoup(response.text, "html.parser")            job_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")                        job_criteria = soup.find_all("li", class_="description__job-criteria-item")            seniority = job_criteria[0].text.replace("Seniority level", "")            position_type = job_criteria[1].text.replace("Employment type", "")            job_function = job_criteria[2].text.replace("Job function", "")            industry = job_criteria[3].text.replace("Industries", "")
            job_data = JobData(                name=row["name"],                seniority=seniority,                position_type=position_type,                job_function=job_function,                industry=industry            )            job_pipeline.add_data(job_data)            job_pipeline.close_pipeline()            success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}, retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")



def process_results(csv_file, location, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_posting(row, location, retries=retries)
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 3    LOCATION = "us"    LOCALITY = "United States"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["software engineer"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, LOCALITY, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, retries=MAX_RETRIES)

JobData holds the data we pull from the page.
DataPipeline takes a JobData object and pipes it to a CSV file.

Step 4: Adding Concurrency

For concurrency support, we're going to use ThreadPoolExecutor like we did earlier.Take a look at our refactored version of process_results().

def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_posting,                reader,                [location] * len(reader),                [retries] * len(reader)            )

Look at our arguments to executor.map():

process_posting: the function we want to call on multiple threads.
All arguments to process_posting get passed in as arrays.

Step 5: Bypassing Anti-Bots

To bypass anti-bots using our scraper, we just need to reuse a function we wrote at the beginning of our crawler.We'll change one line of our parsing function, the GET request.

response = requests.get(get_scrapeops_url(url, location=location))

Our full production code is available below.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    job_title: str = ""    url: str = ""    location: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass JobData:    name: str = ""    seniority: str = ""    position_type: str = ""    job_function: str = ""    industry: str = ""

    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, locality, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    formatted_locality = locality.replace(" ", "+")    url = f"https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords={formatted_keyword}&location={formatted_locality}&original_referer=&start={page_number*10}"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                                        soup = BeautifulSoup(response.text, "html.parser")                        div_cards = soup.find_all("div", class_="base-search-card__info")            for div_card in div_cards:                company_name = div_card.find("h4", class_="base-search-card__subtitle").text                job_title = div_card.find("h3", class_="base-search-card__title").text                link = div_card.parent.find("a")                job_link = link.get("href")                location = div_card.find("span", class_="job-search-card__location").text                                search_data = SearchData(                    name=company_name,                    job_title=job_title,                    url=job_link,                    location=location                )           
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, locality, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            [locality] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

def process_posting(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(get_scrapeops_url(url, location=location))        try:            if response.status_code != 200:                raise Exception(f"Failed Request, status code: {response.status_code}")
            logger.info(f"Status: {response.status_code}")            soup = BeautifulSoup(response.text, "html.parser")            job_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")                        job_criteria = soup.find_all("li", class_="description__job-criteria-item")            seniority = job_criteria[0].text.replace("Seniority level", "")            position_type = job_criteria[1].text.replace("Employment type", "")            job_function = job_criteria[2].text.replace("Job function", "")            industry = job_criteria[3].text.replace("Industries", "")
            job_data = JobData(                name=row["name"],                seniority=seniority,                position_type=position_type,                job_function=job_function,                industry=industry            )            job_pipeline.add_data(job_data)            job_pipeline.close_pipeline()            success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}, retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")



def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_posting,                reader,                [location] * len(reader),                [retries] * len(reader)            )
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 3    LOCATION = "us"    LOCALITY = "United States"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["software engineer"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, LOCALITY, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Step 6: Production Run

We're going to run a full crawl and scrape with exactly the same parameters as before. If you need a refresher on our main, you can see it again below.

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 3    LOCATION = "us"    LOCALITY = "United States"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["software engineer"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, LOCALITY, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Here are the results.If you remember, our 3 page crawl took 6.99 seconds. Our crawl spat out a CSV with 18 results. The full crawl and scrape took 47.565 seconds. 47.565 - 6.99 = 40.575 seconds. 40.575 seconds / 18 results = 2.254 seconds per result. This is even faster than our crawler!

Legal and Ethical Considerations

Scraping private data is usually considered illegal without special permission. When we scrape LinkedIn jobs, we're not logging in and we're scraping publicly available data. You should do the same. If your scraper is questionable, you need to consult an attorney.In addition to any legal ramifications from scraping, we're subject to LinkedIn's terms of service and their robots.txt. Their terms are available here and their robots.txt is here.As stated at the top of their robots.txt, crawling LinkedIn is explicitly prohibited. By scraping LinkedIn, you can have your account suspended, banned, or even deleted.Always ensure compliance with LinkedIn's policies and consider using official APIs or getting explicit permission for large-scale data extraction.

Conclusion

While it can be dicier than other scraping jobs, you now know how to scrape individual job postings from LinkedIn. You know how to use Requests and BeautifulSoup and you've also get a solid understanding of how to iteratively build new features such as parsing, pagination, data storage, concurrency and proxy integration.If you want to know more about the tech stack from this article, check out the links below!

How to Scrape Linkedin Jobs With Selenium

Today, we'll go through the process of scraping LinkedIn jobs from start to finish with Selenium.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

TLDR - How to Scrape LinkedIn Jobs with Selenium

If you need a scraper but don't have time to read, look no further! You can use our prebuilt scraper.

Make a new project folder and add a config.json file.
Inside of your new config file, add you ScrapeOps API key: {"api_key": "your-super-secret-api-key"}.
Then copy and paste the code below into a Python file.
python name_of_your_script.py is the command you'll use to run the scraper.

Once the scraper has finished running, you'll get a CSV named after your search. This one will contain all of your search data.You'll also get an individual CSV file for each of the listings from the CSV. These individual files contain more detailed information about each job posting.

import osimport csvimport jsonfrom selenium import webdriverfrom selenium.webdriver.common.by import Byimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
options = webdriver.ChromeOptions()options.add_argument("--headless")


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    job_title: str = ""    url: str = ""    location: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass JobData:    name: str = ""    seniority: str = ""    position_type: str = ""    job_function: str = ""    industry: str = ""

    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, locality, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    formatted_locality = locality.replace(" ", "+")    url = f"https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords={formatted_keyword}&location={formatted_locality}&original_referer=&start={page_number*10}"    tries = 0    success = False
        while tries <= retries and not success:
        driver = webdriver.Chrome(options=options)
        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            driver.get(scrapeops_proxy_url)                                    div_cards = driver.find_elements(By.CSS_SELECTOR, "div[class='base-search-card__info']")
            if not div_cards:                driver.save_screenshot("debug.png")                raise Exception("Page did not load correctly, please check debug.png")
            for div_card in div_cards:                company_name = div_card.find_element(By.CSS_SELECTOR, "h4[class='base-search-card__subtitle']").text                print("company name", company_name)                job_title = div_card.find_element(By.CSS_SELECTOR, "h3[class='base-search-card__title']").text                parent = div_card.find_element(By.XPATH, "..")                link = parent.find_element(By.CSS_SELECTOR, "a")                job_link = link.get_attribute("href")                location = div_card.find_element(By.CSS_SELECTOR, "span[class='job-search-card__location']").text                                search_data = SearchData(                    name=company_name,                    job_title=job_title,                    url=job_link,                    location=location                )           
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, locality, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            [locality] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

def process_posting(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        driver = webdriver.Chrome(options=options)        try:            driver.get(get_scrapeops_url(url, location=location))                        job_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")                        job_criteria = driver.find_elements(By.CSS_SELECTOR, "li[class='description__job-criteria-item']")            seniority = job_criteria[0].text.replace("Seniority level", "")            position_type = job_criteria[1].text.replace("Employment type", "")            job_function = job_criteria[2].text.replace("Job function", "")            industry = job_criteria[3].text.replace("Industries", "")
            job_data = JobData(                name=row["name"],                seniority=seniority,                position_type=position_type,                job_function=job_function,                industry=industry            )            job_pipeline.add_data(job_data)            job_pipeline.close_pipeline()            success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}, retries left: {retries-tries}")            tries += 1
        finally:            driver.quit()                if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")



def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_posting,                reader,                [location] * len(reader),                [retries] * len(reader)            )
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "us"    LOCALITY = "United States"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["software engineer"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, LOCALITY, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

You can change any of the following from main to fine-tune your results:

MAX_RETRIES: Defines the maximum number of times the script will attempt to retrieve a webpage if the initial request fails (e.g., due to network issues or rate limiting).
MAX_THREADS: Sets the maximum number of threads that the script will use concurrently during scraping.
PAGES: The number of pages of job listings to scrape for each keyword.
LOCATION: The country code or identifier for the region from which job listings should be scraped (e.g., "us" for the United States).
LOCALITY: The textual representation of the location where the jobs are being scraped (e.g., "United States").
keyword_list: A list of keywords representing job titles or roles to search for on LinkedIn (e.g., ["software engineer"]).

How To Architect Our LinkedIn Jobs Scraper

To scrape LinkedIn properly, we need two separate scrapers, a search crawler, and a job scraper. From a high level, this is a pretty simple process.

Our crawler performs a keyword search. When we get our search results, the crawler saves them all to a CSV file.
After we've finished our crawl, our scraper needs to read the report from the crawler. It will then go through and collect extra data for each job we crawled.

If you perform a search for Software Engineer, the crawler will extract and save all the Software Engineer jobs from the search. Then, the scraper will lookup each individual job posting and generate a special report for each posting it looks up.This might sound like a daunting task. To simplify it a little more, we need to break it into smaller pieces. Step by step, we need to define exactly how we want to build our crawler. Then, we need to go through the steps needed to build the scraper as well.Here are the steps to building the crawler:

Write a search results parser to interpret our data.
Add pagination, this way, we get more results and finer control over them.
Create some classes for data storage, and then use them to save our parsed results.
Use ThreadPoolExecutor to add support for multithreading and therefore concurrency.
Write a function for proxy integration and use it to bypass LinkedIn's anti-bot system.

Now, take a look at what we need to build the scraper.

Write a parser to pull information from individual job postings.
Give our scraper the ability to read a CSV file.
Add another class for data storage and build the storage into our parsing function.
Add ThreadPoolExecutor to scrape posting data concurrently.
Use our proxy function from earlier to bypass anti-bots.

Understanding How To Scrape LinkedIn Jobs

Before we get started on coding, we need to get a better high level understanding of the tasks we need to accomplish.

We need to request pages properly.
We need to know which data to extract and how to extract it.
We also need to know how to control our pagination and our location.

In the sections below, we'll go through all of these steps in greater detail. This way, we'll be able to tell Selenium exactly what to do.

Step 1: How To Request LinkedIn Jobs Pages

Whether you're using a browser or a barebones HTTP client, we always need to start with a GET request.

When you visit LinkedIn from your browser, the browser sends a GET to LinkedIn.
LinkedIn then sends back an HTML response.
The browser will then read the HTML and render the page for you to view.

With our scraper, we don't want to view the rendered page. We want to extract the important data from the HTML. This allows us to search tons of results with great speed and efficiency.The following URL outlines the format we'll use to obtain our results:

https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords={formatted_keyword}&location={formatted_locality}&original_referer="

For the Software Engineer search we mentioned earlier, our URL looks like this:

https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=software+engineer&location={formatted_locality}&original_referer=

If you look closer to our base URL (https://www.linkedin.com/jobs-guest/jobs/api), you might notice something interesting. We're actually making API requests, hence the endpoint, /api.Something even more interesting, this API endpoint doesn't give us JSON or XML, it sends back straight HTML. In years of web development and scraping, LinkedIn is the only place I've ever seen something like this.The screenshot below gives us a barebones HTML page without any styling whatsoever, but it is in fact a webpage.Once we're finished with our search, we'll scrape individual listings. Take a look at the shot below. This is the basic layout for any job posted on LinkedIn. We don't need to worry about the urls for these. We'll be extracting these urls during our scrape.

Step 2: How To Extract Data From LinkedIn Jobs Results and Pages

We know what our target pages look like. Now we need to know where they're located inside the HTML Our search results hold a bunch of div elements.

Each one has a class name of base-search-card__info.
For individual job pages, we look for li elements with a class of description__job-criteria-item.

In the image below, we inspect one of our search results. As you can see, it's a div. Its class name is base-search-card__info. To extract this data, we need to find each div that matches this class.Here is the type of li element we want to scrape. Each li element is given the classname, description__job-criteria-item. So for these, we want to pull all li elements with this class.

Step 3: How To Control Pagination

When searching large amounts of data, pagination is imperative. Pagination allows us to get our results in batches. We'll have to add one, more param to our URL, &start={page_number*10}.Our full URL for page 1 of the Software Engineer search would look like this:

We use page_number*10 because we begin counting at 0 and each request yields 10 results.
Page 0 (0 * 10) gives us results 1 through 10.
Page 1 gives us 11 through 20 and so on and so forth.

Look below to see how our fully formatted URL looks:

f"https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords={formatted_keyword}&location={formatted_locality}&original_referer=&start={page_number*10}"

Step 4: Geolocated Data

Our API with the ScrapeOps Proxy Aggregator lets us control our geolocation for free. This API takes in all sorts of arguments, but the one we want is called country.

If we want to appear in the US, we can pass "country": "us" into the API.
If we want to appear in the UK, we'd pass "country": "uk".

You can find a full list of ScrapeOps supported countries here.

Setting Up Our LinkedIn Jobs Scraper Project

Let's get started. We need a new project folder. Then we need to install our dependencies. You can run the following commands to get set up.Create a New Project Folder

mkdir linkedin-jobs-scraper
cd linkedin-jobs-scraper

Create a New Virtual Environment

python -m venv venv

Activate the Environment

source venv/bin/activate

Install Our Dependencies

pip install selenium

Make sure you have Chromedriver installed! Nowadays, it comes prepackaged inside of Chrome for Testing. If you don't have Chrome for Testing, you can get it here.

Build A LinkedIn Jobs Search Crawler

Time to get started on our crawler. We'll use an iterative build process. The steps below outline everything we need to do in order to build the crawler.

First, we're going to build a basic script with error handling, retry logic, and our basic parser.
Next, we'll add pagination.
Once we're getting proper result batches, we need to create a couple classes and use them for data storage.
Then, we'll add concurrency to scrape multiple pages simultaneously.
Finally, we'll use the ScrapeOps Proxy Aggregator to get past any roadblocks that might get in our way.

Step 1: Create Simple Search Data Parser

To start, we need to be able to parse a page. In the code below, we'll write a basic parsing function. This lays the foundation for everything else we build from here on out.Pay close attention to our parsing function, scrape_search_results().

import osimport csvimport jsonfrom selenium import webdriverfrom selenium.webdriver.common.by import Byimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
options = webdriver.ChromeOptions()options.add_argument("--headless")

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


def scrape_search_results(keyword, location, locality, retries=3):    formatted_keyword = keyword.replace(" ", "+")    formatted_locality = locality.replace(" ", "+")    url = f"https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords={formatted_keyword}&location={formatted_locality}"    tries = 0    success = False
        while tries <= retries and not success:
        driver = webdriver.Chrome(options=options)
        try:            driver.get(url)                                    div_cards = driver.find_elements(By.CSS_SELECTOR, "div[class='base-search-card__info']")
            if not div_cards:                driver.save_screenshot("debug.png")                raise Exception("Page did not load correctly, please check debug.png")
            for div_card in div_cards:                company_name = div_card.find_element(By.CSS_SELECTOR, "h4[class='base-search-card__subtitle']").text                print("company name", company_name)                job_title = div_card.find_element(By.CSS_SELECTOR, "h3[class='base-search-card__title']").text                parent = div_card.find_element(By.XPATH, "..")                link = parent.find_element(By.CSS_SELECTOR, "a")                job_link = link.get_attribute("href")                location = div_card.find_element(By.CSS_SELECTOR, "span[class='job-search-card__location']").text                                search_data = {                    "name": company_name,                    "job_title": job_title,                    "url": job_link,                    "location": location                }          
                print(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "us"    LOCALITY = "United States"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["software engineer"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        scrape_search_results(keyword, LOCATION, LOCALITY, retries=MAX_RETRIES)            logger.info(f"Crawl complete.")

First, we create our Selenium options. Early in the script we set our options with options = webdriver.ChromeOptions(). Then we use options.add_argument("--headless") to set our browser to headless mode.
driver = webdriver.Chrome(options=options) launches Selenium with our custom options.
We use driver.find_elements(By.CSS_SELECTOR, "div[class='base-search-card__info']") to find all of our base result cards.
company_name = div_card.find_element(By.CSS_SELECTOR, "h4[class='base-search-card__subtitle']").text finds our company_name.
Our job title is inside an h3, so we use div_card.find_element(By.CSS_SELECTOR, "h3[class='base-search-card__title']").text to find it.
Next, we find the parent of the div_card: div_card.find_element(By.XPATH, ".."). We use the XPATH and pass in .. to find the parent.
Our link is actually embedded in the parent element, so we extract it with parent.find_element(By.CSS_SELECTOR, "a").
We then pull the href from the link element with link.get_attribute("href").
Finally, div_card.find_element(By.CSS_SELECTOR, "span[class='job-search-card__location']").text gets the job location from the card.

Step 2: Add Pagination

As we mentioned earlier in this article, pagination is pretty simple. We just append our URL. We add start={page_number*10} to the end of our URL. We need an additional function to scrape multiple pages. We'll call it start_scrape().Our fully paginated urls are laid out in the snippet you see below.

url = f"https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords={formatted_keyword}&location={formatted_locality}&original_referer=&start={page_number*10}"

start_scrape() is in our next snippet. At the moment, it's just a simple for loop that parses pages using iteration. Later on, we'll make some improvements to it.

def start_scrape(keyword, pages, location, locality, retries=3):    for page in pages:        scrape_search_results(keyword, location, locality, page, retries=retries)

Take a look below and you'll see how everything fits together.

import osimport csvimport jsonfrom selenium import webdriverfrom selenium.webdriver.common.by import Byimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
options = webdriver.ChromeOptions()options.add_argument("--headless")

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


def scrape_search_results(keyword, location, locality, page_number, retries=3):    formatted_keyword = keyword.replace(" ", "+")    formatted_locality = locality.replace(" ", "+")    url = f"https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords={formatted_keyword}&location={formatted_locality}&original_referer=&start={page_number*10}"    tries = 0    success = False
        while tries <= retries and not success:
        driver = webdriver.Chrome(options=options)
        try:            driver.get(url)                                    div_cards = driver.find_elements(By.CSS_SELECTOR, "div[class='base-search-card__info']")
            if not div_cards:                driver.save_screenshot("debug.png")                raise Exception("Page did not load correctly, please check debug.png")
            for div_card in div_cards:                company_name = div_card.find_element(By.CSS_SELECTOR, "h4[class='base-search-card__subtitle']").text                print("company name", company_name)                job_title = div_card.find_element(By.CSS_SELECTOR, "h3[class='base-search-card__title']").text                parent = div_card.find_element(By.XPATH, "..")                link = parent.find_element(By.CSS_SELECTOR, "a")                job_link = link.get_attribute("href")                location = div_card.find_element(By.CSS_SELECTOR, "span[class='job-search-card__location']").text                                search_data = {                    "name": company_name,                    "job_title": job_title,                    "url": job_link,                    "location": location                }          
                print(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, locality, retries=3):    for page in pages:        scrape_search_results(keyword, location, locality, page, retries=retries)

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "us"    LOCALITY = "United States"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["software engineer"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        start_scrape(keyword, PAGES, LOCATION, LOCALITY, retries=MAX_RETRIES)    logger.info(f"Crawl complete.")

start={page_number*10} controls our pagination.
With start_scrape(), we can parse a list of pages.

Step 3: Storing the Scraped Data

To store our data, we need to write some classes. Our first one is a dataclass called SearchData. The second one is our DataPipeline.

SearchData simply needs to represent individual search items.
DataPipeline needs to open a pipe to a CSV file and store SearchData objects inside our CSV.

Here is our SearchData. It holds the name, job_title, url and location that we find during the parse.

@dataclassclass SearchData:    name: str = ""    job_title: str = ""    url: str = ""    location: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

Once we've got our SearchData, we pass it into the DataPipeline you see below.

Our DataPipeline first checks to see if our CSV file exists.
- If it exists, we append the file.
- If the file doesn't exist, we create one.

This approach stops us from accidentally destroying important data. This class also filters out duplicates using the name attribute.

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()

Our newest iteration looks like this.

import osimport csvimport jsonfrom selenium import webdriverfrom selenium.webdriver.common.by import Byimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
options = webdriver.ChromeOptions()options.add_argument("--headless")

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    job_title: str = ""    url: str = ""    location: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, locality, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    formatted_locality = locality.replace(" ", "+")    url = f"https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords={formatted_keyword}&location={formatted_locality}&original_referer=&start={page_number*10}"    tries = 0    success = False
        while tries <= retries and not success:
        driver = webdriver.Chrome(options=options)
        try:            driver.get(url)                                    div_cards = driver.find_elements(By.CSS_SELECTOR, "div[class='base-search-card__info']")
            if not div_cards:                driver.save_screenshot("debug.png")                raise Exception("Page did not load correctly, please check debug.png")
            for div_card in div_cards:                company_name = div_card.find_element(By.CSS_SELECTOR, "h4[class='base-search-card__subtitle']").text                print("company name", company_name)                job_title = div_card.find_element(By.CSS_SELECTOR, "h3[class='base-search-card__title']").text                parent = div_card.find_element(By.XPATH, "..")                link = parent.find_element(By.CSS_SELECTOR, "a")                job_link = link.get_attribute("href")                location = div_card.find_element(By.CSS_SELECTOR, "span[class='job-search-card__location']").text                                search_data = SearchData(                    name=company_name,                    job_title=job_title,                    url=job_link,                    location=location                )           
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, locality, data_pipeline=None, retries=3):    for page in pages:        scrape_search_results(keyword, location, locality, page, data_pipeline=data_pipeline, retries=retries)

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "us"    LOCALITY = "United States"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["software engineer"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, LOCALITY, data_pipeline=crawl_pipeline, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

We use SearchData to represent individual results from our search results page.
DataPipeline is used to store these objects in a safe and efficient way.

Step 4: Adding Concurrency

When we add concurrency support, we use Python's builtin multithreading. To add multithreading, we're going to use ThreadPoolExecutor and we're going to remove our for loop from start_scrape().ThreadPoolExecutor allows us to open a pool with max_threads. If we want to use 4 threads, we pass max_threads=4.

def start_scrape(keyword, pages, location, locality, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            [locality] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

Our arguments to executor.map() go as follows:

scrape_search_results: the function we want to call on all these available threads.
All other arguments get passed in as arrays.
These arrays of arguments then get passed into the function we're calling on multiple threads.

Our full code now looks like this.

import osimport csvimport jsonfrom selenium import webdriverfrom selenium.webdriver.common.by import Byimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
options = webdriver.ChromeOptions()options.add_argument("--headless")

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    job_title: str = ""    url: str = ""    location: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, locality, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    formatted_locality = locality.replace(" ", "+")    url = f"https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords={formatted_keyword}&location={formatted_locality}&original_referer=&start={page_number*10}"    tries = 0    success = False
        while tries <= retries and not success:
        driver = webdriver.Chrome(options=options)
        try:            driver.get(url)                                    div_cards = driver.find_elements(By.CSS_SELECTOR, "div[class='base-search-card__info']")
            if not div_cards:                driver.save_screenshot("debug.png")                raise Exception("Page did not load correctly, please check debug.png")
            for div_card in div_cards:                company_name = div_card.find_element(By.CSS_SELECTOR, "h4[class='base-search-card__subtitle']").text                print("company name", company_name)                job_title = div_card.find_element(By.CSS_SELECTOR, "h3[class='base-search-card__title']").text                parent = div_card.find_element(By.XPATH, "..")                link = parent.find_element(By.CSS_SELECTOR, "a")                job_link = link.get_attribute("href")                location = div_card.find_element(By.CSS_SELECTOR, "span[class='job-search-card__location']").text                                search_data = SearchData(                    name=company_name,                    job_title=job_title,                    url=job_link,                    location=location                )           
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, locality, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            [locality] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "us"    LOCALITY = "United States"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["software engineer"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, LOCALITY, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

We can now crawl multiple pages simultaneously.

Step 5: Bypassing Anti-Bots

To get past anti-bots, we're going to write a special function. This function will take in a url and our ScrapeOps params and then wraps it all into one url with some simple string formatting and url encoding.Take a look at get_scrapeops_url().

def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

Let's unpack our payload.

"api_key": our ScrapeOps API key.
"url": the url we want to scrape.
"country": the country we want to appear in.

Our full production crawler is available below.

import osimport csvimport jsonfrom selenium import webdriverfrom selenium.webdriver.common.by import Byimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
options = webdriver.ChromeOptions()options.add_argument("--headless")


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    job_title: str = ""    url: str = ""    location: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, locality, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    formatted_locality = locality.replace(" ", "+")    url = f"https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords={formatted_keyword}&location={formatted_locality}&original_referer=&start={page_number*10}"    tries = 0    success = False
        while tries <= retries and not success:
        driver = webdriver.Chrome(options=options)
        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            driver.get(scrapeops_proxy_url)                                    div_cards = driver.find_elements(By.CSS_SELECTOR, "div[class='base-search-card__info']")
            if not div_cards:                driver.save_screenshot("debug.png")                raise Exception("Page did not load correctly, please check debug.png")
            for div_card in div_cards:                company_name = div_card.find_element(By.CSS_SELECTOR, "h4[class='base-search-card__subtitle']").text                print("company name", company_name)                job_title = div_card.find_element(By.CSS_SELECTOR, "h3[class='base-search-card__title']").text                parent = div_card.find_element(By.XPATH, "..")                link = parent.find_element(By.CSS_SELECTOR, "a")                job_link = link.get_attribute("href")                location = div_card.find_element(By.CSS_SELECTOR, "span[class='job-search-card__location']").text                                search_data = SearchData(                    name=company_name,                    job_title=job_title,                    url=job_link,                    location=location                )           
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, locality, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            [locality] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "us"    LOCALITY = "United States"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["software engineer"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, LOCALITY, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

Step 6: Production Run

For our production run, we'll use 5 threads. We won't make use of all five, but we'll make full use of these 5 threads later on in our scraper.If you're looking for different results, try changing any of the following.

MAX_RETRIES
MAX_THREADS
PAGES
LOCATION
LOCALITY
keyword_list

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 3    LOCATION = "us"    LOCALITY = "United States"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["software engineer"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, LOCALITY, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

Take a look at our results.As you can see, we crawled 3 pages in 33.694 seconds. This comes out to an average of 11.231 seconds per page.

Build A LinkedIn Jobs Scraper

Now, we're on to the second part of our project. We have our crawler that scrapes and saves search results.Next, we need a scraper that reads those results. After reading those results, it needs to go through and scrape individual details about each job posting.

Step 1: Create Simple Job Data Parser

Just like earlier, we'll get started by writing our parsing function. This function will have error handling and retry logic just like before.Take a look at process_posting(). Like before, pay close attention to our parsing logic.

def process_posting(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        driver = webdriver.Chrome(options=options)        try:            driver.get(url, location=location)                                    job_criteria = driver.find_elements(By.CSS_SELECTOR, "li[class='description__job-criteria-item']")            seniority = job_criteria[0].text.replace("Seniority level", "")            position_type = job_criteria[1].text.replace("Employment type", "")            job_function = job_criteria[2].text.replace("Job function", "")            industry = job_criteria[3].text.replace("Industries", "")
            job_data = {                "name": row["name"],                "seniority": seniority,                "position_type": position_type,                "job_function": job_function,                "industry": industry            }            print(job_data)            success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}, retries left: {retries-tries}")            tries += 1
        finally:            driver.quit()                if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")

driver.find_elements(By.CSS_SELECTOR, "li[class='description__job-criteria-item']") finds all the items from our criteria list.
The criteria list goes as follows:
- job_criteria[0]: seniority level
- job_criteria[1]: position type
- job_criteria[2]: job function
- job_criteria[3]: industry

Step 2: Loading URLs To Scrape

To use our new parsing function, we need to feed it a url. This url will come from the CSV file generated by our parser. We'll read our file and use a for loop to scrape details from every posting we found.Here is our first iteration of process_results(). Later on, we'll rewrite it and add multithreading support.

def process_results(csv_file, location, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_posting(row, location, retries=retries)

In the full code below, we're now updated to perform a crawl, and then scrape individual job postings.

import osimport csvimport jsonfrom selenium import webdriverfrom selenium.webdriver.common.by import Byimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
options = webdriver.ChromeOptions()options.add_argument("--headless")


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    job_title: str = ""    url: str = ""    location: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, locality, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    formatted_locality = locality.replace(" ", "+")    url = f"https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords={formatted_keyword}&location={formatted_locality}&original_referer=&start={page_number*10}"    tries = 0    success = False
        while tries <= retries and not success:
        driver = webdriver.Chrome(options=options)
        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            driver.get(scrapeops_proxy_url)                                    div_cards = driver.find_elements(By.CSS_SELECTOR, "div[class='base-search-card__info']")
            if not div_cards:                driver.save_screenshot("debug.png")                raise Exception("Page did not load correctly, please check debug.png")
            for div_card in div_cards:                company_name = div_card.find_element(By.CSS_SELECTOR, "h4[class='base-search-card__subtitle']").text                print("company name", company_name)                job_title = div_card.find_element(By.CSS_SELECTOR, "h3[class='base-search-card__title']").text                parent = div_card.find_element(By.XPATH, "..")                link = parent.find_element(By.CSS_SELECTOR, "a")                job_link = link.get_attribute("href")                location = div_card.find_element(By.CSS_SELECTOR, "span[class='job-search-card__location']").text                                search_data = SearchData(                    name=company_name,                    job_title=job_title,                    url=job_link,                    location=location                )           
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, locality, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            [locality] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

def process_posting(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        driver = webdriver.Chrome(options=options)        try:            driver.get(url, location=location)                                    job_criteria = driver.find_elements(By.CSS_SELECTOR, "li[class='description__job-criteria-item']")            seniority = job_criteria[0].text.replace("Seniority level", "")            position_type = job_criteria[1].text.replace("Employment type", "")            job_function = job_criteria[2].text.replace("Job function", "")            industry = job_criteria[3].text.replace("Industries", "")
            job_data = {                "name": row["name"],                "seniority": seniority,                "position_type": position_type,                "job_function": job_function,                "industry": industry            }            print(job_data)            success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}, retries left: {retries-tries}")            tries += 1
        finally:            driver.quit()                if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")



def process_results(csv_file, location, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_posting(row, location, retries=retries)
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "us"    LOCALITY = "United States"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["software engineer"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, LOCALITY, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, retries=MAX_RETRIES)

Step 3: Storing the Scraped Data

We can store basically anything we want with our DataPipeline. We just need another dataclass. Take a look below at JobData. Just like our SearchData from earlier, we use it to represent the data we scraped from the page.We'll pass this into our DataPipeline which will then pipe our data into a CSV file.

@dataclassclass JobData:    name: str = ""    seniority: str = ""    position_type: str = ""    job_function: str = ""    industry: str = ""

    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

import osimport csvimport jsonfrom selenium import webdriverfrom selenium.webdriver.common.by import Byimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
options = webdriver.ChromeOptions()options.add_argument("--headless")


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    job_title: str = ""    url: str = ""    location: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass JobData:    name: str = ""    seniority: str = ""    position_type: str = ""    job_function: str = ""    industry: str = ""

    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, locality, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    formatted_locality = locality.replace(" ", "+")    url = f"https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords={formatted_keyword}&location={formatted_locality}&original_referer=&start={page_number*10}"    tries = 0    success = False
        while tries <= retries and not success:
        driver = webdriver.Chrome(options=options)
        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            driver.get(scrapeops_proxy_url)                                    div_cards = driver.find_elements(By.CSS_SELECTOR, "div[class='base-search-card__info']")
            if not div_cards:                driver.save_screenshot("debug.png")                raise Exception("Page did not load correctly, please check debug.png")
            for div_card in div_cards:                company_name = div_card.find_element(By.CSS_SELECTOR, "h4[class='base-search-card__subtitle']").text                print("company name", company_name)                job_title = div_card.find_element(By.CSS_SELECTOR, "h3[class='base-search-card__title']").text                parent = div_card.find_element(By.XPATH, "..")                link = parent.find_element(By.CSS_SELECTOR, "a")                job_link = link.get_attribute("href")                location = div_card.find_element(By.CSS_SELECTOR, "span[class='job-search-card__location']").text                                search_data = SearchData(                    name=company_name,                    job_title=job_title,                    url=job_link,                    location=location                )           
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, locality, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            [locality] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

def process_posting(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        driver = webdriver.Chrome(options=options)        try:            driver.get(url, location=location)                        job_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")                        job_criteria = driver.find_elements(By.CSS_SELECTOR, "li[class='description__job-criteria-item']")            seniority = job_criteria[0].text.replace("Seniority level", "")            position_type = job_criteria[1].text.replace("Employment type", "")            job_function = job_criteria[2].text.replace("Job function", "")            industry = job_criteria[3].text.replace("Industries", "")
            job_data = JobData(                name=row["name"],                seniority=seniority,                position_type=position_type,                job_function=job_function,                industry=industry            )            job_pipeline.add_data(job_data)            job_pipeline.close_pipeline()            success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}, retries left: {retries-tries}")            tries += 1
        finally:            driver.quit()                if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")



def process_results(csv_file, location, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_posting(row, location, retries=retries)
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "us"    LOCALITY = "United States"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["software engineer"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, LOCALITY, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, retries=MAX_RETRIES)

JobData holds the data we pull from the page.
DataPipeline takes a JobData object and pipes it to a CSV file.

Step 4: Adding Concurrency

We're going to use ThreadPoolExecutor for concurrency just like we did earlier.Take a look at our refactored version of process_results().

def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_posting,                reader,                [location] * len(reader),                [retries] * len(reader)            )

Look at our arguments to executor.map():

process_posting: the function we want to call on multiple threads.
All arguments to process_posting get passed in as arrays.

Step 5: Bypassing Anti-Bots

We're just about ready to run in production. However, there is one thing we still need to add, proxy support.

driver.get(get_scrapeops_url(url, location=location))

Below is the final code to our scraper.

import osimport csvimport jsonfrom selenium import webdriverfrom selenium.webdriver.common.by import Byimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
options = webdriver.ChromeOptions()options.add_argument("--headless")


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    job_title: str = ""    url: str = ""    location: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass JobData:    name: str = ""    seniority: str = ""    position_type: str = ""    job_function: str = ""    industry: str = ""

    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, locality, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    formatted_locality = locality.replace(" ", "+")    url = f"https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords={formatted_keyword}&location={formatted_locality}&original_referer=&start={page_number*10}"    tries = 0    success = False
        while tries <= retries and not success:
        driver = webdriver.Chrome(options=options)
        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            driver.get(scrapeops_proxy_url)                                    div_cards = driver.find_elements(By.CSS_SELECTOR, "div[class='base-search-card__info']")
            if not div_cards:                driver.save_screenshot("debug.png")                raise Exception("Page did not load correctly, please check debug.png")
            for div_card in div_cards:                company_name = div_card.find_element(By.CSS_SELECTOR, "h4[class='base-search-card__subtitle']").text                print("company name", company_name)                job_title = div_card.find_element(By.CSS_SELECTOR, "h3[class='base-search-card__title']").text                parent = div_card.find_element(By.XPATH, "..")                link = parent.find_element(By.CSS_SELECTOR, "a")                job_link = link.get_attribute("href")                location = div_card.find_element(By.CSS_SELECTOR, "span[class='job-search-card__location']").text                                search_data = SearchData(                    name=company_name,                    job_title=job_title,                    url=job_link,                    location=location                )           
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, locality, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            [locality] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

def process_posting(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        driver = webdriver.Chrome(options=options)        try:            driver.get(get_scrapeops_url(url, location=location))                        job_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")                        job_criteria = driver.find_elements(By.CSS_SELECTOR, "li[class='description__job-criteria-item']")            seniority = job_criteria[0].text.replace("Seniority level", "")            position_type = job_criteria[1].text.replace("Employment type", "")            job_function = job_criteria[2].text.replace("Job function", "")            industry = job_criteria[3].text.replace("Industries", "")
            job_data = JobData(                name=row["name"],                seniority=seniority,                position_type=position_type,                job_function=job_function,                industry=industry            )            job_pipeline.add_data(job_data)            job_pipeline.close_pipeline()            success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}, retries left: {retries-tries}")            tries += 1
        finally:            driver.quit()                if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")



def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_posting,                reader,                [location] * len(reader),                [retries] * len(reader)            )
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "us"    LOCALITY = "United States"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["software engineer"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, LOCALITY, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Step 6: Production Run

Here, we'll run a full crawl and scrape. Once again, we'll set our PAGES to 3 and our MAX_THREADS to 5.If you need a refresher on our main, you can see it again below.

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 3    LOCATION = "us"    LOCALITY = "United States"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["software engineer"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, LOCALITY, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Here are the results.If you remember, our 3 page crawl took 33.694 seconds. Our crawl spat out a CSV with 20 results. The full crawl and scrape took 47.565 seconds. 155.813 - 33.694 = 122.119 seconds. 122.119 seconds / 20 results = 6.106 seconds per result.We're scraping pages almost twice as fast as it took to crawl them!

Legal and Ethical Considerations

Scraping private data without special permission is pretty much always illegal. When we scrape LinkedIn jobs, we're not logging in and we're scraping publicly available data. You should do the same.If you think your scraper is legally questionable, you need to consult an attorney.In addition to any legal ramifications from scraping, we're subject to LinkedIn's terms of service and their robots.txt. Their terms are available here and their robots.txt is here.As stated at the top of their robots.txt, crawling LinkedIn is explicitly prohibited. By scraping LinkedIn, you can have your account suspended, banned, or even deleted.

Conclusion

It's a bit more tactical than other scraping jobs, but it's entirely possible to scrape job postings from LinkedIn and you've now seen it with your own eyes.By now, you should have a basic understanding of Selenium and how to parse pages with it. You also should have a solid understanding of how to iteratively build new features such as parsing, pagination, data storage, concurrency and proxy integration.If you want to know more about the tech stack from this article, check out the links below!

How to Scrape Linkedin Jobs With Puppeteer

In this tutorial, we'll build a LinkedIn jobs scraper from start to finish with Puppeteer.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

TLDR - How to Scrape LinkedIn Jobs with Puppeteer

Wanna skip the article and just scrape LinkedIn jobs? You can use our prebuilt scraper!

You need to create a new NodeJS project and add a config.json file to it.
Add you ScrapeOps API key to the config file: {"api_key": "your-super-secret-api-key"}.
Then copy and paste the code below into a new JavaScript file.

const puppeteer = require("puppeteer");const createCsvWriter = require("csv-writer").createObjectCsvWriter;const csvParse = require("csv-parse");const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
async function writeToCsv(data, outputFile) {    let success = false;    while (!success) {
        if (!data || data.length === 0) {            throw new Error("No data to write!");        }        const fileExists = fs.existsSync(outputFile);            if (!(data instanceof Array)) {            data = [data]        }            const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))            const csvWriter = createCsvWriter({            path: outputFile,            header: headers,            append: fileExists        });        try {            await csvWriter.writeRecords(data);            success = true;        } catch (e) {            console.log("Failed data", data);            throw new Error("Failed to write to csv");        }    }}
async function readCsv(inputFile) {    const results = [];    const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({        columns: true,        delimiter: ",",        trim: true,        skip_empty_lines: true    }));
    for await (const record of parser) {        results.push(record);    }    return results;}
function range(start, end) {    const array = [];    for (let i=start; i<end; i++) {        array.push(i);    }    return array;}
function getScrapeOpsUrl(url, location="us") {    const params = new URLSearchParams({        api_key: API_KEY,        url: url,        country: location    });    return `https://proxy.scrapeops.io/v1/?${params.toString()}`;}
async function scrapeSearchResults(browser, keyword, pageNumber, locality, location="us", retries=3) {    let tries = 0;    let success = false;
    while (tries <= retries && !success) {                const formattedKeyword = keyword.replace(" ", "+");        const formattedLocality = locality.replace(" ", "+");
        const page = await browser.newPage();        try {            const url = `https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${formattedKeyword}&location=${formattedLocality}&original_referer=&start=${pageNumber*10}`;                const proxyUrl = getScrapeOpsUrl(url, location);            await page.goto(proxyUrl, { timeout: 0 });
            console.log(`Successfully fetched: ${url}`);
            const divCards = await page.$$("div[class='base-search-card__info']");
            for (const divCard of divCards) {
                const nameElement = await divCard.$("h4[class='base-search-card__subtitle']");                const name = await page.evaluate(element => element.textContent, nameElement);
                const jobTitleElement = await divCard.$("h3[class='base-search-card__title']");                const jobTitle = await page.evaluate(element => element.textContent, jobTitleElement);
                const parentElement = await page.evaluateHandle(element => element.parentElement, divCard);
                const aTag = await parentElement.$("a");                const link = await page.evaluate(element => element.getAttribute("href"), aTag);
                const jobLocationElement = await divCard.$("span[class='job-search-card__location']");                const jobLocation = await page.evaluate(element => element.textContent, jobLocationElement);
                const searchData = {                    name: name.trim(),                    job_title: jobTitle.trim(),                    url: link.trim(),                    location: jobLocation.trim()                };
                await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);            }
            success = true;
        } catch (err) {            console.log(`Error: ${err}, tries left ${retries - tries}`);            tries++;
        } finally {            await page.close();        }     }}
async function startCrawl(keyword, pages, locality, location, concurrencyLimit, retries) {    const pageList = range(0, pages);
    const browser = await puppeteer.launch();
    while (pageList.length > 0) {        const currentBatch = pageList.splice(0, concurrencyLimit);        const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, locality, location, retries));
        try {            await Promise.all(tasks);        } catch (err) {            console.log(`Failed to process batch: ${err}`);        }    }
    await browser.close();}
async function processJob(browser, row, location, retries = 3) {    const url = row.url;    let tries = 0;    let success = false;
        while (tries <= retries && !success) {        const page = await browser.newPage();
        try {            const response = await page.goto(getScrapeOpsUrl(url, location), { timeout: 0 });            if (!response || response.status() !== 200) {                throw new Error("Failed to fetch page, status:", response.status());            }
            const jobCriteria = await page.$$("li[class='description__job-criteria-item']");            if (jobCriteria.length < 4) {                throw new Error("Job Criteria Not Found!");            }
            const seniority = (await page.evaluate(element => element.textContent, jobCriteria[0])).replace("Seniority level", "");            const positionType = (await page.evaluate(element => element.textContent, jobCriteria[1])).replace("Employment type", "");            const jobFunction = (await page.evaluate(element => element.textContent, jobCriteria[2])).replace("Job function", "");            const industry = (await page.evaluate(element => element.textContent, jobCriteria[3])).replace("Industries", "");
            const jobData = {                name: row.name,                seniority: seniority.trim(),                position_type: positionType.trim(),                job_function: jobFunction.trim(),                industry: industry.trim()            }            await writeToCsv([jobData], `${row.name.replace(" ", "-")}-${row.job_title.replace(" ", "-")}.csv`);
            success = true;            console.log("Successfully parsed", row.url);

        } catch (err) {            tries++;            console.log(`Error: ${err}, tries left: ${retries-tries}, url: ${getScrapeOpsUrl(url)}`);
        } finally {            await page.close();        }    } }
async function processResults(csvFile, location, concurrencyLimit, retries) {    const rows = await readCsv(csvFile);    const browser = await puppeteer.launch();;
    while (rows.length > 0) {        const currentBatch = rows.splice(0, concurrencyLimit);        const tasks = currentBatch.map(row => processJob(browser, row, location, retries));
        try {            await Promise.all(tasks);        } catch (err) {            console.log(`Failed to process batch: ${err}`);        }    }    await browser.close();
}
async function main() {    const keywords = ["software engineer"];    const concurrencyLimit = 5;    const pages = 1;    const location = "us";    const locality = "United States";    const retries = 3;    const aggregateFiles = [];
    for (const keyword of keywords) {        console.log("Crawl starting");        console.time("startCrawl");        await startCrawl(keyword, pages, locality, location, concurrencyLimit, retries);        console.timeEnd("startCrawl");        console.log("Crawl complete");        aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);    }

    console.log("Starting scrape");    for (const file of aggregateFiles) {        console.time("processResults");        await processResults(file, location, concurrencyLimit, retries);        console.timeEnd("processResults");    }    console.log("Scrape complete");}

main();

You can change any of the following from main to fine-tune your results:

keywords: An array of job titles or terms to be used as search queries on LinkedIn
concurrencyLimit: The maximum number of pages or tasks processed concurrently.
pages: The number of pages of search results to crawl for each keyword.
location: A two-letter country code (e.g., "us") specifying the country for the search results.
locality: The human-readable location name (e.g., "United States") used in the search query.
retries: The number of retry attempts allowed for failed tasks (e.g., failed page loads or data extractions).

node name-of-your-script or node name-of-your-script.js will run the scraper.Modern NodeJS doesn't require a file extension in the name.Once it's done running, you'll get a CSV named after your search. This one will contain all of your search data. You get an individual report generated for each job listing as well. These individual files contain more detailed information about each job posting.

How To Architect Our LinkedIn Jobs Scraper

If we want to scrape LinkedIn jobs thoroughly, we need a result crawler and a job scraper. Our crawler does a keyword search and saves our results. Once our crawl is finished, our job scraper reads the report from the crawler. Then, it looks up every individual listing from the CSV and collects more data on each one.If you perform a search for Software Engineer, the crawler will extract and save all the Software Engineer jobs from the search. Then, the scraper will lookup each individual job posting and generate a special report for each posting it looks up.At this point, this might sound a little intimidating. We need to take our larger project and break it into smaller pieces.. Step by step, we'll define exactly what we want from our crawler. Then, we'll identify the steps we need to take when building our scraper.Here are the steps to building the crawler:

Write a search results parser to extract our data.
Add pagination, this way, we get more results and finer control over them.
Create some classes for data storage, and then use them to save our parsed results.
Use ThreadPoolExecutor to add support for multithreading and therefore concurrency.
Write a function for proxy integration and use it to bypass LinkedIn's anti-bot system.

Now, take a look at what we need to build the scraper.

Write a parser to pull information from individual job postings.
Give our scraper the ability to read a CSV file.
Add another class for data storage and build the storage into our parsing function.
Add ThreadPoolExecutor to scrape posting data concurrently.
Use our proxy function from earlier to bypass anti-bots.

Understanding How To Scrape LinkedIn Jobs

As much as you might want to, we can't just start coding.

We need to see how all this works from a high level.
We need to request specific pages.
We need to know where our data is located on the page and come up with a method for extracting it.
To get control over our results, we also need pagination and geolocation support.

In the next few sections, we'll explore all these concepts in finer detail. By the time we write our code, we'll know exactly what we want it to do.

Step 1: How To Request LinkedIn Jobs Pages

Whenever you go to a page on the web, it begins with a simple GET request.

If you look at LinkedIn from your browser, the browser sends a GET to LinkedIn.
LinkedIn sends an HTML response back to your browser.
Your browser then reads and renders the HTML.
When scraping, we don't actually need to render the page.
We need to pick through the HTML and pull our data from it. This allows us to search much faster and more efficiently than a human could.

You can view our URL format here:

https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${formattedKeyword}&location=${formattedLocality}&original_referer=&start=

For the Software Engineer search we mentioned earlier, our URL looks like this:

https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=$software+engineer&location=united+states&original_referer=

Look closer at the base URL:

https://www.linkedin.com/jobs-guest/jobs/api

We have endpoint, /api inside of it. Our requests are actually going to their API.Surprisingly, this API endpoint doesn't respond with JSON or XML, it gives us straight HTML. In years of web development and scraping, LinkedIn is the only place I've ever seen this.The screenshot below gives us a barebones HTML page without any styling whatsoever, but it is in fact a webpage. When you're viewing data from the main page, the page fetches this HTML and uses to to update your screen.Once we're finished searching, we'll scrape individual listing data. Look at the screenshot below. This is the basic layout for any job posted on LinkedIn. We don't need to worry about the URLs for these. We'll find these URLs when we crawl the search results.

Step 2: How To Extract Data From LinkedIn Jobs Results and Pages

We know which pages we're scraping. Now we need to figure out exactly where our data is located. Our search results hold a bunch of div elements. Each one we want has a class name of base-search-card__info.For individual job pages, we look for li elements with a class of description__job-criteria-item.In the image below, you can see a div. Its class name is base-search-card__info. This is one of our search results. To extract this data, we need to find each div matching this class.The next shot holds the li element we want to scrape. Each li element has the classname, description__job-criteria-item. For these, we'll extract all li elements matching our target class.

Step 3: How To Control Pagination

If you want a lot of data, you need to paginate your results. Pagination allows us to get our results in batches.We'll have to add one, more param to our URL, &start={pageNumber*10}. For page 1 of the Software Engineer search, we get this URL:

https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=software+engineer&location=United+States&original_referer=&start=0

We use pageNumber*10 because we begin counting at 0 and each request yields 10 results. Page 0 (0 * 10) yields results 1 through 10. Page 1 yields 11 through 20 and so on and so forth.Look below to see how our fully formatted url looks:

`https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${formattedKeyword}&location=${formattedLocality}&original_referer=&start=${pageNumber*10}`

Step 4: Geolocated Data

The ScrapeOps Proxy Aggregator gives us excellent geotargeting support. This API takes in all sorts of arguments, but the one we want is called country.

If we want to appear in the US, we can pass "country": "us" into the API.
If we want to appear in the UK, we'd pass "country": "uk".

You can find a full list of ScrapeOps supported countries here.Some other providers charge extra for geotargeting, we don't.

Setting Up Our LinkedIn Jobs Scraper Project

Let's get started. We need to make a new NodeJS project. Then we need to install our dependencies. You can run the following commands to get set up.Create a New Project Folder

mkdir linkedin-jobs-scraper
cd linkedin-jobs-scraper

Create a New NodeJS Project

npm init --y

Install Our Dependencies

npm install puppeteer

npm install csv-writer

npm install csv-parse

npm install fs

We've finished setting everything up. Time to start coding.

Build A LinkedIn Jobs Search Crawler

We're past the boring stuff. It's finally time to start building. We'll start on our crawler. Each time we implement one of the steps below, we'll build a new iteration of our crawler. Iterative building is a great way to simplify your development process.

First, we're going to build a basic script with error handling, retry logic, and our basic parser.
Next, we'll add pagination.
Once we're getting proper result batches, we need to create a couple classes and use them for data storage.
Then, we'll add concurrency to scrape multiple pages simultaneously.
Finally, we'll use the ScrapeOps Proxy Aggregator to get past any roadblocks that might get in our way.

Step 1: Create Simple Search Data Parser

We won't get very far if we can't parse a page.In our code below, we'll write our parsing function for the crawler. Everything else we add will be built on top of this basic script. We've got our imports and retry logic, but you need to pay close attention to our parsing function.

const puppeteer = require("puppeteer");const createCsvWriter = require("csv-writer").createObjectCsvWriter;const csvParse = require("csv-parse");const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;

async function scrapeSearchResults(browser, keyword, locality, location="us", retries=3) {    let tries = 0;    let success = false;
    while (tries <= retries && !success) {                const formattedKeyword = keyword.replace(" ", "+");        const formattedLocality = locality.replace(" ", "+");
        const page = await browser.newPage();        try {            const url = `https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${formattedKeyword}&location=${formattedLocality}&original_referer=`;                await page.goto(url);
            console.log(`Successfully fetched: ${url}`);
            const divCards = await page.$$("div[class='base-search-card__info']");
            for (const divCard of divCards) {
                const nameElement = await divCard.$("h4[class='base-search-card__subtitle']");                const name = await page.evaluate(element => element.textContent, nameElement);
                const jobTitleElement = await divCard.$("h3[class='base-search-card__title']");                const jobTitle = await page.evaluate(element => element.textContent, jobTitleElement);
                const parentElement = await page.evaluateHandle(element => element.parentElement, divCard);
                const aTag = await parentElement.$("a");                const link = await page.evaluate(element => element.getAttribute("href"), aTag);
                const jobLocationElement = await divCard.$("span[class='job-search-card__location']");                const jobLocation = await page.evaluate(element => element.textContent, jobLocationElement);
                const searchData = {                    name: name.trim(),                    job_title: jobTitle.trim(),                    url: link.trim(),                    location: jobLocation.trim()                };
                console.log(searchData);
            }
            success = true;
        } catch (err) {            console.log(`Error: ${err}, tries left ${retries - tries}`);            tries++;
        } finally {            await page.close();        }     }}
async function startCrawl(keyword, locality, location, retries) {
    const browser = await puppeteer.launch();
    await scrapeSearchResults(browser, keyword, locality, location, retries);
    await browser.close();}

async function main() {    const keywords = ["software engineer"];    const concurrencyLimit = 5;    const pages = 1;    const location = "us";    const locality = "United States";    const retries = 3;    const aggregateFiles = [];
    for (const keyword of keywords) {        console.log("Crawl starting");        console.time("startCrawl");        await startCrawl(keyword, pages, locality, location, retries);        console.timeEnd("startCrawl");        console.log("Crawl complete");    }}

main();

In our main(), we call startCrawl(). At the moment, this function opens a browser and passes it into our parsing function, startScrape().
- await puppeteer.launch(); launches the browser.
- We pass it into our parser with scrapeSearchResults(browser, keyword, locality, location, retries).
- Once the parsing function has finished, we close the browser: await browser.close();
The real magic happens from inside scrapeSearchResults().
- We find all of our divCards with await page.$$("div[class='base-search-card__info']");.
- When we extract text from the page elements, we use page.evauluate(): await page.evaluate(element => element.textContent, nameElement). This method is used for the name, jobTitle, link, and jobLocation.
- We then save these inside of a searchData object and remove the whitespace and any newline characters with the trim() method.
- Once we've got our searchData, we print it to the console.

Step 2: Add Pagination

Adding pagination is a pretty easy job.

We just append our URL. We append start={pageNumber*10} to the end of our URL.
We also need to alter startCrawl() to scrape multiple pages.
We add a simple for loop that allows us to do this. This is only temporary, later on, we'll replace it with some more powerful code that performs our search concurrently.

Here is our URL format with pagination support.

`https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${formattedKeyword}&location=${formattedLocality}&original_referer=&start=${pageNumber*10}`

This next function isn't a requirement, but it makes our code easier to write.Here's a homemade range() function similar to the one from Python.

function range(start, end) {    const array = [];    for (let i=start; i<end; i++) {        array.push(i);    }    return array;}

This next little snippet includes our rewritten startCrawl(). It uses a simple for loop to iterate through our pages.

async function startCrawl(keyword, pages, locality, location, retries) {    const pageList = range(0, pages);
    const browser = await puppeteer.launch();
    for (const page of pageList) {        await scrapeSearchResults(browser, keyword, page, locality, location, retries)    }
    await browser.close();}

Below, you can see how everything fits together now.

const puppeteer = require("puppeteer");const createCsvWriter = require("csv-writer").createObjectCsvWriter;const csvParse = require("csv-parse");const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
function range(start, end) {    const array = [];    for (let i=start; i<end; i++) {        array.push(i);    }    return array;}

async function scrapeSearchResults(browser, keyword, pageNumber, locality, location="us", retries=3) {    let tries = 0;    let success = false;
    while (tries <= retries && !success) {                const formattedKeyword = keyword.replace(" ", "+");        const formattedLocality = locality.replace(" ", "+");
        const page = await browser.newPage();        try {            const url = `https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${formattedKeyword}&location=${formattedLocality}&original_referer=&start=${pageNumber*10}`;                await page.goto(url);
            console.log(`Successfully fetched: ${url}`);
            const divCards = await page.$$("div[class='base-search-card__info']");
            for (const divCard of divCards) {
                const nameElement = await divCard.$("h4[class='base-search-card__subtitle']");                const name = await page.evaluate(element => element.textContent, nameElement);
                const jobTitleElement = await divCard.$("h3[class='base-search-card__title']");                const jobTitle = await page.evaluate(element => element.textContent, jobTitleElement);
                const parentElement = await page.evaluateHandle(element => element.parentElement, divCard);
                const aTag = await parentElement.$("a");                const link = await page.evaluate(element => element.getAttribute("href"), aTag);
                const jobLocationElement = await divCard.$("span[class='job-search-card__location']");                const jobLocation = await page.evaluate(element => element.textContent, jobLocationElement);
                const searchData = {                    name: name.trim(),                    job_title: jobTitle.trim(),                    url: link.trim(),                    location: jobLocation.trim()                };
                console.log(searchData);
            }
            success = true;
        } catch (err) {            console.log(`Error: ${err}, tries left ${retries - tries}`);            tries++;
        } finally {            await page.close();        }     }}
async function startCrawl(keyword, pages, locality, location, retries) {    const pageList = range(0, pages);
    const browser = await puppeteer.launch();
    for (const page of pageList) {        await scrapeSearchResults(browser, keyword, page, locality, location, retries)    }
    await browser.close();}

async function main() {    const keywords = ["software engineer"];    const concurrencyLimit = 5;    const pages = 1;    const location = "us";    const locality = "United States";    const retries = 3;    const aggregateFiles = [];
    for (const keyword of keywords) {        console.log("Crawl starting");        console.time("startCrawl");        await startCrawl(keyword, pages, locality, location, retries);        console.timeEnd("startCrawl");        console.log("Crawl complete");    }}

main();

start={pageNumber*10} allows us to control our pagination. We use pageNumber*10 because we get 10 results per page and our results start at zero.
With range() and startCrawl(), we can now scrape an array of pages.

Step 3: Storing the Scraped Data

When you're scraping, you need to be able to store your data. Without storage, our data is gone as soon as the program exits.In this section, we'll create a writetoCsv() function.

This function can take in either a JSON object or an array and write it to a CSV file. We need to write it carefully though.
If the file already exists, we should append it. This will prevent us from overwriting valuable data.

Here is writeToCsv().

We start with by creating a success variable and setting it to false.
While the operation hasn't succeeded, we check to see if the file exists. We set append to the fileExists variable.
This way, if the file already exists, we append it instead of writing a new file. If our data isn't an array, we convert it to one.
We use await csvWriter.writeRecords(data); to write our data to the CSV file.
Once the write has finished we set success to true.
If the operation fails, we remain in the loop and keep retrying the operation until it succeeds.

async function writeToCsv(data, outputFile) {    let success = false;    while (!success) {
        if (!data || data.length === 0) {            throw new Error("No data to write!");        }        const fileExists = fs.existsSync(outputFile);            if (!(data instanceof Array)) {            data = [data]        }            const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))            const csvWriter = createCsvWriter({            path: outputFile,            header: headers,            append: fileExists        });        try {            await csvWriter.writeRecords(data);            success = true;        } catch (e) {            console.log("Failed data", data);            throw new Error("Failed to write to csv");        }    }}

Here is our newest iteration. Aside from the new function, not much has changed. Instead of printing to the screen, we write our data to a CSV file.

const puppeteer = require("puppeteer");const createCsvWriter = require("csv-writer").createObjectCsvWriter;const csvParse = require("csv-parse");const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
async function writeToCsv(data, outputFile) {    let success = false;    while (!success) {
        if (!data || data.length === 0) {            throw new Error("No data to write!");        }        const fileExists = fs.existsSync(outputFile);            if (!(data instanceof Array)) {            data = [data]        }            const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))            const csvWriter = createCsvWriter({            path: outputFile,            header: headers,            append: fileExists        });        try {            await csvWriter.writeRecords(data);            success = true;        } catch (e) {            console.log("Failed data", data);            throw new Error("Failed to write to csv");        }    }}

function range(start, end) {    const array = [];    for (let i=start; i<end; i++) {        array.push(i);    }    return array;}

async function scrapeSearchResults(browser, keyword, pageNumber, locality, location="us", retries=3) {    let tries = 0;    let success = false;
    while (tries <= retries && !success) {                const formattedKeyword = keyword.replace(" ", "+");        const formattedLocality = locality.replace(" ", "+");
        const page = await browser.newPage();        try {            const url = `https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${formattedKeyword}&location=${formattedLocality}&original_referer=&start=${pageNumber*10}`;                await page.goto(url);
            console.log(`Successfully fetched: ${url}`);
            const divCards = await page.$$("div[class='base-search-card__info']");
            for (const divCard of divCards) {
                const nameElement = await divCard.$("h4[class='base-search-card__subtitle']");                const name = await page.evaluate(element => element.textContent, nameElement);
                const jobTitleElement = await divCard.$("h3[class='base-search-card__title']");                const jobTitle = await page.evaluate(element => element.textContent, jobTitleElement);
                const parentElement = await page.evaluateHandle(element => element.parentElement, divCard);
                const aTag = await parentElement.$("a");                const link = await page.evaluate(element => element.getAttribute("href"), aTag);
                const jobLocationElement = await divCard.$("span[class='job-search-card__location']");                const jobLocation = await page.evaluate(element => element.textContent, jobLocationElement);
                const searchData = {                    name: name.trim(),                    job_title: jobTitle.trim(),                    url: link.trim(),                    location: jobLocation.trim()                };
                await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);            }
            success = true;
        } catch (err) {            console.log(`Error: ${err}, tries left ${retries - tries}`);            tries++;
        } finally {            await page.close();        }     }}
async function startCrawl(keyword, pages, locality, location, retries) {    const pageList = range(0, pages);
    const browser = await puppeteer.launch();
    for (const page of pageList) {        await scrapeSearchResults(browser, keyword, page, locality, location, retries)    }
    await browser.close();}

async function main() {    const keywords = ["software engineer"];    const concurrencyLimit = 5;    const pages = 1;    const location = "us";    const locality = "United States";    const retries = 3;    const aggregateFiles = [];
    for (const keyword of keywords) {        console.log("Crawl starting");        console.time("startCrawl");        await startCrawl(keyword, pages, locality, location, retries);        console.timeEnd("startCrawl");        console.log("Crawl complete");        aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);    }}

main();

When we scrape objects from the page, now we write them to a CSV file.

Step 4: Adding Concurrency

NodeJS is actually built to run in a single threaded environment. This makes it seem like concurrency would be difficult to handle, however, it's not. We can harness the first class async support to scrape concurrently. We'll rewrite startCrawl() to handle this.Here is our final startCrawl() function.

Instead of using a for loop, we create a list of tasks by splicing from our pageList up to our concurrencyLimit.
We then await all these tasks to resolve with Promise.all().
If we set our concurrencyLimit to 5, we'll scrape up to 5 pages at a time.
Careful when setting your concurrency limit. Each task opens a browser page inside of Puppeteer. You don't want too many tasks running at once because this can overwhelm your machine.
You need also to be careful because most proxy providers (ScrapeOps included) give you a concurrency limit with their API.

async function startCrawl(keyword, pages, locality, location, concurrencyLimit, retries) {    const pageList = range(0, pages);
    const browser = await puppeteer.launch();
    while (pageList.length > 0) {        const currentBatch = pageList.splice(0, concurrencyLimit);        const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, locality, location, retries));
        try {            await Promise.all(tasks);        } catch (err) {            console.log(`Failed to process batch: ${err}`);        }    }
    await browser.close();}

Our full code now looks like this.

const puppeteer = require("puppeteer");const createCsvWriter = require("csv-writer").createObjectCsvWriter;const csvParse = require("csv-parse");const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
async function writeToCsv(data, outputFile) {    let success = false;    while (!success) {
        if (!data || data.length === 0) {            throw new Error("No data to write!");        }        const fileExists = fs.existsSync(outputFile);            if (!(data instanceof Array)) {            data = [data]        }            const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))            const csvWriter = createCsvWriter({            path: outputFile,            header: headers,            append: fileExists        });        try {            await csvWriter.writeRecords(data);            success = true;        } catch (e) {            console.log("Failed data", data);            throw new Error("Failed to write to csv");        }    }}

function range(start, end) {    const array = [];    for (let i=start; i<end; i++) {        array.push(i);    }    return array;}

async function scrapeSearchResults(browser, keyword, pageNumber, locality, location="us", retries=3) {    let tries = 0;    let success = false;
    while (tries <= retries && !success) {                const formattedKeyword = keyword.replace(" ", "+");        const formattedLocality = locality.replace(" ", "+");
        const page = await browser.newPage();        try {            const url = `https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${formattedKeyword}&location=${formattedLocality}&original_referer=&start=${pageNumber*10}`;                await page.goto(url);
            console.log(`Successfully fetched: ${url}`);
            const divCards = await page.$$("div[class='base-search-card__info']");
            for (const divCard of divCards) {
                const nameElement = await divCard.$("h4[class='base-search-card__subtitle']");                const name = await page.evaluate(element => element.textContent, nameElement);
                const jobTitleElement = await divCard.$("h3[class='base-search-card__title']");                const jobTitle = await page.evaluate(element => element.textContent, jobTitleElement);
                const parentElement = await page.evaluateHandle(element => element.parentElement, divCard);
                const aTag = await parentElement.$("a");                const link = await page.evaluate(element => element.getAttribute("href"), aTag);
                const jobLocationElement = await divCard.$("span[class='job-search-card__location']");                const jobLocation = await page.evaluate(element => element.textContent, jobLocationElement);
                const searchData = {                    name: name.trim(),                    job_title: jobTitle.trim(),                    url: link.trim(),                    location: jobLocation.trim()                };
                await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);            }
            success = true;
        } catch (err) {            console.log(`Error: ${err}, tries left ${retries - tries}`);            tries++;
        } finally {            await page.close();        }     }}
async function startCrawl(keyword, pages, locality, location, concurrencyLimit, retries) {    const pageList = range(0, pages);
    const browser = await puppeteer.launch();
    while (pageList.length > 0) {        const currentBatch = pageList.splice(0, concurrencyLimit);        const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, locality, location, retries));
        try {            await Promise.all(tasks);        } catch (err) {            console.log(`Failed to process batch: ${err}`);        }    }
    await browser.close();}

async function main() {    const keywords = ["software engineer"];    const concurrencyLimit = 5;    const pages = 1;    const location = "us";    const locality = "United States";    const retries = 3;    const aggregateFiles = [];
    for (const keyword of keywords) {        console.log("Crawl starting");        console.time("startCrawl");        await startCrawl(keyword, pages, locality, location, concurrencyLimit, retries);        console.timeEnd("startCrawl");        console.log("Crawl complete");        aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);    }}

main();

We can now crawl multiple pages simultaneously.

Step 5: Bypassing Anti-Bots

While port integration is possible, the best way to use the ScrapeOps Proxy Aggregator is through the API. With the Proxy Aggregator API, we get really fine control over our proxy connection by passing simple parameters to the API.There are all sorts of things we can use to customize our connection, but today we only need an api_key, url and a country.Let's explain these a little better.

api_key: This is literally a key to our ScrapeOps account. Your API key is used to authenticate your accout when making requests.
url: This is the url of the site we want to scrape. ScrapeOps will fetch this site and send the result back to us.
country: We pass a country code in for this parameter. ScrapeOps reads our country code and routes our request through a server in the country we chose.

function getScrapeOpsUrl(url, location="us") {    const params = new URLSearchParams({        api_key: API_KEY,        url: url,        country: location    });    return `https://proxy.scrapeops.io/v1/?${params.toString()}`;}

Our full production crawler is available below.

const puppeteer = require("puppeteer");const createCsvWriter = require("csv-writer").createObjectCsvWriter;const csvParse = require("csv-parse");const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
async function writeToCsv(data, outputFile) {    let success = false;    while (!success) {
        if (!data || data.length === 0) {            throw new Error("No data to write!");        }        const fileExists = fs.existsSync(outputFile);            if (!(data instanceof Array)) {            data = [data]        }            const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))            const csvWriter = createCsvWriter({            path: outputFile,            header: headers,            append: fileExists        });        try {            await csvWriter.writeRecords(data);            success = true;        } catch (e) {            console.log("Failed data", data);            throw new Error("Failed to write to csv");        }    }}

function range(start, end) {    const array = [];    for (let i=start; i<end; i++) {        array.push(i);    }    return array;}
function getScrapeOpsUrl(url, location="us") {    const params = new URLSearchParams({        api_key: API_KEY,        url: url,        country: location    });    return `https://proxy.scrapeops.io/v1/?${params.toString()}`;}
async function scrapeSearchResults(browser, keyword, pageNumber, locality, location="us", retries=3) {    let tries = 0;    let success = false;
    while (tries <= retries && !success) {                const formattedKeyword = keyword.replace(" ", "+");        const formattedLocality = locality.replace(" ", "+");
        const page = await browser.newPage();        try {            const url = `https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${formattedKeyword}&location=${formattedLocality}&original_referer=&start=${pageNumber*10}`;                const proxyUrl = getScrapeOpsUrl(url, location);            await page.goto(proxyUrl, { timeout: 0 });
            console.log(`Successfully fetched: ${url}`);
            const divCards = await page.$$("div[class='base-search-card__info']");
            for (const divCard of divCards) {
                const nameElement = await divCard.$("h4[class='base-search-card__subtitle']");                const name = await page.evaluate(element => element.textContent, nameElement);
                const jobTitleElement = await divCard.$("h3[class='base-search-card__title']");                const jobTitle = await page.evaluate(element => element.textContent, jobTitleElement);
                const parentElement = await page.evaluateHandle(element => element.parentElement, divCard);
                const aTag = await parentElement.$("a");                const link = await page.evaluate(element => element.getAttribute("href"), aTag);
                const jobLocationElement = await divCard.$("span[class='job-search-card__location']");                const jobLocation = await page.evaluate(element => element.textContent, jobLocationElement);
                const searchData = {                    name: name.trim(),                    job_title: jobTitle.trim(),                    url: link.trim(),                    location: jobLocation.trim()                };
                await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);            }
            success = true;
        } catch (err) {            console.log(`Error: ${err}, tries left ${retries - tries}`);            tries++;
        } finally {            await page.close();        }     }}
async function startCrawl(keyword, pages, locality, location, concurrencyLimit, retries) {    const pageList = range(0, pages);
    const browser = await puppeteer.launch();
    while (pageList.length > 0) {        const currentBatch = pageList.splice(0, concurrencyLimit);        const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, locality, location, retries));
        try {            await Promise.all(tasks);        } catch (err) {            console.log(`Failed to process batch: ${err}`);        }    }
    await browser.close();}

async function main() {    const keywords = ["software engineer"];    const concurrencyLimit = 5;    const pages = 1;    const location = "us";    const locality = "United States";    const retries = 3;    const aggregateFiles = [];
    for (const keyword of keywords) {        console.log("Crawl starting");        console.time("startCrawl");        await startCrawl(keyword, pages, locality, location, concurrencyLimit, retries);        console.timeEnd("startCrawl");        console.log("Crawl complete");        aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);    }}

main();

Step 6: Production Run

Next, we need to run this thing in production. We're goingto crawl 3 pages with a concurrencyLimit of 5.Feel free to change any of the following from the main() function.

keywords
concurrencyLimit
pages
location
locality
retries

Here is our full main() if you'd like to review it.

async function main() {    const keywords = ["software engineer"];    const concurrencyLimit = 5;    const pages = 3;    const location = "us";    const locality = "United States";    const retries = 3;    const aggregateFiles = [];
    for (const keyword of keywords) {        console.log("Crawl starting");        console.time("startCrawl");        await startCrawl(keyword, pages, locality, location, concurrencyLimit, retries);        console.timeEnd("startCrawl");        console.log("Crawl complete");        aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);    }}

Take a look at our results.As you can see, we crawled 3 pages in 42.08 seconds. This comes out to an average of 14.02 seconds per page.

Build A LinkedIn Jobs Scraper

Now, for the second part of our project. Our crawler is generating a report. Now, we need a scraper that reads that report. After reading that report, it needs to go through and scrape individual details about each job posting.We'll build this scraper in several iterations, just like we did with the crawler.

Step 1: Create Simple Job Data Parser

We'll start with our parsing function. Just like earlier, we'll add some error handling and retries, but our parsing logic is most important.Take a look at processJob(). We check for bad responses and throw an Error if we don't receive the correct response. If we get a good response, we continue on and parse the page.

async function processJob(browser, row, location, retries = 3) {    const url = row.url;    let tries = 0;    let success = false;
        while (tries <= retries && !success) {        const page = await browser.newPage();
        try {            const response = await page.goto(url);            if (!response || response.status() !== 200) {                throw new Error("Failed to fetch page, status:", response.status());            }
            const jobCriteria = await page.$$("li[class='description__job-criteria-item']");            if (jobCriteria.length < 4) {                throw new Error("Job Criteria Not Found!");            }
            const seniority = (await page.evaluate(element => element.textContent, jobCriteria[0])).replace("Seniority level", "");            const positionType = (await page.evaluate(element => element.textContent, jobCriteria[1])).replace("Employment type", "");            const jobFunction = (await page.evaluate(element => element.textContent, jobCriteria[2])).replace("Job function", "");            const industry = (await page.evaluate(element => element.textContent, jobCriteria[3])).replace("Industries", "");
            const jobData = {                name: row.name,                seniority: seniority.trim(),                position_type: positionType.trim(),                job_function: jobFunction.trim(),                industry: industry.trim()            }            console.log(jobData)
            success = true;            console.log("Successfully parsed", row.url);

        } catch (err) {            tries++;            console.log(`Error: ${err}, tries left: ${retries-tries}, url: ${getScrapeOpsUrl(url)}`);
        } finally {            await page.close();        }    } }

jobCriteria = await page.$$("li[class='description__job-criteria-item']"); finds the items from our criteria list.
The criteria list goes as follows:
- const seniority = (await page.evaluate(element => element.textContent, jobCriteria[0])).replace("Seniority level", "");: seniority level
- const positionType = (await page.evaluate(element => element.textContent, jobCriteria[1])).replace("Employment type", "");: position type
- const jobFunction = (await page.evaluate(element => element.textContent, jobCriteria[2])).replace("Job function", "");: job function
- const industry = (await page.evaluate(element => element.textContent, jobCriteria[3])).replace("Industries", "");: industry

We use page.evaluate() to pull the text from each element we find.

Step 2: Loading URLs To Scrape

Our parsing function takes a row as an argument. To give it a row, we need to read the rows from our CSV file. We'll read our file into an array and then we'll use a for loop to scrape details from every posting we found.Here is our first iteration of processResults().Later on, we'll rewrite it and add concurrency support. It;s pretty similar to our startCrawl() function from earlier in this tutorial.

async function processResults(csvFile, location, retries) {    const rows = await readCsv(csvFile);    const browser = await puppeteer.launch();;
    for (const row of rows) {        await processJob(browser, row, location, retries)    }        await browser.close();
}

When we fit it into our script, here's how everything should look.

const puppeteer = require("puppeteer");const createCsvWriter = require("csv-writer").createObjectCsvWriter;const csvParse = require("csv-parse");const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
async function writeToCsv(data, outputFile) {    let success = false;    while (!success) {
        if (!data || data.length === 0) {            throw new Error("No data to write!");        }        const fileExists = fs.existsSync(outputFile);            if (!(data instanceof Array)) {            data = [data]        }            const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))            const csvWriter = createCsvWriter({            path: outputFile,            header: headers,            append: fileExists        });        try {            await csvWriter.writeRecords(data);            success = true;        } catch (e) {            console.log("Failed data", data);            throw new Error("Failed to write to csv");        }    }}
async function readCsv(inputFile) {    const results = [];    const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({        columns: true,        delimiter: ",",        trim: true,        skip_empty_lines: true    }));
    for await (const record of parser) {        results.push(record);    }    return results;}
function range(start, end) {    const array = [];    for (let i=start; i<end; i++) {        array.push(i);    }    return array;}
function getScrapeOpsUrl(url, location="us") {    const params = new URLSearchParams({        api_key: API_KEY,        url: url,        country: location    });    return `https://proxy.scrapeops.io/v1/?${params.toString()}`;}
async function scrapeSearchResults(browser, keyword, pageNumber, locality, location="us", retries=3) {    let tries = 0;    let success = false;
    while (tries <= retries && !success) {                const formattedKeyword = keyword.replace(" ", "+");        const formattedLocality = locality.replace(" ", "+");
        const page = await browser.newPage();        try {            const url = `https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${formattedKeyword}&location=${formattedLocality}&original_referer=&start=${pageNumber*10}`;                const proxyUrl = getScrapeOpsUrl(url, location);            await page.goto(proxyUrl, { timeout: 0 });
            console.log(`Successfully fetched: ${url}`);
            const divCards = await page.$$("div[class='base-search-card__info']");
            for (const divCard of divCards) {
                const nameElement = await divCard.$("h4[class='base-search-card__subtitle']");                const name = await page.evaluate(element => element.textContent, nameElement);
                const jobTitleElement = await divCard.$("h3[class='base-search-card__title']");                const jobTitle = await page.evaluate(element => element.textContent, jobTitleElement);
                const parentElement = await page.evaluateHandle(element => element.parentElement, divCard);
                const aTag = await parentElement.$("a");                const link = await page.evaluate(element => element.getAttribute("href"), aTag);
                const jobLocationElement = await divCard.$("span[class='job-search-card__location']");                const jobLocation = await page.evaluate(element => element.textContent, jobLocationElement);
                const searchData = {                    name: name.trim(),                    job_title: jobTitle.trim(),                    url: link.trim(),                    location: jobLocation.trim()                };
                await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);            }
            success = true;
        } catch (err) {            console.log(`Error: ${err}, tries left ${retries - tries}`);            tries++;
        } finally {            await page.close();        }     }}
async function startCrawl(keyword, pages, locality, location, concurrencyLimit, retries) {    const pageList = range(0, pages);
    const browser = await puppeteer.launch();
    while (pageList.length > 0) {        const currentBatch = pageList.splice(0, concurrencyLimit);        const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, locality, location, retries));
        try {            await Promise.all(tasks);        } catch (err) {            console.log(`Failed to process batch: ${err}`);        }    }
    await browser.close();}
async function processJob(browser, row, location, retries = 3) {    const url = row.url;    let tries = 0;    let success = false;
        while (tries <= retries && !success) {        const page = await browser.newPage();
        try {            const response = await page.goto(url);            if (!response || response.status() !== 200) {                throw new Error("Failed to fetch page, status:", response.status());            }
            const jobCriteria = await page.$$("li[class='description__job-criteria-item']");            if (jobCriteria.length < 4) {                throw new Error("Job Criteria Not Found!");            }
            const seniority = (await page.evaluate(element => element.textContent, jobCriteria[0])).replace("Seniority level", "");            const positionType = (await page.evaluate(element => element.textContent, jobCriteria[1])).replace("Employment type", "");            const jobFunction = (await page.evaluate(element => element.textContent, jobCriteria[2])).replace("Job function", "");            const industry = (await page.evaluate(element => element.textContent, jobCriteria[3])).replace("Industries", "");
            const jobData = {                name: row.name,                seniority: seniority.trim(),                position_type: positionType.trim(),                job_function: jobFunction.trim(),                industry: industry.trim()            }            console.log(jobData)
            success = true;            console.log("Successfully parsed", row.url);

        } catch (err) {            tries++;            console.log(`Error: ${err}, tries left: ${retries-tries}, url: ${getScrapeOpsUrl(url)}`);
        } finally {            await page.close();        }    } }
async function processResults(csvFile, location, retries) {    const rows = await readCsv(csvFile);    const browser = await puppeteer.launch();;
    for (const row of rows) {        await processJob(browser, row, location, retries)    }        await browser.close();
}
async function main() {    const keywords = ["software engineer"];    const concurrencyLimit = 5;    const pages = 1;    const location = "us";    const locality = "United States";    const retries = 3;    const aggregateFiles = [];
    for (const keyword of keywords) {        console.log("Crawl starting");        console.time("startCrawl");        await startCrawl(keyword, pages, locality, location, concurrencyLimit, retries);        console.timeEnd("startCrawl");        console.log("Crawl complete");        aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);    }

    console.log("Starting scrape");    for (const file of aggregateFiles) {        console.time("processResults");        await processResults(file, location, retries);        console.timeEnd("processResults");    }    console.log("Scrape complete");}

main();

Step 3: Storing the Scraped Data

Just like we did earlier, we need to store our scraped data. In our parsing function, we're already creating a jobData object. We also already have a writeToCsv() function. Instead of logging our jobData to the console, we just need to store it.In the code below, we're going to do exactly that.

const puppeteer = require("puppeteer");const createCsvWriter = require("csv-writer").createObjectCsvWriter;const csvParse = require("csv-parse");const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
async function writeToCsv(data, outputFile) {    let success = false;    while (!success) {
        if (!data || data.length === 0) {            throw new Error("No data to write!");        }        const fileExists = fs.existsSync(outputFile);            if (!(data instanceof Array)) {            data = [data]        }            const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))            const csvWriter = createCsvWriter({            path: outputFile,            header: headers,            append: fileExists        });        try {            await csvWriter.writeRecords(data);            success = true;        } catch (e) {            console.log("Failed data", data);            throw new Error("Failed to write to csv");        }    }}
async function readCsv(inputFile) {    const results = [];    const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({        columns: true,        delimiter: ",",        trim: true,        skip_empty_lines: true    }));
    for await (const record of parser) {        results.push(record);    }    return results;}
function range(start, end) {    const array = [];    for (let i=start; i<end; i++) {        array.push(i);    }    return array;}
function getScrapeOpsUrl(url, location="us") {    const params = new URLSearchParams({        api_key: API_KEY,        url: url,        country: location    });    return `https://proxy.scrapeops.io/v1/?${params.toString()}`;}
async function scrapeSearchResults(browser, keyword, pageNumber, locality, location="us", retries=3) {    let tries = 0;    let success = false;
    while (tries <= retries && !success) {                const formattedKeyword = keyword.replace(" ", "+");        const formattedLocality = locality.replace(" ", "+");
        const page = await browser.newPage();        try {            const url = `https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${formattedKeyword}&location=${formattedLocality}&original_referer=&start=${pageNumber*10}`;                const proxyUrl = getScrapeOpsUrl(url, location);            await page.goto(proxyUrl, { timeout: 0 });
            console.log(`Successfully fetched: ${url}`);
            const divCards = await page.$$("div[class='base-search-card__info']");
            for (const divCard of divCards) {
                const nameElement = await divCard.$("h4[class='base-search-card__subtitle']");                const name = await page.evaluate(element => element.textContent, nameElement);
                const jobTitleElement = await divCard.$("h3[class='base-search-card__title']");                const jobTitle = await page.evaluate(element => element.textContent, jobTitleElement);
                const parentElement = await page.evaluateHandle(element => element.parentElement, divCard);
                const aTag = await parentElement.$("a");                const link = await page.evaluate(element => element.getAttribute("href"), aTag);
                const jobLocationElement = await divCard.$("span[class='job-search-card__location']");                const jobLocation = await page.evaluate(element => element.textContent, jobLocationElement);
                const searchData = {                    name: name.trim(),                    job_title: jobTitle.trim(),                    url: link.trim(),                    location: jobLocation.trim()                };
                await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);            }
            success = true;
        } catch (err) {            console.log(`Error: ${err}, tries left ${retries - tries}`);            tries++;
        } finally {            await page.close();        }     }}
async function startCrawl(keyword, pages, locality, location, concurrencyLimit, retries) {    const pageList = range(0, pages);
    const browser = await puppeteer.launch();
    while (pageList.length > 0) {        const currentBatch = pageList.splice(0, concurrencyLimit);        const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, locality, location, retries));
        try {            await Promise.all(tasks);        } catch (err) {            console.log(`Failed to process batch: ${err}`);        }    }
    await browser.close();}
async function processJob(browser, row, location, retries = 3) {    const url = row.url;    let tries = 0;    let success = false;
        while (tries <= retries && !success) {        const page = await browser.newPage();
        try {            const response = await page.goto(url);            if (!response || response.status() !== 200) {                throw new Error("Failed to fetch page, status:", response.status());            }
            const jobCriteria = await page.$$("li[class='description__job-criteria-item']");            if (jobCriteria.length < 4) {                throw new Error("Job Criteria Not Found!");            }
            const seniority = (await page.evaluate(element => element.textContent, jobCriteria[0])).replace("Seniority level", "");            const positionType = (await page.evaluate(element => element.textContent, jobCriteria[1])).replace("Employment type", "");            const jobFunction = (await page.evaluate(element => element.textContent, jobCriteria[2])).replace("Job function", "");            const industry = (await page.evaluate(element => element.textContent, jobCriteria[3])).replace("Industries", "");
            const jobData = {                name: row.name,                seniority: seniority.trim(),                position_type: positionType.trim(),                job_function: jobFunction.trim(),                industry: industry.trim()            }            await writeToCsv([jobData], `${row.name.replace(" ", "-")}-${row.job_title.replace(" ", "-")}.csv`);
            success = true;            console.log("Successfully parsed", row.url);

        } catch (err) {            tries++;            console.log(`Error: ${err}, tries left: ${retries-tries}, url: ${getScrapeOpsUrl(url)}`);
        } finally {            await page.close();        }    } }
async function processResults(csvFile, location, retries) {    const rows = await readCsv(csvFile);    const browser = await puppeteer.launch();;
    for (const row of rows) {        await processJob(browser, row, location, retries)    }        await browser.close();
}
async function main() {    const keywords = ["software engineer"];    const concurrencyLimit = 5;    const pages = 1;    const location = "us";    const locality = "United States";    const retries = 3;    const aggregateFiles = [];
    for (const keyword of keywords) {        console.log("Crawl starting");        console.time("startCrawl");        await startCrawl(keyword, pages, locality, location, concurrencyLimit, retries);        console.timeEnd("startCrawl");        console.log("Crawl complete");        aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);    }

    console.log("Starting scrape");    for (const file of aggregateFiles) {        console.time("processResults");        await processResults(file, location, retries);        console.timeEnd("processResults");    }    console.log("Scrape complete");}

main();

jobData holds the data we pull from the page.
We pass our jobData into writeToCsv() and it then gets saved to a CSV file.

Step 4: Adding Concurrency

Adding concurrency here will be done almost exactly the same way we did it earlier.

We first read our file into an array. We'll make an array of tasks by splicing our rows by our concurrencyLimit.
Then, we'll await everything to resolve using Promise.all().
This allows us to fetch and scrape multiple pages simultaneously.
Like before, if we set our concurrencyLimit to 5, we'll be processing the rows in batches of 5.

async function processResults(csvFile, location, concurrencyLimit, retries) {    const rows = await readCsv(csvFile);    const browser = await puppeteer.launch();;
    while (rows.length > 0) {        const currentBatch = rows.splice(0, concurrencyLimit);        const tasks = currentBatch.map(row => processJob(browser, row, location, retries));
        try {            await Promise.all(tasks);        } catch (err) {            console.log(`Failed to process batch: ${err}`);        }    }    await browser.close();}

await readCsv(csvFile);: This returns all the rows from the CSV file in an array.
rows.splice(0, concurrencyLimit); shrinks the rows array and gives us a chunk to work with.
currentBatch.map(row => processJob(browser, row, location, retries)) runs processJob() on each element in the chunk.
await Promise.all(tasks); waits for each one of our tasks to resolve.
This process repeats until our rows array is completely gone.

Step 5: Bypassing Anti-Bots

We're almost finished with the project. However, there is one thing we still need to add, proxy support. We've already got a function that accomplishes this as well. We just need use it in the correct place. We're only going to change one line of code here.

const response = await page.goto(getScrapeOpsUrl(url, location), { timeout: 0 });

We add set { timeout: 0 } to tell Puppeteer not to time out. When dealing with a proxy along with a site as difficult as LinkedIn, pages sometimes take awhile to come back to us.
Now that our location is getting passed into our proxy function, we're actually going to be routed through a server in the country of our choice.

Take a look at the finished scraper.

const puppeteer = require("puppeteer");const createCsvWriter = require("csv-writer").createObjectCsvWriter;const csvParse = require("csv-parse");const fs = require("fs");
const API_KEY = JSON.parse(fs.readFileSync("config.json")).api_key;
async function writeToCsv(data, outputFile) {    let success = false;    while (!success) {
        if (!data || data.length === 0) {            throw new Error("No data to write!");        }        const fileExists = fs.existsSync(outputFile);            if (!(data instanceof Array)) {            data = [data]        }            const headers = Object.keys(data[0]).map(key => ({id: key, title: key}))            const csvWriter = createCsvWriter({            path: outputFile,            header: headers,            append: fileExists        });        try {            await csvWriter.writeRecords(data);            success = true;        } catch (e) {            console.log("Failed data", data);            throw new Error("Failed to write to csv");        }    }}
async function readCsv(inputFile) {    const results = [];    const parser = fs.createReadStream(inputFile).pipe(csvParse.parse({        columns: true,        delimiter: ",",        trim: true,        skip_empty_lines: true    }));
    for await (const record of parser) {        results.push(record);    }    return results;}
function range(start, end) {    const array = [];    for (let i=start; i<end; i++) {        array.push(i);    }    return array;}
function getScrapeOpsUrl(url, location="us") {    const params = new URLSearchParams({        api_key: API_KEY,        url: url,        country: location    });    return `https://proxy.scrapeops.io/v1/?${params.toString()}`;}
async function scrapeSearchResults(browser, keyword, pageNumber, locality, location="us", retries=3) {    let tries = 0;    let success = false;
    while (tries <= retries && !success) {                const formattedKeyword = keyword.replace(" ", "+");        const formattedLocality = locality.replace(" ", "+");
        const page = await browser.newPage();        try {            const url = `https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${formattedKeyword}&location=${formattedLocality}&original_referer=&start=${pageNumber*10}`;                const proxyUrl = getScrapeOpsUrl(url, location);            await page.goto(proxyUrl, { timeout: 0 });
            console.log(`Successfully fetched: ${url}`);
            const divCards = await page.$$("div[class='base-search-card__info']");
            for (const divCard of divCards) {
                const nameElement = await divCard.$("h4[class='base-search-card__subtitle']");                const name = await page.evaluate(element => element.textContent, nameElement);
                const jobTitleElement = await divCard.$("h3[class='base-search-card__title']");                const jobTitle = await page.evaluate(element => element.textContent, jobTitleElement);
                const parentElement = await page.evaluateHandle(element => element.parentElement, divCard);
                const aTag = await parentElement.$("a");                const link = await page.evaluate(element => element.getAttribute("href"), aTag);
                const jobLocationElement = await divCard.$("span[class='job-search-card__location']");                const jobLocation = await page.evaluate(element => element.textContent, jobLocationElement);
                const searchData = {                    name: name.trim(),                    job_title: jobTitle.trim(),                    url: link.trim(),                    location: jobLocation.trim()                };
                await writeToCsv([searchData], `${keyword.replace(" ", "-")}.csv`);            }
            success = true;
        } catch (err) {            console.log(`Error: ${err}, tries left ${retries - tries}`);            tries++;
        } finally {            await page.close();        }     }}
async function startCrawl(keyword, pages, locality, location, concurrencyLimit, retries) {    const pageList = range(0, pages);
    const browser = await puppeteer.launch();
    while (pageList.length > 0) {        const currentBatch = pageList.splice(0, concurrencyLimit);        const tasks = currentBatch.map(page => scrapeSearchResults(browser, keyword, page, locality, location, retries));
        try {            await Promise.all(tasks);        } catch (err) {            console.log(`Failed to process batch: ${err}`);        }    }
    await browser.close();}
async function processJob(browser, row, location, retries = 3) {    const url = row.url;    let tries = 0;    let success = false;
        while (tries <= retries && !success) {        const page = await browser.newPage();
        try {            const response = await page.goto(getScrapeOpsUrl(url, location), { timeout: 0 });            if (!response || response.status() !== 200) {                throw new Error("Failed to fetch page, status:", response.status());            }
            const jobCriteria = await page.$$("li[class='description__job-criteria-item']");            if (jobCriteria.length < 4) {                throw new Error("Job Criteria Not Found!");            }
            const seniority = (await page.evaluate(element => element.textContent, jobCriteria[0])).replace("Seniority level", "");            const positionType = (await page.evaluate(element => element.textContent, jobCriteria[1])).replace("Employment type", "");            const jobFunction = (await page.evaluate(element => element.textContent, jobCriteria[2])).replace("Job function", "");            const industry = (await page.evaluate(element => element.textContent, jobCriteria[3])).replace("Industries", "");
            const jobData = {                name: row.name,                seniority: seniority.trim(),                position_type: positionType.trim(),                job_function: jobFunction.trim(),                industry: industry.trim()            }            await writeToCsv([jobData], `${row.name.replace(" ", "-")}-${row.job_title.replace(" ", "-")}.csv`);
            success = true;            console.log("Successfully parsed", row.url);

        } catch (err) {            tries++;            console.log(`Error: ${err}, tries left: ${retries-tries}, url: ${getScrapeOpsUrl(url)}`);
        } finally {            await page.close();        }    } }
async function processResults(csvFile, location, concurrencyLimit, retries) {    const rows = await readCsv(csvFile);    const browser = await puppeteer.launch();;
    while (rows.length > 0) {        const currentBatch = rows.splice(0, concurrencyLimit);        const tasks = currentBatch.map(row => processJob(browser, row, location, retries));
        try {            await Promise.all(tasks);        } catch (err) {            console.log(`Failed to process batch: ${err}`);        }    }    await browser.close();
}
async function main() {    const keywords = ["software engineer"];    const concurrencyLimit = 5;    const pages = 1;    const location = "us";    const locality = "United States";    const retries = 3;    const aggregateFiles = [];
    for (const keyword of keywords) {        console.log("Crawl starting");        console.time("startCrawl");        await startCrawl(keyword, pages, locality, location, concurrencyLimit, retries);        console.timeEnd("startCrawl");        console.log("Crawl complete");        aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);    }

    console.log("Starting scrape");    for (const file of aggregateFiles) {        console.time("processResults");        await processResults(file, location, concurrencyLimit, retries);        console.timeEnd("processResults");    }    console.log("Scrape complete");}

main();

Step 6: Production Run

Time for our final run. As we did earlier, we use 5 threads to crawl 3 pages of results. Then, we scrape each job from our search results.If you need a refresher, take a look at our main() below. As we mentioned earlier, you can change the following to tweak your results.

keywords
concurrencyLimit
pages
location
locality
retries

async function main() {    const keywords = ["software engineer"];    const concurrencyLimit = 5;    const pages = 3;    const location = "us";    const locality = "United States";    const retries = 3;    const aggregateFiles = [];
    for (const keyword of keywords) {        console.log("Crawl starting");        console.time("startCrawl");        await startCrawl(keyword, pages, locality, location, concurrencyLimit, retries);        console.timeEnd("startCrawl");        console.log("Crawl complete");        aggregateFiles.push(`${keyword.replace(" ", "-")}.csv`);    }

    console.log("Starting scrape");    for (const file of aggregateFiles) {        console.time("processResults");        await processResults(file, location, concurrencyLimit, retries);        console.timeEnd("processResults");    }    console.log("Scrape complete");}

Here are the results.If you remember earlier, our 3 page crawl took 33.694 seconds. Our crawl gave us a CSV with 30 results. Our crawl took much longer this time (over a minute).The scrape took a total of 4 minutes and 59.347 seconds. If we convert this all to seconds, we get 259.347 seconds. 259.347 seconds / 30 pages = 8.645 seconds per page.As our scrape gets larger the rate at which w scrape tends to get faster. This is due to our concurrency functions.**

Legal and Ethical Considerations

Don't scrape private data. Private data is any data that's gated behind a login page. When we scrape LinkedIn jobs, we're not logging in and we're scraping publicly available data. You should do the same.If your scraper is legally questionable, you need to consult an attorney. Laws are different all over the world, but it is generally legal to scrape public data. It's not much different than taking a picture of a public billboard.You also need to make some ethical considerations when scraping the web (especially LinkedIn). We're not legally subject to LinkedIn's terms of service and their robots.txt because we haven't agreed to anything, but they take these policies very seriously.Their terms are available here and their robots.txt is here. As stated at the top of their robots.txt, crawling LinkedIn is explicitly prohibited.By scraping LinkedIn, you can have your account suspended, banned, or even deleted.

Conclusion

You've seen it (and possibly done it) yourself! It is completely possible to scrape LinkedIn.At this point, you should have a pretty solid grasp of how to use Puppeteer for basic scraping operations. You also understand our iterative build process for the following features: parsing, pagination, data storage, concurrency and proxy integration.If you want to know more about the tech stack from this article, check out the links below!

More Web Scraping Guides

At ScrapeOps, we wrote the playbook on scraping with Python, Selenium, Puppeteer. Whether you're brand new, or an experienced dev, we've got something for you. If you'd like to read more of our "How To Scrape" series, take a look at the links below.