Skip to main content

Python Scrapy: Build A LinkedIn Company Profile Scraper

Python Scrapy: Build A LinkedIn Company Profile Scraper [2023]

In this guide for our "How To Scrape X With Python Scrapy" series, we're going to look at how to build a Python Scrapy spider that will scrape LinkedIn.com public company profiles.

LinkedIn is the most up-to-date and extensive source of professional people profiles & companies on the internet. As a result it is the most popular web scraping target of recruiting, HR and lead generation companies.

In this article we will focus on building a production LinkedIn spider using Python Scrapy that will scrape LinkedIn Public Company Profiles.

In this guide we will go through:

GitHub Code

The full code for this LinkedIn Company Spider is available on Github here.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.


If you prefer to follow along with a video then check out the video tutorial version here:


How To Build a LinkedIn Company Profile Scraper

Scraping LinkedIn Company Profiles is pretty straight forward, once you have the HTML response.

We just need a list of LinkedIn company profile urls and send requests to the LinkedIn's to get the data from those profiles. Let's check out what a company profile page looks like by going to the link below:


'https://www.linkedin.com/company/usebraintrust/'

It should look something like this:

LinkedIn.com Company Profile Page

We just need to create a Scrapy spider that will parse the profile data from the page.

The following is a simple Scrapy spider that will request the company profile page for every url in the company_pages list, and then parse the company profile from the response.



import json
import scrapy

class LinkedCompanySpider(scrapy.Spider):
name = "linkedin_company_profile"

#add your own list of company urls here
company_pages = [
'https://www.linkedin.com/company/usebraintrust?trk=public_jobs_jserp-result_job-search-card-subtitle',
'https://www.linkedin.com/company/centraprise?trk=public_jobs_jserp-result_job-search-card-subtitle'
]


def start_requests(self):

company_index_tracker = 0

first_url = self.company_pages[company_index_tracker]

yield scrapy.Request(url=first_url, callback=self.parse_response, meta={'company_index_tracker': company_index_tracker})


def parse_response(self, response):
company_index_tracker = response.meta['company_index_tracker']
print('***************')
print('****** Scraping page ' + str(company_index_tracker+1) + ' of ' + str(len(self.company_pages)))
print('***************')

company_item = {}

company_item['name'] = response.css('.top-card-layout__entity-info h1::text').get(default='not-found').strip()
company_item['summary'] = response.css('.top-card-layout__entity-info h4 span::text').get(default='not-found').strip()

try:
## all company details
company_details = response.css('.core-section-container__content .mb-2')

#industry line
company_industry_line = company_details[1].css('.text-md::text').getall()
company_item['industry'] = company_industry_line[1].strip()

#company size line
company_size_line = company_details[2].css('.text-md::text').getall()
company_item['size'] = company_size_line[1].strip()

#company founded
company_size_line = company_details[5].css('.text-md::text').getall()
company_item['founded'] = company_size_line[1].strip()
except IndexError:
print("Error: Skipped Company - Some details missing")

yield company_item


company_index_tracker = company_index_tracker + 1

if company_index_tracker <= (len(self.company_pages)-1):
next_url = self.company_pages[company_index_tracker]

yield scrapy.Request(url=next_url, callback=self.parse_response, meta={'company_index_tracker': company_index_tracker})



Now when we run our scraper:


scrapy crawl linkedin_company_profile

The output of this code will look like this:

{
'name': 'Braintrust',
'summary': "Braintrust is the first decentralized Web3 talent network that connects tech freelancers with the world's leading brands",
'industry': 'Software Development',
'size': '11-50 employees',
'founded': '2018'
}

This spider scrapes the following data from the LinkedIn profile page:

  • Name
  • Summary
  • Industry
  • Company size
  • Year Founded

You can expand this spider to scrape other details by simply using the response.css function to get more data from the page in the parse_response method.


Bypassing LinkedIn's Anti-Bot Protection

As mentioned above, LinkedIn has one of the most aggressive anti-scraping systems on the internet, making it very hard to scrape.

It uses a combination of IP address, headers, browser & TCP fingerprinting to detect scrapers and block them.

As you might have seen already, if you run the above code LinkedIn is likely blocking your requests and returning their login page like this:

How To Scrape LinkedIn.com - Login Page

Public LinkedIn Company Profiles

This Scrapy spider is only designed to scrape public LinkedIn company profiles that don't require you to login to view. Scraping behind LinkedIn's login is significantly harder and opens yourself up to much higher legal risks.

To bypass LinkedIn's anti-scraping system will need to using very high quality rotating residential/mobile proxies, browser-profiles and a fortified headless browser.

We have written guides about how to do this here:

However, if you don't want to implement all this anti-bot bypassing logic yourself, the easier option is to use a smart proxy solution like ScrapeOps Proxy Aggregator which integrates with over 20+ proxy providers and finds the proxy solution that works best for LinkedIn for you.

The ScrapeOps Proxy Aggregator is a smart proxy that handles everything for you:

  • Proxy rotation & selection
  • Rotating user-agents & browser headers
  • Ban detection & CAPTCHA bypassing
  • Country IP geotargeting
  • Javascript rendering with headless browsers

You can get a ScrapeOps API key with 1,000 free API credits by signing up here.

To use the ScrapeOps Proxy Aggregator with our LinkedIn Scrapy Spider, we just need to send the URL we want to scrape to the Proxy API instead of making the request directly ourselves. You can test it out with Curl using the command below:


curl 'https://proxy.scrapeops.io/v1/?api_key=YOUR_API_KEY&url=https://www.linkedin.com/in/reidhoffman/'

We can integrate the proxy easily into our scrapy project by installing the ScrapeOps Scrapy Proxy SDK a Downloader Middleware. We can quickly install it into our project using the following command:


pip install scrapeops-scrapy-proxy-sdk

And then enable it in your project in the settings.py file.

## settings.py

SCRAPEOPS_API_KEY = 'YOUR_API_KEY'
SCRAPEOPS_PROXY_ENABLED = True

DOWNLOADER_MIDDLEWARES = {
'scrapeops_scrapy_proxy_sdk.scrapeops_scrapy_proxy_sdk.ScrapeOpsScrapyProxySdk': 725,
}

Now when we make requests with our scrapy spider they will be routed through the proxy and LinkedIn won't block them.

Full documentation on how to integrate the ScrapeOps Proxy here.


Monitoring Your LinkedIn Company Profile Scraper

When scraping in production it is vital that you can see how your scrapers are doing so you can fix problems early.

You could see if your jobs are running correctly by checking the output in your file or database but the easier way to do it would be to install the ScrapeOps Monitor.

ScrapeOps gives you a simple to use, yet powerful way to see how your jobs are doing, run your jobs, schedule recurring jobs, setup alerts and more. All for free!

Live demo here: ScrapeOps Demo

ScrapeOps Promo

You can create a free ScrapeOps API key here.

We'll just need to run the following to install the ScrapeOps Scrapy Extension:


pip install scrapeops-scrapy

Once that is installed you need to add the following to your Scrapy projects settings.py file if you want to be able to see your logs in ScrapeOps:


# Add Your ScrapeOps API key
SCRAPEOPS_API_KEY = 'YOUR_API_KEY'


# Add In The ScrapeOps Extension
EXTENSIONS = {
'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500,
}


# Update The Download Middlewares
DOWNLOADER_MIDDLEWARES = {
'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
}

Now, every time we run a our LinkedIn Company Profile spider (scrapy crawl linkedin_company_profile), the ScrapeOps SDK will monitor the performance and send the data to ScrapeOps dashboard.

Full documentation on how to integrate the ScrapeOps Monitoring here.


Scheduling & Running Our Scraper In The Cloud

Lastly, we will want to deploy our LinkedIn Company Profile scraper to a server so that we can schedule it to run every day, week, etc.

To do this you have a couple of options.

However, one of the easiest ways is via ScrapeOps Job Scheduler. Plus it is free!

ScrapeOps Job Scheduler Demo

Here is a video guide on how to connect a Digital Ocean to ScrapeOps and schedule your jobs to run.

You could also connect ScrapeOps to any server like Vultr or Amazon Web Services(AWS).


More Web Scraping Guides

In this edition of our "How To Scrape X" series, we went through how you can scrape LinkedIn.com including how to bypass its anti-bot protection.

The full code for this LinkedIn Company Profile Spider is available on Github here.

If you would like to learn how to scrape other popular websites then check out our other How To Scrape With Scrapy Guides here:

Of if you would like to learn more about web scraping in general, then be sure to check out The Web Scraping Playbook, or check out one of our more in-depth guides: