Skip to main content

Python Scrapy: Build A Indeed Scraper [2022]

In this guide for our "How To Scrape X With Python Scrapy" series, we're going to look at how to build a Python Scrapy spider that will crawl Indeed.com for products and scrape product pages.

Indeed is one of the most popular jobs listing websites so it a great website to scrape if you want data on job listings, the state of the jobs market, or company hiring patterns.

In our language agnostic How To Scrape Indeed.com guide, we went into more detail about how Indeed pages work. However, in this article we will focus on building a production Indeed scraper using Python Scrapy.

In this guide we will go through:

GitHub Code

The full code for this Indeed Spider is available on Github here.

If you prefer to follow along with a video then check out the video tutorial version here:


How To Architect Our Indeed Scraper

How we design our Indeed scraper is going to heavily depend on:

  • The use case for scraping this data?
  • What data we want to extract from Indeed?
  • How often do we want to extract data?
  • How much data do we want to extract?
  • Your technical sophistication?

How you answer these questions will change what type of scraping architecture we build.

For this Indeed scraper example we will assume the following:

  • Objective: The objective for this scraping system is to monitor job postings for our target keywords and do some analysis of the job descriptions, etc.
  • Required Data: We want to extract the full job posting and store the relevant information.
  • Scale: This will be a relatively small scale scraping process (hundreds of keywords), so no need to design a more sophisticated infrastructure.
  • Data Storage: To keep things simple for the example we will store to a CSV file, but provide examples on how to store to MySQL & Postgres DBs.

To do this will design a Scrapy spider that combines both a job discovery crawler and a job data scraper.

As the spider runs it will crawl Indeed's job search pages, extract job ids and then send them to the job data scraper via a callback. Saving the data to a CSV file via Scrapy Feed Exports.

The advantage of this scraping architecture is that is pretty simple to build and completely self-contained.


How To Build a Indeed Job Search Crawler

The first part of scraping Indeed is designing a web crawler that will discover jobs for us to scrape.

Step 1: Understand Indeed Search Pages

With Indeed.com the easiest way to do this is to build a Scrapy crawler that uses the Indeed job search page which returns up to 10 products per page.

How To Scrape Indeed.com Search Pages

For example, here is how we would get search results for software engineer in California.


Unencoded --> 'https://www.indeed.com/jobs?q=software engineer&l=San Francisco&start=0&filter=0'

Encoded --> 'https%3A%2F%2Fwww.indeed.com%2Fjobs%3Fq%3Dsoftware%20engineer%26l%3DSan%20Francisco%26start%3D0%26filter%3D0'

This URL contains a number of parameters that we will explain:

  • q stands for the search query. In our case, q=software engineer. Note: If you want to search for a keyword that contains spaces or special characters then remember you need to encode this value. (Encoded: q%3Dsoftware%20engineer)
  • l stands for the location you want to search for jobs. In our case, we used l=California.
  • start stands for the starting point for the pagination. We use the start parameter to paginate through results.

Using these parameters we can customise our requests to the Indeed Job Search endpoint to either scrape overall job data or get job URLs that we can scrape individually.

With Indeed.com it is actually very easy to extract the data we need as the data is available as hidden JSON data on the page.

How To Scrape Indeed.com - Hidden JSON Blob

The Indeed job data is contained in the <script id="mosaic-data" type="text/javascript"> tag, under window.mosaic.providerData["mosaic-provider-jobcards"].


<script id="mosaic-data" type="text/javascript">
...

window.mosaic.providerData["mosaic-provider-jobcards"]={"metaData":{"mosaicProviderJobCardsModel":{"adSignature":"3573","appName":"jasx","applyHolisticStyle":true,"bot":false,"brandedAds":[],"csrfToken":"VfkW0LdLxAXrjIkxOeasTnBCW8vSv9TE","encryptedQueryData":"RnZhMybXSk4M3QtTVGXWoe9dbTL46KyFjV9_vwSAcQxuziQ2QCDK8B6B0pUnV6xlgzK1HVOkc0tMGyUpMO9yEdnbun4jJaS6CbMzioz2PqM","experienceLevelFilterRefineBy":"","fccId":-1,"hasResume":false,"indeedApplyOnlyFilterUsed":false,"ipCountry":"IE","isDesktop":true,"isHighContrastIconShown":true,"isIpadApp":false,"isJobCardShelfApplied":true,"isTablet":false,"jobSeenLogParameters":{},"linkTargetAttribute":"_blank","loggedIn":false,"mobtk":"1ge736cml2gra002","mosaicNonJobContent":[],"mustShowSponsoredLabel":false,"myIndeedEnabled":true,"myIndeedRegisterLink":"https://www.indeed.com/account/register?dest=%2Fjobs%3Fjson%3D1%26q%3Dpython%26vjk%3D532734731891698b%26l%3DTexas","noJsUrlOnly":false,"overrideShelf":true,"pageNumber":1,"prforceGroups":"","proctorContext":{"accountId":-1,"app":false,"country":"US","ctkAge":72611863,"ctkDate":"20220929","hasRez":false,"lang":"en","loggedIn":false,"mtkAge":72611863,"platform":"","privileged":false,"smartphone":false,"stealthGroups":[],"tablet":false,"uaData":"{\"android\":false,\"androidApp\":false,\"androidEmployerApp\":false,\"androidJobSearchApp\":false,\"app\":false,\"bot\":false,\"browser\":\"CHROME\",\"browserFamily\":\"CHROME\",\"browserReleaseVersion\":{\"matchPrecision\":\"BUILD\",\"version\":29554872554618880},\"browserVersion\":{\"majorVersion\":\"105\",\"minorVersion\":\"-1\",\"version\":\"105\"},\"chrome\":true,\"chromeForIOS\":false,\"currentJobseekerDeprecatedBrowser\":false,\"deviceType\":\"COMPUTER\",\"droidRezUploadDialog\":false,\"dumbPhone\":false,\"employerApp\":false,\"fileUploadCapable\":true,\"futureJobseekerDeprecatedBrowser\":false,\"geolocationCapable\":false,\"googleWebLight\":false,\"ios\":false,\"iosemployerApp\":false,\"iosjobSearchApp\":false,\"ipad\":false,\"ipadApp\":false,\"ipadJobSearchApp\":false,\"jobSearchApp\":false,\"mobileDevice\":false,\"operatingSystem\":\"WINDOWS\",\"os\":{\"family\":\"windows\",\"majorVersion\":-1,\"minorVersion\":-1,\"osFamily\":\"windows\",\"osVersion\":{\"matchPrecision\":\"BUILD\",\"version\":0},\"patchVersion\":-1,\"releaseVersion\":{\"matchPrecision\":\"BUILD\",\"version\":0},\"version\":\"\"},\"phone\":false,\"releaseVersion\":{\"matchPrecision\":\"BUILD\",\"version\":29554872554618880},\"safari\":false,\"safariForIOS\":false,\"smartPhone\":false,\"tablet\":false,\"uaVersion\":{\"matchPrecision\":\"BUILD\",\"version\":29554872554618880},\"userAgentDelegate\":{\"android\":false,\"bot\":false,\"browser\":\"CHROME\",\"browserName\":\"Chrome\",\"browserReleaseVersion\":{\"matchPrecision\":\"BUILD\",\"version\":29554872554618880},\"browserVersion\":{\"majorVersion\":\"105\",\"minorVersion\":\"-1\",\"version\":\"105\"},\"browserVersionString\":\"105\",\"chrome\":true,\"delegate\":{\"allFields\":{\"DeviceClass\":{\"confidence\":500,\"defaultValue\":\"Unknown\",\"isDefaultValue\":false,\"value\":\"Desktop\"},\"DeviceName\":{\"confidence\":400001,\"defaultValue\":\"Unknown\",\"isDefaultValue\":false,\"value\":\"Desktop\"},\"DeviceBrand\":{\"confidence\":0,\"defaultValue\":\"Unknown\",\"isDefaultValue\":true,\"value\":\"Unknown\"},\"OperatingSystemClass\":{\"confidence\":400001,\"defaultValue\":\"Unknown\",\"isDefaultValue\":false,\"value\":\"Desktop\"},\"OperatingSystemName\":{\"confidence\":400001,\"defaultValue\":\"Unknown\",\"isDefaultValue\":false,\"value\":\"Windows NT\"},\"OperatingSystemVersion\":{\"confidence\":400001,\"defaultValue\":\"??\",\"isDefaultValue\":true,\"value\":\"??\"},\"OperatingSystemVersionMajor\":{\"confidence\":400001,\"defaultValue\":\"??\",\"isDefaultValue\":true,\"value\":\"??\"},\"AgentClass\":{\"confidence\":2014,\"defaultValue\":\"Unknown\",\"isDefaultValue\":false,\"value\":\"Browser\"},\"AgentName\":{\"confidence\":2014,\"defaultValue\":\"Unknown\",\"isDefaultValue\":false,\"value\":\"Chrome\"},\"AgentVersion\":{\"confidence\":3000,\"defaultValue\":\"??\",\"isDefaultValue\":false,\"value\":\"105\"},\"AgentInformationEmail\":{\"confidence\":-1,\"defaultValue\":\"Unknown\",\"isDefaultValue\":true,\"value\":\"Unknown\"},\"AgentInformationUrl\":{\"confidence\":-1,\"defaultValue\":\"Unknown\",\"isDefaultValue\":true,\"value\":\"Unknown\"},\"WebviewAppName\":{\"confidence\":-1,\"defaultValue\":\"Unknown\",\"isDefaultValue\":true,\"value\":\"Unknown\"},\"WebviewAppVersion\":{\"confidence\":-1,\"defaultValue\":\"??\",\"isDefaultValue\":true,\"value\":\"??\"},\"__SyntaxError__\":{\"confidence\":-1,\"defaultValue\":\"false\",\"isDefaultValue\":true,\"value\":\"false\"}},\"ambiguityCount\":0,\"availableFieldNamesSorted\":[\"DeviceClass\",\"DeviceName\",\"DeviceBrand\",\"OperatingSystemClass\",\"OperatingSystemName\",\"OperatingSystemVersion\",\"OperatingSystemVersionMajor\",\"AgentClass\",\"AgentName\",\"AgentVersion\",\"AgentInformationEmail\",\"AgentInformationUrl\",\"WebviewAppName\",\"WebviewAppVersion\",\"__SyntaxError__\"],\"cleanedAvailableFieldNamesSorted\":[\"DeviceClass\",\"DeviceName\",\"DeviceBrand\",\"OperatingSystemClass\",\"OperatingSystemName\",\"OperatingSystemVersion\",\"OperatingSystemVersionMajor\",\"AgentClass\",\"AgentName\",\"AgentVersion\"],\"hasAmbiguity\":false,\"hasSyntaxError\":false,\"headers\":{\"User-Agent\":\"Mozilla\\u002F5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\\u002F537.36 (KHTML, like Gecko) Chrome\\u002F105.0.0.0 Safari\\u002F537.36\"},\"userAgentString\":\"Mozilla\\u002F5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\\u002F537.36 (KHTML, like Gecko) Chrome\\u002F105.0.0.0 Safari\\u002F537.36\",\"userAgentStringField\":{\"confidence\":0,\"defaultValue\":\"Mozilla\\u002F5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\\u002F537.36 (KHTML, like Gecko) Chrome\\u002F105.0.0.0 Safari\\u002F537.36\",\"isDefaultValue\":false,\"value\":\"Mozilla\\u002F5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\\u002F537.36 (KHTML, like Gecko) Chrome\\u002F105.0.0.0 Safari\\u002F537.36\"}},\"deviceName\":\"Desktop\",\"deviceType\":\"COMPUTER\",\"deviceTypeString\":\"Desktop\",\"dumbPhone\":false,\"ios\":false,\"ipad\":false,\"mobileDevice\":false,\"operatingSystem\":\"WINDOWS\",\"operatingSystemFamily\":\"Windows NT\",\"operatingSystemVersion\":\"??\",\"os\":{\"family\":\"windows\",\"majorVersion\":-1,\"minorVersion\":-1,\"osFamily\":\"windows\",\"osVersion\":{\"matchPrecision\":\"BUILD\",\"version\":0},\"patchVersion\":-1,\"releaseVersion\":{\"matchPrecision\":\"BUILD\",\"version\":0},\"version\":\"\"},\"phone\":false,\"safari\":false,\"smartPhone\":false,\"tablet\":false,\"userAgentString\":\"Mozilla\\u002F5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\\u002F537.36 (KHTML, like Gecko) Chrome\\u002F105.0.0.0 Safari\\u002F537.36\",\"webviewName\":\"Unknown\",\"webviewVersion\":{\"matchPrecision\":\"BUILD\",\"version\":0},\"windowsPhone\":false},\"userAgentString\":\"Mozilla\\u002F5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\\u002F537.36 (KHTML, like Gecko) Chrome\\u002F105.0.0.0 Safari\\u002F537.36\",\"version\":{\"major\":105,\"minor\":-1,\"version\":\"105\"},\"windowsPhone\":false}"},"proctorIdentifiers":{"ACCOUNT":"-1","USER":"1ge4tueuklhdh800"},"queryModifierResult":{"originalQuery":"python","queryModifiers":[{"clickUrl":"http://www.indeed.com/jobs?q=python%2Bintern&l=Texas&from=querymodifiers&qm=1&oq=python","newQuery":"python intern"},{"clickUrl":"http://www.indeed.com/jobs?q=panda&l=Texas&from=querymodifiers&qm=1&oq=python","newQuery":"panda"},{"clickUrl":"http://www.indeed.com/jobs?q=bobcat&l=Texas&from=querymodifiers&qm=1&oq=python","newQuery":"bobcat"},{"clickUrl":"http://www.indeed.com/jobs?q=rhino&l=Texas&from=querymodifiers&qm=1&oq=python","newQuery":"rhino"},{"clickUrl":"http://www.indeed.com/jobs?q=reptile&l=Texas&from=querymodifiers&qm=1&oq=python","newQuery":"reptile"},{"clickUrl":"http://www.indeed.com/jobs?q=boba&l=Texas&from=querymodifiers&qm=1&oq=python","newQuery":"boba"},{"clickUrl":"http://www.indeed.com/jobs?q=drupal&l=Texas&from=querymodifiers&qm=1&oq=python","newQuery":"drupal"},{"clickUrl":"http://www.indeed.com/jobs?q=caterpillar&l=Texas&from=querymodifiers&qm=1&oq=python","newQuery":"caterpillar"},{"clickUrl":"http://www.indeed.com/jobs?q=abacus&l=Texas&from=querymodifiers&qm=1&oq=python","newQuery":"abacus"},{"clickUrl":"http://www.indeed.com/jobs?q=food%2Blion&l=Texas&from=querymodifiers&qm=1&oq=python","newQuery":"food lion"}]},"radius":25,"refineByTypes":[],"results":[{"appliedOrGreater":false,"company":"John Deere","companyBrandingAttributes":{"headerImageUrl":"https://d2q79iu7y748jz.cloudfront.net/s/_headerimage/1960x400/5e8d35d0dcbc8a32f12d61e4541c55ae","logoUrl":"https://d2q79iu7y748jz.cloudfront.net/s/_squarelogo/256x256/b46cb1797d2ea21811908aaa0ab2bdad"},"companyIdEncrypted":"38eb72d608d80c79","companyOverviewLink":"/cmp/John-Deere","companyOverviewLinkCampaignId":"serp-linkcompanyname","companyRating":4,"companyReviewCount":3767,"companyReviewLink":"/cmp/John-Deere/reviews","companyReviewLinkCampaignId":"cmplinktst2","d2iEnabled":false,"displayTitle":"Part-Time Student-MLOps Software Engineer-Remote","dradisJob":false,"employerAssistEnabled":false,"employerResponsive":false,"encryptedFccompanyId":"eade00c6021a5947","encryptedResultData":"VwIPTVJ1cTn5AN7Q-tSqGRXGNe2wB2UYx73qSczFnGU","expired":false,"extractTrackingUrls":"","extractedEntities":[],"fccompanyId":-1,"featuredCompanyAttributes":{},"featuredEmployer":false,"featuredEmployerCandidate":false,"feedId":2701,"formattedLocation":"Austin, TX 78704","formattedRelativeTime":"Today","hideMetaData":false,"hideSave":false,"highVolumeHiringModel":{"highVolumeHiring":false},"highlyRatedEmployer":false,"hiringEventJob":false,"indeedApplyEnabled":false,"indeedApplyable":false,"isJobSpotterJob":false,"isJobVisited":false,"isMobileThirdPartyApplyable":false,"isNoResumeJob":false,"isSubsidiaryJob":false,"jobCardRequirementsModel":{"additionalRequirementsCount":0,"requirementsHeaderShown":false},"jobLocationCity":"Austin","jobLocationExtras":"South Lamar-South Congress","jobLocationPostal":"78704","jobLocationState":"TX","jobTypes":["Full-time","Part-time"],"jobkey":"a22fa26470cfb9ad","jsiEnabled":false,"link":"/rc/clk?jk=a22fa26470cfb9ad&fccid=38eb72d608d80c79&vjs=3","locationCount":1,"loceJobTagModel":{},"mobtk":"1ge736cml2gra002","moreLinks":{"companyName":"John Deere","companyText":"John Deere jobs in Austin, TX","locationName":"Austin","qnaUrl":"/cmp/John-Deere/faq","qnaUrlParams":"?from=serp-more&campaignid=serp-more&fromjk=a22fa26470cfb9ad&jcid=38eb72d608d80c79","resultNumber":0,"salaryLocationName":"Austin, TX","salaryNoFollowLink":false,"salaryUrl":"/career/software-engineer/salaries/78704--TX","salaryUrlParams":"?campaignid=serp-more&fromjk=a22fa26470cfb9ad&from=serp-more","shortLocationName":"Austin, TX","showAcmeLink":true,"showAcmeQnaLink":true,"showViewAllCompanyAndLocationLinks":true,"showViewAllCompanyLink":true,"showViewAllLocationLink":true,"showViewAllNormalizedTitleLink":false,"viewAllCompanyLinkText":"John Deere jobs in Austin, TX","viewAllCompanyUrl":"/q-John-Deere-l-Austin,-TX-jobs.html","viewAllLocationUrl":"/l-Austin,-TX-jobs.html","visible":false},"moreLocUrl":"/jobs?q=python&l=Texas&jtid=b3a825820658bf92&jcid=38eb72d608d80c79&grp=tcl","mouseDownHandlerOption":{"adId":"","advn":"","extractTrackingUrls":[],"from":"vjs","jobKey":"a22fa26470cfb9ad","link":"/rc/clk?jk=a22fa26470cfb9ad&fccid=38eb72d608d80c79&vjs=3","tk":"1ge736cml2gra002"},"newJob":true,"normTitle":"Part Time Student Mlop Software Engineer Remote","openInterviewsInterviewsOnTheSpot":false,"openInterviewsJob":false,"openInterviewsOffersOnTheSpot":false,"openInterviewsPhoneJob":false,"overrideIndeedApplyText":true,"preciseLocationModel":{"obfuscateLocation":false,"overrideJCMPreciseLocationModel":true},"pubDate":1664427600000,"redirectToThirdPartySite":false,"remoteLocation":false,"remoteWorkModel":{"inlineText":true,"type":"REMOTE_ALWAYS"},"resumeMatch":false,"salarySnippet":{"salaryTextFormatted":false},"saved":false,"savedApplication":false,"screenerQuestionsURL":"","showCommutePromo":false,"showEarlyApply":false,"showJobType":false,"showRelativeDate":true,"showSponsoredLabel":false,"showStrongerAppliedLabel":false,"smartFillEnabled":false,"smbD2iEnabled":false,"snippet":"\u003Cul style=\"list-style-type:circle;margin-top: 0px;margin-bottom: 0px;padding-left:20px;\"\u003E \n \u003Cli style=\"margin-bottom:0px;\"\u003ETitle: Part-Time Student-MLOps Software Engineer-Remote - 91235.\u003C/li\u003E\n \u003Cli\u003EThe Part-Time Student Program is primarily designed to augment the Company’s regular full-time…\u003C/li\u003E\n\u003C/ul\u003E","sourceId":2775,"sponsored":false,"taxoAttributes":[],"taxoAttributesDisplayLimit":5,"taxoLogAttributes":[],"taxonomyAttributes":[{"attributes":[{"label":"Part-time","suid":"75GKK"}};"

...
</script>

So get at this data we can just use a regex command to find the window.mosaic.providerData["mosaic-provider-jobcards"] json on the page and parse it's contents.

We can do this with a regex command like:


'window.mosaic.providerData\["mosaic-provider-jobcards"\]=(\{.+?\});'

In Python the full regex command would look like:


script_tag = re.findall(r'window.mosaic.providerData\["mosaic-provider-jobcards"\]=(\{.+?\});', response.text)

This JSON blob is pretty big and contains a lot of unnecessary data but the data we are looking for is in:


json_blob = json.loads(script_tag[0])
jobs_list = json_blob['metaData']['mosaicProviderJobCardsModel']['results']


Step 2: Build Indeed Search Crawler

So the first thing we need to do is to build a Scrapy spider that will send a request to the Indeed Search page, and if that keyword has more than 10 jobs calculate the total number of pages and send requests for each page so that we can discover every product.

First we will use Python Scrapy to request a single Jobs search page:


import re
import json
import scrapy
from urllib.parse import urlencode

class IndeedJobSpider(scrapy.Spider):
name = "indeed_jobs"

def get_indeed_search_url(self, keyword, location, offset=0):
parameters = {"q": keyword, "l": location, "filter": 0, "start": offset}
return "https://www.indeed.com/jobs?" + urlencode(parameters)

def start_requests(self):
keyword_list = ['python']
location_list = ['texas']
for keyword in keyword_list:
for location in location_list:
indeed_jobs_url = self.get_indeed_search_url(keyword, location)
yield scrapy.Request(url=indeed_jobs_url, callback=self.parse_search_results, meta={'keyword': keyword, 'location': location, 'offset': 0})

def parse_search_results(self, response):
pass


Next, we can expand this to paginate through all available pages of results for our query by access the total jobs count from the metaData:


import re
import json
import scrapy
from urllib.parse import urlencode

class IndeedJobSpider(scrapy.Spider):
name = "indeed_jobs"

def get_indeed_search_url(self, keyword, location, offset=0):
parameters = {"q": keyword, "l": location, "filter": 0, "start": offset}
return "https://www.indeed.com/jobs?" + urlencode(parameters)

def start_requests(self):
keyword_list = ['python']
location_list = ['texas']
for keyword in keyword_list:
for location in location_list:
indeed_jobs_url = self.get_indeed_search_url(keyword, location)
yield scrapy.Request(url=indeed_jobs_url, callback=self.parse_search_results, meta={'keyword': keyword, 'location': location, 'offset': 0})

def parse_search_results(self, response):
location = response.meta['location']
keyword = response.meta['keyword']
offset = response.meta['offset']
script_tag = re.findall(r'window.mosaic.providerData\["mosaic-provider-jobcards"\]=(\{.+?\});', response.text)
if script_tag is not None:
json_blob = json.loads(script_tag[0])

# Paginate Through Jobs Pages
if offset == 0:
meta_data = json_blob["metaData"]["mosaicProviderJobCardsModel"]["tierSummaries"]
num_results = sum(category["jobCount"] for category in meta_data)
if num_results > 1000:
num_results = 50

for offset in range(10, num_results + 10, 10):
url = self.get_indeed_search_url(keyword, location, offset)
yield scrapy.Request(url=url, callback=self.parse_search_results, meta={'keyword': keyword, 'location': location, 'offset': offset})


Now when we run this spider using the following command the spider it will crawl through every available search page for your target keywords and locations.


scrapy crawl indeed_jobs

Now that we have a Jobs Discovery Crawler that crawls every Job search, however, it won't ouptut any data.


How To Build a Indeed Job Scraper

To scrape actual job data we will add a callback to our job discovery crawler, that will request each job page and then a job scraper to scrape all the job information we want.

Step 1: Add Job Scraper Callback

First we need to update our parse_search_results() method to extract all the product URLs from the product_list and then send a request to each one.


import re
import json
import scrapy
from urllib.parse import urlencode

class IndeedJobSpider(scrapy.Spider):
name = "indeed_jobs"

def get_indeed_search_url(self, keyword, location, offset=0):
parameters = {"q": keyword, "l": location, "filter": 0, "start": offset}
return "https://www.indeed.com/jobs?" + urlencode(parameters)

def start_requests(self):
keyword_list = ['python']
location_list = ['texas']
for keyword in keyword_list:
for location in location_list:
indeed_jobs_url = self.get_indeed_search_url(keyword, location)
yield scrapy.Request(url=indeed_jobs_url, callback=self.parse_search_results, meta={'keyword': keyword, 'location': location, 'offset': 0})

def parse_search_results(self, response):
location = response.meta['location']
keyword = response.meta['keyword']
offset = response.meta['offset']
script_tag = re.findall(r'window.mosaic.providerData\["mosaic-provider-jobcards"\]=(\{.+?\});', response.text)
if script_tag is not None:
json_blob = json.loads(script_tag[0])

# Paginate Through Jobs Pages
if offset == 0:
meta_data = json_blob["metaData"]["mosaicProviderJobCardsModel"]["tierSummaries"]
num_results = sum(category["jobCount"] for category in meta_data)
if num_results > 1000:
num_results = 50

for offset in range(10, num_results + 10, 10):
url = self.get_indeed_search_url(keyword, location, offset)
yield scrapy.Request(url=url, callback=self.parse_search_results, meta={'keyword': keyword, 'location': location, 'offset': offset})

## Extract Jobs From Search Page
jobs_list = json_blob['metaData']['mosaicProviderJobCardsModel']['results']
for index, job in enumerate(jobs_list):
if job.get('jobkey') is not None:
job_url = 'https://www.indeed.com/m/basecamp/viewjob?viewtype=embedded&jk=' + job.get('jobkey')
yield scrapy.Request(url=job_url,
callback=self.parse_job,
meta={
'keyword': keyword,
'location': location,
'page': round(offset / 10) + 1 if offset > 0 else 1,
'position': index,
'jobKey': job.get('jobkey'),
})

def parse_job(self, response):
pass


This will extract all the Job Ids from the Indeed search page, create & request the job_url, and then trigger a parse_job scraper when it recieves a response.


Step 2: Understand Indeed Product Page

Here is an example Indeed job page URL:


'https://www.indeed.com/m/basecamp/viewjob?viewtype=embedded&jk=f6288f8af00406b1'

Which looks like this in our browser:

How To Scrape Indeed.com Job Page

Again, as Indeed returns the data inside a window._initialData={} inside a <script> tag in the HTML response it is pretty easy to extract the data.


<script>
...
window._initialData={"accountKey":null,"apiPaths":{},"appCommonData":null,"averageRatingsModel":null,"base64EncodedJson":"eyJjIjpmYWxzZSwiZSI6ZmFsc2UsImciOiJodHRwOi8vd3d3LmluZGVlZC5jb20vbS9iYXNlY2FtcC92aWV3am9iP3ZpZXd0eXBlPWVtYmVkZGVkJmprPWY2Mjg4ZjhhZjAwNDA2YjEifQ","baseInboxUrl":"https:\u002F\u002Finbox.indeed.com","baseUrl":"https:\u002F\u002Fwww.indeed.com","benefitsModel":{"benefits":[{"key":"FVKX2","label":"401(k)"},{"key":"SENX8","label":"401(k) matching"},{"key":"6XHWW","label":"Commuter assistance"},{"key":"YJ8XR","label":"Food provided"},{"key":"3K96F","label":"Free parking"},{"key":"TZV2T","label":"Gym membership"},{"key":"EY33Q","label":"Health insurance"},{"key":"Y2WS5","label":"Life insurance"},{"key":"HW4J4","label":"Paid time off"},{"key":"NPHPU","label":"Parental leave"},{"key":"6XT6J","label":"Stock options"}]},"callToInterviewButtonModel":null,"categorizedAttributesModel":null,"chatbotApplyButtonLinkModel":null,"clientsideProctorGroups":{"callToApplyStickySideBySide":true,"mobmapandcommutetimetst9":false,"showSalaryGuide":true,"desktopvj_stickyheader_tst":false,"callToApplyStickyBelowApplyNow":false,"mob_desktop_serp_tst":true,"callButtonPrimaryApplySecondary":false,"showInterviewCardBelowJobDesc":false},"cmiJobCategoryModel":null,"commuteInfoModel":null,"companyAvatarModel":null,"companyFollowFormModel":null,"companyTabModel":null,"contactPersonModel":null,"country":"US","cssResetProviders":{"mosaic-provider-reportcontent":false,"mosaic-provider-salary-feedback":true,"mosaic-provider-company-info-salary":true,"mosaic-provider-rich-media":false,"js-match-insights-provider":false,"MosaicProviderCallToApplyFeedback":true,"mosaic-provider-dislike-feedback":false},"ctk":"1ge4tueuklhdh800","dcmModel":{"category":"jobse0","source":"6927552","type":"organic"},"desktop":true,"desktopSponsoredJobSeenData":"tk=1ge7cg6jhkke6800","dgToken":"6987EEB4A2C6A193E5C44936188510EB","dislikeFrom2paneEnabled":false,"downloadAppButtonModel":null,"employerResponsiveCardModel":null,"from":null,"globalnavFooterHTML":"","globalnavHeaderHTML":"","highQualityMarketplace":null,"hiringInsightsModel":{"age":"30+ days ago","employerLastReviewed":null,"employerResponsiveCardModel":null,"numOfCandidates":null,"postedToday":false,"recurringHireText":null,"urgentlyHiringModel":null},"indeedApplyButtonContainer":{"brandingText":null,"buttonClickUrl":null,"disabled":false,"employerResponsiveCard":null,"enableStickyInquiry":false,"enableStickyInquiryTooltip":false,"hasMessage":false,"indeedApplyAttributes":{"content":"data-indeed-apply-apiToken='f09f8f8add995328354a7a9e7a7fefdedfe230dee1f12eeb30c6e4b184f2dd9e' data-indeed-apply-jobTitle='Senior Full Stack Engineer (python)' data-indeed-apply-jobId='4931506003' data-indeed-apply-jobLocation='Austin, TX' data-indeed-apply-jobCompanyName='Bluevine' data-indeed-apply-jobUrl='https:\u002F\u002Fwww.indeed.com\u002Fviewjob?jk=f6288f8af00406b1' data-indeed-apply-questions='https:\u002F\u002Fapi.greenhouse.io\u002Fv1\u002Fboards\u002Fbluevine\u002Fjobs\u002F4931506003\u002Findeed' data-indeed-apply-postUrl='https:\u002F\u002Fapi.greenhouse.io\u002Fv1\u002Fboards\u002Fbluevine\u002Fjobs\u002F4931506003\u002Findeed' data-indeed-apply-name='firstlastname' data-indeed-apply-coverletter='optional' data-indeed-apply-phone='required' data-indeed-apply-resume='required' data-indeed-apply-noButtonUI='true' data-indeed-apply-pingbackUrl='https:\u002F\u002Fgdc.indeed.com\u002Fconv\u002ForgIndApp?co=US&amp;vjtk=1ge7cg6jhkke6800&amp;jk=f6288f8af00406b1&amp;mvj=0&amp;astse=b78e1ac815f57228&amp;assa=2193' data-indeed-apply-onappliedstatus='_updateIndeedApplyStatus' data-indeed-apply-onready='_onButtonReady' data-indeed-apply-jk='f6288f8af00406b1' data-indeed-apply-onclose=\"indeedApplyHandleModalClose\" data-indeed-apply-onapplied=\"indeedApplyHandleApply\" data-indeed-apply-oncontinueclick=\"indeedApplyHandleModalClose\" data-indeed-apply-onClick=\"indeedApplyHandleButtonClick\" data-indeed-apply-returnToJobSearchUrl='' data-acc-payload=\"1,2,22,1,144,1,552,1,3648,1,4392,1\" data-indeed-apply-recentsearchquery='{\"what\":\"software engineer\",\"where\":\"California\"}'","contentKind":"ATTRIBUTES"},"indeedApplyBaseUrl":"https:\u002F\u002Fapply.indeed.com","indeedApplyBootStrapAttributes":{"hl":"en","source":"idd","co":"US","vjtk":"1ge7cg6jhkke6800"},"indeedApplyButtonAttributes":{"postUrl":"https:\u002F\u002Fapi.greenhouse.io\u002Fv1\u002Fboards\u002Fbluevine\u002Fjobs\u002F4931506003\u002Findeed","jk":"f6288f8af00406b1","onClick":"indeedApplyHandleButtonClick","jobTitle":"Senior Full Stack Engineer (python)","questions":"https:\u002F\u002Fapi.greenhouse.io\u002Fv1\u002Fboards\u002Fbluevine\u002Fjobs\u002F4931506003\u002Findeed","onappliedstatus":"_updateIndeedApplyStatus","jobCompanyName":"Bluevine","recentsearchquery":"{\"what\":\"software engineer\",\"where\":\"California\"}","onclose":"indeedApplyHandleModalClose","jobUrl":"https:\u002F\u002Fwww.indeed.com\u002Fviewjob?jk=f6288f8af00406b1","onready":"_onButtonReady","onapplied":"indeedApplyHandleApply","coverletter":"optional","resume":"required","pingbackUrl":"https:\u002F\u002Fgdc.indeed.com\u002Fconv\u002ForgIndApp?co=US&vjtk=1ge7cg6jhkke6800&jk=f6288f8af00406b1&mvj=0&astse=b78e1ac815f57228&assa=2193","noButtonUI":"true","jobId":"4931506003","apiToken":"f09f8f8add995328354a7a9e7a7fefdedfe230dee1f12eeb30c6e4b184f2dd9e","jobLocation":"Austin, TX","phone":"required","name":"firstlastname","oncontinueclick":"indeedApplyHandleModalClose","returnToJobSearchUrl":""},"indeedApplyButtonModel":{"applyBtnNewStyle":true,"buttonSize":"block","buttonType":"branded","contentHtml":"Apply now","dataHref":null,"disclaimer":null,"href":"\u002F","icon":null,"isBlock":false,"largeScreenSizeText":null,"openInNewTab":false,"referrerpolicy":null,"rel":null,"sanitizedHref":null,"sanitizedHtml":null,"sticky":false,"target":null,"title":null,"viewJobDisplay":null},"indeedApplyLoginModalModel":null,"indeedApplyScriptAttributes":{"data-indeed-apply-qs":"vjtk=1ge7cg6jhkke6800"},"indeedApplyScriptLocation":"https:\u002F\u002Fapply.indeed.com\u002Findeedapply\u002Fstatic\u002Fscripts\u002Fapp\u002Fbootstrap.js?hl=en&co=US&source=idd","shouldUseButtonPlaceholder":true,"stagingLevel":"prod","viewFormUrl":null,"viewFormUrlAttribute":{"content":"","contentKind":"ATTRIBUTES"}},"indeedLogoModel":null,"inlineJsErrEnabled":false,"isApp":false,"isApplyTextColorChanges":true,"isApplyTextSizeChanges":true,"isCriOS":false,"isDislikeFormV2Enabled":false,"isSafariForIOS":false,"isSalaryNewDesign":false,"isSyncJobs":false,"jasJobViewPingModel":null,"jasxInputWhatWhereActive":true,"jobAlertSignInModalModel":null,"jobAlertSignUp":null,"jobCardStyleModel":{"elementSpacingIncreased":false,"fontSizeEnlarged":false,"highContrastIconShown":false,"jobCardShelfApplied":false,"salaryBlack":false,"shouldMarkClickedJobAsVisited":false},"jobInfoWrapperModel":{"jobInfoModel":{"appliedStateBannerModel":null,"commuteInfoModel":null,"expiredJobMetadataModel":null,"hideCmpHeader":false,"isD2iEnabled":false,"isJsiEnabled":false,"jobAttributesTestValue":-1,"jobDebugInfoModel":null,"jobDescriptionSectionModel":null,"jobInfoHeaderModel":{"a11yNewtabIconActive":false,"averageRatingsModel":null,"companyImagesModel":{"ejiBannerAsBackground":false,"enhancedJobDescription":false,"featuredEmployer":false,"headerImageUrl":"https:\u002F\u002Fd2q79iu7y748jz.cloudfront.net\u002Fs\u002F_headerimage\u002F1960x400\u002Fae55ead6c2c0702692b9e43ac06f3277","logoAltText":"Bluevine logo","logoImageOverlayLower":false,"logoUrl":"https:\u002F\u002Fd2q79iu7y748jz.cloudfront.net\u002Fs\u002F_squarelogo\u002F256x256\u002Fc338b6786b5eadab2a1f404e10259004","showBannerTop":false,"showEnhancedJobImp":false,"showIconInTitle":false},"companyName":"Bluevine","companyOverviewLink":"https:\u002F\u002Fwww.indeed.com\u002Fcmp\u002FBluevine?campaignid=mobvjcmp&from=mobviewjob&tk=1ge7cg6jhkke6800&fromjk=f6288f8af00406b1","companyReviewLink":"https:\u002F\u002Fwww.indeed.com\u002Fcmp\u002FBluevine\u002Freviews?campaignid=mobvjcmp&cmpratingc=mobviewjob&from=mobviewjob&tk=1ge7cg6jhkke6800&fromjk=f6288f8af00406b1&jt=Senior+Full+Stack+Engineer+%28python%29","companyReviewModel":null,"disableAcmeLink":false,"employerActivity":null,"employerResponsiveCardModel":null,"encryptedFccCompanyId":null,"formattedLocation":"Austin, TX","hideRating":false,"isDesktopApplyButtonSticky":false,"isSimplifiedHeader":false,"jobNormTitle":null,"jobTitle":"Senior Full Stack Engineer (python)","jobTypes":null,"location":null,"openCompanyLinksInNewTab":false,"parentCompanyName":null,"preciseLocationModel":null,"ratingsModel":null,"recentSearch":null,"remoteWorkModel":{"inlineText":true,"text":"Hybrid remote","type":"REMOTE_HYBRID"},"salaryMax":null,"salaryMin":null,"salaryType":null,"subtitle":"Bluevine - Austin, TX","tagModels":null,"taxonomyAttributes":null,"viewJobDisplay":"DESKTOP_EMBEDDED"},"jobMetadataHeaderModel":{"jobType":null},"jobTagModel":null,"resumeEvaluationResult":null,"sanitizedJobDescription":{"content":"<div>\n <div>\n <p><b>About Bluevine<\u002Fb><\u002Fp> \n <p> Bluevine is on a mission to enable a better financial future for small business owners through innovative banking solutions designed just for them. By combining best-in-class technology with advanced security and a deep understanding of the small business community, we deliver end-to-end banking and lending products that empower always-on entrepreneurs to grow their businesses with confidence.<\u002Fp> \n <p> As a dynamic company with massive potential, we're backed by leading investors such as Lightspeed Venture Partners, Menlo Ventures, 83North, Citi Ventures, and nearly 9 years of proven success. Since launching in 2013, we have grown exponentially, amassing over 400,000 customers across all 50 states and a global team of more than 500 people. Our passion is driven by purpose: to give small businesses the tools they need to succeed and we're just getting started.<\u002Fp> \n <p> All of this begins with our team who are driven by collaboration, problem-solving, and learning and growing together. With a commitment to innovation and community impact, our mission is to help every small business—and every team member—thrive. Join us!<\u002Fp>\n <\u002Fdiv>\n <p><b><i> This is a hybrid role<\u002Fi><\u002Fb><i>. <\u002Fi>At Bluevine, we pride ourselves on our collaborative culture, which we believe is best maintained through in-person interactions and a vibrant office environment. All of our offices have reopened in accordance with local guidelines, and are following a hybrid model. In-office days will be determined by location and discipline.<\u002Fp> \n <p><b> ABOUT THE ROLE:<\u002Fb><\u002Fp> \n <p> We're looking for a Senior Full Stack Engineer flexible enough to develop features from the front (beautiful UX) to the back (scalable and robust components and integrations). If you're drawn to engineering challenges and have a strong desire to make a big impact as part of a small, agile team, in an exciting space, we'd love to talk to you.<\u002Fp> \n <p> The team serves a variety of stakeholders across all the business and the platform.<\u002Fp> \n <p><b> WHAT YOU'LL DO:<\u002Fb><\u002Fp> \n <ul> \n <li>Independently drive the engineering development of complex features<\u002Fli> \n <li>Design and build state-of-the-art responsive banking applications<\u002Fli> \n <li>Work closely with, and incorporate feedback from product managers and other stakeholders in the company<\u002Fli> \n <li>Be part of a fast-paced and highly-flexible team with the comfort of making decisions using your best judgement<\u002Fli> \n <li>Develop projects through their entire life cycle<\u002Fli> \n <\u002Ful> \n <p><b>WHAT WE LOOK FOR:<\u002Fb><\u002Fp> \n <ul> \n <li>5+ years of combined full stack experience experience building fast, reliable, web and\u002For mobile applications on applications with Python backends<\u002Fli> \n <li>Experience with Web frameworks (e.g., Angular, React, or Vue)<\u002Fli> \n <li>Experience with source control management systems, preferably Git<\u002Fli> \n <li>B.S. in Computer Science or a related field preferred<\u002Fli> \n <\u002Ful> \n <p><b>Nice to Haves<\u002Fb><\u002Fp> \n <ul> \n <li>Experience with AWS<\u002Fli> \n <li>Experience with mobile development (e.g., Native, Native Script, or React)<\u002Fli> \n <\u002Ful>\n <div>\n <div> \n <p><b>BENEFITS AND PERKS - for employees located in the US<\u002Fb><\u002Fp> \n <ul> \n <li>Excellent health coverage and life insurance benefits<\u002Fli> \n <li>401K with an immediate 3% company match<\u002Fli> \n <li>PTO, Company Holidays, and Flexible Holidays<\u002Fli> \n <li>Company-sponsored Mental Health Benefits, including 1:1 therapy<\u002Fli> \n <li>Over $1,000 annually for a wellness benefit of your choice<\u002Fli> \n <li>Monthly WFH stipend<\u002Fli> \n <li>Generous, paid parental leave covering up to 16 weeks<\u002Fli> \n <li>Access to financial coaches and education sessions<\u002Fli> \n <li>Free commuter benefits - Caltrain passes for San Francisco employees and a monthly parking allowance<\u002Fli> \n <li>Monthly DoorDash credit<\u002Fli> \n <li>Weekly catered lunches and fully stocked kitchen pantries<\u002Fli> \n <li>Dog-friendly Redwood City, CA office<\u002Fli> \n <li>Community-based volunteering opportunities<\u002Fli> \n <\u002Ful> \n <p><b>BENEFITS &amp; PERKS - for employees located in Israel<\u002Fb><\u002Fp> \n <ul> \n <li>Excellent group health coverage and life insurance benefits<\u002Fli> \n <li>Stock options<\u002Fli> \n <li>Flexible hybrid work model<\u002Fli> \n <li>Large Study Fund contribution<\u002Fli> \n <li>Salary Benchmarks and Checkpoints<\u002Fli> \n <li>Monthly meal card of TenBis or CiBus (your choice) with generous balance<\u002Fli> \n <li>Free parking for cars, scooters, and bikes<\u002Fli> \n <li>Free gym membership<\u002Fli> \n <li>Company-sponsored Mental Health Benefits<\u002Fli> \n <li>PTO, Company Holidays, and Flexible Holidays<\u002Fli> \n <li>Community-based volunteering opportunities<\u002Fli>\n <\u002Ful>\n <\u002Fdiv>\n <\u002Fdiv>\n<\u002Fdiv>\n<div><\u002Fdiv>","contentKind":"HTML"},"screenerRequirementsModel":null,"showExpiredHeader":false,"tagModels":null,"viewJobDisplay":"DESKTOP_EMBEDDED"},"sectionedJobInfoModel":null},"jobKey":"f6288f8af00406b1","jobLocation":"Austin, TX","jobMetadataFooterModel":{"age":"30+ days ago","indeedApplyAdaNotice":"If you require alternative methods of application or screening, you must approach the employer directly to request this as Indeed is not responsible for the employer's application process.","originalJobLink":null,"phoneNumber":null,"saveJobLink":null,"showReportJobAsButton":true,"source":"BlueVine"},"jobSeenData":"tk=1ge7cg6jhkke6800&context=viewjobrecs","jobTitle":"Senior Full Stack Engineer (python)","language":"en","lastVisitTime":1664538063,"lazyProviders":{"mosaic-provider-reportcontent":"<div class=\"mosaic-reportcontent-wrapper button\"><style data-emotion=\"css 1686x4\">.css-1686x4{box-sizing:border-box;background:none;-webkit-appearance:none;-moz-appearance:none;-ms-appearance:none;appearance:none;text-align:left;-webkit-text-decoration:none;text-decoration:none;border:none;cursor:pointer;-webkit-user-select:none;-moz-user-select:none;-ms-user-select:none;user-select:none;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;-webkit-box-pack:center;-ms-flex-pack:center;-webkit-justify-content:center;justify-content:center;position:relative;margin:0;padding-left:1rem;padding-right:1rem;line-height:1.5;font-family:\"Noto Sans\",\"Helvetica Neue\",\"Helvetica\",\"Arial\",\"Liberation Sans\",\"Roboto\",\"Noto\",sans-serif;font-size:1rem;font-weight:700;border-radius:0.5rem;border-width:1px;border-style:solid;-webkit-transition:border-color 200ms cubic-bezier(0.645, 0.045, 0.355, 1),background-color 200ms cubic-bezier(0.645, 0.045, 0.355, 1),opacity 200ms cubic-bezier(0.645, 0.045, 0.355, 1),box-shadow 200ms cubic-bezier(0.645, 0.045, 0.355, 1),color 200ms cubic-bezier(0.645, 0.045, 0.355, 1);transition:border-color 200ms cubic-bezier(0.645, 0.045, 0.355, 1),background-color 200ms cubic-bezier(0.645, 0.045, 0.355, 1),opacity 200ms cubic-bezier(0.645, 0.045, 0.355, 1),box-shadow 200ms cubic-bezier(0.645, 0.045, 0.355, 1),color 200ms cubic-bezier(0.645, 0.045, 0.355, 1);display:-webkit-inline-box;display:-webkit-inline-flex;display:-ms-inline-flexbox;display:inline-flex;width:auto;padding-top:0.5625rem;padding-bottom:0.5625rem;color:#2d2d2d;border-color:#e4e2e0;background-color:#e4e2e0;}.css-1686x4::-moz-focus-inner{border:0;}@media (prefers-reduced-motion: reduce){.css-1686x4{-webkit-transition:none;transition:none;}}.css-1686x4:disabled{opacity:0.4;pointer-events:none;}.css-1686x4:focus{outline:none;box-shadow:0 0 0 0.125rem #ffffff,0 0 0 0.1875rem #2557a7;}.css-1686x4:focus:not([data-focus-visible-added]){box-shadow:none;}.css-1686x4:visited{color:#2d2d2d;}.css-1686x4:hover{border-color:#d4d2d0;background-color:#d4d2d0;}.css-1686x4:active{box-shadow:inset 0 0.125rem 0.25rem rgba(45, 45, 45, 0.2),inset 0 0.0625rem 0.1875rem rgba(45, 45, 45, 0.12),inset 0 0 0.125rem rgba(45, 45, 45, 0.2);border-color:#b4b2b1;background-color:#b4b2b1;}<\u002Fstyle><button class=\"mosaic-reportcontent-button desktop css-1686x4 e8ju0x51\"><span class=\"mosaic-reportcontent-button-icon\"><\u002Fspan>Report job<\u002Fbutton><div class=\"mosaic-reportcontent-content\"><\u002Fdiv><\u002Fdiv>","mosaic-provider-salary-feedback":"","mosaic-provider-company-info-salary":"","mosaic-provider-rich-media":"","js-match-insights-provider":"","MosaicProviderCallToApplyFeedback":"","mosaic-provider-dislike-feedback":"<div class=\"animatedToast i-unmask\"><div class=\"\"><\u002Fdiv><\u002Fdiv>"},"locale":"en_US","localeData":{"":[null,"Project-Id-Version: \nReport-Msgid-Bugs-To: \nPOT-Creation-Date: 2022-09-27 23:51-0500\nPO-Revision-Date: 2021-08-06 19:00+0000\nLast-Translator: Auto Generated <noreply@indeed.com>\nLanguage-Team: English (United States) <https:\u002F\u002Fweblate.corp.indeed.com\u002Fprojects\u002Findeed\u002Findeedmobile-i18n-content\u002Fen_US\u002F>\nLanguage: en_US\nMIME-Version: 1.0\nContent-Type: text\u002Fplain; charset=UTF-8\nContent-Transfer-Encoding: 8bit\nPlural-Forms: nplurals=2; plural=n != 1;\nX-Generator: Weblate 3.9.1\n"],"\"Interview times available\" card content\u0004If your application meets the employer's criteria, you may be able to book a call or request an interview that suits your schedule.":[null,"If your application meets the employer's criteria, you can directly provide your availability for a video interview."]},"loggedIn":false,"mobResourceTimingEnabled":false,"mobileGlobalHeader":null,"mobtk":"1ge7cg6jhkke6800","mosaicData":null,"originalJobLinkModel":null,"pageId":"viewjob","parenttk":null,"phoneLinkType":null,"phoneNumberButtonLinkModel":null,"preciseLocationModel":null,"profileBaseUrl":"https:\u002F\u002Fprofile.indeed.com","queryString":null,"recentQueryString":"q1=software engineer&l1=california&r1=-1&q2=python&l2=texas&r2=-1","relatedLinks":null,"resumeFooterModel":{"buttonLink":{"applyBtnNewStyle":false,"buttonSize":"md","buttonType":"primary","contentHtml":"Upload Your Resume","dataHref":"\u002Fpromo\u002Fresume?from=bottomResumeCTAviewjob&trk.origin=viewjob","disclaimer":null,"href":"\u002Fpromo\u002Fresume","icon":null,"isBlock":false,"largeScreenSizeText":null,"openInNewTab":false,"referrerpolicy":null,"rel":null,"sanitizedHref":null,"sanitizedHtml":null,"sticky":false,"target":null,"title":null,"viewJobDisplay":null},"isJanusActive":true,"letEmployersFindText":"Let Employers Find You"},"resumePromoCardModel":null,"rtl":false,"salaryGuideModel":{"acmeMicrocontentEndpoint":"https:\u002F\u002Fcocos-api.indeed.com","country":"US","estimatedSalaryModel":{"formattedRange":"$120K - $152K a year","max":151976.5,"min":120023.5,"type":"YEARLY"},"formattedLocation":"Austin, TX","jobKey":"f6288f8af00406b1","language":"en"},"salaryInfoModel":null,"saveJobButtonContainerModel":{"alreadySavedButtonModel":{"actions":["Saved","Applied","Interviewing","Offered","Hired"],"buttonSize":"block","buttonType":"secondary","contentHtml":"Saved","href":"\u002F","iconSize":null},"applyFromComputerLogUrl":"\u002Fm\u002Frpc\u002Flog\u002Femailmyself?jk=f6288f8af00406b1&mobvjtk=1ge7cg6jhkke6800&sbt=121f10e71cf3df2d415dae11933eb9ce&ctk=1ge4tueuklhdh800&acctKey=","currentJobState":"VISITED","didYouApplyPromptModel":{"calloutModel":{"actionsList":null,"actionsMap":{"NO":{"children":"Not interested","className":null,"href":null,"target":null},"LATER":{"children":"Maybe later","className":null,"href":null,"target":null},"YES":{"children":"Yes","className":null,"href":null,"target":null}},"caretPosition":null,"children":null,"dismissAriaLabel":"Close","dismissAttributes":null,"dismissHref":null,"heading":"Did you apply?"},"jobKey":"f6288f8af00406b1","possibleResponses":{"NO":"NO","LATER":"LATER","YES":"YES"},"userCanView":false},"didYouApplyResponseUrl":"\u002Fm\u002Frpc\u002Fdidyouapply?tk=1ge7cg6jhkke6800&jobKey=f6288f8af00406b1&originPage=viewjob&from=viewjob","hashedCSRFToken":"121f10e71cf3df2d415dae11933eb9ce","isAlreadySavedButtonVisible":false,"isDisableJobStatusChange":false,"isLoggedIn":false,"isSaveWithoutLoginEnabled":false,"isSticky":false,"isSyncJobs":false,"mobtk":"1ge7cg6jhkke6800","myIndeedLoginLink":"https:\u002F\u002Fwww.indeed.com\u002Faccount\u002Flogin?dest=%2Fm%2Fbasecamp%2Fviewjob%3Fviewtype%3Dembedded%26jk%3Df6288f8af00406b1&from=jsfe-desktopembedded-save-indeedmobile","myJobsAPIHref":"\u002Frpc\u002Flog\u002Fmyjobs\u002Ftransition_job_state?client=mobile&cause=statepicker&preserveTimestamp=false&tk=1ge7cg6jhkke6800&jobKey=f6288f8af00406b1&originPage=viewjob","myJobsURL":"https:\u002F\u002Fmyjobs.indeed.com?co=US&hl=en_US&from=viewjob","pageId":"viewjob","possibleJobActions":{"SAVED":"save","APPLIED":"apply","INTERVIEWING":"interview","OFFERED":"offer","HIRED":"hire","VISITED":"visit","ARCHIVED":"archive"},"possibleJobStates":{"SAVED":"Saved","APPLIED":"Applied","INTERVIEWING":"Interviewing","OFFERED":"Offered","HIRED":"Hired","VISITED":"Visited","ARCHIVED":"Archived"},"saveButtonModel":{"applyBtnNewStyle":false,"buttonSize":"block","buttonType":"secondary","contentHtml":"","dataHref":null,"disclaimer":null,"href":"\u002F","icon":{"iconTitle":"save-icon","iconType":"favorite-border"},"isBlock":false,"largeScreenSizeText":null,"openInNewTab":false,"referrerpolicy":null,"rel":null,"sanitizedHref":null,"sanitizedHtml":null,"sticky":false,"target":null,"title":null,"viewJobDisplay":"DESKTOP_EMBEDDED"},"showSaveJobInlineCallout":true,"uistates":{"INTERVIEWING":"INTERVIEWING","OFFERED":"OFFERED","SAVED":"SAVED","VISITED":"VISITED","HIRED":"HIRED","ARCHIVED":"ARCHIVED","APPLIED":"APPLIED"},"viewJobDisplay":"DESKTOP_EMBEDDED"},"saveJobCalloutModel":{"actionsList":null,"actionsMap":{"createaccount":{"children":"Create account (it's free)","className":null,"href":"https:\u002F\u002Fwww.indeed.com\u002Faccount\u002Fregister?dest=%2Fm%2Fbasecamp%2Fviewjob%3Fviewtype%3Dembedded%26jk%3Df6288f8af00406b1","target":"_PARENT"},"signin":{"children":"Sign in","className":null,"href":"https:\u002F\u002Fwww.indeed.com\u002Faccount\u002Flogin?dest=%2Fm%2Fbasecamp%2Fviewjob%3Fviewtype%3Dembedded%26jk%3Df6288f8af00406b1","target":"_PARENT"}},"caretPosition":null,"children":"You must sign in to save jobs:","dismissAriaLabel":"Close","dismissAttributes":null,"dismissHref":null,"heading":"Save jobs and view them from any computer."},"saveJobFailureModalModel":{"closeAriaLabel":"Close","closeButtonText":"Close","message":"Please retry","signInButtonText":null,"signInHref":null,"title":"Failed to Save Job"},"saveJobLimitExceededModalModel":{"closeAriaLabel":"Close","closeButtonText":null,"message":"You reached the limit. Please log in to save additional jobs.","signInButtonText":"Sign in","signInHref":"https:\u002F\u002Fwww.indeed.com\u002Faccount\u002Flogin?dest=%2Fm%2Fbasecamp%2Fviewjob%3Fviewtype%3Dembedded%26jk%3Df6288f8af00406b1&from=viewjob_savejoblimitmodal","title":"You've already saved 20 jobs"},"segmentId":"software_dev_seo","segmentPhoneNumberButtonLinkModel":null,"shareJobButtonContainerModel":{"buttonIconModel":{"color":"blue","position":null,"size":"md","title":"Share this job","type":"\u002Fm\u002Fimages\u002Fnativeshare.svg"},"buttonModel":{"buttonSize":null,"buttonType":"secondary","children":"Share this job","disabled":false,"href":null,"isActive":false,"isBlock":false,"isResponsive":false,"size":"md"},"fallbackButtonIconModel":{"color":"blue","position":null,"size":"md","title":"Copy link","type":"\u002Fm\u002Fimages\u002Ficon-copy.svg"},"shareText":"Check out this job on Indeed:\nBluevine\nSenior Full Stack Engineer (python)\nAustin, TX\nhttps:\u002F\u002Fwww.indeed.com\u002Fm\u002Fviewjob?jk=f6288f8af00406b1&from=native","shareType":"native","shareUrl":"https:\u002F\u002Fwww.indeed.com\u002Fm\u002Fviewjob?jk=f6288f8af00406b1&from=native","showUnderSaveButton":true},"shouldLogResolution":true,"showEmployerResponsiveCard":false,"showGlobalNavContent":false,"showReportInJobButtons":false,"sponsored":false,"sponsoredAdsContainerModel":null,"sponsoredJobs":null,"staticPrefix":"\u002F\u002Fd3fw5vlhllyvee.cloudfront.net\u002Fm\u002Fs\u002F","stickyType":"ALWAYS","successfullySignedInModel":null,"viewJobButtonLinkContainerModel":null,"viewJobDisplay":"DESKTOP_EMBEDDED","viewJobDisplayParam":"dtembd","viewjobDislikes":false,"whatWhereFormModel":null,"zoneProviders":{"aboveViewjobButtons":[],"viewjobModals":["MosaicProviderCallToApplyFeedback"],"aboveExtractedJobDescription":[],"aboveFullJobDescription":["mosaic-provider-company-info-salary"],"rightRail":[],"legacyProvidersViewJob":["mosaic-provider-reportcontent"],"betweenJobDescriptionAndButtons":[],"ssrVJModals":[],"belowJobDescription":[],"belowFullJobDescription":[],"belowViewjobButtons":["mosaic-provider-dislike-feedback","mosaic-provider-salary-feedback"],"belowViewjobNav":[]}};
</script>

We don't need to build CSS/xPath selectors for each field, we just to parse the data we want from the JSON response. An extra bonus from this is that the data is very clean so we have to little to no data cleaning.


'window._initialData=(\{.+?\});'

In Python the full regex command would look like:


script_tag = re.findall(r"_initialData=(\{.+?\});", html)

The job data can be found here:


json_blob = json.loads(script_tag[0])
job = json_blob["jobInfoWrapperModel"]["jobInfoModel"]

The JSON blob with the job data is pretty big so we will configure our scraper to only parse the data we want.


Step 3: Build Our Indeed Job Page Scraper

To scrape the resulting Indeed Job page we need to create a new callback parse_job() which will parse the data from the Indeed Job page after Scrapy has recieved a response:


import re
import json
import scrapy
from urllib.parse import urlencode

class IndeedJobSpider(scrapy.Spider):
name = "indeed_jobs"

def get_indeed_search_url(self, keyword, location, offset=0):
parameters = {"q": keyword, "l": location, "filter": 0, "start": offset}
return "https://www.indeed.com/jobs?" + urlencode(parameters)


def start_requests(self):
keyword_list = ['python']
location_list = ['texas']
for keyword in keyword_list:
for location in location_list:
indeed_jobs_url = self.get_indeed_search_url(keyword, location)
yield scrapy.Request(url=indeed_jobs_url, callback=self.parse_search_results, meta={'keyword': keyword, 'location': location, 'offset': 0})

def parse_search_results(self, response):
location = response.meta['location']
keyword = response.meta['keyword']
offset = response.meta['offset']
script_tag = re.findall(r'window.mosaic.providerData\["mosaic-provider-jobcards"\]=(\{.+?\});', response.text)
if script_tag is not None:
json_blob = json.loads(script_tag[0])

# Paginate Through Jobs Pages
if offset == 0:
meta_data = json_blob["metaData"]["mosaicProviderJobCardsModel"]["tierSummaries"]
num_results = sum(category["jobCount"] for category in meta_data)
if num_results > 1000:
num_results = 50

for offset in range(10, num_results + 10, 10):
url = self.get_indeed_search_url(keyword, location, offset)
yield scrapy.Request(url=url, callback=self.parse_search_results, meta={'keyword': keyword, 'location': location, 'offset': offset})

## Extract Jobs From Search Page
jobs_list = json_blob['metaData']['mosaicProviderJobCardsModel']['results']
for index, job in enumerate(jobs_list):
if job.get('jobkey') is not None:
job_url = 'https://www.indeed.com/m/basecamp/viewjob?viewtype=embedded&jk=' + job.get('jobkey')
yield scrapy.Request(url=job_url,
callback=self.parse_job,
meta={
'keyword': keyword,
'location': location,
'page': round(offset / 10) + 1 if offset > 0 else 1,
'position': index,
'jobKey': job.get('jobkey'),
})

def parse_job(self, response):
location = response.meta['location']
keyword = response.meta['keyword']
page = response.meta['page']
position = response.meta['position']
script_tag = re.findall(r"_initialData=(\{.+?\});", response.text)
if script_tag is not None:
json_blob = json.loads(script_tag[0])
job = json_blob["jobInfoWrapperModel"]["jobInfoModel"]
yield {
'keyword': keyword,
'location': location,
'page': page,
'position': position,
'company': job.get('companyName'),
'jobkey': response.meta['jobKey'],
'jobTitle': job.get('jobTitle'),
'jobDescription': job.get('sanitizedJobDescription').get('content') if job.get('sanitizedJobDescription') is not None else '',
}

Extra Data

The JSON blob with the job data is pretty big so we have configured our spider to only extract the data we want.

Now when we run our scraper and set it to save the data to a CSV file.

Command:


scrapy crawl indeed_jobs -o indeed_jobs_data.csv


Storing Data To Database Or S3 Bucket

With Scrapy, it is very easy to save our scraped data to CSV files, databases or file storage systems (like AWS S3) using Scrapy's Feed Export functionality.

To configure Scrapy to save all our data to a new CSV file everytime we run the scraper we simply need to create a Scrapy Feed and configure a dynamic file path.

If we add the following code to our settings.py file, Scrapy will create a new CSV file in our data folder using the spider name and time the spider was run.

# settings.py 

FEEDS = {
'data/%(name)s_%(time)s.csv': {
'format': 'csv',
}
}

If you would like to save your CSV files to a AWS S3 bucket then check out our Saving CSV/JSON Files to Amazon AWS S3 Bucket guide here

Or if you would like to save your data to another type of database then be sure to check out these guides:


Bypassing Indeed's Anti-Bot Protection

As you might have seen already if you run this code a couple times Indeed might already have started to redirecting you to its blocked page.

This is because Indeed uses anti-bot protection to try and prevent (or at least make it harder) developers from scraping their site.

You will need to using rotating proxies, browser-profiles and possibly fortify your headless browser if you want to scrape Indeed reliably at scale.

We have written guides about how to do this here:

However, if you don't want to implement all this anti-bot bypassing logic yourself, the easier option is to use a smart proxy solution like ScrapeOps Proxy Aggregator.

The ScrapeOps Proxy Aggregator is a smart proxy that handles everything for you:

  • Proxy rotation & selection
  • Rotating user-agents & browser headers
  • Ban detection & CAPTCHA bypassing
  • Country IP geotargeting
  • Javascript rendering with headless browsers

You can get a ScrapeOps API key with 1,000 free API credits by signing up here.

To use the ScrapeOps Proxy Aggregator with our Indeed Scrapy Spider, we just need to send the URL we want to scrape to the Proxy API instead of making the request directly ourselves. You can test it out with Curl using the command below:


curl 'https://proxy.scrapeops.io/v1/?api_key=YOUR_API_KEY&url=https://indeed.com'

We can integrate the proxy easily into our scrapy project by installing the ScrapeOps Scrapy proxy SDK a Downloader Middleware. We can quickly install it into our project using the following command:


pip install scrapeops-scrapy-proxy-sdk

And then enable it in your project in the settings.py file.

## settings.py

SCRAPEOPS_API_KEY = 'YOUR_API_KEY'
SCRAPEOPS_PROXY_ENABLED = True

DOWNLOADER_MIDDLEWARES = {
'scrapeops_scrapy_proxy_sdk.scrapeops_scrapy_proxy_sdk.ScrapeOpsScrapyProxySdk': 725,
}

Now when we make requests with our scrapy spider they will be routed through the proxy and Indeed won't block them.

Full documentation on how to integrate the ScrapeOps Proxy here.


Monitoring Your Indeed Scraper

When scraping in production it is vital that you can see how your scrapers are doing so you can fix problems early.

You could see if your jobs are running correctly by checking the output in your file or database but the easier way to do it would be to install the ScrapeOps Monitor.

ScrapeOps gives you a simple to use, yet powerful way to see how your jobs are doing, run your jobs, schedule recurring jobs, setup alerts and more. All for free!

Live demo here: ScrapeOps Demo

ScrapeOps Promo

You can create a free ScrapeOps API key here.

We'll just need to run the following to install the ScrapeOps Scrapy Extension:


pip install scrapeops-scrapy

Once that is installed you need to add the following to your Scrapy projects settings.py file if you want to be able to see your logs in ScrapeOps:


# Add Your ScrapeOps API key
SCRAPEOPS_API_KEY = 'YOUR_API_KEY'


# Add In The ScrapeOps Extension
EXTENSIONS = {
'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500,
}


# Update The Download Middlewares
DOWNLOADER_MIDDLEWARES = {
'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
}

Now, every time we run a our Indeed spider (scrapy crawl indeed_jobs), the ScrapeOps SDK will monitor the performance and send the data to ScrapeOps dashboard.

Full documentation on how to integrate the ScrapeOps Monitoring here.


Scheduling & Running Our Scraper In The Cloud

Lastly, we will want to deploy our Indeed scraper to a server so that we can schedule it to run every day, week, etc.

To do this you have a couple of options.

However, one of the easiest ways is via ScrapeOps Job Scheduler. Plus it is free!

ScrapeOps Job Scheduler Demo

Here is a video guide on how to connect a Digital Ocean to ScrapeOps and schedule your jobs to run.

You could also connect ScrapeOps to any server like Vultr or Amazon Web Services(AWS).


More Web Scraping Guides

In this edition of our "How To Scrape X" series, we went through how you can scrape Indeed.com including how to bypass its anti-bot protection.

The full code for this Indeed Spider is available on Github here.

If you would like to learn how to scrape other popular websites then check out our other How To Scrape With Scrapy Guides:

Of if you would like to learn more about web scraping in general, then be sure to check out The Web Scraping Playbook, or check out one of our more in-depth guides: