Skip to main content

Scrapy For Beginners Series: How To Build Your First Production Scraper [Part 5]

So far in this series we learned how to:

    1. build a basic scrapy spider and get it to scrape some data from a website
    1. clean up the data as it was being scraped
    1. save the data to a file & a database
    1. bypass any site anti-bots or rate limiting

In this chapter(part 5) we will be exploring how to deploy your spider to a seperate server so it's always running on your personal laptop/computer, setuping up monitoring and alerting for when your jobs are running and you want to know if there are any issues with them, and finally scheduling jobs so that you can have them run at set recurring times.

Python Scrapy 5-Part Beginner Series

  • Part 1: Basic Scrapy Spider - We will go over the basics of Scrapy, and build our first Scrapy spider.

  • Part 2: Cleaning Dirty Data & Dealing With Edge Cases - Web data can be messy, unstructured, and have lots of edge cases. In this tutorial we will make our spider robust to these edge cases, using Items, Itemloaders and Item Pipelines.

  • Part 3: Storing Our Data - There are many different ways we can store the data that we scrape from databases, .csv files to json format and to s3 buckets. We will explore several different ways we can store the data and talk about their pro's, con's and in which situations you would use them.

  • Part 4: Bypassing anti-bots with User Agents & Proxies - We'll show you how to use User Agents and Proxies to scale your scaping and bypass any anti bots or scraper restrictions.

  • Part 5: Deploying, Monitoring, Alerting & Scheduling your Scrapy Jobs - Making our spider production ready by deploying our spider onto a Heroku server, as well as scheduling/monitoring and alerting via ScrapeOps. (This Tutorial)

For this beginner series, we're going to be using the simplest scraping architecture. A single scraper, scraping.


In this tutorial, Part 5: Beginners guide to Deployment, Monitoring, Alerting and Scheduling with Scrapy we're going to cover:

With the intro out of the way let's get down to business.

Deploying your locally made spider to a seperate server

For this project we are going to deploy the spider from the

Getting our project ready for Heroku

We need to tell Heroku what to install for us (apart from our scrapy project) to enable scrapy to run correctly and to give us a way of starting jobs remotly since we won't have access to the command line on the Heroku server.

Requirements.txt

To do this we need to create a requirements.txt file in the top level of our porject. In this file we will put the packages that we want Heroku to install for us. These are:

scrapy
scrapyd
herokuify_scrapyd
scrapeops-scrapy
  1. Scrapy - this is self explanitory we need this just to run our spider
  2. ScrapyD - this will create a small service on our server which will be listing for any commands we send to start/stop/schedule our spiders/jobs
  3. Herokuify_scrapyd - this is needed to enable ScrapyD to work correctly on Heroku
  4. Scrapeops-scrapy - this installs Scrapeops which will allow us to easily run/schedule/view our spiders

Configure Scrapyd in your projects scrapy.cfg

[scrapyd]
application = scrapy_heroku.app.application


[deploy]
url = http://<YOUR_HEROKU_APP_NAME>.herokuapp.com:80/
project = <YOUR_SCRAPY_PROJECT_NAME>
username = <A_USER_NAME>
password = <A_PASSWORD>

Create a Procfile

When our app is started Heroku will look for a file named Procfile in your projects root and will run any commands that it sees there. We will use this to start scrapyD which will then be running and awaiting our instructions to start/stop/schedule our spider!

The Procfile is always a simple text file that is named Procfile without a file extension. For example, Procfile.txt is not valid!!

The Procfile must live in your app’s root directory. It does not function if placed anywhere else.

web: scrapyd

Getting setup on Heroku

Now that we are ready to get our code deployed onto Heroku we can login to our Heroku account. Register for a Heroku account if you don't have one already - there is no cost to signup and use their base plan.

Go to the main Heroku dashboard: https://dashboard.heroku.com/ Click the "New" button and click "Create new app".

Heroku has several ways to deploy the code to its servers but for this tutorial we are going to use the "GitHub" option.

If you don't have a github account setup you should set one up now.

Once connected to your github repository any code you have available in on your github repository you should be able to now deploy to Heroku.

The next step is to clone the ScrapOps Scrapy Playbook github repository so that you can then deploy this to your Heroku instance.

Once this is done in Heroku you can connect the cloned repository with this tutorials code examples striaght to your Heroku server!

Now that your repository is connected you can either manually deploy the code with by clicking the deploy code button (see below), or you can have it automatically deploy the code once a code change is pushed to github.

Deploying the code to Heroku with git

Running our Spider with scrapyD

Now that we have scrapy & scrapyD running on Heroku and our project is deployed to Heroku we can run several curl commands via our command line to see the status of our scrapy project. For example:

List our projects in our project

curl https://<YOUR-APP-NAME-HERE>.herokuapp.com/listprojects.json

List the spiders in our project

curl https://<YOUR-APP-NAME-HERE>.herokuapp.com/listspiders.json?project=default

Run our spider

We can also run our spider with a curl command:

curl https://<YOUR-APP-NAME-HERE>.herokuapp.com/schedule.json -d project=default -d spider=chocolatespider

How to monitor your running spiders

You could keep an eye on your jobs via the Heroku logs dashboard or via the Heroku command line interface but the easier way to do it would be to install ScrapeOps.

ScrapeOps give you a simple to use, beautiful way to see how your jobs are doing, run your jobs, scheudle recurring jobs, setup alerts and more. Oh, and its free!

Register for an account with ScrapeOps at https://scrapeops.io As part of the onboarding you are asked to add the following to your scrapy project settings if you want to be able to see your logs in ScrapeOps:

# Add Your ScrapeOps API key
SCRAPEOPS_API_KEY = 'YOUR-API-KEY-HERE'


# Add In The ScrapeOps Extension
EXTENSIONS = {
'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500,
}


# Update The Download Middlewares
DOWNLOADER_MIDDLEWARES = {
'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
}

Then simply add your Heroku url to into the Server Domain Name field with 0.0.0.0 in for the IP address.

Make sure to tick the https box too.

Once you submit that form you should see that ScrapeOps has read & write access to your scrapyD server.

Any job that we run should now send data back to ScrapeOps on the status of the job and the statistics related to the job.

Dashboard Manager

As you can see from the screenhot this is what it looks like on ScrapeOps on the main dashboard when this spider has run. We can see lots of helpful things such as:

  • The number of pages scraped
  • The number of items scraped
  • The specific fields that were scraped
  • The run time

We can also see if any specific job has varied by more than 10% outside the usual number of items and pages that usually come back for this spider!

Running & Scheduling a Job

When a spider runs it is usually called a job. We can schedule our spider to run a "job" we can do either a once off run or run the spider at a specific time on a specific day of the week or month.

In your scrapeops account go to the Jobs Manager page. Then click on schedule job. Jobs Manager

Here you can see you have two options, run the crawl straight away or schedule a recurring job. If you have multiple spiders on your Heroku Server then ScrapeOps should pick it up and allow you to select the one you want.

Jobs Manager

You can easily set the spider to run a crawl once a week by selecting the time/day of week using the dropdowns.

Jobs Manager

The scheduled jobs then appear on the scheduled jobs page where you can update/delete them as needed!

Setting up alerting for your scrapy jobs

Now that we have a spider setup to run once a week lets setup an alert to alert us via slack if an issue.

First we need to select our spider and add our slack channel. Then we can choose the parameters that will trigger the alert. Let's set up the alert to check if the chocolate items are less than 10 as there is more than likely some issue - either the website changed how the items are displayed or maybe the website has added in some anti-bots that you'll need to update your sider to work around.

Jobs Manager

Once the alert is created we can see it on the alerts page and edit/delete it as needed. Jobs Manager

Saving the data to Heroku Postgres

Now as you might have noticed, when we run a job at the moment - the data isn't exactly easily accessable.

Next Steps

We hope you now have a good understanding of how to save the data you've scraped into the format/file/database you need! If you have any questions leave them in the comments below and we'll do our best to help out!

If you would like the code from this example please check out our Github!

If you would like to see more in-depth articles we will soon be publishing an intermediate series on scrapy topics such as:

  • Using headless browsers
  • The importance of exception handling
  • Chunking your scrape
  • How to best scrape Google search results using Scrapy
  • How to best scrape Amazon listings using Scrapy
  • How to best scrape LinkedIn using Scrapy
  • How to scrape over 5 million requests per month
  • Different Scrapy architectures for large scale scraping
  • And much much more!

Need a Proxy? Then check out our Proxy Comparison Tool that allows to compare the pricing, features and limits of every proxy provider on the market so you can find the one that best suits your needs. Including the best free plans.