Skip to main content

Scrapy Beginners Series Part 5: Deploying & Scheduling Spiders

So far in this series we learned how to:

    1. build a basic scrapy spider and get it to scrape some data from a website
    1. clean up the data as it was being scraped
    1. save the data to a file & a database
    1. bypass any site anti-bots or rate limiting

In this chapter(part 5) we will be exploring how to deploy your spider to a seperate server so it's always running on your personal laptop/computer, setuping up monitoring and alerting for when your jobs are running and you want to know if there are any issues with them, and finally scheduling jobs so that you can have them run at set recurring times.

Python Scrapy 5-Part Beginner Series

  • Part 1: Basic Scrapy Spider - We will go over the basics of Scrapy, and build our first Scrapy spider. (Part 1)

  • Part 2: Cleaning Dirty Data & Dealing With Edge Cases - Web data can be messy, unstructured, and have lots of edge cases. In this tutorial we will make our spider robust to these edge cases, using Items, Itemloaders and Item Pipelines. (Part 2)

  • Part 3: Storing Our Data - There are many different ways we can store the data that we scrape from databases, CSV files to JSON format, and to S3 buckets. We will explore several different ways we can store the data and talk about their Pro's, Con's and in which situations you would use them. (Part 3)

  • Part 4: User Agents & Proxies - Make our spider production ready by managing our user agents & IPs so we don't get blocked. (Part 4)

  • Part 5: Deployment, Scheduling & Running Jobs - Making our spider "production ready" by deploying our spider onto a Digital Ocean server, and then setting up scheduling and monitoring and via ScrapeOps. (This Tutorial)

For this beginner series, we're going to be using the simplest scraping architecture. A single scraper, scraping.


In this tutorial, Part 5: Beginners guide to Deployment, Monitoring, Alerting and Scheduling with Scrapy we're going to cover:

With the intro out of the way let's get down to business.


How to monitor your running spiders

If you're just joining us for this tutorial - lets fork the ScrapeOps part 5 code and then you can add in your own API keys to your own repo, as we'll be cloning from your own repo to your Digital Ocean server. If you already have your own repo setup in github then there is no need to clone the ScrapeOps one! :)

You can find the code for this (part 5) tutorial here: https://github.com/python-scrapy-playbook/Beginner-Series-Part-5-Deployment-Scheduling-Monitoring

OK - why do we need monitoring? You could see if your jobs are running correctly by checking the output in your file or database but the easier way to do it would be to install the ScrapeOps monitoring extension.

ScrapeOps give you a simple to use, beautiful way to see how your jobs are doing, run your jobs, scheudle recurring jobs, setup alerts and more. Oh, and its free :)

You can create a free scrapeOps account here: https://scrapeops.io/app/register

We'll just need to run the following to install the ScrapeOps extention:

pip install scrapeops-scrapy

Once that is installed correctly you need to add the following to your scrapy project settings if you want to be able to see your logs in ScrapeOps:

# Add Your ScrapeOps API key
SCRAPEOPS_API_KEY = 'YOUR-API-KEY-HERE'


# Add In The ScrapeOps Extension
EXTENSIONS = {
'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500,
}


# Update The Download Middlewares
DOWNLOADER_MIDDLEWARES = {
'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
}

Any job that we run should now send data back to ScrapeOps on the status of the job and the statistics related to the job. Test it out by running scrapy crawl chocolatespider. A few seconds later the results should be shown in the ScrapeOps dashboard!

Dashboard Manager

As you can see from the screenshot this is what it looks like on ScrapeOps on the main dashboard when this spider has run. We can see lots of helpful things such as:

  • The number of pages scraped
  • The number of items scraped
  • The specific fields that were scraped
  • The run time

We can also see if any specific job has varied by more than 10% outside the usual number of items and pages that usually come back for this spider!

Getting our project ready for deployment onto a server

Finding our scrapy project requirements

Now that we have our monitoring working we just need to make sure our scrapy project is ready to be put onto the server.

As you know, we need several different python modules/packages to be installed for our python scrapy project to run correctly. When we setup a server and clone the repository onto the server we will not have these packages/modules installed. The good thing is we don't have to remember all the different things we've already installed because we can use the pip freeze command to give us the list of installed modules/packages and that list can then be used to install everything.

Creating a Requirements.txt file

To get pip to create this file for us we just need to run the following command:

Just run :

pip freeze requirements.txt > requirements.txt

You should now have a requirements.txt file with the following lines(or something very similar) in it:

attrs==21.4.0
Automat==20.2.0
beautifulsoup4==4.11.1
bleach==4.1.0
botocore==1.27.57
certifi==2021.10.8
cffi==1.15.0
charset-normalizer==2.0.12
colorama==0.4.4
constantly==15.1.0
cryptography==36.0.2
cssselect==1.1.0
docutils==0.18.1
filelock==3.6.0
hyperlink==21.0.0
idna==3.3
importlib-metadata==4.11.3
incremental==21.3.0
itemadapter==0.5.0
itemloaders==1.0.4
jmespath==1.0.0
json5==0.9.6
keyring==23.5.0
lxml==4.8.0
mysql-client==0.0.1
mysql-connector-python==8.0.30
packaging==21.3
parsel==1.6.0
pkginfo==1.8.2
Protego==0.2.1
protobuf==3.20.1
proxyscrape==0.3.0
psycopg2==2.9.3
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.21
PyDispatcher==2.0.5
Pygments==2.11.2
pyOpenSSL==22.0.0
pyparsing==3.0.7
python-dateutil==2.8.2
queuelib==1.6.2
readme-renderer==34.0
requests==2.27.1
requests-file==1.5.1
requests-toolbelt==0.9.1
rfc3986==2.0.0
scrapeops-python-requests==0.4.0
scrapeops-scrapy==0.4.6
Scrapy==2.6.1
scrapy-proxy-pool==0.1.9
scrapy-user-agents==0.1.1
service-identity==21.1.0
six==1.16.0
soupsieve==2.3.2.post1
tld==1.0.1
tldextract==3.2.0
tqdm==4.63.1
twine==3.8.0
Twisted==22.2.0
typing_extensions==4.1.1
ua-parser==0.16.1
urllib3==1.26.9
user-agents==2.2.0
w3lib==1.22.0
webencodings==0.5.1
zipp==3.7.0
zope.interface==5.4.0

This has already been run in the project that is up on the ScrapeOps repo but if you add any new python modules to your project you will have to re-run this to keep it up to date.

Creating a Server on Digital Ocean

Why we need a server?

Unless you want to leave your home laptop/computer running every day/night to be able to run the spider automatically whenever you want, we need to have a remote server which will do the job for us. VPS - (Virtual Private Servers) are the most commonly used as they are cheap and easy to setup/access.

There a number of different providers, many of which you will probably have heard of before such as, Digital Ocean, Amazon Web Services(AWS), Google Cloud, Vultur etc.

For this tutorial we're going to use Digital Ocean's cheapest offering which is only $4 a month - they also offer $100 in free credit which you can use while you do this tutorial. AWS also offer free plans for up to 12 months on some of their VPS's.

A quick note on Heroku - we are aware that Heroku have a free tier, the downside to this is that it only runs for a maximum of 18 hours a day which makes scheduling recurring jobs with it more complicated. It also automaticlly goes into sleep mode after a period of time - which can disrupt any long running jobs you have. That's why we recommend going with one of the other providers listed above!

Setting up the server

First signup for a Digital Ocean account here: https://m.do.co/c/2656441c8345

Click the "Create" dropdown at the top of the page and then the "Droplets" button. (See below) New Droplet

The operating system to select is Ubuntu 22.04. Select the cheapest $4 server, which is under the shared CPU, regular SSD plan options. New Droplet

The location(datacenter region) of the server can be anywhere but it makes sense to choose the location closest to where you are based.

For the ssh key/password - you can select either. For this tutorial we are going to select the password option as we are not going to cover how to create an SSH key now. New Droplet

Once those options are selected simply click the "Create Droplet" button at the bottom of the and wait a minute or two for the droplet to be created.

Connect the server to ScrapeOps

Ok - now our scrapy project is ready to be deployed to our server and we have our server setup. We could manually run git clone to copy our project to our server but there are easier ways of doing this. We are going to use ScrapeOps for this - ScrapeOps makes it easy for us to deploy and manage our server, repos and spiders all from the browser. No command line needed once its hooked up correctly!

To add the server just go to the Servers & Deployments page and click the "Add Server" button.

New Server

On this section we just need to add the server name - this can be anything we want and then the server IP address. This we can get from our Digital Ocean server page. The server domain name can be left blank. New Server

Once that's done the following screen will be displayed (See below). New Server

We just need to now open a console page from our Digital Ocean Server page, wait for the server terminal to finish connecting. New Server

Then copy and paste in the code block from the ScrapeOps server page.

The server will then go through the steps of installing the required libraries, creating a seperate user and account for security purposes and installing an ssh key so that you can interact with your server using scrapeops. When the steps are completed you will be brought to the server page automatically and you should see the following screen:

Deploy your scrapy spider to your server

Now we're ready to deploy our spider to our server. To do this we just click the "Clone Repository" button. This will open a modal where we can then enter our repo details.

We first need to copy and paste the URL from our github repository. If your repository is private when we click the clone repo button it will ask us to provide access to the repository.

You need to add the branch name which is the same as the name of the main branch of your repository - this is usually master or main.

The language and framework are preselected as Python & Scrapy as those are the most popular language/framework combination we have using ScrapeOps at the moment - for the purpose of this tutorial you can leave those as they are since they are already correct!

The install script will do the following after cloning the repo:

  1. Install a python virtual environment
  2. Activate the python virtual environment
  3. Install any python modules specified in the requirements.txt file

We can leave this script as is for this tutorial.

Now click Clone Repo.

ScrapeOps will now:

  • Install the repository from your Github repository
  • Run the install script
  • Run "scrapy list" to find your scrapy spiders

If any errors occur during this process they will be displayed to you now. This could be anything such as a missing python module which needs to be installed for scrapy to work or the branch name being incorrect. If those occur you will need to fix those and try to clone the repo again.

If the install process works correctly then you will see the repo in the "Cloned Repo" table and the Spider in the "Spiders" table. (Like in the below screenshot)

Running & Scheduling a Job

So we now have jobs that run on our server and we can see the stats come into the ScrapeOps dashboard. The next step would be to run and schedule our jobs via scrapeops so that we don't have to manually run scrapy crawl from the digital ocean terminal page or SSH into our server and run the command locally. We could just run/schedule a job in the browser using ScrapeOps - this could even be done on our phone or iPad!

When a spider runs/crawls it is usually called a job. We can schedule our spider to run a "job" we can do either a once off run or run the spider at a specific time on a specific day of the week or month.

In your scrapeops account go to the Scheduler page. Then click on "Schedule" button.

Here you can see the server/repo/spider which you can select and then you have two options, either you can run the crawl straight away or schedule a recurring job.

You can easily set the spider to run a crawl once a week by selecting the time/day of week using the dropdowns. Just be sure to take note that the job is being scheduled in UTC time as it is being saved to your server's cron file which uses UTC.

The scheduled jobs then appear on the scheduled jobs page where you can update/delete them as needed! Simple!

Next Steps

We hope you now have a good understanding of how to setup a scrapy project, scrape the data you need and schedule the spider jobs so that your data is up to date! If you have any questions leave them in the comments below and we'll do our best to help out!

If you would like the code from this example please check out our Github!

If you would like to see more in-depth articles such as the ones in the list below you can check out the rest of the Python Scrapy Playbook

  • Using headless browsers
  • The importance of exception handling
  • Chunking your scrape
  • How to best scrape Google search results using Scrapy
  • How to best scrape Amazon listings using Scrapy
  • How to best scrape LinkedIn using Scrapy
  • How to scrape over 5 million requests per month
  • Different Scrapy architectures for large scale scraping
  • And much much more!

Need a Proxy to stop your spiders being blocked? Then check out our Proxy Comparison Tool that allows to compare the pricing, features and limits of every proxy provider on the market so you can find the one that best suits your needs. Including the best free plans.