Skip to main content

freeCodeCamp Scrapy Beginners Course Part 12: Deploying & Scheduling Spiders With Scrapy Cloud

freeCodeCamp Scrapy Beginners Course Part 12: Deploying & Scheduling Spiders With Scrapy Cloud

In Part 12 of the Scrapy Beginner Course, we go through how you can deploy, schedule and run your spiders on any server with Scrapy Cloud.

There are several ways to run and deploy your scrapers to the cloud which we will cover in this course:

However, in Part 11 we will show you how to deploy, schedule and run your spiders on any server with Scrapy Cloud.

The code for this part of the course is available on Github here!

If you prefer video tutorials, then check out the video version of this course on the freeCodeCamp channel here.

freeCodeCamp Scrapy Course

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.


What Is Scrapy Cloud?

Scrapy Cloud is a great spider hosting solution if you are building your scrapers using the Scrapy, and don't want to deal with setting up your own servers and job scheduling system.

With Scrapy Cloud you simply need to deploy your spiders to the Scrapy Cloud platform and configure when you want them to run. From here, Scrapy Cloud takes care of running the jobs and storing the data your spiders scrape.

Scrapy Cloud boasts some pretty powerful features:

  • On-demand scaling
  • Easy integration with other Scrapy & Zyte products (Splash, Spidermon, Zyte Smart Proxy Manager)
  • Full suite of logging & data QA tools

Get Started With Scrapy Cloud

Getting started with Scrapy Cloud is very simple.

First create a Free Account Scrapy Cloud Here, and then once logged in click "Start a new project".

Python Scrapy Playbook - Start Scrapy Cloud Project

And give your project a name.

Python Scrapy Playbook - Create Scrapy Cloud Project

Once your project has been created, you have two ways to deploy your Scrapy spiders to Scrapy Cloud:

  • Via the Command Line
  • Via Github Integration

Python Scrapy Playbook - Scrapy Cloud Deployment Options


Deploy Your Spiders To Scrapy Cloud From Command Line

Using the shub command line tool we can deploy our spiders directly to Scrapy Cloud from the command line.

First install shub on your system:


pip install shub

Then link the shub client to your Scrapy Cloud project by running shub login in your command line, and when prompted enter your Scrapy Cloud API key.


shub login
API key: YOUR_API_KEY

You can find your API key on the Code & Deploys page.

Then to deploy your Scrapy project to Scrapy Cloud, run the shub deploy command followed by your project's id:


shub deploy PROJECT_ID

You can find your project's id, on the Code & Deploys page or in the project URL.


https://app.zyte.com/p/PROJECT_ID/jobs

If successful, you will see the spiders you have available in the spiders tab.

Python Scrapy Playbook - Scrapy Cloud Spiders Dashboard

You can then run your scraping job on Scrapy Cloud directly from your command line:


$ shub schedule bookspider
Spider bookspider scheduled, watch it running here:
https://app.zyte.com/p/26731/job/1/8


Deploy Your Spiders To Scrapy Cloud via GitHub

The other option is to connect Scrapy Cloud directly to your GitHub account and deploy your spiders directly from the GitHub.

On the Code & Deploys page, select the option to Connect to Github and follow the instructions.

If you haven't connected Zyte to your GitHub account previously, then you might be asked to authorize Zyte to access your repositories.

Python Scrapy Playbook - Scrapy Cloud GitHub Authorization

Next, you will be prompted to pick which repository you want Scrapy Cloud to connect to.

Python Scrapy Playbook - Scrapy Cloud Select GitHub Repository

tip

The repository you select must contain a Scrapy project at its root (i.e. the scrapy.cfg file is located in the repository root). Otherwise, the build process will fail.

By default, when you connect Scrapy Cloud to a GitHub repository it is configured to auto-deploy any changes you push to the repository. However, if you prefer you can switch it to Manual Deploy mode, and deploy changes to your spiders manually.

If you leave it in Automatic Deploy mode, then to commence the first deployment then click on the Deploy Branch button.

Scrapy Cloud - First Automatic Deployment

If successful, you will see the spiders you have available in the spiders tab.

Python Scrapy Playbook - Scrapy Cloud Spiders Dashboard


Run Spiders On Scrapy Cloud

To run our scraping jobs on Scrapy Cloud once we've deployed our spider is very straightforward.

Simply go to the Spiders Dashboard, select the spider you want to run, and then click Run.

Python Scrapy Playbook - Scrapy Cloud Run Job

You will then be given the option to add any arguments, tags or extra Scrapy Units to the job before it is run.

Python Scrapy Playbook - Scrapy Cloud Run Job Settings

Once you are happy then click Run, and Scrapy Cloud will queue up the job to be run.

tip

You can have certain jobs skip the queue or go to the back of the queue by giving your jobs a priority value from Lowest to Highest.

Now when you go to the Jobs Dashboard, you will see if the Job is queued, running or completed. Along with some overview stats like Runtime, Items Scraped, Errors, etc.

Python Scrapy Playbook - Scrapy Cloud Completed Job


Schedule Jobs on Scrapy Cloud

The most useful feature of Scrapy Cloud, is the Periodic Jobs functionality that allows you to schedule your spiders to run periodically in the future.

Scrapy Cloud uses a scheduler similar to CronTabs so you can schedule your spiders to run every minute, hour, day, week or month.

To use the scheduling functionality, go to the Periodic Jobs Dashboard, and click Add Periodic Job.

Python Scrapy Playbook - Scrapy Cloud Add Periodic Job

Here you are prompted to select which Spider you want to schedule, when it should run and any extra settings like Priority, Tags and Arguments.

Once saved this spider will automatically run at your selected interval.

Python Scrapy Playbook - Scrapy Cloud Periodic Jobs Dashboard

Requires Paid Plan

To use the Scrapy Cloud's Periodic Jobs functionality, you need to subscribe to a paid Scrapy Cloud plan (starts at $9/month). However, you can check out some free Scrapy Cloud alternatives here that don't require you to pay to schedule your jobs.


Next Steps

In this part, we looked at how we can use Scrapyd to deploy and run our spiders in the cloud and control them using ScrapeOps and ScrapydWeb.

So in Part 13, we will summarize what we have learned and introduce you to some of the more advanced topics not covered in this course.

All parts of the 12 Part freeCodeCamp Scrapy Beginner Course are as follows: