Skip to main content

Scrapy SDK Integration

The ScrapeOps Scrapy SDK is an extension for your Scrapy spiders that gives you all the scraping monitoring, statistics, alerting, and data validation you will need straight out of the box.

Just enable it in your settings.py file and the SDK will automatically monitor your scrapers and send your logs to your scraping dashboard.

🚀 Getting Setup

You can get the ScrapeOps monitoring suite up and running in 4 easy steps.

#1 - Install the ScrapeOps SDK:

pip install scrapeops-scrapy

#2 - Get Your ScrapeOps API Key:

Create a free ScrapeOps account here and get your API key from the dashboard.

When you have your API key, open your Scrapy projects settings.py file and insert your API key into it.

SCRAPEOPS_API_KEY = 'YOUR_API_KEY'

#3 - Add in the ScrapeOps Extension:

In the settings.py file, add in the ScrapeOps extension, by simply adding it to the EXTENSIONS dictionary.

EXTENSIONS = {
'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500,
}

#4 - Enable the ScrapeOps Retry Middleware:

To get the most accurate stats, you need to add in the ScrapeOps retry middleware into the DOWNLOADER_MIDDLEWARES dictionary and disable the default Scrapy Retry middleware in your Scrapy project's settings.py file.

You can do this by setting the default Scrapy RetryMiddleware to None and enabling the ScrapeOps retry middleware in it's place.

DOWNLOADER_MIDDLEWARES = {
'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
}

The retry middleware will operate the exactly as before, however, the ScrapeOps retry middleware will log every request, response and exception your spiders generate.


#5 - (Optional) Exclude Settings From Being Logged By ScrapeOps SDK:

By default the ScrapeOps SDK will log the settings used for each particular scrape so you can keep track of the settings used. However, to ensure it doesn't record sensitive information like API keys it won't log any settings that contain the following substrings:

  • API_KEY
  • APIKEY
  • SECRET_KEY
  • SECRETKEY

However, it can still log other settings that don't match these patterns. You can specify which settings not to log by adding the setting to the SCRAPEOPS_SETTINGS_EXCLUSION_LIST.

SCRAPEOPS_SETTINGS_EXCLUSION_LIST = [
'NAME_OF_SETTING_NOT_TO_LOG'
]

Done!

That's all. From here, the ScrapeOps SDK will automatically monitor and collect statistics from your scraping jobs and display them in your ScrapeOps dashboard.