Skip to main content

How To Set Scrapy Delays/Sleeps Between Requests

Web scraping is a controversial topic for a lot of reasons, however, one of the most cited reasons is web scrapers being selfish and hitting websites too hard.

Overloading a website with so many requests that it slows a websites servers, harming the user experience for real users. Or in the worst cases, basically launching a DDoS attack on a website.

This is a massive headache for website administrators, and can be costly for them to mitigate against.

That's why it is important for all web scrapers to act in an ethical way and scrape as politely as possible.

One of the ways we can scrape more politely is by adding delays between our requests.

Not only will this reduce the load on a website, it can also make our spiders harder for websites to detect and block. So using delays between your requests is a win-win for everyone.

In this guide we will show you the various ways you can add delays or sleeps between your requests using Scrapy.

Let's begin...


Don't Use Sleeps Between Requests

If this was a scraper using the Python requests, a lot of developers would simply use a time.sleep to add a delay between requests.

However, when scraping with Scrapy you shouldn't use time.sleep as it will block the Twisted reactor (the underlying framework powering Scrapy), which will completely block your Scrapy spider and stop all of Scrapy's concurrency functionality.

You should use one of the following methods...


Set Download Delays

The easiest way to set Scrapy to delay or sleep between requests is to use its DOWNLOAD_DELAY functionality.

By default, your Scrapy projects DOWNLOAD_DELAY setting is set to 0, which means that it sends each request consecutively to the same website without any delay between requests.

However, you can introduce delays between your requests by setting the DOWNLOAD_DELAY a non-zero seconds value:

You can do this in your settings.py file like this:

## settings.py

DOWNLOAD_DELAY = 2 # 2 seconds of delay

Or in a specific spider using a custom_settings attribute (you need to use this method if running your spiders as a script with CrawlerProcess).

# bookspider.py 

import scrapy
from demo.items import BookItem

class BookSpider(scrapy.Spider):
name = 'bookspider'
start_urls = ["http://books.toscrape.com"]

custom_settings = {
'DOWNLOAD_DELAY': 2 # 2 seconds of delay
}

def parse(self, response):

for article in response.css('article.product_pod'):
book_item = BookItem(
url = article.css("h3 > a::attr(href)").get(),
title = article.css("h3 > a::attr(title)").extract_first(),
price = article.css(".price_color::text").extract_first(),
)
yield book_item


Using this DOWNLOAD_DELAY setting, Scrapy will add a delay between each request when making requests to the same domain.

Best Practice: If your scraping job isn't big and you don't have massive time pressure to complete a scrape, then it is recommended to set a high DOWNLOAD_DELAY as this will minimize the load on the website and reduce your chances of getting blocked.


Random Delays Between Requests

By default, when you set DOWNLOAD_DELAY = 2 for example, Scrapy will introduce random delays of between:

  • Upper Limit: 1.5 * DOWNLOAD_DELAY
  • Lower Limit: 0.5 * DOWNLOAD_DELAY

So for our example of DOWNLOAD_DELAY = 2, when a request it is made Scrapy will wait between 1-3 seconds before making the next request.

This is because, by default, RANDOMIZE_DOWNLOAD_DELAY is set to `True in your Scrapy project.


Fixed Delays Between Requests

To introduced fixed delays, you simply need to RANDOMIZE_DOWNLOAD_DELAY equal to False in your settings.py file or spider like this.

In settings.py file:

## settings.py

DOWNLOAD_DELAY = 2 # 2 seconds of delay
RANDOMIZE_DOWNLOAD_DELAY = False #

In spider:

# bookspider.py 

import scrapy
from demo.items import BookItem

class BookSpider(scrapy.Spider):
name = 'bookspider'
start_urls = ["http://books.toscrape.com"]

custom_settings = {
'DOWNLOAD_DELAY': 2, # 2 seconds of delay
'RANDOMIZE_DOWNLOAD_DELAY': False,
}

def parse(self, response):
pass


Using AutoThrottle Extension

Another way to add delays between your requests when scraping a website is using Scrapy's AutoThrottle extension.

AutoThrottle is a built-in Scrapy extension that continuously calculates the optimal delay between your requests to minimise the load on the website you are crawling. It does this by adjusting the delay based on the latency of each response and if the response is valid or not.

This approach has a couple of advantages:

  • Adjusts To Website: Every website is different in terms of the amount of traffic their servers normally handle and how aggressively they ban/throttle requests when a single IP is making requests too fast. With the AutoThrottle extension, you just set the initial parameters and it will calculate the optimal delay to use.
  • Backoff When Errors: A key feature of the AutoThrottle extension is that it will slow down the requests if the server is returning errors (non-2XX status codes). Servers typically return error (non-200) responses faster than valid responses, so with a normal download delay and hard concurrency limit your scraper will start sending requests faster when it starts to return errors. This the opposite of what a good scraper should do. So using the AutoThrottle extension fixes this problem.

AutoThrottle Throttling Algorithm

The AutoThrottle algorithm throttles the download delays using the following rules:

  1. Spiders start with a download delay of AUTOTHROTTLE_START_DELAY.
  2. When a response is received, the target download delay is calculated as latency / N where latency is the latency of the response, and N is AUTOTHROTTLE_TARGET_CONCURRENCY.
  3. The download delay for next requests is set to the average of previous download delay and the target download delay.
  4. Responses that return a non-200 response don't decrease the download delay.
  5. The download delay can’t become less than DOWNLOAD_DELAY or greater than AUTOTHROTTLE_MAX_DELAY.

Setting Up AutoThrottle

To configure AutoThrottle extension, you first need to enable it in your settings.py file or the spider itself:

In settings.py file:

## settings.py

DOWNLOAD_DELAY = 2 # minimum download delay
AUTOTHROTTLE_ENABLED = True

In spider:

# bookspider.py 

import scrapy
from demo.items import BookItem

class BookSpider(scrapy.Spider):
name = 'bookspider'
start_urls = ["http://books.toscrape.com"]

custom_settings = {
'DOWNLOAD_DELAY': 2, # minimum download delay
'AUTOTHROTTLE_ENABLED': True,
}

def parse(self, response):
pass

Then if you would like to customise the AutoThrottle extension you can use the following settings to configure it:

AUTOTHROTTLE_START_DELAY

The initial download delay in seconds. Default: 5.0 seconds.

AUTOTHROTTLE_MAX_DELAY

The maximum download delay in seconds the spider will us. It won't increase the download delay above this delay even when experiencing high latencies. Default: 60.0 seconds.

AUTOTHROTTLE_TARGET_CONCURRENCY

The target number of active requests the spider should be sending to the website at any point in time. Default: 1 concurrent thread.

The lower the AUTOTHROTTLE_TARGET_CONCURRENCY the politer your scraper.

AUTOTHROTTLE_DEBUG

When AUTOTHROTTLE_DEBUG is enabled, Scrapy will display stats about every response so you can monitor the download delays in real-time. Default: False.

For more information about how to configure the AutoThrottle extension, then check out the official docs here.


More Scrapy Tutorials

So that's how you can add delays between requests in your Scrapy spiders.

If you would like to learn more about Scrapy, then be sure to check out The Scrapy Playbook.