Skip to main content

Scrapy 503 Service Unavailable Errors

How To Solve A Scrapy 503 Service Unavailable Errors

When scraping or crawling getting a Scrapy 503 Service Unavailable Error is a common and confusing error as it often isn't 100% clear what is causing the error.

A Scrapy 503 Service Unavailable Error is logged when the backend server your spider is trying to connect to returns a 503 HTTP status code.

Meaning the server is currently unable to handle incoming requests. Either because the server is down for maintainence or is too overloaded with incoming requests and can't handle anymore.

However, oftentimes when your spider gets this error you can connect to the target website normally with your browser. This means that the server is likely returning the 503 HTTP status code on purpose to your scraper.

Most likely because the server believes you are a scraper and is blocking you.

In this guide we will walk you through how to troubleshoot Scrapy 503 Service Unavailable Errors and provide solutions that you can implement.

Let's begin...

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.


Determine If Server Is Really Down

The first step you should take when trying to debug a Scrapy 503 Service Unavailable Error is to check if it is a real 503 error or a fake 503 because the website thinks you are scraper.

Checking this is as simple as requesting the same URLs with a real web browser, or with a headless browser (Selenium, Puppeteer, Playwright).

If you CAN'T access the website, then the server is temporarily down for maintainence or is too busy. In cases, like these there isn't anything you can do.

The websites server is simple down, so you will just have to wait until it is live and fully operational again.

If you CAN access the website using a browser with no issues then it is highly likely that the server isn't really down for maintainence or is too busy.

Instead, the server has likely flagged your spider as a scraper and is blocking requests from your spider.

To solve this we need to figure out how the website is detecting us, and make our spider more stealthy.


Easy Way To Solve Scrapy 503 Errors

If the server is live, but you are getting Scrapy 503 Service Unavailable Errors then it is likely that the website is flagging your spider as a scraper and blocking your requests.

To avoid getting detected we need to optimise our spiders to bypass anti-bot countermeasures by:

  • Using Fake User Agents
  • Optimizing Request Headers
  • Using Proxies

We will discuss these below, however, the easiest way to fix this problem is to use a smart proxy solution like the ScrapeOps Proxy Aggregator.

ScrapeOps Proxy Aggregator

With the ScrapeOps Proxy Aggregator you simply need to send your requests to the ScrapeOps proxy endpoint and our Proxy Aggregator will optimise your request with the best user-agent, header and proxy configuration to ensure you don't get 503 errors from your target website.

Simply get your free API key by signing up for a free account here and edit your Scrapy spider as follows:


import scrapy

API_KEY = 'YOUR_API_KEY'

def get_scrapeops_url(url):
payload = {'api_key': API_KEY, 'url': url}
proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)
return proxy_url

class QuotesSpider(scrapy.Spider):
name = "quotes"

def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=get_scrapeops_url(url), callback=self.parse)


You can check out the full documentation here.

Or if you would prefer to try to optimize your user-agent, headers and proxy configuration yourself then read on and we will explain how to do it.


Use Fake User Agents

The most common reason for a website to block a Scrapy spider and return a 503 error is because your spider is telling the website your spider is a automated scraper.

This is largely because by default Scrapy tells the website that it is a scraper in the user-agent it sends with your request.

Unless, you override the default Scrapy settings, your spider will send the following user-agent with every request:


user-agent: Scrapy/VERSION (+https://scrapy.org)

This tells the website that your requests are coming from a Scrapy spider, so it is very easy for them to block your requests and return a 503 status code.

Solution

The solution to this problem is to configure your spider to send a fake user-agent with every request. This way it is harder for the website to tell if your requests are coming from a scraper or a real user.

We wrote a full guide on how to set fake user-agents for your scrapers here, however, this is a quick summary of the solution:


Method 1: Set Fake User-Agent In Settings.py File

The easiest way to change the default Scrapy user-agent is to set a default user-agent in your settings.py file.

Simply uncomment the USER_AGENT value in the settings.py file and add a new user agent:

## settings.py

USER_AGENT = 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'

You can find a huge list of user-agents here.

This will only work on relatively small scrapes, as if you use the same user-agent on every single request then a website with a more sophisticated anti-bot solution could easily still detect your scraper.

To get around this we need to rotate through a large pool of fake user-agents so that every request looks unique.


Method 2: Use Scrapy-Fake-Useragent

You could gather a large list of fake user-agents and configure your spider to rotate through them yourself like this example, or you could use a Scrapy middleware like scrapy-fake-useragent.

scrapy-fake-useragent generates fake user-agents for your requests based on usage statistics from a real world database, and attached them to every request.

Getting scrapy-fake-useragent setup is simple. Simply install the Python package:


pip install scrapy-fake-useragent

Then in your settings.py file, you need to turn off the built in UserAgentMiddleware and RetryMiddleware, and enable scrapy-fake-useragent's RandomUserAgentMiddleware and RetryUserAgentMiddleware.

## settings.py

DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
'scrapy_fake_useragent.middleware.RetryUserAgentMiddleware': 401,
}

And then enable the Fake User-Agent Providers by adding them to your settings.py file.

## settings.py

FAKEUSERAGENT_PROVIDERS = [
'scrapy_fake_useragent.providers.FakeUserAgentProvider', # This is the first provider we'll try
'scrapy_fake_useragent.providers.FakerProvider', # If FakeUserAgentProvider fails, we'll use faker to generate a user-agent string for us
'scrapy_fake_useragent.providers.FixedUserAgentProvider', # Fall back to USER_AGENT value
]

## Set Fallback User-Agent
USER_AGENT = '<your user agent string which you will fall back to if all other providers fail>'


When activated, scrapy-fake-useragent will download a list of the most common user-agents from useragentstring.com and use a random one with every request, so you don't need to create your own list.

You can also add your own user-agent string providers, or configure it to generate new user-agent strings as a backup using Faker.

To see all the configuration options, then check out the docs here.


Optimize Request Headers

In a lot of cases, just adding fake user-agents to your requests will solve the Scrapy 503 Service Unavailable Error, however, if the website is has a more sophisticated anti-bot detection system in place you will also need to optimize the request headers.

By default, Scrapy will only send basic request headers along with your requests such as Accept, Accept-Language, and User-Agent.


Accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
Accept-Language: 'en'
User-Agent: 'Scrapy/VERSION (+https://scrapy.org)'

In contrast, here are the request headers a Chrome browser running on a MacOS machine would send:


Connection: 'keep-alive'
Cache-Control: 'max-age=0'
sec-ch-ua: '" Not A;Brand";v="99", "Chromium";v="99", "Google Chrome";v="99"'
sec-ch-ua-mobile: '?0'
sec-ch-ua-platform: "macOS"
Upgrade-Insecure-Requests: 1
User-Agent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.83 Safari/537.36'
Accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9'
Sec-Fetch-Site: 'none'
Sec-Fetch-Mode: 'navigate'
Sec-Fetch-User: '?1'
Sec-Fetch-Dest: 'document'
Accept-Encoding: 'gzip, deflate, br'
Accept-Language: 'en-GB,en-US;q=0.9,en;q=0.8'

If the website is really trying to prevent web scrapers from accessing their content, then they will be analysing the request headers to make sure that the other headers match the user-agent you set, and that the request includes other common headers a real browser would send.

Solution

To solve this, we need to make sure we optimize the request headers, including making sure the fake user-agent is consistent with the other headers.

This is a big topic, so if you would like to learn more about header optimization then check out our guide to header optimization.

However, here is a quick example of adding optimized headers to our requests:

# bookspider.py 

import scrapy
from demo.items import BookItem

class BookSpider(scrapy.Spider):
name = 'bookspider'
url_list = ["http://books.toscrape.com"]

HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Cache-Control": "max-age=0",
}

def start_requests(self):
for url in self.url_list:
return Request(url=url, callback=self.parse, headers=HEADERS)

def parse(self, response):

for article in response.css('article.product_pod'):
book_item = BookItem(
url = article.css("h3 > a::attr(href)").get(),
title = article.css("h3 > a::attr(title)").extract_first(),
price = article.css(".price_color::text").extract_first(),
)
yield book_item

Here we are adding the same optimized header with a fake user-agent to every request.


Use Rotating Proxies

If the above solutions don't work then it is highly likely that the server has flagged your IP address as being used by a scraper and is either throttling your requests or completely blocking them.

This is especially likely if you are scraping at larger volumes, as it is easy for websites to detect scrapers if they are getting an unnaturally large amount of requests from the same IP address.

Solution

You will need to send your requests through a rotating proxy pool. We created a full guide on the various options you have when integrating & rotating proxies in your Scrapy spiders here.

However, he is one possible solution using the scrapy-rotating-proxies middleware.

To get started simply install the middleware:


pip install scrapy-rotating-proxies

Then we just need to update our settings.py to load in our proxies and enable the scrapy-rotating-proxies middleware:

## settings.py

## Insert Your List of Proxies Here
ROTATING_PROXY_LIST = [
'proxy1.com:8000',
'proxy2.com:8031',
'proxy3.com:8032',
]

## Enable The Proxy Middleware In Your Downloader Middlewares
DOWNLOADER_MIDDLEWARES = {
# ...
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
# ...
}

And that's it. After this all requests your spider will make will be proxied using one of the proxies from the ROTATING_PROXY_LIST.

If you need help finding the best & cheapest proxies for your particular use case then check out our proxy comparison tool here.

Alternatively, you could just use the ScrapeOps Proxy Aggregator as we discussed previously.


More Scrapy Tutorials

So that's how you can solve Scrapy 503 Service Unavailable Errors when you get them.

If you would like to learn more about Scrapy, then be sure to check out The Scrapy Playbook.