How To Solve A Scrapy 403 Unhandled or Forbidden Errors
When scraping or crawling getting a Scrapy 403 Error is a common and confusing response as it often isn't 100% clear what is causing the error.
It will typically look something like this in your logs:
2022-07-13 00:13:02[scrapy.core.engine] DEBUG: Crawled (403) http://www.blablacar.in/ride-sharing/new-delhi/chandigarh> (referer: None)
2022-07-13 00:13:03 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://www.blablacar.in/ride-sharing/new-delhi/chandigarh>: HTTP status code is not handled or not allowed
Here, as Scrapy's in-built response handling doesn't handle the 403
status code, it doesn't give you anymore context on what caused the error.
However, oftentimes there are only two possible causes:
- The URL you are trying to scrape is forbidden, and you need to be authorised to access it.
- The website detects that you are scraper and returns a 403 HTTP Status Code Forbidden Error as a ban page.
Most of the time it is the second cause, i.e. the website is blocking your requests because it thinks you are a scraper.
Scrapy 403 Responses are common when you are trying to scrape websites protected by Cloudflare, as Cloudflare returns a 403
status code
In this guide we will walk you through how to debug Scrapy 403 Forbidden Errors and provide solutions that you can implement.
- Easy Way To Solve Scrapy 403 Errors
- Randomising Your Request Delays
- Use Fake User Agents
- Optimize Request Headers
- Use Rotating Proxies
Let's begin...
Easy Way To Solve Scrapy 403 Errors
If the URL you are trying to scrape is normally accessible, but you are getting Scrapy 403 Forbidden Errors then it is likely that the website is flagging your spider as a scraper and blocking your requests.
To avoid getting detected we need to optimise our spiders to bypass anti-bot countermeasures by:
- Randomising Your Requests
- Using Fake User Agents
- Optimizing Request Headers
- Using Proxies
We will discuss these below, however, the easiest way to fix this problem is to use a smart proxy solution like the ScrapeOps Proxy Aggregator.
With the ScrapeOps Proxy Aggregator you simply need to send your requests to the ScrapeOps proxy endpoint and our Proxy Aggregator will optimise your request with the best user-agent, header and proxy configuration to ensure you don't get 403
errors from your target website.
Simply get your free API key by signing up for a free account here and edit your Scrapy spider as follows:
import scrapy
API_KEY = 'YOUR_API_KEY'
def get_scrapeops_url(url):
payload = {'api_key': API_KEY, 'url': url}
proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)
return proxy_url
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=get_scrapeops_url(url), callback=self.parse)
If you are getting blocked by Cloudflare, then you can simply activate ScrapeOps' Cloudflare Bypass by adding bypass=cloudflare_level_1
to the request:
import scrapy
API_KEY = 'YOUR_API_KEY'
def get_scrapeops_url(url):
payload = {'api_key': API_KEY, 'url': url, 'bypass': 'cloudflare_level_1'}
proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)
return proxy_url
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=get_scrapeops_url(url), callback=self.parse)
Cloudflare is the most common anti-bot system being used by websites today, and bypassing it depends on which security settings the website has enabled.
To combat this, we offer 3 different Cloudflare bypasses designed to solve the Cloudflare challenges at each security level.
Security Level | Bypass | API Credits | Description |
---|---|---|---|
Low | cloudflare_level_1 | 10 | Use to bypass Cloudflare protected sites with low security settings enabled. |
Medium | cloudflare_level_2 | 35 | Use to bypass Cloudflare protected sites with medium security settings enabled. On large plans the credit multiple will be increased to maintain a flat rate of $3.50 per thousand requests. |
High | cloudflare_level_3 | 50 | Use to bypass Cloudflare protected sites with high security settings enabled. On large plans the credit multiple will be increased to maintain a flat rate of $4 per thousand requests. |
You can check out the full documentation here.
Or if you would prefer to try to optimize your user-agent, headers and proxy configuration yourself then read on and we will explain how to do it.
Randomising Your Request Delays
If you send a request to a website from the same IP every second then it websites can easily detect you and flag you as a scraper.
Instead, you should space out your requests over a longer period of time and randomise when they are sent.
Doing this in Scrapy is very simple using the DOWNLOAD_DELAY functionality.
By default, your Scrapy projects DOWNLOAD_DELAY
setting is set to 0
, which means that it sends each request consecutively to the same website without any delay between requests.
However, you can randomize your requests by giving DOWNLOAD_DELAY
a non-zero seconds value in your settings.py
file:
## settings.py
DOWNLOAD_DELAY = 2 # 2 seconds of delay
When DOWNLOAD_DELAY is non-zero, Scrapy will wait a random interval of between 0.5 * DOWNLOAD_DELAY and 1.5 * DOWNLOAD_DELAY between each request.
This is because, by default RANDOMIZE_DOWNLOAD_DELAY
is set to True
.
If your scraping job isn't big and you don't have massive time pressure then it is recommended to set a high DOWNLOAD_DELAY
as this will minimize the load on the website and reduce your chances of getting blocked.
Use Fake User Agents
The most common reason for a website to block a Scrapy spider and return a 403
error is because your spider is telling the website your spider is a automated scraper.
This is largely because by default Scrapy tells the website that it is a scraper in the user-agent it sends with your request.
Unless, you override the default Scrapy settings, your spider will send the following user-agent with every request:
user-agent: Scrapy/VERSION (+https://scrapy.org)
This tells the website that your requests are coming from a Scrapy spider, so it is very easy for them to block your requests and return a 403
status code.
Solution
The solution to this problem is to configure your spider to send a fake user-agent with every request. This way it is harder for the website to tell if your requests are coming from a scraper or a real user.
We wrote a full guide on how to set fake user-agents for your scrapers here, however, this is a quick summary of the solution:
Method 1: Set Fake User-Agent In Settings.py File
The easiest way to change the default Scrapy user-agent is to set a default user-agent in your settings.py
file.
Simply uncomment the USER_AGENT
value in the settings.py
file and add a new user agent:
## settings.py
USER_AGENT = 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'
You can find a huge list of user-agents here.
This will only work on relatively small scrapes, as if you use the same user-agent on every single request then a website with a more sophisticated anti-bot solution could easily still detect your scraper.
To get around this we need to rotate through a large pool of fake user-agents so that every request looks unique.
Method 2: Use Scrapy-Fake-Useragent
You could gather a large list of fake user-agents and configure your spider to rotate through them yourself like this example, or you could use a Scrapy middleware like scrapy-fake-useragent.
scrapy-fake-useragent generates fake user-agents for your requests based on usage statistics from a real world database, and attached them to every request.
Getting scrapy-fake-useragent setup is simple. Simply install the Python package:
pip install scrapy-fake-useragent
Then in your settings.py
file, you need to turn off the built in UserAgentMiddleware
and RetryMiddleware
, and enable scrapy-fake-useragent's RandomUserAgentMiddleware
and RetryUserAgentMiddleware
.
## settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
'scrapy_fake_useragent.middleware.RetryUserAgentMiddleware': 401,
}
And then enable the Fake User-Agent Providers by adding them to your settings.py
file.
## settings.py
FAKEUSERAGENT_PROVIDERS = [
'scrapy_fake_useragent.providers.FakeUserAgentProvider', # This is the first provider we'll try
'scrapy_fake_useragent.providers.FakerProvider', # If FakeUserAgentProvider fails, we'll use faker to generate a user-agent string for us
'scrapy_fake_useragent.providers.FixedUserAgentProvider', # Fall back to USER_AGENT value
]
## Set Fallback User-Agent
USER_AGENT = '<your user agent string which you will fall back to if all other providers fail>'
When activated, scrapy-fake-useragent will download a list of the most common user-agents from useragentstring.com and use a random one with every request, so you don't need to create your own list.
You can also add your own user-agent string providers, or configure it to generate new user-agent strings as a backup using Faker.
To see all the configuration options, then check out the docs here.
Optimize Request Headers
In a lot of cases, just adding fake user-agents to your requests will solve the Scrapy 403 Forbidden Error, however, if the website is has a more sophisticated anti-bot detection system in place you will also need to optimize the request headers.
By default, Scrapy will only send basic request headers along with your requests such as Accept
, Accept-Language
, and User-Agent
.
Accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
Accept-Language: 'en'
User-Agent: 'Scrapy/VERSION (+https://scrapy.org)'
In contrast, here are the request headers a Chrome browser running on a MacOS machine would send:
Connection: 'keep-alive'
Cache-Control: 'max-age=0'
sec-ch-ua: '" Not A;Brand";v="99", "Chromium";v="99", "Google Chrome";v="99"'
sec-ch-ua-mobile: '?0'
sec-ch-ua-platform: "macOS"
Upgrade-Insecure-Requests: 1
User-Agent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.83 Safari/537.36'
Accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9'
Sec-Fetch-Site: 'none'
Sec-Fetch-Mode: 'navigate'
Sec-Fetch-User: '?1'
Sec-Fetch-Dest: 'document'
Accept-Encoding: 'gzip, deflate, br'
Accept-Language: 'en-GB,en-US;q=0.9,en;q=0.8'
If the website is really trying to prevent web scrapers from accessing their content, then they will be analysing the request headers to make sure that the other headers match the user-agent you set, and that the request includes other common headers a real browser would send.
Solution
To solve this, we need to make sure we optimize the request headers, including making sure the fake user-agent is consistent with the other headers.
This is a big topic, so if you would like to learn more about header optimization then check out our guide to header optimization.
However, here is a quick example of adding optimized headers to our requests:
# bookspider.py
import scrapy
from demo.items import BookItem
class BookSpider(scrapy.Spider):
name = 'bookspider'
url_list = ["http://books.toscrape.com"]
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Cache-Control": "max-age=0",
}
def start_requests(self):
for url in self.url_list:
return Request(url=url, callback=self.parse, headers=HEADERS)
def parse(self, response):
for article in response.css('article.product_pod'):
book_item = BookItem(
url = article.css("h3 > a::attr(href)").get(),
title = article.css("h3 > a::attr(title)").extract_first(),
price = article.css(".price_color::text").extract_first(),
)
yield book_item
Here we are adding the same optimized header with a fake user-agent to every request.
Use Rotating Proxies
If the above solutions don't work then it is highly likely that the server has flagged your IP address as being used by a scraper and is either throttling your requests or completely blocking them.
This is especially likely if you are scraping at larger volumes, as it is easy for websites to detect scrapers if they are getting an unnaturally large amount of requests from the same IP address.
Solution
You will need to send your requests through a rotating proxy pool. We created a full guide on the various options you have when integrating & rotating proxies in your Scrapy spiders here.
However, he is one possible solution using the scrapy-rotating-proxies middleware.
To get started simply install the middleware:
pip install scrapy-rotating-proxies
Then we just need to update our settings.py
to load in our proxies and enable the scrapy-rotating-proxies middleware:
## settings.py
## Insert Your List of Proxies Here
ROTATING_PROXY_LIST = [
'proxy1.com:8000',
'proxy2.com:8031',
'proxy3.com:8032',
]
## Enable The Proxy Middleware In Your Downloader Middlewares
DOWNLOADER_MIDDLEWARES = {
# ...
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
# ...
}
And that's it. After this all requests your spider will make will be proxied using one of the proxies from the ROTATING_PROXY_LIST
.
If you need help finding the best & cheapest proxies for your particular use case then check out our proxy comparison tool here.
Alternatively, you could just use the ScrapeOps Proxy Aggregator as we discussed previously.
More Scrapy Tutorials
So that's how you can solve Scrapy 403 Unhandled & Forbidden Errors when you get them.
If you would like to know more about bypassing the most common anti-bots then check out our bypass guides here:
If you would like to learn more about Scrapy, then be sure to check out The Scrapy Playbook.