Python HTTPX: Retry Failed Requests
In this guide for The Python Web Scraping Playbook, we will look at how to configure the Python HTTPX to retry failed requests so you can build a more reliable system.
There are a couple of ways to approach this, so in this guide we will walk you through the 2 most common ways to retry failed requests and show you how to use them with Python HTTPX:
Let's begin...
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
Retry Failed Requests Using Retry Strategy
The easiest way to configure your Python HTTPX scraper to retry failed requests is to use the built-in httpx.Retry()
to create a retry strategy.
The Retry
class allows you to specify the conditions under which a request should be retried and the number of retries to attempt. As HTTPX is built on top of the urllib3 library, it utilizes urllib3's Retry functionality
Here is an example:
import httpx
retry_strategy = httpx.Retry(
total=3, # Total number of retries (including the initial request)
status_forcelist=[500, 502, 503, 504], # HTTP status codes to retry
backoff_factor=0.5, # Factor by which the delay increases after each retry
method_whitelist=["GET"], # HTTP methods to retry
)
def make_request():
url = "http://quotes.toscrape.com/"
with httpx.Client() as client:
response = client.get(url, retry=retry_strategy)
if response.status_code == 200:
return response.json()
else:
return None
# Make request
data = make_request()
if data is not None:
print(data)
else:
print("Request failed after retries.")
Here are the available settings:
-
total
: Specifies the maximum number of retries to attempt, including the initial request. Default value is3
. -
backoff_factor
: Determines the factor by which the delay increases after each retry. For example, if set to0.5
, the delay after the first retry will be half of the delay after the second retry, and so on. Default value is0
. -
status_forcelist
: Defines a list of HTTP status codes that should trigger a retry. For example,[500, 502, 503, 504]
would retry requests if any of these status codes are received. Default is an empty list. -
method_whitelist
: Specifies a list of HTTP methods for which retries should be attempted. For example,["GET", "POST"]
would retry onlyGET
andPOST
requests. Default vallue isNone
, which means retries are attempted for all methods. -
status_allowlist
: Works in the opposite way tostatus_forcelist
. Instead of specifying which status codes should trigger a retry, it defines a list of status codes that are explicitly allowed to retry. For example,[200, 404]
would allow retries only for responses with status codes200
and404
. Default value is an empty list. -
method_retry
: A callable that takes the HTTP method and response status code and returnsTrue
if a retry should be attempted for that combination, andFalse
otherwise. You can use this to implement custom retry logic based on specific conditions. Default value isNone
. -
backoff_max
: Specifies the maximum delay (in seconds) between retries. If the calculated delay based on thebackoff_factor
exceeds this value, it will be capped at thebackoff_max
value. Default `value isNone
. -
raise_on_redirect
: Determines whether to raise an exception for redirects. If set toFalse
, redirects will be followed automatically without raising an exception. Default value isTrue
. -
raise_on_status
: Specifies whether to raise an exception for non-successful status codes (e.g., 4xx or 5xx). If set toFalse
, retries will be attempted for any response, regardless of the status code. Default value isTrue
.
Using the backoff_factor
we can configure our script to exponentially increase the timeout between each retry.
Here is the backoff algorithm:
{backoff_factor} * (2 ** ({number_retries} - 1))
Here are some example sleep sequences different backoff factors will apply:
## backoff_factor = 1
0.5, 1, 2, 4, 8, 16, 32, 64, 128, 256
## backoff_factor = 2
1, 2, 4, 8, 16, 32, 64, 128, 256, 512
## backoff_factor = 3
5, 10, 20, 40, 80, 160, 320, 640, 1280, 2560
Custom Retry Logic
In certain applications, you can retry solely based on the status code being returned. However, when it comes to web scraping oftentimes a website can return a ban page as a successful 200 response. So you might need custom retry logic to check the response content itself.
In the following example, we check the response for text that indicates it is a ban page. If it is present, we will retry the request.
import httpx
retry_strategy = httpx.Retry(
total=3,
status_forcelist=[500, 502, 503, 504],
)
def make_request():
url = "http://quotes.toscrape.com/"
with httpx.Client() as client:
for _ in range(retry_strategy.total + 1):
response = client.get(url, retry=retry_strategy)
if response.status_code == 200:
# Check the response content for ban page
if '<title>Robot or human?</title>' in response.text:
print("Retrying due to error in response")
else:
return response
elif response.status_code == 404:
return response
else:
print("Retrying due to non-200 status code")
return None
# Example usage
response = make_request()
if response is not None:
print(response.text)
else:
print("Request failed after retries.")
The advantage of this approach is that you have a lot of control over what is a failed response.
In this example, we define the make_request()
function that performs the GET
request. Inside the function, we iterate a number of times equal to the total number of retries specified in the retry_strategy
.
After each request attempt, we examine the response. If the status code is 200
, we assume a successful response and proceed to check the content for specific conditions. In this case, we check if the response contains an ban page text in the HTML response.
If the error condition is met, we print a message and continue with the next retry iteration. Otherwise, we return the response.
If the status code is 404
, we assume the page doesn't exist and return the 404
response.
If the status code is not 200
or 404
, we print a message and proceed to the next retry iteration.
If the maximum number of retries is reached without a successful response, we return None
to indicate that the request failed after retries.
You can modify the content-based conditions within the make_request()
function based on your specific requirements to trigger retries.
More Web Scraping Tutorials
So that's how you can configure Python HTTPX to automatically retry failed requests.
If you would like to learn more about Web Scraping, then be sure to check out The Web Scraping Playbook.
Or check out one of our more in-depth guides: