Python HTTPX - Retry Failed Requests

Python HTTPX: Retry Failed Requests

In this guide for The Python Web Scraping Playbook, we will look at how to configure the Python HTTPX to retry failed requests so you can build a more reliable system.

There are a couple of ways to approach this, so in this guide we will walk you through the 2 most common ways to retry failed requests and show you how to use them with Python HTTPX:

Retry Failed Requests Using Retry Strategy
Custom Retry Logic

Let's begin...

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

Retry Failed Requests Using Retry Strategy

The easiest way to configure your Python HTTPX scraper to retry failed requests is to use the built-in httpx.Retry() to create a retry strategy.

The Retry class allows you to specify the conditions under which a request should be retried and the number of retries to attempt. As HTTPX is built on top of the urllib3 library, it utilizes urllib3's Retry functionality

Here is an example:

import httpx

retry_strategy = httpx.Retry(
    total=3,  # Total number of retries (including the initial request)
    status_forcelist=[500, 502, 503, 504],  # HTTP status codes to retry
    backoff_factor=0.5,  # Factor by which the delay increases after each retry
    method_whitelist=["GET"],  # HTTP methods to retry
)

def make_request():
    url = "http://quotes.toscrape.com/"
    with httpx.Client() as client:
        response = client.get(url, retry=retry_strategy)
        if response.status_code == 200:
            return response.json()
        else:
            return None

# Make request
data = make_request()
if data is not None:
    print(data)
else:
    print("Request failed after retries.")

Here are the available settings:

total: Specifies the maximum number of retries to attempt, including the initial request. Default value is 3.
backoff_factor: Determines the factor by which the delay increases after each retry. For example, if set to 0.5, the delay after the first retry will be half of the delay after the second retry, and so on. Default value is 0.
status_forcelist: Defines a list of HTTP status codes that should trigger a retry. For example, [500, 502, 503, 504] would retry requests if any of these status codes are received. Default is an empty list.
method_whitelist: Specifies a list of HTTP methods for which retries should be attempted. For example, ["GET", "POST"] would retry only GET and POST requests. Default vallue is None, which means retries are attempted for all methods.
status_allowlist: Works in the opposite way to status_forcelist. Instead of specifying which status codes should trigger a retry, it defines a list of status codes that are explicitly allowed to retry. For example, [200, 404] would allow retries only for responses with status codes 200 and 404. Default value is an empty list.
method_retry: A callable that takes the HTTP method and response status code and returns True if a retry should be attempted for that combination, and False otherwise. You can use this to implement custom retry logic based on specific conditions. Default value is None.
backoff_max: Specifies the maximum delay (in seconds) between retries. If the calculated delay based on the backoff_factor exceeds this value, it will be capped at the backoff_max value. Default `value is None.
raise_on_redirect: Determines whether to raise an exception for redirects. If set to False, redirects will be followed automatically without raising an exception. Default value is True.
raise_on_status: Specifies whether to raise an exception for non-successful status codes (e.g., 4xx or 5xx). If set to False, retries will be attempted for any response, regardless of the status code. Default value is True.

Using the backoff_factor we can configure our script to exponentially increase the timeout between each retry.

Here is the backoff algorithm:

{backoff_factor} * (2 ** ({number_retries} - 1))

Here are some example sleep sequences different backoff factors will apply:

## backoff_factor = 1
0.5, 1, 2, 4, 8, 16, 32, 64, 128, 256

## backoff_factor = 2
1, 2, 4, 8, 16, 32, 64, 128, 256, 512

## backoff_factor = 3
5, 10, 20, 40, 80, 160, 320, 640, 1280, 2560

Custom Retry Logic

In certain applications, you can retry solely based on the status code being returned. However, when it comes to web scraping oftentimes a website can return a ban page as a successful 200 response. So you might need custom retry logic to check the response content itself.

In the following example, we check the response for text that indicates it is a ban page. If it is present, we will retry the request.

import httpx

retry_strategy = httpx.Retry(
    total=3,
    status_forcelist=[500, 502, 503, 504],
)

def make_request():
    url = "http://quotes.toscrape.com/"
    with httpx.Client() as client:
        for _ in range(retry_strategy.total + 1):
            response = client.get(url, retry=retry_strategy)
            if response.status_code == 200:
                # Check the response content for ban page
                if '<title>Robot or human?</title>' in response.text:
                    print("Retrying due to error in response")
                else:
                    return response
            elif response.status_code == 404:
                return response
            else:
                print("Retrying due to non-200 status code")

    return None

# Example usage
response = make_request()
if response is not None:
    print(response.text)
else:
    print("Request failed after retries.")

The advantage of this approach is that you have a lot of control over what is a failed response.

In this example, we define the make_request() function that performs the GET request. Inside the function, we iterate a number of times equal to the total number of retries specified in the retry_strategy.

After each request attempt, we examine the response. If the status code is 200, we assume a successful response and proceed to check the content for specific conditions. In this case, we check if the response contains an ban page text in the HTML response.

If the error condition is met, we print a message and continue with the next retry iteration. Otherwise, we return the response.

If the status code is 404, we assume the page doesn't exist and return the 404 response.

If the status code is not 200 or 404, we print a message and proceed to the next retry iteration.

If the maximum number of retries is reached without a successful response, we return None to indicate that the request failed after retries.

You can modify the content-based conditions within the make_request() function based on your specific requirements to trigger retries.

Python HTTPX: Retry Failed Requests

Need help scraping the web?

Retry Failed Requests Using Retry Strategy​

Custom Retry Logic​

More Web Scraping Tutorials​

Retry Failed Requests Using Retry Strategy

Custom Retry Logic

More Web Scraping Tutorials