Python HTTPX: How to Use & Rotate Proxies
To use proxies with Python HTTPX create a proxies dictionary and pass it into the proxies
attribute of your request.
import httpx
proxies = {
'http://': 'http://proxy.example.com:8080',
'https://': 'https://proxy.example.com:8081',
}
response = httpx.get('http://example.com', proxies=proxies)
In this guide for The Python Web Scraping Playbook, we will look at how to integrate the 3 most common types of proxies into our Python HTTPX based web scraper.
Although, similar to Python Requests library, how you integrate proxies into Python HTTPX based scrapers can be slightly different. Allowing you to spread your requests over multiple IP addresses making it harder for websites to detect & block your web scrapers.
In this guide we will walk you through the 3 most common proxy integration methods and show you how to use them with Python HTTPX:
- Using Proxy IPs With Python HTTPX
- Proxy Authentication With Python HTTPX
- Proxy Routing With Python HTTPX
- Using Proxies With Client Instances
- The 3 Most Common Proxy Formats
- Proxy Integration #1: Rotating Through Proxy IP List
- Proxy Integration #2: Using Proxy Gateway
- Proxy Integration #3: Using Proxy API Endpoint
If you would like to know how to integrate proxies into your Python Requests or Python Scrapy scrapers then check out our Python Requests proxies guide here and Scrapy proxies guide here.
Let's begin...
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
Using Proxy IPs With Python HTTPX
Using a proxy with Python HTTPX is very straight forward. We simply need to create a proxies
dictionary and pass it into the proxies attribute of our Python HTTPX request.
import httpx
proxies = {
'http://': 'http://proxy.example.com:8080',
'https://': 'https://proxy.example.com:8081',
}
response = httpx.get('http://example.com', proxies=proxies)
This method will work for all request methods Python HTTPX supports: GET
, POST
, PUT
, DELETE
, PATCH
, HEAD
.
Proxy Authentication With Python HTTPX
Some proxy IPs require authentication in the form of a username
and password
to use the proxy.
To authenticate our proxy we simply need to add the username
and password
to the proxy strings.
import httpx
proxies = {
'http://': 'http://USERNAME:PASSWORD@proxy.example.com:8080',
'https://': 'https://USERNAME:PASSWORD@proxy.example.com:8081',
}
response = httpx.get('http://example.com', proxies=proxies)
Proxy Routing With Python HTTPX
A useful feature of HTTPX is that it gives you the ability to control which requests use a proxy and which don't.
The proxies
dictionary maps URL patterns to proxy URLs. Matching from the most specific https://<domain>:<port>
to the least specific https://
.
HTTPX supports routing proxies based on scheme, domain, port, or a combination of these.
For more details on this functionality check out the official docs, however, here are some useful examples:
Wildcard Routing
Route everything through a proxy.
proxies = {
"all://": "http://85.237.57.198:20000",
}
Scheme Routing
Configure HTTP to route HTTP
requests through one proxy, and HTTPS
requests through another.
proxies = {
"http://": "http://85.237.57.198:20000",
"https://": "http://85.237.57.198:20001",
}
Domain Routing
Domain routing allows you to control which proxy should be used for specific domains.
For example, you could tell HTTPX to route all requests to one Walmart.com through BrightData's proxy, all other requests through ScraperAPI, but any request to Target.com shouldn't use a proxy.
proxies = {
"all://*target.com": "none", ## No Proxy
"all://*walmart.com": "http://USERNAME:PASSWORD@zproxy.lum-superproxy.io:22225", ## BrightData
"all://": "http://scraperapi:APIKEY@proxy-server.scraperapi.com:8001", ## ScrapeAPI
}
Using Proxies With Client Instances
If you are using Python HTTPX Client functionality then you will need to add your proxy in a slightly different way.
If you are coming from Python Requests, httpx.Client()
is the HTTPX version of requests.Session()
.
Instead of defining the proxies in our httpx.get()
as we've seen previously, we will instead intialize our HTTPX Client
object and then set it to use our proxies.
import httpx
proxies = {
'http://': 'http://USERNAME:PASSWORD@proxy.example.com:8080',
'https://': 'https://USERNAME:PASSWORD@proxy.example.com:8081',
}
with httpx.Client(proxies=proxies) as client:
response = client.get('https://httpbin.org/ip')
response.text
When using Python HTTPX Client functionality with a proxy the proxy IP address you are setting remains constant, unless you are using a proxy gateway that manages the proxy rotation on their end.
If you are using a single static proxy IP with the Client
functionality then the proxy might get blocked as every request will be using it.
The 3 Most Common Proxy Formats
That covered the basics of integrating a proxy into Python HTTPX, in the next sections we will show you how to integrate Python HTTPX into the 3 most common proxy formats:
- Rotating Through List of Proxy IPs
- Using Proxy Gateways
- Using Proxy APIs
A couple years ago, proxy providers would sell you a list of proxy IP addresses and you would configure your scraper to rotate through these IP addresses and use a new one with each request.
However, today more and more proxy providers don't sell raw lists of proxy IP addresses anymore. Instead providing access to their proxy pools via proxy gateways or proxy API endpoints.
We will look at how to integrate with all 3 proxy formats.
If you are looking to find a good proxy provider then check out our web scraping proxy comparison tool where you can compare the plans of all the major proxy providers.
Proxy Integration #1: Rotating Through Proxy IP List
Here a proxy provider will normally provide you with a list of proxy IP addresses that you will need to configure your scraper to rotate through and select a new IP address for every request.
The proxy list you recieve will look something like this:
'http://Username:Password@85.237.57.198:20000',
'http://Username:Password@85.237.57.198:21000',
'http://Username:Password@85.237.57.198:22000',
'http://Username:Password@85.237.57.198:23000',
To integrate them into our scrapers we need to configure our code to pick a random proxy from this list everytime we make a request.
In our Python HTTPX scraper we could do it like this:
import httpx
from random import randint
proxy_list = [
'http://Username:Password@85.237.57.198:20000',
'http://Username:Password@85.237.57.198:21000',
'http://Username:Password@85.237.57.198:22000',
'http://Username:Password@85.237.57.198:23000',
]
proxy_index = randint(0, len(proxy_list) - 1)
proxies = {
"http://": proxy_list[proxy_index],
"https://": proxy_list[proxy_index],
}
r = httpx.get(url='https://example.com/', proxies=proxies)
print(r.text)
This is a simplistic example, as when scraping at scale we would also need to build a mechanism to monitor the performance of each individual IP address and remove it from the proxy rotation if it got banned or blocked.
Proxy Integration #2: Using Proxy Gateway
Increasingly, a lot of proxy providers aren't selling lists of proxy IP addresses anymore. Instead, they give you access to their proxy pools via a proxy gateway.
Here, you only have to integrate a single proxy into your Python HTTPX scraper and the proxy provider will manage the proxy rotation, selection, cleaning, etc. on their end for you.
This is the most common way to use residential and mobile proxies, and becoming increasingly common when using datacenter proxies too.
Here is an example of how to integrate a BrightData's residential proxy gateway into our Python HTTPX scraper:
import httpx
proxies = {
'http://': 'http://USERNAME:PASSWORD@zproxy.lum-superproxy.io:22225',
'https://': 'http://USERNAME:PASSWORD@zproxy.lum-superproxy.io:22225',
}
url = 'http://example.com/'
response = httpx.get(url, proxies=proxies)
As you can see, it is much easier to integrate than using a proxy list as you don't have to worry about implementing all the proxy rotation logic.
Proxy Integration #3: Using Proxy API Endpoint
Recently, a lot of proxy providers have started offering smart proxy APIs that take care of managing your proxy infrastructure for you by rotating proxies and headers for you so you can focus on extracting the data you need.
Here you typically, send the URL you want to scrape to their API endpoint and then they will return the HTML response.
Although every proxy API provider has a slightly different API integration, they are all very similar and are very easy to integrate with.
Here is an example of how to integrate with the ScrapeOps Proxy Manager:
import httpx
from urllib.parse import urlencode
payload = {'api_key': 'APIKEY', 'url': 'https://httpbin.org/ip'}
r = httpx.get('https://proxy.scrapeops.io/v1/', params=urlencode(payload))
print r.text
Here you simply send the URL you want to scrape to the ScrapeOps API endpoint in the URL
query parameter, along with your API key in the api_key
query parameter, and ScrapeOps will deal with finding the best proxy for that domain and return the HTML response to you.
You can get your own free API key with 1,000 free requests by signing up here.
When using proxy API endpoints it is very important to encode the URL you want to scrape before sending it to the Proxy API endpoint. As if the URL contains query parameters then the Proxy API might think that those query parameters are for the Proxy API and not the target website.
To encode your URL you just need to use the urlencode(payload)
function as we've done above in the example.
More Web Scraping Tutorials
So that's how you can integrate proxies into your Python HTTPX scrapers.
If you would like to learn more about Web Scraping, then be sure to check out The Web Scraping Playbook.
Or check out one of our more in-depth guides: