Python Requests: Web Scraping Guide
In this guide for The Python Web Scraping Playbook, we will look at how to set up your Python Requests scrapers to avoid getting blocked, retrying failed requests and scaling up with concurrency.
Python Requests is the most popular HTTP client library used by Python developers, so in this article we will run through all the best practices you need to know. Including:
- Making GET Requests
- Making POST Requests
- Using Fake User Agents With Python Requests
- Using Proxies With Python Requests
- Retrying Failed Requests
- Scaling Your Scrapers Using Concurrent Threads
- Rendering JS On Client-Side Rendered Pages
For this guide, we're going to focus on how to setup the HTTP client element of your Python Request based scrapers, not how to parse the data from the HTML responses.
To keep things simple, we will be using BeautifulSoup to parse data from QuotesToScrape.
If you want to learn more about how to use BeautifulSoup or web scraping with Python in general then check out our BeautifulSoup Guide or our Python Beginners Web Scraping Guide.
Let's begin with the basics and work ourselves up to the more complex topics...
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
Making GET Requests
Making GET requests with Python Requests is very simple.
We just need to request the URL using requests.get(url)
:
import requests
response = requests.get('http://quotes.toscrape.com/')
print(response.text)
The following are the most commonly used attributes of the Response
class:
status_code
: The HTTP status code of the response.text
: The response content as a Unicode string.content
: The response content in bytes.headers
: A dictionary-like object containing the response headers.url
: The URL of the response.encoding
: The encoding of the response content.cookies
: ARequestsCookieJar
object containing the cookies sent by the server.history
: A list of previous responses if there were redirects.ok
: A boolean indicating whether the response was successful (status code between 200 and 399).reason
: The reason phrase returned by the server (e.g., "OK", "Not Found").elapsed
: The time elapsed between sending the request and receiving the response.request
: ThePreparedRequest
object that was sent to the server.
Making POST Requests
Making POST requests with Python Requests is also very simple.
To send JSON data in a POST request, we just need to request the URL using requests.post()
along with the URL and the data using the json
parameter:
import requests
url = 'http://quotes.toscrape.com/'
data = {'key': 'value'}
# Send POST request with JSON data using the json parameter
response = requests.post(url, json=data)
# Print the response
print(response.json())
To send Form data in a POST request, we just need to request the URL using requests.post()
along with the URL and the data using the data
parameter:
import requests
url = 'http://quotes.toscrape.com/'
data = {'key': 'value'}
# Send POST request with JSON data using the json parameter
response = requests.post(url, data=data)
# Print the response
print(response.json())
For more details on how to send POST requests with Python Requests, then check out our Python Requests Guide: How to Send POST Requests
Using Fake User Agents With Python Requests
User Agents are strings that let the website you are scraping identify the application, operating system (OSX/Windows/Linux), browser (Chrome/Firefox/Internet Explorer), etc. of the user sending a request to their website. They are sent to the server as part of the request headers.
Here is an example User agent sent when you visit a website with a Chrome browser:
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.36'
When scraping a website, you also need to set user-agents on every request as otherwise the website may block your requests because it knows you aren't a real user.
In the case of most Python HTTP clients like Python Requests, when you send a request the default settings clearly identify that the request is being made with Python Requests in the user-agent string.
'User-Agent': 'python-requests/2.26.0',
This user-agent will clearly identify your requests are being made by the Python Requests library, so the website can easily block you from scraping the site.
That is why we need to manage the user-agents we use with Python Request when we send requests.
How To Set A Fake User-Agent In Python Requests
Setting Python Requests to use a fake user-agent is very easy. We just need to define it in a headers
dictionary and add it to the request using the headers
parameter.
import requests
headers={"User-Agent": "Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148"}
r = requests.get('http://quotes.toscrape.com/', headers=headers)
print(r.json())
Link to the official documentation.
How To Rotate User-Agents
In the previous example, we only set a single user-agent. However, when scraping at scale you need to rotate your usser-agents to make your requests harder to detect for the website you are scraping.
Luckliy, rotating through user-agents is also pretty straightforward when using Python Requests. We just need a list of user-agents in our scraper and use a random one with every request.
import requests
import random
user_agent_list = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36',
'Mozilla/5.0 (iPhone; CPU iPhone OS 14_4_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Mobile/15E148 Safari/604.1',
'Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/87.0.664.75',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18363',
]
headers={"User-Agent": user_agent_list[random.randint(0, len(user_agent_list)-1)]}
r = requests.get('http://quotes.toscrape.com/', headers=headers)
print(r.json())
This works but it has drawbacks as we would need to build & keep an up-to-date list of user-agents ourselves.
Another approach would be to use a user-agent database like ScrapeOps Free Fake User-Agent API that returns a list of up-to-date user-agents you can use in your scrapers.
Here is an example Python Requests scraper integration:
import requests
from random import randint
SCRAPEOPS_API_KEY = 'YOUR_API_KEY'
def get_user_agent_list():
response = requests.get('http://headers.scrapeops.io/v1/user-agents?api_key=' + SCRAPEOPS_API_KEY)
json_response = response.json()
return json_response.get('result', [])
def get_random_user_agent(user_agent_list):
random_index = randint(0, len(user_agent_list) - 1)
return user_agent_list[random_index]
## Retrieve User-Agent List From ScrapeOps
user_agent_list = get_user_agent_list()
url_list = [
'http://quotes.toscrape.com/',
'http://quotes.toscrape.com/',
'http://quotes.toscrape.com/',
]
for url in url_list:
## Add Random User-Agent To Headers
headers = {'User-Agent': get_random_user_agent(user_agent_list)}
## Make Requests
r = requests.get(url=url, headers=headers)
print(r.text)
For a more detailed guide on how to use fake user-agents with Python Requests then check out our Guide to Setting Fake User-Agents With Python Requests or check out this video:
Using Proxies With Python Requests
Using proxies with the Python Requests library allows you to spread your requests over multiple IP addresses making it harder for websites to detect & block your web scrapers.
Using a proxy with Python requests is very straight forward. We simply need to create a proxies
dictionary and pass it into the proxies attribute of our Python Requests request.
import requests
proxies = {
'http': 'http://proxy.example.com:8080',
'https': 'http://proxy.example.com:8081',
}
response = requests.get('http://quotes.toscrape.com/', proxies=proxies)
This method will work for all request methods Python Requests supports: GET
, POST
, PUT
, DELETE
, PATCH
, HEAD
.
However, the above example only uses a single proxy to make the requests. To avoid having your scrapers blocked you need to use large pools of proxies and rotate your requests through different proxies.
There are 3 common ways to integrate and rotate proxies in your scrapers:
Proxy Integration #1: Rotating Through Proxy IP List
Here a proxy provider will normally provide you with a list of proxy IP addresses that you will need to configure your scraper to rotate through and select a new IP address for every request.
The proxy list you receive will look something like this:
'http://Username:Password@85.237.57.198:20000',
'http://Username:Password@85.237.57.198:21000',
'http://Username:Password@85.237.57.198:22000',
'http://Username:Password@85.237.57.198:23000',
To integrate them into our scrapers we need to configure our code to pick a random proxy from this list everytime we make a request.
In our Python Requests scraper we could do it like this:
import requests
from random import randint
proxy_list = [
'http://Username:Password@85.237.57.198:20000',
'http://Username:Password@85.237.57.198:21000',
'http://Username:Password@85.237.57.198:22000',
'http://Username:Password@85.237.57.198:23000',
]
proxy_index = randint(0, len(proxy_list) - 1)
proxies = {
"http://": proxy_list[proxy_index],
"https://": proxy_list[proxy_index],
}
r = requests.get(url='http://quotes.toscrape.com/', proxies=proxies)
print(r.text)
This is a simplistic example, as when scraping at scale we would also need to build a mechanism to monitor the performance of each individual IP address and remove it from the proxy rotation if it got banned or blocked.
Proxy Integration #2: Using Proxy Gateway
Increasingly, a lot of proxy providers aren't selling lists of proxy IP addresses anymore. Instead, they give you access to their proxy pools via a proxy gateway.
Here, you only have to integrate a single proxy into your Python Requests scraper and the proxy provider will manage the proxy rotation, selection, cleaning, etc. on their end for you.
This is the most common way to use residential and mobile proxies, and becoming increasingly common when using datacenter proxies too.
Here is an example of how to integrate BrightData's residential proxy gateway into our Python Requests scraper:
import requests
proxies = {
'http': 'http://zproxy.lum-superproxy.io:22225',
'https': 'http://zproxy.lum-superproxy.io:22225',
}
url = 'http://quotes.toscrape.com/'
response = requests.get(url, proxies=proxies, auth=('USERNAME', 'PASSWORD'))
As you can see, it is much easier to integrate than using a proxy list as you don't have to worry about implementing all the proxy rotation logic.
Proxy Integration #3: Using Proxy API Endpoint
Recently, a lot of proxy providers have started offering smart proxy APIs that take care of managing your proxy infrastructure for you by rotating proxies and headers for you so you can focus on extracting the data you need.
Here you typically, send the URL you want to scrape to their API endpoint and then they will return the HTML response.
Although every proxy API provider has a slightly different API integration, they are all very similar and are very easy to integrate with.
Here is an example of how to integrate with the ScrapeOps Proxy Manager:
import requests
from urllib.parse import urlencode
payload = {'api_key': 'APIKEY', 'url': 'http://quotes.toscrape.com/'}
r = requests.get('https://proxy.scrapeops.io/v1/', params=urlencode(payload))
print r.text
Here you simply send the URL you want to scrape to the ScrapeOps API endpoint in the URL
query parameter, along with your API key in the api_key
query parameter, and ScrapeOps will deal with finding the best proxy for that domain and return the HTML response to you.
You can get your own free API key with 1,000 free requests by signing up here.
When using proxy API endpoints it is very important to encode the URL you want to scrape before sending it to the Proxy API endpoint. As if the URL contains query parameters then the Proxy API might think that those query parameters are for the Proxy API and not the target website.
To encode your URL you just need to use the urlencode(payload)
function as we've done above in the example.
For a more detailed guide on how to use proxies with Python Requests then check out our Guide to Using Proxies With Python Requests or check out this video:
Retrying Failed Requests With Python Requests
When web scraping, some requests will inevitably fail either from connection issues or because the website blocks the requests.
To combat this, we need to configure our Python Request scrapers to retry failed requests so they will be more reliable and extract all the target data.
For a more detailed guide on how to retry failed requests with Python Requests then check out our Guide to Retrying Requests With Python Requests or check out this video:
One of the best methods of retrying failed requests with Python Requests is to build your own retry logic around your request functions.
import requests
NUM_RETRIES = 3
for _ in range(NUM_RETRIES):
try:
response = requests.get('http://quotes.toscrape.com/')
if response.status_code in [200, 404]:
## Escape for loop if returns a successful response
break
except requests.exceptions.ConnectionError:
pass
## Do something with successful response
if response is not None and response.status_code == 200:
pass
The advantage of this approach is that you have a lot of control over what is a failed response.
Above we only look at the response code to see if we should retry the request, however, we could adapt this so that we also check the response to make sure the HTML response is valid.
Below we will add a additional check to make sure the HTML response doesn't contain a ban page.
import requests
NUM_RETRIES = 3
for _ in range(NUM_RETRIES):
try:
response = requests.get('http://quotes.toscrape.com/')
if response.status_code in [200, 404]:
if response.status_code == 200 and '<title>Robot or human?</title>' not in response.text:
break
except requests.exceptions.ConnectionError:
pass
## Do something with successful response
if response is not None and response.status_code == 200:
pass
Scaling Your Python Request Scrapers With Concurrent Threads
Another common bottleneck you will encounter when building web scrapers with Python Requests is that by default you can only send requests serially. So your scraper can be quite slow if the scraping job is large.
However, you can increase the speed of your scrapers by making concurrent requests.
The more concurrent threads you have, the more requests you can have active in parallel, and the faster you can scrape.
For a more detailed guide on how to make concurrent requests with Python Requests then check out our Guide to Scaling Your Scrapers Using Concurrency With Python Requests or check out this video:
One of the best approaches to making concurrent requests with Python Requests is to use the ThreadPoolExecutor from Pythons concurrent.futures package.
Here is an example:
import requests
from bs4 import BeautifulSoup
import concurrent.futures
NUM_THREADS = 5
## Example list of urls to scrape
list_of_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
'http://quotes.toscrape.com/page/3/',
'http://quotes.toscrape.com/page/4/',
'http://quotes.toscrape.com/page/5/',
]
output_data_list = []
def scrape_page(url):
try:
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, "html.parser")
title = soup.find('h1').text
## add scraped data to "output_data_list" list
output_data_list.append({
'title': title,
})
except Exception as e:
print('Error', e)
with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
executor.map(scrape_page, list_of_urls)
print(output_data_list)
Here we:
- We define a
list_of_urls
we want to scrape. - Create a function
scrape_page(url)
that will take a URL as a input and output the scraped title into theoutput_data_list
. - Then using
ThreadPoolExecutor
we create a pool of workers that will pull from thelist_of_urls
and pass them intoscrape_page(url)
. - Now when we run this script, it will create 5 workers
max_workers=NUM_THREADS
that will concurrently pull URLs from thelist_of_urls
and pass them intoscrape_page(url)
.
Using this approach we can significantly increase the speed at which we can make requests with Python Requests.
Rendering JS On Client-Side Rendered Pages
As Python Requests is an HTTP client, it only retrieves the HTML/JSON response the website's server initially returns. It can't render any Javascript on client-side rendered pages.
This can prevent your scraper from being able to see and extract all the data you need from the web page.
As a consequence using a headless browser is often needed if you want to scrape a Single Page Application built with frameworks such as React.js, Angular.js, JQuery or Vue.
In the case, you need to scrape a JS rendered page you can use headless browser libraries for Python like Selenium or Pyppeteer instead of Python Requests.
Check out our guides to scraping JS rendered pages with Pyppeteer here.
Another option is to use a proxy service that manages the headless browser for you so you can scrape JS rendered pages using Python Requests HTTP requests.
The ScrapeOps Proxy Aggregator enables you to use a headless browser by adding the render_js=true
to your requests.
import requests
from urllib.parse import urlencode
payload = {'api_key': 'APIKEY', 'url': 'http://quotes.toscrape.com/', 'render_js': 'true'}
r = requests.get('https://proxy.scrapeops.io/v1/', params=urlencode(payload))
print r.text
You can get your own free API key with 1,000 free requests by signing up here.
For more information about ScrapeOps JS rendering functionality check out our headless browser docs here.
More Web Scraping Tutorials
So that's how you can integrate proxies into your Python Requests scrapers.
If you would like to learn more about Web Scraping, then be sure to check out The Web Scraping Playbook.
Or check out one of our more in-depth guides: