Python hRequests: Make Concurrent Requests
In this guide for The Python Web Scraping Playbook, we will look at how to configure the Python Hrequests library to make concurrent requests so that you can increase the speed of your scrapers.
The more concurrent threads you have, the more requests you can have active in parallel, and the faster you can scrape.
So in this guide we will walk you through the best way to send concurrent requests with Python Hrequests using the ThreadPoolExecutor:
Let's begin...
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
Make Concurrent Requests Using ThreadPoolExecutor
The first approach to making concurrent requests with Python Hrequests is to use the ThreadPoolExecutor from Pythons concurrent.futures package.
Here is an example:
import hrequests
from bs4 import BeautifulSoup
import concurrent.futures
NUM_THREADS = 5
## Example list of urls to scrape
list_of_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
'http://quotes.toscrape.com/page/3/',
'http://quotes.toscrape.com/page/4/',
'http://quotes.toscrape.com/page/5/',
]
output_data_list = []
def scrape_page(url):
try:
response = hrequests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, "html.parser")
title = soup.find('h1').text
## add scraped data to "output_data_list" list
output_data_list.append({
'title': title,
})
except Exception as e:
print('Error', e)
with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
executor.map(scrape_page, list_of_urls)
print(output_data_list)
Here we:
- We define a
list_of_urls
we want to scrape. - Create a function
scrape_page(url)
that will take a URL as a input and output the scraped title into theoutput_data_list
. - Then using
ThreadPoolExecutor
we create a pool of workers that will pull from thelist_of_urls
and pass them intoscrape_page(url)
. - Now when we run this script, it will create 5 workers
max_workers=NUM_THREADS
that will concurrently pull URLs from thelist_of_urls
and pass them intoscrape_page(url)
.
Using this approach we can significantly increase the speed at which we can make requests with Python Hrequests.
Adding Concurrency To ScrapeOps Scrapers
The following is an example sending requests to the ScrapeOps Proxy API Aggregator, which enables you to use all the available threads your proxy plan allows you to make.
Just change the NUM_THREADS
value to the number of concurrent threads your proxy plan allows.
import hrequests
from bs4 import BeautifulSoup
import concurrent.futures
from urllib.parse import urlencode
SCRAPEOPS_API_KEY = 'YOUR_API_KEY'
NUM_THREADS = 5
def get_scrapeops_url(url):
payload = {'api_key': SCRAPEOPS_API_KEY, 'url': url}
proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)
return proxy_url
## Example list of urls to scrape
list_of_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
'http://quotes.toscrape.com/page/3/',
'http://quotes.toscrape.com/page/4/',
'http://quotes.toscrape.com/page/5/',
]
output_data_list = []
def scrape_page(url):
try:
response = hrequests.get(get_scrapeops_url(url))
if response.status_code == 200:
soup = BeautifulSoup(response.text, "html.parser")
title = soup.find('h1').text
## add scraped data to "output_data_list" list
output_data_list.append({
'title': title,
})
except Exception as e:
print('Error', e)
with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
executor.map(scrape_page, list_of_urls)
print(output_data_list)
You can get your own free API key with 1,000 free requests by signing up here.
More Web Scraping Tutorials
So that's how you can configure Python Hrequests to send requests concurrently.
If you would like to learn more about Web Scraping, then be sure to check out The Web Scraping Playbook.
Or check out one of our more in-depth guides: