Python hRequests - Make Concurrent Requests

Python hRequests: Make Concurrent Requests

In this guide for The Python Web Scraping Playbook, we will look at how to configure the Python Hrequests library to make concurrent requests so that you can increase the speed of your scrapers.

The more concurrent threads you have, the more requests you can have active in parallel, and the faster you can scrape.

So in this guide we will walk you through the best way to send concurrent requests with Python Hrequests using the ThreadPoolExecutor:

Make Concurrent Requests Using ThreadPoolExecutor
Adding Concurrency To ScrapeOps Scrapers

Let's begin...

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

Make Concurrent Requests Using ThreadPoolExecutor

The first approach to making concurrent requests with Python Hrequests is to use the ThreadPoolExecutor from Pythons concurrent.futures package.

Here is an example:

import hrequests
from bs4 import BeautifulSoup
import concurrent.futures

NUM_THREADS = 5

## Example list of urls to scrape
list_of_urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
            'http://quotes.toscrape.com/page/3/',
            'http://quotes.toscrape.com/page/4/',
            'http://quotes.toscrape.com/page/5/',
            ]

output_data_list = []

def scrape_page(url):
    try:
        response = hrequests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, "html.parser")
            title = soup.find('h1').text
            
            ## add scraped data to "output_data_list" list
            output_data_list.append({
                'title': title,
            })
            
    except Exception as e:
        print('Error', e)
        
        
with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
    executor.map(scrape_page, list_of_urls)
    
print(output_data_list)

Here we:

We define a list_of_urls we want to scrape.
Create a function scrape_page(url) that will take a URL as a input and output the scraped title into the output_data_list.
Then using ThreadPoolExecutor we create a pool of workers that will pull from the list_of_urls and pass them into scrape_page(url).
Now when we run this script, it will create 5 workers max_workers=NUM_THREADS that will concurrently pull URLs from the list_of_urls and pass them into scrape_page(url).

Using this approach we can significantly increase the speed at which we can make requests with Python Hrequests.

Adding Concurrency To ScrapeOps Scrapers

The following is an example sending requests to the ScrapeOps Proxy API Aggregator, which enables you to use all the available threads your proxy plan allows you to make.

Just change the NUM_THREADS value to the number of concurrent threads your proxy plan allows.

import hrequests
from bs4 import BeautifulSoup
import concurrent.futures
from urllib.parse import urlencode

SCRAPEOPS_API_KEY = 'YOUR_API_KEY'
NUM_THREADS = 5

def get_scrapeops_url(url):
    payload = {'api_key': SCRAPEOPS_API_KEY, 'url': url}
    proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)
    return proxy_url

## Example list of urls to scrape
list_of_urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
            'http://quotes.toscrape.com/page/3/',
            'http://quotes.toscrape.com/page/4/',
            'http://quotes.toscrape.com/page/5/',
            ]

output_data_list = []

def scrape_page(url):
    try:
        response = hrequests.get(get_scrapeops_url(url))
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, "html.parser")
            title = soup.find('h1').text
            
            ## add scraped data to "output_data_list" list
            output_data_list.append({
                'title': title,
            })
            
    except Exception as e:
        print('Error', e)
        
        
with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
    executor.map(scrape_page, list_of_urls)
    
print(output_data_list)

You can get your own free API key with 1,000 free requests by signing up here.

Python hRequests: Make Concurrent Requests

Need help scraping the web?

Make Concurrent Requests Using ThreadPoolExecutor​

Adding Concurrency To ScrapeOps Scrapers​

More Web Scraping Tutorials​

Make Concurrent Requests Using ThreadPoolExecutor

Adding Concurrency To ScrapeOps Scrapers

More Web Scraping Tutorials