Skip to main content

How To Build A Python Web Scraping Framework

In this guide, we will look at how you can build a simple web scraping client/framework that you can use with all your Python scrapers to make them production ready.

This universal web scraping client will making dealing with retries, integrating proxies, monitoring your scrapers and sending concurrent requests much easier when moving a scraper to production.

You can use this client yourself, or use it as a reference to build your own web scraping client.

So in this guide we will walk through:


Structuring Our Universal Web Scraping Client

To keep our ScrapingClient clean and simple we will create it in a Class, and then integrate it into our scrapers when we want to move our scraper to production.

Here is an outline of our ScrapingClient class:


import time
import requests
import concurrent.futures
from urllib.parse import urlencode
from scrapeops_python_requests.scrapeops_requests import ScrapeOpsRequests


class ScrapingClient:

def __init__(self):
pass

def start_scrapeops_monitor(self):
"""
Starts the ScrapeOps monitor, which ships logs to dashboard.
"""
pass

def scrapeops_proxy_url(self):
"""
Converts URL into ScrapeOps Proxy API URL
"""
pass

def send_request(self):
"""
Sends HTTP request and retries failed responses.
"""
pass

def concurrent_requests(self):
"""
Enables requests to be sent in parallel
"""
pass


This class will use the ScrapeOps Proxy API as the proxy provider, and the free ScrapeOps Monitoring SDK to monitor the scraper in production.

You can get your free ScrapeOps API key here.

For demonstration purposes, we will integrate this ScrapingClient into a simple QuotesToScrape scraper.


import requests
from bs4 import BeautifulSoup

list_of_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
'http://quotes.toscrape.com/page/3/',
]

scraped_quotes = []

for url in list_of_urls:
response = requests.get(url)
if response is not None and response.status_code == 200:
## Parse data with Beautifulsoup
html_response = response.text
soup = BeautifulSoup(html_response, "html.parser")
quotes_sections = soup.find_all('div', class_="quote")

## Loop through each quotes section and extract the quote and author
for quote_block in quotes_sections:
quote = quote_block.find('span', class_='text').text
author = quote_block.find('small', class_='author').text

## Add scraped data to "scraped_quotes" list
scraped_quotes.append({
'quote': quote,
'author': author
})

print(scraped_quotes)


Initializing Our Scraping Client

The first step we need to take it to setup the ScrapingClient initialization and integrate this into our QuotesToScrape scraper.


import time
import requests
import concurrent.futures
from urllib.parse import urlencode
from scrapeops_python_requests.scrapeops_requests import ScrapeOpsRequests


class ScrapingClient:

def __init__(self,
scrapeops_api_key=None,
scrapeops_proxy_enabled=True,
scrapeops_monitoring_enabled=True,
scrapeops_proxy_settings={},
spider_name=None,
job_name=None,
num_concurrent_threads=1,
num_retries=5,
http_allow_list=[200, 404],
):
self.scrapeops_api_key = scrapeops_api_key
self.scrapeops_proxy_settings = scrapeops_proxy_settings
self.scrapeops_proxy_enabled = scrapeops_proxy_enabled
self.scrapeops_monitoring_enabled = scrapeops_monitoring_enabled
self.num_concurrent_threads = num_concurrent_threads
self.num_retries = num_retries
self.http_allow_list = http_allow_list
self.spider_name = spider_name
self.job_name = job_name
self.sops_request_wrapper = None
self.start_scrapeops_monitor()

def start_scrapeops_monitor(self):
"""
Starts the ScrapeOps monitor, which ships logs to dashboard.
"""
pass

def scrapeops_proxy_url(self):
"""
Converts URL into ScrapeOps Proxy API URL
"""
pass

def send_request(self):
"""
Sends HTTP request and retries failed responses.
"""
pass

def concurrent_requests(self):
"""
Enables requests to be sent in parallel
"""
pass


Here we are creating the input parameters that we can use to configure how the ScrapingClient operates.

You can add/change these variables, however, these are the input parameters we will define and why:

  • scrapeops_api_key is your ScrapeOps API key that you can get here.
  • scrapeops_proxy_enabled will tell the client to route your requests through the ScrapeOps proxy or not.
  • scrapeops_proxy_settings allows you to set the advanced functionality you would like to use from the ScrapeOps Proxy API,
  • scrapeops_monitoring_enabled will tell the client to monitor your scraper using the ScrapeOps Monitoring SDK.
  • spider_name set a spider name to be used my the ScrapeOps Monitoring SDK.
  • job_name set a job name to be used my the ScrapeOps Monitoring SDK.
  • num_concurrent_threads set the number of requests your scraper can make in parallel.
  • num_retries set the number of times the client will retry failed requests.
  • http_allow_list set which HTTP codes are considered successful responses.

Then to start including this in our QuotesToScrape scraper we just need to initialize it in our scraper.


import requests
from bs4 import BeautifulSoup

scraping_client = ScrapingClient(
scrapeops_api_key='YOUR_API_KEY',
scrapeops_proxy_enabled=True,
scrapeops_monitoring_enabled=True,
spider_name='Quotes Scraper',
job_name='quotes_main',
num_concurrent_threads=5,
num_retries=3,
http_allow_list=[200, 404]
)

list_of_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
'http://quotes.toscrape.com/page/3/',
]

scraped_quotes = []

for url in list_of_urls:
response = requests.get(url)
if response.status_code == 200:
## Parse data with Beautifulsoup
html_response = response.text
soup = BeautifulSoup(html_response, "html.parser")
quotes_sections = soup.find_all('div', class_="quote")

## Loop through each quotes section and extract the quote and author
for quote_block in quotes_sections:
quote = quote_block.find('span', class_='text').text
author = quote_block.find('small', class_='author').text

## Add scraped data to "scraped_quotes" list
scraped_quotes.append({
'quote': quote,
'author': author
})

print(scraped_quotes)


Start ScrapeOps Monitor & Get HTTP Client

Next we need to create the functionality that will start the ScrapeOps Monitoring SDK and return the HTTP client from which scrapers can use to make requests.

To do this we add this functionality to the start_scrapeops_monitor method.


import time
import requests
import concurrent.futures
from urllib.parse import urlencode
from scrapeops_python_requests.scrapeops_requests import ScrapeOpsRequests

class ScrapingClient:

...

def start_scrapeops_monitor(self):
"""
Starts the ScrapeOps monitor, which ships logs to dashboard.
"""
if self.scrapeops_monitoring_enabled and self.scrapeops_api_key is not None:
try:
self.scrapeops_logger = ScrapeOpsRequests(
scrapeops_api_key=self.scrapeops_api_key,
spider_name=self.spider_name,
job_name=self.job_name,
)
self.sops_request_wrapper = self.scrapeops_logger.RequestsWrapper()
except Exception as e:
print('monitioring error', e)
else:
self.sops_request_wrapper = requests

...


Now when we initialize an instance of the ScrapingClient it will start the ScrapeOps Monitoring SDK if the monitoring has been enabled and a API key has been initialized in the instance.

It will also initialize a HTTP client from which all our scrapers will use. If the ScrapeOps Monitoring SDK has been enabled it will use the ScrapeOps Logger's RequestsWrapper, if not then it will use the standard Python Requests library.


Integrate Proxy Provider

The next thing we are going to do is integrate a proxy solution into our ScrapingClient.

For this we're going to use the ScrapeOps Proxy API which is integrated by sending the URLs you want to scrape to a API endpoint.


import time
import requests
import concurrent.futures
from urllib.parse import urlencode
from scrapeops_python_requests.scrapeops_requests import ScrapeOpsRequests

class ScrapingClient:

...

def scrapeops_proxy_url(self, url, scrapeops_proxy_settings=None):
"""
Converts URL into ScrapeOps Proxy API URL
"""
payload = {'api_key': self.scrapeops_api_key, 'url': url}

## Global Proxy Settings
if self.scrapeops_proxy_settings is not None and type(self.scrapeops_proxy_settings) is dict:
for key, value in self.scrapeops_proxy_settings.items():
payload[key] = value

## Per Request Proxy Settings
if scrapeops_proxy_settings is not None and type(scrapeops_proxy_settings) is dict:
for key, value in self.scrapeops_proxy_settings.items():
payload[key] = value

proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)
return proxy_url


...

When used, the scrapeops_proxy_url method will convert the URL you want to scrape into a API request to the ScrapeOps Proxy API and add any additional functionality you have enabled using the scrapeops_proxy_settings attribute.

This method will return this API request URL which can then be used later when making the request.


Sending Requests & Handling Retries

Next up is that we need to build the functionality into our ScrapingClient that will enable it to send HTTP requests, use the ScrapeOps proxy if it is enabled, and automatically retry any failed requests if they occur.

To this we will update the send_request method of our ScrapingClient to the following:


import time
import requests
import concurrent.futures
from urllib.parse import urlencode
from scrapeops_python_requests.scrapeops_requests import ScrapeOpsRequests

class ScrapingClient:

...

def send_request(self, url, method='GET', scrapeops_proxy_settings=None, **kwargs):
"""
Sends HTTP request and retries failed responses.
"""
final_url = url
try:
if self.scrapeops_proxy_enabled and self.scrapeops_api_key is not None:
final_url = self.scrapeops_proxy_url(url, scrapeops_proxy_settings)
for _ in range(self.num_retries):
try:
response = self.sops_request_wrapper.get(final_url, **kwargs)
if response.status_code in self.http_allow_list:
return response
except Exception as e:
print('Request error:', e)
return None
except Exception as e:
print('Overall error', e)


...

With this method, it will convert the requested URL into a API call to the ScrapeOps Proxy API endpoint if the ScrapeOps proxy has been enabled during initialization. If not, if will send a direct request to the URL without using a proxy.

It will then retry the request up to the num_retries you defined when initializating the ScrapingClient (by default it is set to 3 retries).

It will return the HTML response if the proxy returns a successful response as defined in the http_allow_list which states which status codes are considered successful responses (by default 200 and 404 responses).

Next, we will need to update our QuotesToScrape scraper to use this send_request method instead of the standard Python Requests library.


from bs4 import BeautifulSoup

scraping_client = ScrapingClient(
scrapeops_api_key='YOUR_API_KEY',
scrapeops_proxy_enabled=True,
scrapeops_monitoring_enabled=True,
spider_name='Quotes Scraper',
job_name='quotes_main',
num_concurrent_threads=5,
num_retries=3,
http_allow_list=[200, 404]
)

list_of_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
'http://quotes.toscrape.com/page/3/',
]

scraped_quotes = []

for url in list_of_urls:
response = scraping_client.send_request(url) ## Now uses scraping_client.send_request()
if response.status_code == 200:
## Parse data with Beautifulsoup
html_response = response.text
soup = BeautifulSoup(html_response, "html.parser")
quotes_sections = soup.find_all('div', class_="quote")

## Loop through each quotes section and extract the quote and author
for quote_block in quotes_sections:
quote = quote_block.find('span', class_='text').text
author = quote_block.find('small', class_='author').text

## Add scraped data to "scraped_quotes" list
scraped_quotes.append({
'quote': quote,
'author': author
})

print(scraped_quotes)


Send Concurrent Requests

Our QuotesToScrape scraper currently loops through the list_of_urls and scrapes each page one after another.

This is okay when you are only scraping a few URLs, however, if you are scraping at scale this will make your scraper very slow. The solution to this is to scrape numerous pages concurrently.

We will implement this functionality into our ScrapingClient by using the ThreadPoolExecutor found in the concurrent package.


import time
import requests
import concurrent.futures
from urllib.parse import urlencode
from scrapeops_python_requests.scrapeops_requests import ScrapeOpsRequests

class ScrapingClient:

...

def concurrent_requests(self, function, input_list):
"""
Enables requests to be sent in parallel
"""
with concurrent.futures.ThreadPoolExecutor(max_workers=self.num_retries) as executor:
executor.map(function, input_list)

This method allows you to define a function to execute and a list of URLs to scrape.

We will control how many requests can be made concurrently using the max_workers attribite of the ThreadPoolExecutor.

To integrate this concurrency functionality into our QuotesToScrape scraper we need to restructure it slightly.


from bs4 import BeautifulSoup

scraping_client = ScrapingClient(
scrapeops_api_key='YOUR_API_KEY',
scrapeops_proxy_enabled=True,
scrapeops_monitoring_enabled=True,
spider_name='Quotes Scraper',
job_name='quotes_main',
num_concurrent_threads=5,
num_retries=3,
http_allow_list=[200, 404]
)

list_of_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
'http://quotes.toscrape.com/page/3/',
'http://quotes.toscrape.com/page/4/',
'http://quotes.toscrape.com/page/5/',
'http://quotes.toscrape.com/page/6/',
'http://quotes.toscrape.com/page/7/',
]

scraped_quotes = []


def scrape_url(url):
response = scraping_client.send_request(url)
if response is not None and response.status_code == 200:
## Example: parse data with beautifulsoup
html_response = response.text
soup = BeautifulSoup(html_response, "html.parser")
quotes_sections = soup.find_all('div', class_="quote")

## loop through each quotes section and extract the quote and author
for quote_block in quotes_sections:
quote = quote_block.find('span', class_='text').text
author = quote_block.find('small', class_='author').text

## add scraped data to "scraped_quotes" list
scraped_quotes.append({
'quote': quote,
'author': author
})


scraping_client.concurrent_requests(scrape_url, list_of_urls)
scraped_quotes

Here we put our scraping functionality into a function called scrape_url which accepts one URL to scrape. And we pass this function along with the list of urls to scrape into our concurrent_requests method:


scraping_client.concurrent_requests(scrape_url, list_of_urls)
scraped_quotes

Now when we run this script, the scraper will use 5 concurrent threads (set when we intialized the ScrapingClient).


Complete Web Scraping Client

The following is the completed ScrapingClient, and how you would integrate it into our QuotesToScrape scraper.


import time
import requests
import concurrent.futures
from bs4 import BeautifulSoup
from urllib.parse import urlencode
from scrapeops_python_requests.scrapeops_requests import ScrapeOpsRequests

class ScrapingClient:

def __init__(self,
scrapeops_api_key=None,
scrapeops_proxy_enabled=True,
scrapeops_monitoring_enabled=True,
scrapeops_proxy_settings={},
spider_name=None,
job_name=None,
num_concurrent_threads=1,
num_retries=5,
http_allow_list=[200, 404],
):
self.scrapeops_api_key = scrapeops_api_key
self.scrapeops_proxy_settings = scrapeops_proxy_settings
self.scrapeops_proxy_enabled = scrapeops_proxy_enabled
self.scrapeops_monitoring_enabled = scrapeops_monitoring_enabled
self.num_concurrent_threads = num_concurrent_threads
self.num_retries = num_retries
self.http_allow_list = http_allow_list
self.spider_name = spider_name
self.job_name = job_name
self.sops_request_wrapper = None
self.start_scrapeops_monitor()

def start_scrapeops_monitor(self):
"""
Starts the ScrapeOps monitor, which ships logs to dashboard.
"""
if self.scrapeops_monitoring_enabled and self.scrapeops_api_key is not None:
try:
self.scrapeops_logger = ScrapeOpsRequests(
scrapeops_api_key=self.scrapeops_api_key,
spider_name=self.spider_name,
job_name=self.job_name,
)
self.sops_request_wrapper = self.scrapeops_logger.RequestsWrapper()
except Exception as e:
print('monitioring error', e)
else:
self.sops_request_wrapper = requests

def scrapeops_proxy_url(self, url, scrapeops_proxy_settings=None):
"""
Converts URL into ScrapeOps Proxy API URL
"""
payload = {'api_key': self.scrapeops_api_key, 'url': url}

## Global Proxy Settings
if self.scrapeops_proxy_settings is not None and type(self.scrapeops_proxy_settings) is dict:
for key, value in self.scrapeops_proxy_settings.items():
payload[key] = value

## Per Request Proxy Settings
if scrapeops_proxy_settings is not None and type(scrapeops_proxy_settings) is dict:
for key, value in self.scrapeops_proxy_settings.items():
payload[key] = value

proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)
return proxy_url


def send_request(self, url, method='GET', scrapeops_proxy_settings=None, **kwargs):
"""
Sends HTTP request and retries failed responses.
"""
final_url = url
try:
if self.scrapeops_proxy_enabled and self.scrapeops_api_key is not None:
final_url = self.scrapeops_proxy_url(url, scrapeops_proxy_settings)
for _ in range(self.num_retries):
try:
#response = self.sops_request_wrapper.request(method, final_url, **kwargs)
response = self.sops_request_wrapper.get(final_url, **kwargs)
if response.status_code in self.http_allow_list:
return response
except Exception as e:
print('Request error:', e)
return None
except Exception as e:
print('Overall error', e)

def concurrent_requests(self, function, input_list):
"""
Enables requests to be sent in parallel
"""
with concurrent.futures.ThreadPoolExecutor(max_workers=self.num_retries) as executor:
executor.map(function, input_list)


"""
Use ScrapingClient in scraper.
"""

scraping_client = ScrapingClient(
scrapeops_api_key='YOUR_API_KEY',
scrapeops_proxy_enabled=True,
scrapeops_monitoring_enabled=True,
spider_name='Quotes Scraper',
job_name='quotes_main',
num_concurrent_threads=5,
num_retries=3,
http_allow_list=[200, 404]
)

list_of_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
'http://quotes.toscrape.com/page/3/',
'http://quotes.toscrape.com/page/4/',
'http://quotes.toscrape.com/page/5/',
'http://quotes.toscrape.com/page/6/',
'http://quotes.toscrape.com/page/7/',
]

scraped_quotes = []


def scrape_url(url):
response = scraping_client.send_request(url)
if response is not None and response.status_code == 200:
## Example: parse data with beautifulsoup
html_response = response.text
soup = BeautifulSoup(html_response, "html.parser")
quotes_sections = soup.find_all('div', class_="quote")

## loop through each quotes section and extract the quote and author
for quote_block in quotes_sections:
quote = quote_block.find('span', class_='text').text
author = quote_block.find('small', class_='author').text

## add scraped data to "scraped_quotes" list
scraped_quotes.append({
'quote': quote,
'author': author
})


More Web Scraping Tutorials

So that's how to build a Python Web Scraping Client/Framework that can be easily integrated into any Python scraper.

Or if you would like to learn more about Web Scraping, then be sure to check out The Python Web Scraping Playbook.

Or check out one of our more in-depth guides: