Using Proxies

Python Requests/BS4 Beginners Series Part 6: Proxies

So far in this Python Requests/BeautifulSoup 6-Part Beginner Series, we have learned how to build a basic web scraper Part 1, scrape data from a website in Part 2, clean it up, save it to a file or database in Part 3, and make our scraper more robust and scalable by handling failed requests and using concurrency in Part 4. We also learned how to use fake user-agents and browser headers to bypass restrictions on sites trying to prevent scraping in Part 5

In Part 6, we'll explore how to use proxies to bypass various website restrictions by hiding your real IP address and location without needing to worry about user agents and headers.

Why Use Proxies
The 3 Most Common Proxy Integration
Integrate Proxy Aggregator into the Existing Scraper

Python Requests/BeautifulSoup 6-Part Beginner Series

Part 1: Basic Python Requests/BeautifulSoup Scraper - We'll go over the basics of scraping with Python, and build our first Python scraper. (Part 1)
Part 2: Cleaning Dirty Data & Dealing With Edge Cases - Web data can be messy, unstructured, and have lots of edge cases. In this tutorial we'll make our scraper robust to these edge cases, using data classes and data cleaning pipelines. (Part 2)
Part 3: Storing Data in AWS S3, MySQL & Postgres DBs - There are many different ways we can store the data that we scrape from databases, CSV files to JSON format, and S3 buckets. We'll explore several different ways we can store the data and talk about their pros, and cons and in which situations you would use them. (Part 3)
Part 4: Managing Retries & Concurrency - Make our scraper more robust and scalable by handling failed requests and using concurrency. (Part 4)
Part 5: Faking User-Agents & Browser Headers - Make our scraper production ready by using fake user agents & browser headers to make our scrapers look more like real users. (Part 5)
Part 6: Using Proxies To Avoid Getting Blocked - Explore how to use proxies to bypass anti-bot systems by hiding your real IP address and location. (Part 6)

The code for this project is available on GitHub.

Why Use Proxies?

Scraping data from websites can be tricky sometimes. Websites might restrict you based on your location or block your IP address. This is where proxies come in handy.

Proxies help you bypass these restrictions by hiding your real IP address and location. When you use a proxy, your request gets routed through a proxy server first, acting as an intermediary. This way, the website only sees the proxy's IP address, not yours.

Websites often serve different information based on specific locations. Without a proxy, you might not be able to access information that's relevant to your needs if you're not located in a particular location.

Furthermore, proxies offer an extra layer of security by encrypting your data as it travels between your device and the server. This protects your data from being thwarted by third parties.

Additionally, you can use multiple proxies at the same time to distribute your scraping requests across different IP addresses, avoiding website rate limits.

The 3 Most Common Proxy Integration

Let's dive into integrating Python Requests with the 3 most common proxy formats:

Rotating Through a List of Proxy IPs
Using Proxy Gateways
Using Proxy API Endpoints

Previously, proxy providers offered lists of IP addresses, and you'd configure your scraper to cycle through them, using a new IP for each request. However, this method is less common now.

Many providers now offer access through proxy gateways or proxy API endpoints instead of raw lists. These gateways act as intermediaries, routing your requests through their pool of IPs.

Proxy Integration #1: Rotating Through Proxy IP List

Using rotating proxies is crucial because websites can restrict access to scrapers that send many requests from the same IP address. This technique makes it harder for websites to track and block your scraping activity by constantly changing the IP address used.

The code snippet fetches a list of free proxies from the Free Proxy List website. It extracts proxy information (IP address and port). Next, it filters out proxies that do not support HTTPS and returns a set of unique proxy entries.

import requests
from bs4 import BeautifulSoup
from itertools import cycle

def get_proxies():
    # Fetching proxies from a website
    url = "https://free-proxy-list.net/"
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")

    # Extracting proxy information from the HTML
    proxies = set()
    rows = soup.select("tbody tr")

    for row in rows:
        td7 = row.select_one("td:nth-child(7)")

        # Checking if the proxy supports HTTPS
        if td7 and td7.text.strip().lower() == "yes":
            td1 = row.select_one("td:nth-child(1)").text.strip()
            td2 = row.select_one("td:nth-child(2)").text.strip()

            # Combining IP and Port to form a proxy entry
            combined_result = f"{td1}:{td2}"
            proxies.add(combined_result)
    return proxies

# Obtain the set of proxies and create a cycle
proxies = cycle(get_proxies())
url = "https://icanhazip.com/"

for i in range(1, 6):
    # Selecting the next proxy from the cycle for each request
    proxy = next(proxies)

    try:
        # Making a request using the selected proxy
        response = requests.get(url, proxies={"http": proxy, "https": proxy})

        # Checking for HTTP errors in the response
        response.raise_for_status()

        print(
            f"Request #{i} successful. IP Address: {response.text.strip()}", end="\n")
    except Exception as e:
        # Skip free proxies with connection errors; retry the request with the next proxy in the cycle..
        print(f"Request #{i} failed! Exception Name: {type(e).__name__}")

The cycle function takes an iterable and creates an iterator that endlessly cycles through its elements. You can select the random proxy from this iterator using the next function.

Keep in mind that using free proxies can have limitations. Not all proxies may work perfectly, and that’s why we tried five times using a loop.

The output shows that only 2 attempts were successful out of 5 total attempts. The fetched IP address belongs to the proxy, not your own. This indicates that the HTTPS request was indeed routed through a proxy.

Proxies - Rotating Through Proxy IP List - rotate_proxy.png

This is a simplistic example. For larger-scale scraping, you would require monitoring individual IP performance and removing banned/blocked ones from the proxy pool.

Proxy Integration #2: Using Proxy Gateway

Many proxy providers are moving away from selling static IP lists and instead offer access to their proxy pools through a gateway. This eliminates the need to manage and rotate individual IP addresses, as the provider handles that on your behalf. This has become the preferred method for using residential and mobile proxies, and increasingly for datacenter proxies as well.

Here is an example of how to integrate BrightData's residential proxy gateway into your Python Requests scraper:

import requests
proxies = {
    'http': 'http://brd.superproxy.io:22225',
    'https': 'http://brd.superproxy.io:22225',
}

url = 'https://icanhazip.com/'

response = requests.get(url, proxies=proxies, auth=(
    'Username', 'Password'))
print(response.status_code)

Integrating via a gateway is significantly easier compared to a proxy list as you don't have to worry about implementing all the proxy rotation logic.

Proxy Integration #3: Using Proxy API Endpoint

Recently, many proxy providers have begun offering smart proxy APIs. These APIs manage your proxy infrastructure by automatically rotating proxies and headers allowing you to focus on extracting the data you need.

Typically, you send the URL you want to scrape to an API endpoint, and the API returns the HTML response. While each provider's API integration differs slightly, most are very similar and easy to integrate with.

Here's an example of integrating with the ScrapeOps Proxy Manager:

import requests
from urllib.parse import urlencode

payload = {'api_key': 'APIKEY', 'url': 'https://httpbin.org/ip'}
r = requests.get('https://proxy.scrapeops.io/v1/', params=urlencode(payload))
print r.text

Here you simply send the URL you want to scrape to the ScrapeOps API endpoint in the URL query parameter, along with your API key in the api_key query parameter. ScrapeOps will then locate the optimal proxy for the target domain and deliver the HTML response directly to you.

You can get your free API key with 1,000 free requests by signing up here.

Note that, when using proxy API endpoints it is very important to encode the URL you want to scrape before sending it to the Proxy API endpoint. If the URL contains query parameters then the Proxy API might think that those query parameters are for the Proxy API and not the target website.

To encode your URL you just need to use the urlencode(payload) function as we've done above in the example.

Integrate Proxy Aggregator into the Existing Scraper

After integrating the ScrapeOps Proxy Aggregator, you won't need to worry about user agents and headers. ScrapeOps Proxy Aggregator acts as an intermediary between your scraper and the target website. It routes your requests through a pool of high-performing proxies from various providers.

These proxies already have different user-agent strings and other headers pre-configured that help you avoid detection and blocks, even without additional middleware.

In our scraper, we only need to modify the retry logic, and everything else remains the same. Specifically, we'll add a new method called make_scrapeops_request() to the RetryLogic class. The scrape_page() function will first request this new method.

Then, it will call the existing make_request() method with the final proxy URL generated by make_scrapeops_request(). Previously, make_request() was called directly from scrape_page().

import requests
from urllib.parse import urlencode

class RetryLogic:
    def __init__(
        self,
        retry_limit=5,
        anti_bot_check=False,
        use_fake_browser_headers=False,
        scrapeops_api_key="",
    

):
        self.retry_limit = retry_limit
        self.anti_bot_check = anti_bot_check
        self.use_fake_browser_headers = use_fake_browser_headers
        self.scrapeops_api_key = scrapeops_api_key

    def make_scrapeops_request(self, url, method="GET", **kwargs):
        payload = {"api_key": self.scrapeops_api_key, "url": url}
        clean_scrapeops_params = {}

        # Extract ScrapeOps parameters and clean the keys
        for key, value in kwargs.items():
            if "sops_" in key:
                clean_scrapeops_params[key] = value
                _ = kwargs.pop(key, None)
                clean_key = key.replace("sops_", "")
                clean_scrapeops_params[clean_key] = value
        
				# Update the payload with any additional ScrapeOps params
        payload.update(clean_scrapeops_params)
        proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)
        return self.make_request(proxy_url, method=method, **kwargs)

    def make_request(self, url, method="GET", **kwargs):
        kwargs.setdefault("allow_redirects", True)

        # Retry Logic
        for _ in range(self.retry_limit):
            try:
                response = requests.request(method, url, **kwargs)

                # Check if response status code is 200 or 404
                if response.status_code in [200, 404]:
                    if self.anti_bot_check and response.status_code == 200:
                        if not self.passed_anti_bot_check(response):
                            return False, response
                    return True, response
            except Exception as e:
                print("Error:", e)
        return False, None

    def passed_anti_bot_check(self, response):
        # Example Anti-Bot Check
        if "<title>Robot or human?</title>" in response.text:
            return False
        # Passed All Tests
        return True

The make_scrapeops_request() method starts by creating a dictionary called payload with two key-value pairs:

api_key: This holds the value stored in the class attribute self.scrapeops_api_key.
url: This uses the provided url argument.

Next, the method constructs the proxy URL by appending the encoded payload to the ScrapeOps proxy base URL (https://proxy.scrapeops.io/v1/?).

Finally, the method calls the make_request method with the constructed proxy_url.

The integration is incredibly simple. You no longer need to worry about the user agents and browser headers we used before. Just send the URL you want to scrape to the ScrapeOps API endpoint, and it will return the HTML response.

Complete Code

We did it! We have a fully functional scraper that creates a final CSV file containing all the desired data.

import os
import time
import csv
import requests
import concurrent.futures
from urllib.parse import urlencode
from bs4 import BeautifulSoup
from dataclasses import dataclass, field, fields, InitVar, asdict

@dataclass
class Product:
    name: str = ''
    price_string: InitVar[str] = ''
    price_gb: float = field(init=False)
    price_usd: float = field(init=False)
    url: str = ''

    def __post_init__(self, price_string):
        self.name = self.clean_name()
        self.price_gb = self.clean_price(price_string)
        self.price_usd = self.convert_price_to_usd()
        self.url = self.create_absolute_url()

    def clean_name(self):
        if not self.name:
            return 'missing'
        return self.name.strip()

    def clean_price(self, price_string):
        price_string = price_string.strip()
        price_string = price_string.replace('Sale price£', '')
        price_string = price_string.replace('Sale priceFrom £', '')
        return float(price_string) if price_string else 0.0

    def convert_price_to_usd(self):
        return self.price_gb * 1.21

    def create_absolute_url(self):
        if not self.url:
            return 'missing'
        return 'https://www.chocolate.co.uk' + self.url

class ProductDataPipeline:
    def __init__(self, csv_filename='', storage_queue_limit=5):
        self.names_seen = []
        self.storage_queue = []
        self.storage_queue_limit = storage_queue_limit
        self.csv_filename = csv_filename
        self.csv_file_open = False

    def save_to_csv(self):
        self.csv_file_open = True
        products_to_save = self.storage_queue.copy()
        self.storage_queue.clear()
        if not products_to_save:
            return

        keys = [field.name for field in fields(products_to_save[0])]
        file_exists = os.path.isfile(
            self.csv_filename) and os.path.getsize(self.csv_filename) > 0

        with open(self.csv_filename, mode='a', newline='', encoding='utf-8') as output_file:
            writer = csv.DictWriter(output_file, fieldnames=keys)

            if not file_exists:
                writer.writeheader()

            for product in products_to_save:
                writer.writerow(asdict(product))

        self.csv_file_open = False

    def clean_raw_product(self, scraped_data):
        return Product(
            name=scraped_data.get('name', ''),
            price_string=scraped_data.get('price', ''),
            url=scraped_data.get('url', '')
        )

    def is_duplicate(self, product_data):
        if product_data.name in self.names_seen:
            print(f"Duplicate item found: {product_data.name}. Item dropped.")
            return True
        self.names_seen.append(product_data.name)
        return False

    def add_product(self, scraped_data):
        product = self.clean_raw_product(scraped_data)
        if not self.is_duplicate(product):
            self.storage_queue.append(product)
            if len(self.storage_queue) >= self.storage_queue_limit and not self.csv_file_open:
                self.save_to_csv()

    def close_pipeline(self):
        if self.csv_file_open:
            time.sleep(3)
        if len(self.storage_queue) > 0:
            self.save_to_csv()

class RetryLogic:
    def __init__(self, retry_limit=5, anti_bot_check=False, use_fake_browser_headers=False, scrapeops_api_key=''):
        self.retry_limit = retry_limit
        self.anti_bot_check = anti_bot_check
        self.use_fake_browser_headers = use_fake_browser_headers
        self.scrapeops_api_key = scrapeops_api_key

    def make_scrapeops_request(self, url, method='GET', **kwargs):
        payload = {'api_key': self.scrapeops_api_key, 'url': url}
        clean_scrapeops_params = {key.replace(
            'sops_', ''): value for key, value in kwargs.items() if 'sops_' in key}
        payload.update(clean_scrapeops_params)
        proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)
        return self.make_request(proxy_url, method=method, **kwargs)

    def make_request(self, url, method='GET', **kwargs):
        kwargs.setdefault('allow_redirects', True)

        for _ in range(self.retry_limit):
            try:
                response = requests.request(method, url, **kwargs)
                if response.status_code in [200, 404]:
                    if self.anti_bot_check and response.status_code == 200 and not self.passed_anti_bot_check(response):
                        return False, response
                    return True, response

            except Exception as e:
                print('Error', e)
        return False, None

    def passed_anti_bot_check(self, response):
        return '<title>Robot or human?</title>' not in response.text

def scrape_page(url):
    list_of_urls.remove(url)
    valid, response = retry_request.make_scrapeops_request(url)
    if valid and response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        products = soup.select('product-item')
        for product in products:
            name = product.select('a.product-item-meta__title')[0].get_text()
            price = product.select('span.price')[
                0].get_text().replace('\nSale price£', '')
            url = product.select('div.product-item-meta a')[0]['href']

            data_pipeline.add_product(
                {'name': name, 'price': price, 'url': url})

        next_page = soup.select('a[rel="next"]')
        if next_page:
            list_of_urls.append(
                'https://www.chocolate.co.uk' + next_page[0]['href'])

def start_concurrent_scrape(num_threads=5):
    while list_of_urls:
        with concurrent.futures.ThreadPoolExecutor(max_workers=num_threads) as executor:
            executor.map(scrape_page, list_of_urls)

list_of_urls = ['https://www.chocolate.co.uk/collections/all']

if __name__ == "__main__":
    data_pipeline = ProductDataPipeline(csv_filename='product_data.csv')
    retry_request = RetryLogic(retry_limit=3, anti_bot_check=False,
                               use_fake_browser_headers=False, scrapeops_api_key='a96254e0-e202-450b-873a-a356b85b02c4')
    start_concurrent_scrape(num_threads=10)
    data_pipeline.close_pipeline()

The CSV file:

Proxies - Complete Code - csv-data.png

Conclusion

The guide explored using proxies to bypass website restrictions by masking your real IP address and location. We discussed the three most common proxy integration methods in detail. Finally, we successfully integrated the ScrapeOps Proxy Aggregator into our existing scraper code.

You can visit of our previous articles in the Python Requests/BeautifulSoup 6-Part Beginner Series:

Part 1: Basic Python Requests/BeautifulSoup Scraper - We'll go over the basics of scraping with Python, and build our first Python scraper. (Part 1)
Part 2: Cleaning Dirty Data & Dealing With Edge Cases - Web data can be messy, unstructured, and have lots of edge cases. In this tutorial we'll make our scraper robust to these edge cases, using data classes and data cleaning pipelines. (Part 2)
Part 3: Storing Our Data - There are many different ways we can store the data that we scrape from databases, CSV files to JSON format, and S3 buckets. We'll explore several different ways we can store the data and talk about their pros, and cons and in which situations you would use them. (Part 3)
Part 4: Retries & Concurrency - Make our scraper more robust and scalable by handling failed requests and using concurrency. (Part 4)
Part 5: User Agents & Proxies - Make our scraper production ready by managing our user agents & IPs so we don't get blocked. (Part 5)
Part 6: Deployment, Scheduling & Running Jobs - Deploying our scraper on a server, and monitoring and scheduling jobs via ScrapeOps. (This Tutorial)

Python Requests/BeautifulSoup 6-Part Beginner Series
Why Use Proxies?
The 3 Most Common Proxy Integration
Integrate Proxy Aggregator into the Existing Scraper
Complete Code
Conclusion

Python Requests/BS4 Beginners Series Part 6: Proxies

Python Requests/BeautifulSoup 6-Part Beginner Series​

Why Use Proxies?​

The 3 Most Common Proxy Integration​

Proxy Integration #1: Rotating Through Proxy IP List​

Proxy Integration #2: Using Proxy Gateway​

Proxy Integration #3: Using Proxy API Endpoint​

Integrate Proxy Aggregator into the Existing Scraper​

Complete Code​

Conclusion​