Skip to main content

Python Scrapy Code Examples

The following are code examples on how to integrate the ScrapeOps Proxy Aggregator with your Python Scrapy Spiders.

Authorisation - API Key

To use the ScrapeOps proxy, you first need an API key which you can get by signing up for a free account here.

Your API key must be included with every request using the api_key query parameter otherwise the API will return a 403 Forbidden Access status code.


ScrapeOps Scrapy Proxy SDK Installation

If you want to easily integrate the ScrapeOps proxy into your scrapy project then you can simply install our pre-made sdk with pip. All the requests that you make will then be automatically routed through our proxy and by-pass any anti-bots!

We can quickly install it into our project using the following command:


pip install scrapeops-scrapy-proxy-sdk

And then enable it in your project in the settings.py file.


SCRAPEOPS_API_KEY = 'YOUR_API_KEY'
SCRAPEOPS_PROXY_ENABLED = True

DOWNLOADER_MIDDLEWARES = {
'scrapeops_scrapy_proxy_sdk.scrapeops_scrapy_proxy_sdk.ScrapeOpsScrapyProxySdk': 725,
}

Now when you run your spiders, the requests will be automatically sent through the ScrapeOps Proxy API.

Enabling Advanced Functionality

The ScrapeOps Proxy API supports a range of more advanced features that you can enable by adding extra query parameters to your request.

To enable them using the ScrapeOps Scrapy Proxy Middleware you can do so using 3 methods:

Method #1: Global Project Settings

You can apply the proxy setting to every spider that runs in your project by adding a SCRAPEOPS_PROXY_SETTINGS dictionary to your settings.py file with the extra features you want to enable.


SCRAPEOPS_API_KEY = 'YOUR_API_KEY'
SCRAPEOPS_PROXY_ENABLED = True
SCRAPEOPS_PROXY_SETTINGS = {'country': 'us'}

DOWNLOADER_MIDDLEWARES = {
'scrapeops_scrapy_proxy_sdk.scrapeops_scrapy_proxy_sdk.ScrapeOpsScrapyProxySdk': 725,
}

Method #2: Spider Settings

You can apply the proxy setting to every request a spider makes by adding a SCRAPEOPS_PROXY_SETTINGS dictionary to the custom_settings attribute in your spider with the extra features you want to enable.


import scrapy

class QuotesSpider(scrapy.Spider):
name = "quotes"
custom_settings = {
'SCRAPEOPS_PROXY_SETTINGS': {'country': 'us'}
}

def start_requests(self):
urls = [
'https://quotes.toscrape.com/page/1/',
'https://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):
pass

Method #3: Request Settings

You can apply the proxy setting to each individual request a spider makes by adding the extra features you want to enable to the meta parameter of each request.

When using this method you need to add 'sops_' to the start of the feature you key you want to enable. So to enable 'country': 'uk', you would use 'sops_country': 'uk'.


import scrapy

class QuotesSpider(scrapy.Spider):
name = "quotes"

def start_requests(self):
urls = [
'https://quotes.toscrape.com/page/1/',
'https://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, meta={'sops_country': 'uk'}, callback=self.parse)

def parse(self, response):
pass

A full list of advanced features can be found here.


Basic Request Wrapper

If you want to integrate the ScrapeOps proxy on a request by request basis then you can simply use a simple function to modify the URL Scrapy requests.

The following is some example Python code to send a URL to the ScrapeOps Proxy endpoint https://proxy.scrapeops.io/v1/:


import scrapy
from urllib.parse import urlencode

API_KEY = 'YOUR_API_KEY'

def get_scrapeops_url(url):
payload = {'api_key': API_KEY, 'url': url}
proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)
return proxy_url

class QuotesSpider(scrapy.Spider):
name = "quotes"

def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=get_scrapeops_url(url), callback=self.parse)


Here you just need to use the get_scrapeops_url on every URL you request using scrapy.Request. From here ScrapeOps will take care of the proxy selection and rotation for you so you just need to send us the URL you want to scrape.


Proxy Middleware

The other approach is to create a Downloader Middleware and activate it for the entire project, each spider individually or on each request. Here is an example middleware you can use:

## middlewares.py

from urllib.parse import urlencode
from scrapy import Request

class ScrapeOpsProxyMiddleware:

@classmethod
def from_crawler(cls, crawler):
return cls(crawler.settings)


def __init__(self, settings):
self.scrapeops_api_key = settings.get('SCRAPEOPS_API_KEY')
self.scrapeops_endpoint = 'https://proxy.scrapeops.io/v1/?'
self.scrapeops_proxy_active = settings.get('SCRAPEOPS_PROXY_ENABLED', False)


@staticmethod
def _param_is_true(request, key):
if request.meta.get(key) or request.meta.get(key, 'false').lower() == 'true':
return True
return False


@staticmethod
def _replace_response_url(response):
real_url = response.headers.get(
'Sops-Final-Url', def_val=response.url)
return response.replace(
url=real_url.decode(response.headers.encoding))


def _get_scrapeops_url(self, request):
payload = {'api_key': self.scrapeops_api_key, 'url': request.url}
if self._param_is_true(request, 'sops_render_js'):
payload['render_js'] = True
if self._param_is_true(request, 'sops_residential'):
payload['residential'] = True
if self._param_is_true(request, 'sops_keep_headers'):
payload['keep_headers'] = True
if request.meta.get('sops_country') is not None:
payload['country'] = request.meta.get('sops_country')
proxy_url = self.scrapeops_endpoint + urlencode(payload)
return proxy_url


def _scrapeops_proxy_enabled(self):
if self.scrapeops_api_key is None or self.scrapeops_api_key == '' or self.scrapeops_proxy_active == False:
return False
return True

def process_request(self, request, spider):
if self._scrapeops_proxy_enabled is False or self.scrapeops_endpoint in request.url:
return None

scrapeops_url = self._get_scrapeops_url(request)
new_request = request.replace(
cls=Request, url=scrapeops_url, meta=request.meta)
return new_request


def process_response(self, request, response, spider):
new_response = self._replace_response_url(response)
return new_response


And then enable it in your project in the settings.py file.

## settings.py

SCRAPEOPS_API_KEY = 'YOUR_API_KEY'
SCRAPEOPS_PROXY_ENABLED = True

DOWNLOADER_MIDDLEWARES = {
'YOUR_PROJECT_NAME.middlewares.ScrapeOpsProxyMiddleware': 725,
}

Replace YOUR_PROJECT_NAME

Remember to swap the YOUR_PROJECT_NAME for the name of your project (BOT_NAME in your settings.py file).

Or in the spider itself using the custom_settings attribute.

## your_spider.py

import scrapy
from demo.items import QuoteItem


class QuotesSpider(scrapy.Spider):
name = "demo"
start_urls = ["http://quotes.toscrape.com/"]

## Enable ScrapeOps Proxy Here
custom_settings = {
'SCRAPEOPS_API_KEY': 'YOUR_API_KEY',
'SCRAPEOPS_PROXY_ENABLED': True,
'DOWNLOADER_MIDDLEWARES': {
'YOUR_PROJECT_NAME.middlewares.ScrapeOpsProxyMiddleware': 725,
}
}

def parse(self, response):
quote_item = QuoteItem()
for quote in response.css('div.quote'):
quote_item['text'] = quote.css('span.text::text').get()
quote_item['author'] = quote.css('small.author::text').get()
quote_item['tags'] = quote.css('div.tags a.tag::text').getall()
yield quote_item

# go to next page
next_page = response.css("li.next a::attr(href)").extract_first()
if next_page:
yield response.follow(next_page, callback=self.parse)


Response Format

After recieving a response from one of our proxy providers the ScrapeOps Proxy API will then respond with the raw HTML content of the target URL along with a response code:


<html>
<head>
...
</head>
<body>
...
</body>
</html>

The ScrapeOps Proxy API will return a 200 status code when it successfully got a response from the website that also passed response validation, or a 404 status code if the website responds with a 404 status code. Both of these status codes are considered successful requests.

Here is the full list of status codes the Proxy API returns.


Concurrency Management

When using Scrapy with the ScrapeOps Proxy you need to make sure you don't exceed your concurrency limit of the plan you are using.

For example, if you were using the Free Plan which has a concurrency limit of 1 thread, then you would set CONCURRENT_REQUESTS=1 in your settings.py file.

For maximum performance you would also ensure that DOWNLOAD_DELAY is set to zero in your settings.py file (this is the default setting).

## settings.py

CONCURRENT_REQUESTS=1
DOWNLOAD_DELAY=0


Advanced Functionality

To enable other API functionality when using the Proxy API endpoint you need to add the appropriate query parameters to the ScrapeOps Proxy URL.

For example, if you want to enable Javascript rendering with a request, then add render_js=true to the request:

If you would like to enable additional functionality like Javascript Rendering via a Headless Browser then you would modify the get_scrapeops_url function like this:


API_KEY = 'YOUR_API_KEY'

def get_scrapeops_url(url):
payload = {'api_key': API_KEY, 'url': url, 'render_js': True}
proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)
return proxy_url

Or if you were using the proxy via the example middleware above, you could enable specific functionality by adding them to the request meta parameter and appending sops_ to the parameter key.

## your_spider.py

import scrapy
from demo.items import QuoteItem

class QuotesSpider(scrapy.Spider):
name = "quotes"

def start_requests(self):
urls = [
'https://quotes.toscrape.com/page/1/',
'https://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse, meta={'sops_render_js': True, 'sops_country': 'us'})

def parse(self, response):
pass

Check out this guide to see the full list of advanced functionality available.


Timeout

The ScrapeOps proxy keeps retrying a request for up to 2 minutes before returning a failed response to you.

To use the Proxy correctly, you should set the timeout on your request to a least 2 minutes to avoid you getting charged for any successful request that you timed out on your end before the Proxy API responded.