Skip to main content

Python Scrapy Code Examples

The following are code examples on how to integrate the ScrapeOps Proxy Aggregator with your Python Scrapy Spiders.

Authorisation - API Key

To use the ScrapeOps proxy, you first need an API key which you can get by signing up for a free account here.

Your API key must be included with every request using the api_key query parameter otherwise the API will return a 403 Forbidden Access status code.


Basic Request Wrapper

If you want to integrate the ScrapeOps proxy on a request by request basis then you can simply use a simple function to modify the URL Scrapy requests.

The following is some example Python code to send a URL to the ScrapeOps Proxy endpoint https://proxy.scrapeops.io/v1/:


import scrapy
from urllib.parse import urlencode

API_KEY = 'YOUR_API_KEY'

def get_scrapeops_url(url):
payload = {'api_key': API_KEY, 'url': url}
proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)
return proxy_url

class QuotesSpider(scrapy.Spider):
name = "quotes"

def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=get_scrapeops_url(url), callback=self.parse)


Here you just need to use the get_scrapeops_url on every URL you request using scrapy.Request. From here ScrapeOps will take care of the proxy selection and rotation for you so you just need to send us the URL you want to scrape.


Proxy Middleware

The other approach is to create a Downloader Middleware and activate it for the entire project, each spider individually or on each request. Here is an example middleware you can use:

## middleware.py

from urllib.parse import urlencode
from scrapy import Request

class ScrapeOpsProxyMiddleware:

@classmethod
def from_crawler(cls, crawler):
return cls(crawler.settings)


def __init__(self, settings):
self.scrapeops_api_key = settings.get('SCRAPEOPS_API_KEY')
self.scrapeops_endpoint = 'https://proxy.scrapeops.io/v1/?'
self.scrapeops_proxy_active = settings.get('SCRAPEOPS_PROXY_ENABLED', False)


@staticmethod
def _param_is_true(request, key):
if request.meta.get(key) or request.meta.get(key, 'false').lower() == 'true':
return True
return False


@staticmethod
def _replace_response_url(response):
real_url = response.headers.get(
'Sops-Final-Url', def_val=response.url)
return response.replace(
url=real_url.decode(response.headers.encoding))


def _get_scrapeops_url(self, request):
payload = {'api_key': self.scrapeops_api_key, 'url': request.url}
if self._param_is_true(request, 'sops_render_js'):
payload['render_js'] = True
if self._param_is_true(request, 'sops_residential'):
payload['residential'] = True
if self._param_is_true(request, 'sops_keep_headers'):
payload['residential'] = True
if request.meta.get('sops_country') is not None:
payload['country'] = request.meta.get('sops_country')
proxy_url = self.scrapeops_endpoint + urlencode(payload)
return proxy_url


def _scrapeops_proxy_enabled(self):
if self.scrapeops_api_key is None or self.scrapeops_api_key == '' or self.scrapeops_proxy_active == False:
return False
return True

def process_request(self, request, spider):
if self._scrapeops_proxy_enabled is False or self.scrapeops_endpoint in request.url:
return None

scrapeops_url = self._get_scrapeops_url(request)
new_request = request.replace(
cls=Request, url=scrapeops_url, meta=request.meta)
return new_request


def process_response(self, request, response, spider):
new_response = self._replace_response_url(response)
return new_response


And then enable it in your project in the settings.py file. Remembering to swap the YOUR_PROJECT_NAME for the name of your project (BOT_NAME in your settings.py file):

## settings.py

SCRAPEOPS_API_KEY = 'YOUR_API_KEY'
SCRAPEOPS_PROXY_ENABLED = True

DOWNLOADER_MIDDLEWARES = {
'YOUR_PROJECT_NAME.middlewares.ScrapeOpsProxyMiddleware': 725,
}

Or in the spider itself using the custom_settings attribute.

## your_spider.py

import scrapy
from demo.items import QuoteItem


class QuotesSpider(scrapy.Spider):
name = "demo"
start_urls = ["http://quotes.toscrape.com/"]

## Enable ScrapeOps Proxy Here
custom_settings = {
'SCRAPEOPS_API_KEY': 'YOUR_API_KEY',
'SCRAPEOPS_PROXY_ENABLED': True,
'DOWNLOADER_MIDDLEWARES': {
'YOUR_PROJECT_NAME.middlewares.ScrapeOpsProxyMiddleware': 725,
}
}

def parse(self, response):
quote_item = QuoteItem()
for quote in response.css('div.quote'):
quote_item['text'] = quote.css('span.text::text').get()
quote_item['author'] = quote.css('small.author::text').get()
quote_item['tags'] = quote.css('div.tags a.tag::text').getall()
yield quote_item

# go to next page
next_page = response.css("li.next a::attr(href)").extract_first()
if next_page:
yield response.follow(next_page, callback=self.parse)


Response Format

After recieving a response from one of our proxy providers the ScrapeOps Proxy API will then respond with the raw HTML content of the target URL along with a response code:


<html>
<head>
...
</head>
<body>
...
</body>
</html>

The ScrapeOps Proxy API will return a 200 status code when it successfully got a response from the website that also passed response validation, or a 404 status code if the website responds with a 404 status code. Both of these status codes are considered successful requests.

Here is the full list of status codes the Proxy API returns.


Advanced Functionality

To enable other API functionality when using the Proxy API endpoint you need to add the appropriate query parameters to the ScrapeOps Proxy URL.

For example, if you want to enable Javascript rendering with a request, then add render_js=true to the request:

If you would like to enable additional functionality like Javascript Rendering via a Headless Browser then you would modify the get_scrapeops_url function like this:


API_KEY = 'YOUR_API_KEY'

def get_scrapeops_url(url):
payload = {'api_key': API_KEY, 'url': url, 'render_js': True}
proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)
return proxy_url

Or if you were using the proxy via the example middleware above, you could enable specific functionality by adding them to the request meta parameter and appending sops_ to the parameter key.

## your_spider.py

import scrapy
from demo.items import QuoteItem

class QuotesSpider(scrapy.Spider):
name = "quotes"

def start_requests(self):
urls = [
'https://quotes.toscrape.com/page/1/',
'https://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse, meta={'sops_render_js': True, 'sops_country': 'us'})

def parse(self, response):
pass

Check out this guide to see the full list of advanced functionality available.


Timeout

The ScrapeOps proxy keeps retrying a request for up to 2 minutes before returning a failed response to you.

To use the Proxy correctly, you should set the timeout on your request to a least 2 minutes to avoid you getting charged for any successful request that you timed out on your end before the Proxy API responded.