Python Scrapy Code Examples
The following are code examples on how to integrate the ScrapeOps Proxy Aggregator with your Python Scrapy Spiders.
Authorisation - API Key
To use the ScrapeOps proxy, you first need an API key which you can get by signing up for a free account here.
Your API key must be included with every request using the api_key
query parameter otherwise the API will return a 403 Forbidden Access
status code.
Basic Request Wrapper
If you want to integrate the ScrapeOps proxy on a request by request basis then you can simply use a simple function to modify the URL Scrapy requests.
The following is some example Python code to send a URL to the ScrapeOps Proxy endpoint https://proxy.scrapeops.io/v1/
:
import scrapy
from urllib.parse import urlencode
API_KEY = 'YOUR_API_KEY'
def get_scrapeops_url(url):
payload = {'api_key': API_KEY, 'url': url}
proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)
return proxy_url
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=get_scrapeops_url(url), callback=self.parse)
Here you just need to use the get_scrapeops_url
on every URL you request using scrapy.Request
. From here ScrapeOps will take care of the proxy selection and rotation for you so you just need to send us the URL you want to scrape.
Proxy Middleware
The other approach is to create a Downloader Middleware and activate it for the entire project, each spider individually or on each request. Here is an example middleware you can use:
## middleware.py
from urllib.parse import urlencode
from scrapy import Request
class ScrapeOpsProxyMiddleware:
@classmethod
def from_crawler(cls, crawler):
return cls(crawler.settings)
def __init__(self, settings):
self.scrapeops_api_key = settings.get('SCRAPEOPS_API_KEY')
self.scrapeops_endpoint = 'https://proxy.scrapeops.io/v1/?'
self.scrapeops_proxy_active = settings.get('SCRAPEOPS_PROXY_ENABLED', False)
@staticmethod
def _param_is_true(request, key):
if request.meta.get(key) or request.meta.get(key, 'false').lower() == 'true':
return True
return False
@staticmethod
def _replace_response_url(response):
real_url = response.headers.get(
'Sops-Final-Url', def_val=response.url)
return response.replace(
url=real_url.decode(response.headers.encoding))
def _get_scrapeops_url(self, request):
payload = {'api_key': self.scrapeops_api_key, 'url': request.url}
if self._param_is_true(request, 'sops_render_js'):
payload['render_js'] = True
if self._param_is_true(request, 'sops_residential'):
payload['residential'] = True
if self._param_is_true(request, 'sops_keep_headers'):
payload['residential'] = True
if request.meta.get('sops_country') is not None:
payload['country'] = request.meta.get('sops_country')
proxy_url = self.scrapeops_endpoint + urlencode(payload)
return proxy_url
def _scrapeops_proxy_enabled(self):
if self.scrapeops_api_key is None or self.scrapeops_api_key == '' or self.scrapeops_proxy_active == False:
return False
return True
def process_request(self, request, spider):
if self._scrapeops_proxy_enabled is False or self.scrapeops_endpoint in request.url:
return None
scrapeops_url = self._get_scrapeops_url(request)
new_request = request.replace(
cls=Request, url=scrapeops_url, meta=request.meta)
return new_request
def process_response(self, request, response, spider):
new_response = self._replace_response_url(response)
return new_response
And then enable it in your project in the settings.py
file. Remembering to swap the YOUR_PROJECT_NAME
for the name of your project (BOT_NAME
in your settings.py
file):
## settings.py
SCRAPEOPS_API_KEY = 'YOUR_API_KEY'
SCRAPEOPS_PROXY_ENABLED = True
DOWNLOADER_MIDDLEWARES = {
'YOUR_PROJECT_NAME.middlewares.ScrapeOpsProxyMiddleware': 725,
}
Or in the spider itself using the custom_settings
attribute.
## your_spider.py
import scrapy
from demo.items import QuoteItem
class QuotesSpider(scrapy.Spider):
name = "demo"
start_urls = ["http://quotes.toscrape.com/"]
## Enable ScrapeOps Proxy Here
custom_settings = {
'SCRAPEOPS_API_KEY': 'YOUR_API_KEY',
'SCRAPEOPS_PROXY_ENABLED': True,
'DOWNLOADER_MIDDLEWARES': {
'YOUR_PROJECT_NAME.middlewares.ScrapeOpsProxyMiddleware': 725,
}
}
def parse(self, response):
quote_item = QuoteItem()
for quote in response.css('div.quote'):
quote_item['text'] = quote.css('span.text::text').get()
quote_item['author'] = quote.css('small.author::text').get()
quote_item['tags'] = quote.css('div.tags a.tag::text').getall()
yield quote_item
# go to next page
next_page = response.css("li.next a::attr(href)").extract_first()
if next_page:
yield response.follow(next_page, callback=self.parse)
Response Format
After recieving a response from one of our proxy providers the ScrapeOps Proxy API will then respond with the raw HTML content of the target URL along with a response code:
<html>
<head>
...
</head>
<body>
...
</body>
</html>
The ScrapeOps Proxy API will return a 200
status code when it successfully got a response from the website that also passed response validation, or a 404
status code if the website responds with a 404
status code. Both of these status codes are considered successful requests.
Here is the full list of status codes the Proxy API returns.
Advanced Functionality
To enable other API functionality when using the Proxy API endpoint you need to add the appropriate query parameters to the ScrapeOps Proxy URL.
For example, if you want to enable Javascript rendering with a request, then add render_js=true
to the request:
If you would like to enable additional functionality like Javascript Rendering via a Headless Browser then you would modify the get_scrapeops_url
function like this:
API_KEY = 'YOUR_API_KEY'
def get_scrapeops_url(url):
payload = {'api_key': API_KEY, 'url': url, 'render_js': True}
proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)
return proxy_url
Or if you were using the proxy via the example middleware above, you could enable specific functionality by adding them to the request meta
parameter and appending sops_
to the parameter key.
## your_spider.py
import scrapy
from demo.items import QuoteItem
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'https://quotes.toscrape.com/page/1/',
'https://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse, meta={'sops_render_js': True, 'sops_country': 'us'})
def parse(self, response):
pass
Check out this guide to see the full list of advanced functionality available.
Timeout
The ScrapeOps proxy keeps retrying a request for up to 2 minutes before returning a failed response to you.
To use the Proxy correctly, you should set the timeout on your request to a least 2 minutes to avoid you getting charged for any successful request that you timed out on your end before the Proxy API responded.