Skip to main content

Getting Started

ScrapeOps Proxy Aggregator is an easy to use proxy that gives you access to the best performing proxies via a single endpoint. We take care of finding the best proxies, so you can focus on the data.

Authorisation - API Key

To use the ScrapeOps proxy, you first need an API key which you can get by signing up for a free account here.

Your API key must be included with every request using the api_key query parameter otherwise the API will return a 403 Forbidden Access status code.


Integration Method 1 - API Endpoint

To make requests you need send the URL you want to scrape to the ScrapeOps Proxy endpoint https://proxy.scrapeops.io/v1/ by adding your API Key and URL to the request using the api_key and url query parameter:


curl -k "https://proxy.scrapeops.io/v1/?api_key=YOUR_API_KEY&url=http://httpbin.org/anything"

The ScrapeOps Proxy supports GET and POST requests. For information on how to use POST requests then check out the documentation here.

The following is some example Python code to use with Proxy API:


import requests
from urllib.parse import urlencode

proxy_params = {
'api_key': 'YOUR_API_KEY',
'url': 'http://httpbin.org/ip',
'render_js': True,
}

response = requests.get(
url='https://proxy.scrapeops.io/v1/',
params=urlencode(proxy_params),
timeout=120,
)

print('Body: ', response.content)

ScrapeOps will take care of the proxy selection and rotation for you so you just need to send us the URL you want to scrape.

URL Encoding

When using the ScrapeOps Proxy Aggregator API integration method, you should always encode your target URL.

This is because if you send an unencoded URL that contains query parameters then the API can think those query parameters are meant for the API and not part of your URL.

Here is documentation on how to encode URLs in various programming languages.


Integration Method 2 - Proxy Port

For those of you with existing proxy pools, we offer an easy to use proxy port solution which will take your requests and pass them through to the API endpoint which will then look after proxy rotation, captchas, and retries.

The proxy port is a light front-end for the API and has all the same functionality and performance as sending requests to the API endpoint.

The username for the proxy is scrapeops and the password is your API key.


curl -x "http://scrapeops:YOUR_API_KEY@proxy.scrapeops.io:5353" "http://httpbin.org/ip"


Here are the individual connection details:

  • Proxy: proxy.scrapeops.io
  • Port: 5353
  • Username: scrapeops
  • Password: YOUR_API_KEY
SSL Certificate Verification

Note: So that we can properly direct your requests through the API, your code must be configured to not verify SSL certificates.

To enable extra/advanced functionality, you can pass parameters by adding them to username, separated by periods.

For example, if you want to enable Javascript rendering with a request, the username would be scrapeops.render=true.

Also, multiple parameters can be included by separating them with periods, for example:


curl -x "http://scrapeops.country=us:YOUR_API_KEY@proxy.scrapeops.io:5353" "http://httpbin.org/ip"

Below we have an example of how you would use our proxy port with Python Requests.


import requests

proxies = {
"http": "http://scrapeops:YOUR_API_KEY@proxy.scrapeops.io:5353"
}
response = requests.get('http://httpbin.org/ip', proxies=proxies, verify=False)
print(response.text)

Scrapy users can likewise simply pass the proxy details via the meta object.


# ...other scrapy setup code
start_urls = ['http://httpbin.org/ip']
meta = {
"proxy": "http://scrapeops:YOUR_API_KEY@proxy.scrapeops.io:5353"
}

def parse(self, response):
# ...your parsing logic here
yield scrapy.Request(url, callback=self.parse, meta=meta)

Scrapy & SSL Certificate Verification

Note: Scrapy skips SSL verification by default so you don't need to worry about switching it off.


Response Formats

The ScrapeOps Proxy Aggregator offers two possible formats:

  1. Target Server Response (Default)
  2. JSON Response

Target Server Response (Default)

The default response from our Proxy API endpoint and Proxy Port is the response returned by the target URL you request.

This response could be in HTML, JSON, XML, etc. format depending on the response returned by the websites server.

Example response:


<html>
<head>
...
</head>
<body>
...
</body>
</html>

The response will contain the HTML, etc. response and any headers (Note: cookies aren't returned.)

JSON Response

If you add the parameter json_response=true to your request, then the proxy will return an extended JSON response with additional information about the request and response.

You can use this functionality when you would like to access additional response information such as cookies and XHR requests/responses.

The following are parameters returned.

KeyDescription
successfulBoolean value indicating if the request was successful or not. true if successful.
bodyThe HTML, JSON, XML, etc. response from the target website.
urlThe requested URL.
status_codeThe ScrapeOps status code.
sops_api_creditsThe number of ScrapeOps API credits consumed for the request.
content_typeThe content type of the websites response.
headersAny headers returned with the server response.
cookiesAny cookies returned with the server response.
xhrA array of the XHR requests/responses made by the headless browser when making the request. Only works when render_js=true is enabled.

The following is an example response:


{
"successful": true,
"url": "https://www.example.com/",
"content_type": "text/html;charset=UTF-8",
"sops_api_credits": 1,
"status_code": 200,
"headers": {
"Accept-Ch": "ect,rtt,downlink,device-memory,sec-ch-device-memory,viewport-width,sec-ch-viewport-width,dpr,sec-ch-dpr,sec-ch-ua-platform,sec-ch-ua-platform-version",
"Content-Language": "en-GB",
"Content-Security-Policy": "upgrade-insecure-requests;report-uri https://metrics.media-amazon.com/",
"Content-Type": "text/html;charset=UTF-8",
"X-Amz-Rid": "C347MNWAT7SE9XS4MJCN",
"X-Cache": "Miss from cloudfront",
"X-Content-Type-Options": "nosniff",
"X-Frame-Options": "SAMEORIGIN",
"X-Ua-Compatible": "IE=edge",
"X-Xss-Protection": "1;"
},
"xhr": null,
"cookies": [
{
"session-id": "258-3775233-4286404;"
}
],
"body": "<html><head>...</head><body>...</body></html>",
}


Status Codes

The ScrapeOps Proxy API will return a 200 status code when it successfully got a response from the website that also passed response validation, or a 404 status code if the website responds with a 404 status code. Both of these status codes are considered successful requests.

Here is the full list of status codes the Proxy API returns.


Request Optimization

Certain domains are very hard to scrape and require you to use more advanced/expensive functionality to scrape them reliably at scale.

The ScrapeOps Proxy API provides an automatic Request Optimization functionality that when enabled will tell the API to find the optimal request settings to give you the best performance at the lowest cost.

Instead of you having to decide which features and proxies to use, the ScrapeOps Proxy API will enable/disable the following features for you to give you the best performance at the lowest cost:

To enable Request Optimization, simply add optimize_request=true to your request and the Proxy API will take care of the rest.


curl -k "https://proxy.scrapeops.io/v1/?api_key=YOUR_API_KEY&url=http://httpbin.org/anything&optimize_request=true"

For more details on how Request Optimization works then check out the documentation here.


Advanced Functionality

To manually enable other API functionality when using the Proxy API endpoint you need to add the appropriate query parameters to the ScrapeOps Proxy URL.

For example, if you want to enable Javascript rendering with a request, then add render_js=true to the request:


curl -k "https://proxy.scrapeops.io/v1/?api_key=YOUR_API_KEY&url=http://httpbin.org/anything&render_js=true"

The API will accept the following parameters:

ParameterDescription
json_responseReturn an extended JSON response with additional information about the request and response such as cookies and XHR requests/responses. Example: json_response=true. More info on response formats
optimize_requestRequest with request optimization enabled. Example: optimize_request=true
max_request_costUsed in conjunction with optimize_request to set the maximum number of API credits a request can use. Example: max_request_cost=30
bypassRequest with anti-bot bypass enabled. List of bypasses. Example: bypass=cloudflare_level_1
auto_extractUse maintained parsers to automatically extract data from HTML and return data in JSON format. List of parsers. Example: auto_extract=amazon
render_jsRequest with Javascript rendering enabled. Example: render_js=true
waitTell headless browser to wait a specfic period of time before returning response. Example: wait=3000
wait_forTell headless browser to wait a specfic page element to appear before returning response. Example: wait_for=.loading-done
scrollTell headless browser to scroll the page down a defined number of pixels before returning the response. Example: scroll=5000
js_scenarioSend a sequence of commands to a headless browser before returning the response. Examples
premiumRequest using premium proxy pools. Example: premium=true
residentialRequest using residential proxy pools. Example: residential=true
mobileRequest using mobile proxy pools. Example: mobile=true
countryMake requests from specific country. Example: country=us
keep_headersUse your own custom headers when making the request. Example: keep_headers=true
device_typeTell API to use desktop vs mobile user-agents when making requests. Default is desktop. Example: device_type=mobile
session_numberEnable sticky sessions that use the same IP address for multiple requests by setting a session_number. Example: session_number=7
follow_redirectsTell API to not follow redirects by setting follow_redirects=false.
initial_status_codeTell API to return the inital status code the website responses with in the headers by setting initial_status_code=true.
final_status_codeTell API to return the final status code the website responses with in the headers by setting final_status_code=true.

Check out this guide to see the full list of advanced functionality available.


Timeout

The ScrapeOps proxy keeps retrying a request for up to 2 minutes before returning a failed response to you.

To use the Proxy correctly, you should set the timeout on your request to a least 2 minutes to avoid you getting charged for any successful request that you timed out on your end before the Proxy API responded.