Getting Started
ScrapeOps Proxy API Aggregator is an easy to use proxy that gives you access to the best performing proxies via a single endpoint. We take care of finding the best proxies, so you can focus on the data.
Authorisation - API Key
To use the ScrapeOps proxy, you first need an API key which you can get by signing up for a free account here.
Your API key must be included with every request using the api_key
query parameter otherwise the API will return a 403 Forbidden Access
status code.
Integration Method 1 - API Endpoint
To make requests you need send the URL you want to scrape to the ScrapeOps Proxy endpoint https://proxy.scrapeops.io/v1/
by adding your API Key and URL to the request using the api_key
and url
query parameter:
curl -k "https://proxy.scrapeops.io/v1/?api_key=YOUR_API_KEY&url=http://httpbin.org/anything"
The ScrapeOps Proxy supports GET
and POST
requests. For information on how to use POST
requests then check out the documentation here.
The following is some example Python code to use with Proxy API:
import requests
from urllib.parse import urlencode
proxy_params = {
'api_key': 'YOUR_API_KEY',
'url': 'https://httpbin.org/ip',
'render_js': True,
}
response = requests.get(
url='https://proxy.scrapeops.io/v1/',
params=urlencode(proxy_params),
timeout=120,
)
print('Body: ', response.content)
ScrapeOps will take care of the proxy selection and rotation for you so you just need to send us the URL you want to scrape.
When using the ScrapeOps Proxy API Aggregator API integration method, you should always encode your target URL.
This is because if you send an unencoded URL that contains query parameters then the API can think those query parameters are meant for the API and not part of your URL.
Here is documentation on how to encode URLs in various programming languages.
Integration Method 2 - Proxy Port
For those of you with existing proxy pools, we offer an easy to use proxy port solution which will take your requests and pass them through to the API endpoint which will then look after proxy rotation, captchas, and retries.
The proxy port is a light front-end for the API and has all the same functionality and performance as sending requests to the API endpoint.
The username
for the proxy is scrapeops and the password
is your API key.
curl -x "http://scrapeops:YOUR_API_KEY@proxy.scrapeops.io:5353" "https://httpbin.org/ip"
Here are the individual connection details:
- Proxy: proxy.scrapeops.io
- Port: 5353
- Username: scrapeops
- Password: YOUR_API_KEY
Note: So that we can properly direct your requests through the API, your code must be configured to not verify SSL certificates.
To enable extra/advanced functionality, you can pass parameters by adding them to username, separated by periods.
For example, if you want to enable Javascript rendering with a request, the username would be scrapeops.render=true
.
Also, multiple parameters can be included by separating them with periods, for example:
curl -x "http://scrapeops.country=us:YOUR_API_KEY@proxy.scrapeops.io:5353" "https://httpbin.org/ip"
Below we have an example of how you would use our proxy port with Python Requests.
import requests
proxies = {
"http": "http://scrapeops:YOUR_API_KEY@proxy.scrapeops.io:5353"
}
response = requests.get('https://httpbin.org/ip', proxies=proxies, verify=False)
print(response.text)
Scrapy users can likewise simply pass the proxy details via the meta object.
# ...other scrapy setup code
start_urls = ['https://httpbin.org/ip']
meta = {
"proxy": "http://scrapeops:YOUR_API_KEY@proxy.scrapeops.io:5353"
}
def parse(self, response):
# ...your parsing logic here
yield scrapy.Request(url, callback=self.parse, meta=meta)
Note: Scrapy skips SSL verification by default so you don't need to worry about switching it off.
Response Formats
The ScrapeOps Proxy API Aggregator offers two possible formats:
Target Server Response (Default)
The default response from our Proxy API endpoint and Proxy Port is the response returned by the target URL you request.
This response could be in HTML, JSON, XML, etc. format depending on the response returned by the websites server.
Example response:
<html>
<head>
...
</head>
<body>
...
</body>
</html>
The response will contain the HTML, etc. response and any headers (Note: cookies aren't returned.)
JSON Response
If you add the parameter json_response=true
to your request, then the proxy will return an extended JSON response with additional information about the request and response.
You can use this functionality when you would like to access additional response information such as cookies and XHR requests/responses.
The following are parameters returned.
Key | Description |
---|---|
successful | Boolean value indicating if the request was successful or not. true if successful. |
body | The HTML, JSON, XML, etc. response from the target website. |
url | The requested URL. |
status_code | The ScrapeOps status code. |
sops_api_credits | The number of ScrapeOps API credits consumed for the request. |
content_type | The content type of the websites response. |
headers | Any headers returned with the server response. |
cookies | Any cookies returned with the server response. |
xhr | A array of the XHR requests/responses made by the headless browser when making the request. Only works when render_js=true is enabled. |
The following is an example response:
{
"successful": true,
"url": "https://www.example.com/",
"content_type": "text/html;charset=UTF-8",
"sops_api_credits": 1,
"status_code": 200,
"headers": {
"Accept-Ch": "ect,rtt,downlink,device-memory,sec-ch-device-memory,viewport-width,sec-ch-viewport-width,dpr,sec-ch-dpr,sec-ch-ua-platform,sec-ch-ua-platform-version",
"Content-Language": "en-GB",
"Content-Security-Policy": "upgrade-insecure-requests;report-uri https://metrics.media-amazon.com/",
"Content-Type": "text/html;charset=UTF-8",
"X-Amz-Rid": "C347MNWAT7SE9XS4MJCN",
"X-Cache": "Miss from cloudfront",
"X-Content-Type-Options": "nosniff",
"X-Frame-Options": "SAMEORIGIN",
"X-Ua-Compatible": "IE=edge",
"X-Xss-Protection": "1;"
},
"xhr": null,
"cookies": [
{
"session-id": "258-3775233-4286404;"
}
],
"body": "<html><head>...</head><body>...</body></html>",
}
Status Codes
The ScrapeOps Proxy API Aggregator will return a 200
status code when it successfully got a response from the website that also passed response validation, or a 404
status code if the website responds with a 404
status code. Both of these status codes are considered successful requests.
Here is the full list of status codes the Proxy API returns.
Request Optimization
Certain domains are very hard to scrape and require you to use more advanced/expensive functionality to scrape them reliably at scale.
The ScrapeOps Proxy API Aggregator provides an automatic Request Optimization functionality that when enabled will tell the API to find the optimal request settings to give you the best performance at the lowest cost.
Instead of you having to decide which features and proxies to use, the ScrapeOps Proxy API Aggregator will enable/disable the following features for you to give you the best performance at the lowest cost:
- JS Rendering
- Country Geotargeting
- Premium Proxies
- Residential Proxies
- Mobile Proxies
- Anti-Bot Bypasses
- Waits
To enable Request Optimization, simply add optimize_request=true
to your request and the Proxy API will take care of the rest.
curl -k "https://proxy.scrapeops.io/v1/?api_key=YOUR_API_KEY&url=http://httpbin.org/anything&optimize_request=true"
For more details on how Request Optimization works then check out the documentation here.
Advanced Functionality
To manually enable other API functionality when using the Proxy API endpoint you need to add the appropriate query parameters to the ScrapeOps Proxy URL.
For example, if you want to enable Javascript rendering with a request, then add render_js=true
to the request:
curl -k "https://proxy.scrapeops.io/v1/?api_key=YOUR_API_KEY&url=http://httpbin.org/anything&render_js=true"
The API will accept the following parameters:
Parameter | Description |
---|---|
json_response | Return an extended JSON response with additional information about the request and response such as cookies and XHR requests/responses. Example: json_response=true . More info on response formats |
optimize_request | Request with request optimization enabled. Example: optimize_request=true |
max_request_cost | Used in conjunction with optimize_request to set the maximum number of API credits a request can use. Example: max_request_cost=30 |
bypass | Request with anti-bot bypass enabled. List of bypasses. Example: bypass=cloudflare_level_1 |
auto_extract | Use maintained parsers to automatically extract data from HTML and return data in JSON format. List of parsers. Example: auto_extract=amazon |
render_js | Request with Javascript rendering enabled. Example: render_js=true |
wait | Tell headless browser to wait a specfic period of time before returning response. Example: wait=3000 |
wait_for | Tell headless browser to wait a specfic page element to appear before returning response. Example: wait_for=.loading-done |
scroll | Tell headless browser to scroll the page down a defined number of pixels before returning the response. Example: scroll=5000 |
screenshot | Tell the headless browser to take a screenshot of the rendered page. The screenshot will be returned in a base64 encoded string. Parameters needed: screenshot=true&render_js=true&json_response=true . |
js_scenario | Send a sequence of commands to a headless browser before returning the response. Examples |
premium | Request using premium proxy pools. Example: premium=true |
residential | Request using residential proxy pools. Example: residential=true |
mobile | Request using mobile proxy pools. Example: mobile=true |
country | Make requests from specific country. Example: country=us |
keep_headers | Use your own custom headers when making the request. Example: keep_headers=true |
device_type | Tell API to use desktop vs mobile user-agents when making requests. Default is desktop . Example: device_type=mobile |
session_number | Enable sticky sessions that use the same IP address for multiple requests by setting a session_number . Example: session_number=7 |
follow_redirects | Tell API to not follow redirects by setting follow_redirects=false . |
initial_status_code | Tell API to return the inital status code the website responses with in the headers by setting initial_status_code=true . |
final_status_code | Tell API to return the final status code the website responses with in the headers by setting final_status_code=true . |
Check out this guide to see the full list of advanced functionality available.
Timeout
The ScrapeOps proxy keeps retrying a request for up to 2 minutes before returning a failed response to you.
To use the Proxy correctly, you should set the timeout on your request to a least 2 minutes to avoid you getting charged for any successful request that you timed out on your end before the Proxy API responded.