Skip to main content

Scrape.do: Web Scraping Integration Guide

Scrape.do is a powerful platform that simplifies web scraping by offering a seamless API integration for extracting data from websites, without the hassle of dealing with complex setups or getting blocked.

In this guide, we'll walk you through how to integrate Scrape.do into your projects, enabling you to scrape data effortlessly while maintaining compliance and performance.


TLDR: Web Scraping With Scrape.do?

Starting with Scrape.do is pretty easy. The function below will get you started. To customize your proxy, you can checkout the additional params in their docs.

def get_scrapedo_url(url):
payload = {
"token": API_KEY,
"url": url,
}
proxy_url = "https://api.scrape.do/?" + urlencode(payload)
return proxy_url

What Is Scrape.do?

According to their landing page, Scrape.do is a "Rotating Proxy & Web Scraping API". This means that Scrape.do rotates between different proxies to assist in Web Scraping, much like the ScrapeOps Proxy Aggregator.

However, their providers are not listed on their site. Their usecase is pretty similar to that of ScrapeOps and the process is outlined below.

Scrape.do Homepage

Whenever we use a proxy provider, the process goes as follows.

  1. We send our url and our api_key to the proxy service.
  2. The provider attempts to get our url through one of their servers.
  3. The provider receives their response.
  4. The provider sends the response back to us.

During a scrape like this, the proxy server can route all of our requests through multiple IP addresses. This makes our requests appear as if they're coming from a bunch of different sources and that each request is coming from a different user. When you use any scraping API, all of the following are true.

  • You tell the API which site you want to access.
  • Their servers access the site for you.
  • You scrape your desired site(s).

How Does Scrape.do API Work?

Each time we talk to Scrape.do, we need our API key. Along with our API key, there is a list of other parameters we can use in order to customize our request. We package up a request containing our API key and any other parameters to customize our scrape.

The table below contains a list of parameters commonly sent to Scrape.do.

ParameterDescription
token (requried)Your Scrape.do API key (string)
url (required)The url you'd like to scrape (string)
superUse residential and mobile IP addresses (boolean)
geoCode Route the request through a specific country (string)
regionalGeoCodeRoute the request through a specific continent (string)
sessionIdId used for sticky sessions (integer)
customHeadersSend custom headers in the request (bool)
extraHeadersChange header values or add a new headers over existing ones (bool)
forwardHeadersForward your own headers to the website (bool)
setCookiesSet custom cookies for the site (string)
disableRedirectionDisable redirects to other pages (bool)
callbackSend results to a specific domain/address via webhook (string)
timeoutMaximum time for a request (integer)
retryTimeoutMaximum time for a retry (integer)
disableRetryDisable retry logic for your request (bool)
deviceDevice you'd like to use (string, desktop by default)
renderRender the content via a browser (bool, false by default)
waitUntilWait until a certain condition (string, domcontentloaded by default)
customWaitWait an arbitrary amount of time (integer, 0 by default)
waitSelectorWait for a CSS selector to appear on the page
widthWidth of the browser in pixels (integer, 1920 by default)
heightHeight of the browser in pixels (integer, 1080 by default)
blockResourcesBlock CSS and images from loading (boolean, true by default)
screenShotTake a screenshot of the visible page (boolean, false by default)
fullScreenShotTake a full screenshot of the page (boolean, false by default)
particularScreenShotTake a screenshot of a certain location on the page (string)
playWithBrowserExecute actions using the browser: scroll, click, etc. (string)
outputReturn output in either raw HTML or Markdown (string, raw by default)
transparentResponseReturn only the target page (bool, false by default)

Here is an example of a request with the Scrape.do API.

import requests

token = "YOUR_TOKEN"
targetUrl = "https://httpbin.co/anything"

url = "https://api.scrape.do"

payload = {
"token": token,
"url": targetUrl
}

response = requests.get(url, params=payload)
print(response.text)

Response Format

We can change our response format from HTML to JSON using the returnJSON parameter. returnJSON requires us to use "render": True.

import requests

token = "YOUR_TOKEN"
targetUrl = "https://httpbin.co/anything"

url = "https://api.scrape.do"

payload = {
"token": token,
"url": targetUrl,
"render": True,
"returnJSON": True
}

response = requests.get(url, params=payload)
print(response.text)

We can also change our response format to Markdown with the output parameter.

import requests

token = "YOUR_TOKEN"
targetUrl = "https://httpbin.co/anything"

url = "https://api.scrape.do"

payload = {
"token": token,
"url": targetUrl,
"output": "markdown"
}

response = requests.get(url, params=payload)
print(response.text)

Scrape.do API Pricing

Scrape.do has a smaller selection of plans than most proxy providers. Their lowest tier plan is Hobby at $29 per month. Their largest plan is Business at $249 per month. Anything beyond Business requires a custom plan and you need to contact them directly to work it out.

PlanAPI Credits per MonthPrice per Month
Hobby250,000$29
Pro1,250,000$99
Business3,500,000$249
CustomN/A$249+

With each of these plans, you only pay for successful requests. If the API fails to get your page, you pay nothing. Each plan also includes the following:

  • Concurrency (limits vary based on plan)
  • Datacenter Proxies
  • Sticky Sessions
  • Unlimited Bandwidth
  • Email Support

As your price increases, you get other features along with these benefits listed above.

Response Status Codes

When using their API, there are numerous status codes we might get back. 200 is the one we want.

Status CodeTypePossible Causes/Solutions
200SuccessIt worked!
400Bad RequestThe Request Was Invalid or Malformed
401Account IssueNo API credits or Account Suspended
404Not FoundSite Not Found, Page Not Found
429Too Many RequestsConcurrency Limit Exceeded
500Internal Server ErrorContext Cancelled, Unknown Error

Setting Up Scrape.do API

To actually get started, we need to create an account and obtain an API key. We can sign up using any of the following options.

  • Create an account with Google
  • Create an account with Github
  • Create an account with LinkedIn
  • Create an account with an email address and password

Signup

After generating an account, you can view the dashboard. The dashboard contains information about your plan and some analytics tools toward the bottom of the page.

Dashboard

Unlike ScrapeOps and some of the other sites we've explored, Scrape.do does not appear to have a request builder or generator anywhere on their site.

While taking the dashboard screenshot, I exposed my API key. This actually isn't a big deal. As any good API service should, Scrape.do gives us the ability to change our API key. To update your key, you need to enter your password and complete a CAPTCHA.

alt text

Once you've got your API key, you're all set to start using the Scrape.do API.

API Endpoint Integration

When dealing with the Scrape.do API, we actually don't have to worry about any custom endpoints. This isn't bad because we have less to keep track of. All of our requests are made to "https://api.scrape.do". Instead of a custom endpoint, we're using the apex domain of scrape.do and the subdomain of api.

We have no custom endpoints to worry about, all requests go to "https://api.scrape.do".

Proxy Port Integration

Proxy Port Integration is a great tool for beginners and people who just want to set the proxy and forget it. With Proxy Port Integration, you can tell Requests to use a specfic proxy and then just worry about coding as normal.

Scrape.do requires that we set verify to False. This way, we don't have to worry about our HTTP client rejecting the Scrape.do CA certificate. All requests go through port 8080.

http://YOUR-API-KEY:@proxy.scrape.do:8080

Below is an example of how to do this using Python Requests.

import json

import requests

import urllib3

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

url = "https://httpbin.co/anything"

token = "YOUR_TOKEN"

proxyModeUrl = f"http://{token}:@proxy.scrape.do:8080"

proxies = {
"http": proxyModeUrl,
"https": proxyModeUrl,
}

response = requests.get(url, proxies=proxies, verify=False)

print(response.text)

When you set you proxy like the example above, you can just continue coding like normal, and you don't have to worry about custom proxy settings or finer control.

Managing Concurrency

With Scrape.do, we're given at least some concurrency with each plan. Concurrency allows us to make multiple requests simultaneously.

For example, we could send a request to https://quotes.toscrape.com/page/1/ and while we're still awaiting that request, we can send another one to https://quotes.toscrape.com/page/2/.

Even on the free trial we get a concurrency limit of 5. This is pretty generous.


import requests
from bs4 import BeautifulSoup
import concurrent.futures
from urllib.parse import urlencode

API_KEY = 'YOUR_API_KEY'
NUM_THREADS = 5

def get_proxy_url(url):
payload = {"token": API_KEY, "url": url}
proxy_url = 'https://api.scrape.do/?' + urlencode(payload)
return proxy_url

## Example list of urls to scrape
list_of_urls = [
"https://quotes.toscrape.com/page/1/",
"https://quotes.toscrape.com/page/2/",
"http://quotes.toscrape.com/page/3/",
]

output_data_list = []

def scrape_page(url):
try:
response = requests.get(get_proxy_url(url))
if response.status_code == 200:
soup = BeautifulSoup(response.text, "html.parser")
title = soup.find("h1").text

## add scraped data to "output_data_list" list
output_data_list.append({
'title': title,
})

except Exception as e:
print('Error', e)


with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
executor.map(scrape_page, list_of_urls)

print(output_data_list)

Pay attention to executor.map(), this is where all of our concurrency is happening.

  • Our first argument is the function we want to call on each thread, scrape_page.
  • list_of_urls is the list of arguments we want to pass into each instance scrape_page.

Any other arguments to the function also get passed in as arrays.


Advanced Functionality

Now, we're going to delve into the advanced functionality of Scrape.do.

As you saw in one of our tables earlier, Scrape.do offers a ton of different advanced options we can use to customize our scrape. We get everything from custom countries and IP addresses to full blown browser control using the API.

You already saw a breakdown of their advanced functionality earlier here. The pricing for all these features is rather simple. You can see a pricing breakdown below.

Request TypeAPI Credits
Normal Request (Datacenter IP)1
Normal + Headless Browser5
Residential + Mobile (Super)10
Super + Headless Browser25

If you need to review the actual requests, you can see the table again below.

ParameterDescription
token (requried)Your Scrape.do API key (string)
url (required)The url you'd like to scrape (string)
superUse residential and mobile IP addresses (boolean)
geoCode Route the request through a specific country (string)
regionalGeoCodeRoute the request through a specific continent (string)
sessionIdId used for sticky sessions (integer)
customHeadersSend custom headers in the request (bool)
extraHeadersChange header values or add a new headers over existing ones (bool)
forwardHeadersForward your own headers to the website (bool)
setCookiesSet custom cookies for the site (string)
disableRedirectionDisable redirects to other pages (bool)
callbackSend results to a specific domain/address via webhook (string)
timeoutMaximum time for a request (integer)
retryTimeoutMaximum time for a retry (integer)
disableRetryDisable retry logic for your request (bool)
deviceDevice you'd like to use (string, desktop by default)
renderRender the content via a browser (bool, false by default)
waitUntilWait until a certain condition (string, domcontentloaded by default)
customWaitWait an arbitrary amount of time (integer, 0 by default)
waitSelectorWait for a CSS selector to appear on the page
widthWidth of the browser in pixels (integer, 1920 by default)
heightHeight of the browser in pixels (integer, 1080 by default)
blockResourcesBlock CSS and images from loading (boolean, true by default)
screenShotTake a screenshot of the visible page (boolean, false by default)
fullScreenShotTake a full screenshot of the page (boolean, false by default)
particularScreenShotTake a screenshot of a certain location on the page (string)
playWithBrowserExecute actions using the browser: scroll, click, etc. (string)
outputReturn output in either raw HTML or Markdown (string, raw by default)
transparentResponseReturn only the target page (bool, false by default)

They also have a special pricing structure for certain sites that are more difficult to scrape. The breakdown for that is available on this page.

You can view their full API documentation here.


JavaScript Rendering

JavaScript Rendering is a functionality that allows web scraping tools or browsers to execute and render JavaScript code on a webpage before extracting its content.

Many modern websites rely heavily on JavaScript to load dynamic content, such as product listings, user-generated content, or ads. This means that simply scraping the static HTML of a webpage may not capture all the data, especially if the data is loaded asynchronously via JavaScript.

To render JavaScript using their headless browser, we can use the render param. When set to True, this tells Scrape.do that we want to run the browser and render JavaScript.

import requests

token = "YOUR_TOKEN"
targetUrl = "https://httpbin.co/anything"

url = "https://api.scrape.do"

payload = {
"token": token,
"url": targetUrl,
"render": True
}

response = requests.get(url, params=payload)
print(response.text)

You can view the documentation for this here.

Controlling The Browser

To control the browser, we can use the playWithBrowser parameter. This parameter is pretty self explanatory. It tells Scrape.do that we want to play with the browser. We pass our browser actions in as an array and we need to set render to True.

import requests

url = "https://api.scrape.do"
params = {
"render": True,
"playWithBrowser": '[{"Action":"Click","Selector":"#html-page"}]',
"token": "YOUR_TOKEN",
"url": "https://httpbin.co/"
}

response = requests.get(url, params=params)

print(response.text)

The browser control docs are available here.


Country Geotargeting

Country Geotargeting is a feature in proxy and web scraping services that allows users to access and extract data from websites as if they were located in a specific country.

By routing requests through IP addresses from different geographic locations, this functionality lets you appear as if you're browsing from a particular country, enabling access to location-specific content.

Country geotargeting is useful for extracting localized data, accessing geo-restricted content, and conducting region-specific analysis for marketing, pricing, and competitive insights.

Setting our geolocation with Scrape.do is extremely easy. We do it almost exactly the same way that we would with ScrapeOps. The major difference is that we can control our location by country or by continent.

To control our country, we use the geoCode parameter. This is only available with the Pro Plan or higher.

import requests

url = "https://api.scrape.do"
params = {
"token": "YOUR_TOKEN",
"url": "https://httpbin.co/",
"geoCode": "us"
}

response = requests.get(url, params=params)

print(response.text)

Scrape.do supports a decent list of countries when using a Datacenter IP. If you're choosing a Super Proxy, the list is even bigger than this one!

CountryCountry Code
United States"us"
Great Britain"gb"
Germany"de"
Turkey"tr"
Russia"ru"
France"fr"
Israel"il"
India"in"
Brazil"br"
Ukraine"ua"
Pakistan"pk"
Netherlands"nl"
United Arab Emirates"ae"
Saudi Arabia"sa"
Mexico"mx"
Egypt"eg"
Slovakia"sk"
Italy"it"
Singapore"sg"

To control our location by continent, we use the regionalGeoCode parameter instead. Regional Geotargeting requires a Super Proxy.

import requests

url = "https://api.scrape.do"
params = {
"token": "YOUR_TOKEN",
"url": "https://httpbin.co/",
"super": True,
"regionalGeoCode": "northamerica"
}

response = requests.get(url, params=params)

print(response.text)

Here is their list of regional geocodes.

ContinentGeocode
Europeeurope
Asiaasia
Africaafrica
Oceaniaoceania
North Americanorthamerica
South Americasouthamerica

You can view the full documentation for this here.


Residential Proxies

Residential proxies are proxy servers that use IP addresses assigned to real residential devices, such as computers, smartphones, or smart TVs, by Internet Service Providers (ISPs).

These proxies are tied to actual, physical locations and appear as normal, everyday users to the websites they access. This makes them highly reliable and difficult to detect as proxies, especially compared to data center proxies.

Residential proxies provide a high level of anonymity and credibility, as they mimic real user behavior by using genuine IPs from ISPs.

To use residential proxies with Scrape.do, we use the super param. super tells Scrape.do that we'd like to use the Residential & Mobile Proxy service.

Here's a code example of how to use them. If you do not set a geoCode (as seen in the geolocation examples above), your location will default to US.

import requests

url = "https://api.scrape.do"
params = {
"token": "YOUR_TOKEN",
"url": "https://httpbin.co/",
"super": True,
}

response = requests.get(url, params=params)

print(response.text)

You can view their full documentation for Super Proxies here.


Custom Headers

The custom header functionality in proxy APIs allows users to manually specify the HTTP request headers sent during web scraping or data collection tasks.

By default, proxy APIs manage these headers automatically, optimizing them for the best performance and minimizing detection. However, some proxy services give users the option to customize headers for specific needs, offering greater control over the data extraction process.

Why Use Custom Headers?

  • Target Specific Data: Some websites require specific headers (such as user-agent, authorization, or referrer) to access certain content or retrieve accurate data.
  • POST Requests: When sending POST requests, many websites expect certain headers like Content-Type or Accept. Custom headers ensure that your request is formatted correctly for the server to process.
  • Bypass Anti-Bot Systems: Custom headers can help trick anti-bot systems by mimicking real browsers or users. This can include rotating user-agent strings, referring URLs, or cookies to make requests appear more legitimate.

Word of Caution

  • Potential to Reduce Performance: Using static or improperly configured custom headers can make your requests appear automated, increasing the likelihood of detection and blocks. Proxy APIs are often better at dynamically adjusting headers for optimal performance.
  • Risk of Getting Blocked: For large-scale web scraping, sending the same custom headers repeatedly can raise red flags. You'll need a system to continuously rotate and clean headers to avoid being blocked.
  • Use Only When Necessary: In most cases, it's better to rely on the proxy service’s optimized headers unless you have a specific need. Custom headers should be used sparingly and strategically.

In summary, custom headers provide flexibility but should be used with caution to maintain proxy performance and avoid detection.

Custom headers are pretty easy to set. All we need to do is use customHeaders. This one is a boolean. When we set customHeaders to True, Scrape.do knows that we want to use custom headers.

import requests

url = "https://api.scrape.do"

params = {
"token": "YOUR_TOKEN",
"url": "https://httpbin.co/",
"customHeaders": True,
}

headers = {
"Test-Header-Key": "Test Header Value"
}

response = requests.get(url, params=params, headers=headers)

print(response.text)

Take a look at the docs here.


Static Proxies

Static proxies allows a user to maintain the same IP address for an extended period when making multiple requests.

Unlike rotating proxies, where the IP address changes with every request, a static proxy gives you consistent access to the same IP for a set duration, usually between a few minutes to hours, depending on the service.

Static Proxies are ideal for maintaining sessions over multiple requests. These are also called Sticky Sessions.

To use a Sticky Session, we use the sessionId param. We set our sessionId to 1234. Scrape.do keeps track of the session via our API key. Our API key ties our activity strictly to us.

import requests

url = "https://api.scrape.do"

params = {
"token": "YOUR_TOKEN",
"url": "https://httpbin.co/",
"sessionId": 1234,
}

response = requests.get(url, params=params)

print(response.text)

Screenshot Functionality

The screenshot functionality allows users to capture images of web pages at any given point during the scraping process. This feature takes a visual snapshot of the rendered web page, preserving the exact layout, content, and appearance as seen by a user.

We get several options for screenshots with Scrape.do. Each option requires us to set render and returnJSON to True. In order to take the screenshot, we need a real browser to do it.

Here is a standard screenshot, it uses screenShot.

import requests

url = "https://api.scrape.do"

params = {
"token": "YOUR_TOKEN",
"url": "https://httpbin.co/",
"render": True,
"screenShot": True,
"returnJSON": True
}

response = requests.get(url, params=params)

print(response.text)

Here is fullScreenShot.

import requests

url = "https://api.scrape.do"

params = {
"token": "YOUR_TOKEN",
"url": "https://httpbin.co/",
"render": True,
"fullScreenShot": True,
"returnJSON": True
}

response = requests.get(url, params=params)

print(response.text)

Here is particularScreenShot. We use this one to take a shot of a specific CSS selector.

import requests

url = "https://api.scrape.do"

params = {
"token": "YOUR_TOKEN",
"url": "https://httpbin.co/",
"render": True,
"particularScreenShot": "h1",
"returnJSON": True
}

response = requests.get(url, params=params)

print(response.text)

Auto Parsing

Auto Parsing is a really cool feature in which your API will actually try and scrape the site for you. However, Scrape.do does not support auto parsing of any kind.

ScrapeOps and a few other sites support Auto Parsing. You can view their auto parsing features on the links below.


Case Study: Using ScraperAPI on IMDb Top 250 Movies

Now, we're going to scrape the 250 movies from IMDB. Our scrapers will be pretty much identical. The major difference is the param for our API key. With ScrapeOps, we use api_key. With Scrape.do, we use token. Pretty much everything else in the code remains the same.

Take a look at the snippets below, you'll notice the subtle difference between the proxy functions.

Here is the proxy function for ScrapeOps:

def get_scrapeops_url(url):
payload = {
"api_key": API_KEY,
"url": url,
}
proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)
return proxy_url

Here is our proxy function for Scrape.do.

def get_scrapedo_url(url):
payload = {
"token": API_KEY,
"url": url,
}
proxy_url = "https://api.scrape.do/?" + urlencode(payload)
return proxy_url

The full ScrapeOps code is available for you below.

import os
import requests
from bs4 import BeautifulSoup
import json
import csv
import logging
from urllib.parse import urlencode
import concurrent.futures

## Logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

API_KEY = ""

with open("config.json", "r") as config_file:
config = json.load(config_file)
API_KEY = config["scrapeops_api_key"]

def get_scrapeops_url(url):
payload = {
"api_key": API_KEY,
"url": url,
}
proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)
return proxy_url


def scrape_movies(url, location="us", retries=3):
success = False
tries = 0

while not success and tries <= retries:
response = requests.get(get_scrapeops_url(url))

try:
if response.status_code != 200:
raise Exception(f"Failed response from server, status code: {response.status_code}")

soup = BeautifulSoup(response.text, "html.parser")
json_tag = soup.select_one("script[type='application/ld+json']")
json_data = json.loads(json_tag.text)["itemListElement"]

movie_list_length = 0

movie_list = []

for item in json_data:
movie_list.append(item["item"])
movie_list_length+=len(json_data)

print(f"Movie list length: {len(json_data)}")
with open("scrapeops-top-250.json", "w") as file:
json.dump(movie_list, file, indent=4)
success = True
except Exception as e:
logger.error(f"Failed to process page: {e}, retries left: {retries-tries}")
tries+=1

if not success:
raise Exception(f"Failed to scrape page, MAX RETRIES {retries} EXCEEDED!!!")



if __name__ == "__main__":

MAX_RETRIES = 3

logger.info("Starting IMDB scrape")

url = "https://www.imdb.com/chart/top/"

scrape_movies(url, retries=MAX_RETRIES)

logger.info("Scrape complete")

Here are the results from the ScrapeOps Proxy Aggregator. It took 6.159 seconds.

ScrapeOps Results

Here is the full Scrape.do code.

import os
import requests
from bs4 import BeautifulSoup
import json
import csv
import logging
from urllib.parse import urlencode
import concurrent.futures

## Logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

API_KEY = ""

with open("config.json", "r") as config_file:
config = json.load(config_file)
API_KEY = config["scrape_do_api_key"]

def get_scrapedo_url(url):
payload = {
"token": API_KEY,
"url": url,
}
proxy_url = "https://api.scrape.do/?" + urlencode(payload)
return proxy_url

def scrape_movies(url, location="us", retries=3):
success = False
tries = 0

while not success and tries <= retries:
response = requests.get(get_scrapedo_url(url))

try:
if response.status_code != 200:
raise Exception(f"Failed response from server, status code: {response.status_code}")

soup = BeautifulSoup(response.text, "html.parser")
json_tag = soup.select_one("script[type='application/ld+json']")
json_data = json.loads(json_tag.text)["itemListElement"]

movie_list_length = 0

movie_list = []

for item in json_data:
movie_list.append(item["item"])
movie_list_length+=len(json_data)

print(f"Movie list length: {len(json_data)}")
with open("scrape-do-top-250.json", "w") as file:
json.dump(movie_list, file, indent=4)
success = True
except Exception as e:
logger.error(f"Failed to process page: {e}, retries left: {retries-tries}")
tries+=1

if not success:
raise Exception(f"Failed to scrape page, MAX RETRIES {retries} EXCEEDED!!!")



if __name__ == "__main__":

MAX_RETRIES = 3

logger.info("Starting IMDB scrape")

url = "https://www.imdb.com/chart/top/"

scrape_movies(url, retries=MAX_RETRIES)

logger.info("Scrape complete")

Go ahead and compare them to the results from Scrape.do which took 5.256 seconds.

Scrape.do Results

Scrape.do was slightly faster than ScrapeOps. 6.159 - 5.256 = 0.903 second difference. Scrape.do was just under a second faster than ScrapeOps. Depending on your conditions (location, time of day, and internet connection), ScrapeOps could just as easily be faster. In fact, on another run (an hour later), ScrapeOps clocked at 5.12 seconds and Scrape.do came in at 6.424.

Here is Scrape.do's other run.

Scrape.do Second Run Results

Here is the other one for ScrapeOps.

Scrapeops Second Run Results


Alternative: ScrapeOps Proxy API Aggregator

ScrapeOps and Scrape.do offer some pretty similar products. However, ScrapeOps really shines with our variety of pricing plans. With Scrape.do, you're stuck with one of 3 options or you need to make a custom plan.

With ScrapeOps, you get to choose between 8 different plans ranging in price from $9 per month to $249 per month.

ScrapeOps Pricing

Not only does ScrapeOps offer plans comparable to those of Scrape.do, we offer a wider variety with a lower barrier to entry (starting at $9/month).


Troubleshooting

Issue #1: Request Timeouts

We can set a timeout argument with Python Requests. Sometimes we run into issues where our requests time out. To fix this, just set a custom timeout.

import requests

# 5 second timeout
response = requests.get("https://httpbin.org/get", timeout=5)

Issue #2: Handling CAPTCHAs

Proxies are supposed to get us past CAPTCHAs. If you're receiving CAPTCHAs, your scraper has already failed to appear human. However, this does sometimes happen in the wild. To get through CAPTCHAs, first retry your request. If that doesn't work, change your location and/or consider using a residential IP address.

If the solutions outlined above fail (they shouldn't), you can always use 2captcha. We have an excellent article on bypassing CAPTCHAs here.

Issue #4: Invalid Response Data

Error codes are a common occurrence in all facets of web development. To handle error codes, we need to know why they're occurring. You can view the Scrape.do error codes here. The ScrapeOps error codes are available for review here.

In most cases, you need to double check your parameters or make sure your bill is paid. Every once in awhile, you may receive a different error code that you can find in the links above.


Web scraping is generally legal for public data. Private data is subject to numerous privacy laws and intellectual property policies. Public data is any data that is not gated begind a login.

You should also take into account the Terms and Conditions and the robots.txt of the site you're scraping. You can view these documents from IMDB below.

Consequences of Misuse

Violating either the terms of service or privacy policies of a website can lead to several consequences:

  • Account Suspension or IP Blocking: Scraping websites without regard for their policies often leads to being blocked from accessing the site. For authenticated platforms, this may result in account suspension, making further interactions with the site impossible from that account.

  • Legal Penalties: Violating a website's ToS or scraping data unlawfully can lead to legal action. Laws such as the Computer Fraud and Abuse Act (CFAA) in the U.S. have been used to pursue lawsuits against unauthorized scraping, especially if it's done at scale or causes harm (such as server overload). Companies can face lawsuits for unauthorized use of proprietary data or violating intellectual property rights.

  • Data Breaches and Privacy Violations: If scraping is used to collect personal or sensitive data without consent, it can lead to severe privacy violations. This can expose businesses to penalties under regulations like GDPR, which can impose heavy fines for non-compliance, and reputational damage.

  • Server Overload: Excessive scraping can strain a website’s servers, especially if done without rate-limiting or throttling. This can cause performance issues for the website, leading to possible financial or legal claims against the scraper for damages caused by server downtime.

Ethical Considerations

  • Fair Use: Even if scraping is legal, it's important to consider the ethical use of the data. For instance, scraping content to directly copy and republish it for profit without adding value is generally unethical and may infringe on copyright laws. Ethical scraping should aim to provide new insights, analysis, or utility from the data.

  • User Consent: Scraping platforms that collect user-generated content (like social media) should consider user privacy and consent. Even if the content is publicly available, using it in ways that violate privacy expectations can lead to ethical concerns and backlash.

  • Transparency: Scrapers should be transparent about their intentions, especially if the scraping is for commercial purposes. Providing appropriate attributions or using data responsibly demonstrates ethical integrity.


Conclusion

ScrapeOps and Scrape.do offer very similar products. Both of these solutions give you a reliable rotating proxy with residential and mobile options all over the world. They're also very similar in terms of cost.

ScrapeOps offers a wider variety of plans and both APIs take a similar amount of time for our responses.

Both of these solutions will help you get the data you need.


More Web Scraping Guides

Whether you're brand new to scraping or you're a hardened web developer, we have something for you. We wrote the playbook on scraping with Python.

Bookmark one of the articles below and level up your scraping toolbox!