Scrape.do: Web Scraping Integration Guide
Scrape.do is a powerful platform that simplifies web scraping by offering a seamless API integration for extracting data from websites, without the hassle of dealing with complex setups or getting blocked.
In this guide, we'll walk you through how to integrate Scrape.do into your projects, enabling you to scrape data effortlessly while maintaining compliance and performance.
- TLDR: Scraping With Scrape.do
- What is Scrape.do?
- Setting Up the Scrape.do API
- Advanced Functionality
- JavaScript Rendering
- Country Geotargeting
- Residential Proxies
- Custom Headers
- Static Proxies
- Screenshot Functionality
- Auto Parsing
- Case Study: IMDB Top 250 Movies
- Alternative: ScrapeOps Proxy Aggregator
- Troubleshooting
- Conclusion
- More Web Scraping Guides
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
TLDR: Web Scraping With Scrape.do?
Starting with Scrape.do is pretty easy. The function below will get you started. To customize your proxy, you can checkout the additional params in their docs.
def get_scrapedo_url(url):
payload = {
"token": API_KEY,
"url": url,
}
proxy_url = "https://api.scrape.do/?" + urlencode(payload)
return proxy_url
What Is Scrape.do?
According to their landing page, Scrape.do is a "Rotating Proxy & Web Scraping API". This means that Scrape.do rotates between different proxies to assist in Web Scraping, much like the ScrapeOps Proxy Aggregator.
However, their providers are not listed on their site. Their usecase is pretty similar to that of ScrapeOps and the process is outlined below.
Whenever we use a proxy provider, the process goes as follows.
- We send our
url
and ourapi_key
to the proxy service. - The provider attempts to get our
url
through one of their servers. - The provider receives their response.
- The provider sends the response back to us.
During a scrape like this, the proxy server can route all of our requests through multiple IP addresses. This makes our requests appear as if they're coming from a bunch of different sources and that each request is coming from a different user. When you use any scraping API, all of the following are true.
- You tell the API which site you want to access.
- Their servers access the site for you.
- You scrape your desired site(s).
How Does Scrape.do API Work?
Each time we talk to Scrape.do, we need our API key. Along with our API key, there is a list of other parameters we can use in order to customize our request. We package up a request containing our API key and any other parameters to customize our scrape.
The table below contains a list of parameters commonly sent to Scrape.do.
Parameter | Description |
---|---|
token (requried) | Your Scrape.do API key (string) |
url (required) | The url you'd like to scrape (string) |
super | Use residential and mobile IP addresses (boolean) |
geoCode | Route the request through a specific country (string) |
regionalGeoCode | Route the request through a specific continent (string) |
sessionId | Id used for sticky sessions (integer) |
customHeaders | Send custom headers in the request (bool) |
extraHeaders | Change header values or add a new headers over existing ones (bool) |
forwardHeaders | Forward your own headers to the website (bool) |
setCookies | Set custom cookies for the site (string) |
disableRedirection | Disable redirects to other pages (bool) |
callback | Send results to a specific domain/address via webhook (string) |
timeout | Maximum time for a request (integer) |
retryTimeout | Maximum time for a retry (integer) |
disableRetry | Disable retry logic for your request (bool) |
device | Device you'd like to use (string, desktop by default) |
render | Render the content via a browser (bool, false by default) |
waitUntil | Wait until a certain condition (string, domcontentloaded by default) |
customWait | Wait an arbitrary amount of time (integer, 0 by default) |
waitSelector | Wait for a CSS selector to appear on the page |
width | Width of the browser in pixels (integer, 1920 by default) |
height | Height of the browser in pixels (integer, 1080 by default) |
blockResources | Block CSS and images from loading (boolean, true by default) |
screenShot | Take a screenshot of the visible page (boolean, false by default) |
fullScreenShot | Take a full screenshot of the page (boolean, false by default) |
particularScreenShot | Take a screenshot of a certain location on the page (string) |
playWithBrowser | Execute actions using the browser: scroll, click, etc. (string) |
output | Return output in either raw HTML or Markdown (string, raw by default) |
transparentResponse | Return only the target page (bool, false by default) |
Here is an example of a request with the Scrape.do API.
import requests
token = "YOUR_TOKEN"
targetUrl = "https://httpbin.co/anything"
url = "https://api.scrape.do"
payload = {
"token": token,
"url": targetUrl
}
response = requests.get(url, params=payload)
print(response.text)
Response Format
We can change our response format from HTML to JSON using the returnJSON
parameter. returnJSON
requires us to use "render": True
.
import requests
token = "YOUR_TOKEN"
targetUrl = "https://httpbin.co/anything"
url = "https://api.scrape.do"
payload = {
"token": token,
"url": targetUrl,
"render": True,
"returnJSON": True
}
response = requests.get(url, params=payload)
print(response.text)
We can also change our response format to Markdown with the output
parameter.
import requests
token = "YOUR_TOKEN"
targetUrl = "https://httpbin.co/anything"
url = "https://api.scrape.do"
payload = {
"token": token,
"url": targetUrl,
"output": "markdown"
}
response = requests.get(url, params=payload)
print(response.text)
Scrape.do API Pricing
Scrape.do has a smaller selection of plans than most proxy providers. Their lowest tier plan is Hobby at $29 per month. Their largest plan is Business at $249 per month. Anything beyond Business requires a custom plan and you need to contact them directly to work it out.
Plan | API Credits per Month | Price per Month |
---|---|---|
Hobby | 250,000 | $29 |
Pro | 1,250,000 | $99 |
Business | 3,500,000 | $249 |
Custom | N/A | $249+ |
With each of these plans, you only pay for successful requests. If the API fails to get your page, you pay nothing. Each plan also includes the following:
- Concurrency (limits vary based on plan)
- Datacenter Proxies
- Sticky Sessions
- Unlimited Bandwidth
- Email Support
As your price increases, you get other features along with these benefits listed above.
Response Status Codes
When using their API, there are numerous status codes we might get back. 200 is the one we want.
Status Code | Type | Possible Causes/Solutions |
---|---|---|
200 | Success | It worked! |
400 | Bad Request | The Request Was Invalid or Malformed |
401 | Account Issue | No API credits or Account Suspended |
404 | Not Found | Site Not Found, Page Not Found |
429 | Too Many Requests | Concurrency Limit Exceeded |
500 | Internal Server Error | Context Cancelled, Unknown Error |
Setting Up Scrape.do API
To actually get started, we need to create an account and obtain an API key. We can sign up using any of the following options.
- Create an account with Google
- Create an account with Github
- Create an account with LinkedIn
- Create an account with an email address and password
After generating an account, you can view the dashboard. The dashboard contains information about your plan and some analytics tools toward the bottom of the page.
Unlike ScrapeOps and some of the other sites we've explored, Scrape.do does not appear to have a request builder or generator anywhere on their site.
While taking the dashboard screenshot, I exposed my API key. This actually isn't a big deal. As any good API service should, Scrape.do gives us the ability to change our API key. To update your key, you need to enter your password and complete a CAPTCHA.
Once you've got your API key, you're all set to start using the Scrape.do API.
API Endpoint Integration
When dealing with the Scrape.do API, we actually don't have to worry about any custom endpoints. This isn't bad because we have less to keep track of. All of our requests are made to "https://api.scrape.do"
. Instead of a custom endpoint, we're using the apex domain of scrape.do
and the subdomain of api
.
We have no custom endpoints to worry about, all requests go to "https://api.scrape.do"
.
Proxy Port Integration
Proxy Port Integration is a great tool for beginners and people who just want to set the proxy and forget it. With Proxy Port Integration, you can tell Requests to use a specfic proxy and then just worry about coding as normal.
Scrape.do requires that we set verify
to False
. This way, we don't have to worry about our HTTP client rejecting the Scrape.do CA certificate. All requests go through port 8080.
http://YOUR-API-KEY:@proxy.scrape.do:8080
Below is an example of how to do this using Python Requests.
import json
import requests
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
url = "https://httpbin.co/anything"
token = "YOUR_TOKEN"
proxyModeUrl = f"http://{token}:@proxy.scrape.do:8080"
proxies = {
"http": proxyModeUrl,
"https": proxyModeUrl,
}
response = requests.get(url, proxies=proxies, verify=False)
print(response.text)
When you set you proxy like the example above, you can just continue coding like normal, and you don't have to worry about custom proxy settings or finer control.
Managing Concurrency
With Scrape.do, we're given at least some concurrency with each plan. Concurrency allows us to make multiple requests simultaneously.
For example, we could send a request to https://quotes.toscrape.com/page/1/
and while we're still awaiting that request, we can send another one to https://quotes.toscrape.com/page/2/
.
Even on the free trial we get a concurrency limit of 5. This is pretty generous.
import requests
from bs4 import BeautifulSoup
import concurrent.futures
from urllib.parse import urlencode
API_KEY = 'YOUR_API_KEY'
NUM_THREADS = 5
def get_proxy_url(url):
payload = {"token": API_KEY, "url": url}
proxy_url = 'https://api.scrape.do/?' + urlencode(payload)
return proxy_url
## Example list of urls to scrape
list_of_urls = [
"https://quotes.toscrape.com/page/1/",
"https://quotes.toscrape.com/page/2/",
"http://quotes.toscrape.com/page/3/",
]
output_data_list = []
def scrape_page(url):
try:
response = requests.get(get_proxy_url(url))
if response.status_code == 200:
soup = BeautifulSoup(response.text, "html.parser")
title = soup.find("h1").text
## add scraped data to "output_data_list" list
output_data_list.append({
'title': title,
})
except Exception as e:
print('Error', e)
with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
executor.map(scrape_page, list_of_urls)
print(output_data_list)
Pay attention to executor.map()
, this is where all of our concurrency is happening.
- Our first argument is the function we want to call on each thread,
scrape_page
. list_of_urls
is the list of arguments we want to pass into each instancescrape_page
.
Any other arguments to the function also get passed in as arrays.
Advanced Functionality
Now, we're going to delve into the advanced functionality of Scrape.do.
As you saw in one of our tables earlier, Scrape.do offers a ton of different advanced options we can use to customize our scrape. We get everything from custom countries and IP addresses to full blown browser control using the API.
You already saw a breakdown of their advanced functionality earlier here. The pricing for all these features is rather simple. You can see a pricing breakdown below.
Request Type | API Credits |
---|---|
Normal Request (Datacenter IP) | 1 |
Normal + Headless Browser | 5 |
Residential + Mobile (Super) | 10 |
Super + Headless Browser | 25 |
If you need to review the actual requests, you can see the table again below.
Parameter | Description |
---|---|
token (requried) | Your Scrape.do API key (string) |
url (required) | The url you'd like to scrape (string) |
super | Use residential and mobile IP addresses (boolean) |
geoCode | Route the request through a specific country (string) |
regionalGeoCode | Route the request through a specific continent (string) |
sessionId | Id used for sticky sessions (integer) |
customHeaders | Send custom headers in the request (bool) |
extraHeaders | Change header values or add a new headers over existing ones (bool) |
forwardHeaders | Forward your own headers to the website (bool) |
setCookies | Set custom cookies for the site (string) |
disableRedirection | Disable redirects to other pages (bool) |
callback | Send results to a specific domain/address via webhook (string) |
timeout | Maximum time for a request (integer) |
retryTimeout | Maximum time for a retry (integer) |
disableRetry | Disable retry logic for your request (bool) |
device | Device you'd like to use (string, desktop by default) |
render | Render the content via a browser (bool, false by default) |
waitUntil | Wait until a certain condition (string, domcontentloaded by default) |
customWait | Wait an arbitrary amount of time (integer, 0 by default) |
waitSelector | Wait for a CSS selector to appear on the page |
width | Width of the browser in pixels (integer, 1920 by default) |
height | Height of the browser in pixels (integer, 1080 by default) |
blockResources | Block CSS and images from loading (boolean, true by default) |
screenShot | Take a screenshot of the visible page (boolean, false by default) |
fullScreenShot | Take a full screenshot of the page (boolean, false by default) |
particularScreenShot | Take a screenshot of a certain location on the page (string) |
playWithBrowser | Execute actions using the browser: scroll, click, etc. (string) |
output | Return output in either raw HTML or Markdown (string, raw by default) |
transparentResponse | Return only the target page (bool, false by default) |
They also have a special pricing structure for certain sites that are more difficult to scrape. The breakdown for that is available on this page.
You can view their full API documentation here.
JavaScript Rendering
JavaScript Rendering is a functionality that allows web scraping tools or browsers to execute and render JavaScript code on a webpage before extracting its content.
Many modern websites rely heavily on JavaScript to load dynamic content, such as product listings, user-generated content, or ads. This means that simply scraping the static HTML of a webpage may not capture all the data, especially if the data is loaded asynchronously via JavaScript.
To render JavaScript using their headless browser, we can use the render
param. When set to True
, this tells Scrape.do that we want to run the browser and render JavaScript.
import requests
token = "YOUR_TOKEN"
targetUrl = "https://httpbin.co/anything"
url = "https://api.scrape.do"
payload = {
"token": token,
"url": targetUrl,
"render": True
}
response = requests.get(url, params=payload)
print(response.text)
You can view the documentation for this here.
Controlling The Browser
To control the browser, we can use the playWithBrowser
parameter. This parameter is pretty self explanatory. It tells Scrape.do that we want to play with the browser. We pass our browser actions in as an array and we need to set render
to True
.
import requests
url = "https://api.scrape.do"
params = {
"render": True,
"playWithBrowser": '[{"Action":"Click","Selector":"#html-page"}]',
"token": "YOUR_TOKEN",
"url": "https://httpbin.co/"
}
response = requests.get(url, params=params)
print(response.text)
The browser control docs are available here.
Country Geotargeting
Country Geotargeting is a feature in proxy and web scraping services that allows users to access and extract data from websites as if they were located in a specific country.
By routing requests through IP addresses from different geographic locations, this functionality lets you appear as if you're browsing from a particular country, enabling access to location-specific content.
Country geotargeting is useful for extracting localized data, accessing geo-restricted content, and conducting region-specific analysis for marketing, pricing, and competitive insights.
Setting our geolocation with Scrape.do is extremely easy. We do it almost exactly the same way that we would with ScrapeOps. The major difference is that we can control our location by country or by continent.
To control our country, we use the geoCode
parameter. This is only available with the Pro Plan or higher.
import requests
url = "https://api.scrape.do"
params = {
"token": "YOUR_TOKEN",
"url": "https://httpbin.co/",
"geoCode": "us"
}
response = requests.get(url, params=params)
print(response.text)
Scrape.do supports a decent list of countries when using a Datacenter IP. If you're choosing a Super Proxy, the list is even bigger than this one!
Country | Country Code |
---|---|
United States | "us" |
Great Britain | "gb" |
Germany | "de" |
Turkey | "tr" |
Russia | "ru" |
France | "fr" |
Israel | "il" |
India | "in" |
Brazil | "br" |
Ukraine | "ua" |
Pakistan | "pk" |
Netherlands | "nl" |
United Arab Emirates | "ae" |
Saudi Arabia | "sa" |
Mexico | "mx" |
Egypt | "eg" |
Slovakia | "sk" |
Italy | "it" |
Singapore | "sg" |
To control our location by continent, we use the regionalGeoCode
parameter instead. Regional Geotargeting requires a Super Proxy.
import requests
url = "https://api.scrape.do"
params = {
"token": "YOUR_TOKEN",
"url": "https://httpbin.co/",
"super": True,
"regionalGeoCode": "northamerica"
}
response = requests.get(url, params=params)
print(response.text)
Here is their list of regional geocodes.
Continent | Geocode |
---|---|
Europe | europe |
Asia | asia |
Africa | africa |
Oceania | oceania |
North America | northamerica |
South America | southamerica |
You can view the full documentation for this here.
Residential Proxies
Residential proxies are proxy servers that use IP addresses assigned to real residential devices, such as computers, smartphones, or smart TVs, by Internet Service Providers (ISPs).
These proxies are tied to actual, physical locations and appear as normal, everyday users to the websites they access. This makes them highly reliable and difficult to detect as proxies, especially compared to data center proxies.
Residential proxies provide a high level of anonymity and credibility, as they mimic real user behavior by using genuine IPs from ISPs.
To use residential proxies with Scrape.do, we use the super
param. super
tells Scrape.do that we'd like to use the Residential & Mobile Proxy service.
Here's a code example of how to use them. If you do not set a geoCode
(as seen in the geolocation examples above), your location will default to US.
import requests
url = "https://api.scrape.do"
params = {
"token": "YOUR_TOKEN",
"url": "https://httpbin.co/",
"super": True,
}
response = requests.get(url, params=params)
print(response.text)
You can view their full documentation for Super Proxies here.
Custom Headers
The custom header functionality in proxy APIs allows users to manually specify the HTTP request headers sent during web scraping or data collection tasks.
By default, proxy APIs manage these headers automatically, optimizing them for the best performance and minimizing detection. However, some proxy services give users the option to customize headers for specific needs, offering greater control over the data extraction process.
Why Use Custom Headers?
- Target Specific Data: Some websites require specific headers (such as user-agent, authorization, or referrer) to access certain content or retrieve accurate data.
- POST Requests: When sending POST requests, many websites expect certain headers like Content-Type or Accept. Custom headers ensure that your request is formatted correctly for the server to process.
- Bypass Anti-Bot Systems: Custom headers can help trick anti-bot systems by mimicking real browsers or users. This can include rotating user-agent strings, referring URLs, or cookies to make requests appear more legitimate.
Word of Caution
- Potential to Reduce Performance: Using static or improperly configured custom headers can make your requests appear automated, increasing the likelihood of detection and blocks. Proxy APIs are often better at dynamically adjusting headers for optimal performance.
- Risk of Getting Blocked: For large-scale web scraping, sending the same custom headers repeatedly can raise red flags. You'll need a system to continuously rotate and clean headers to avoid being blocked.
- Use Only When Necessary: In most cases, it's better to rely on the proxy service’s optimized headers unless you have a specific need. Custom headers should be used sparingly and strategically.
In summary, custom headers provide flexibility but should be used with caution to maintain proxy performance and avoid detection.
Custom headers are pretty easy to set. All we need to do is use customHeaders
. This one is a boolean. When we set customHeaders
to True
, Scrape.do knows that we want to use custom headers.
import requests
url = "https://api.scrape.do"
params = {
"token": "YOUR_TOKEN",
"url": "https://httpbin.co/",
"customHeaders": True,
}
headers = {
"Test-Header-Key": "Test Header Value"
}
response = requests.get(url, params=params, headers=headers)
print(response.text)
Take a look at the docs here.
Static Proxies
Static proxies allows a user to maintain the same IP address for an extended period when making multiple requests.
Unlike rotating proxies, where the IP address changes with every request, a static proxy gives you consistent access to the same IP for a set duration, usually between a few minutes to hours, depending on the service.
Static Proxies are ideal for maintaining sessions over multiple requests. These are also called Sticky Sessions.
To use a Sticky Session, we use the sessionId
param. We set our sessionId
to 1234
. Scrape.do keeps track of the session via our API key. Our API key ties our activity strictly to us.
import requests
url = "https://api.scrape.do"
params = {
"token": "YOUR_TOKEN",
"url": "https://httpbin.co/",
"sessionId": 1234,
}
response = requests.get(url, params=params)
print(response.text)
Screenshot Functionality
The screenshot functionality allows users to capture images of web pages at any given point during the scraping process. This feature takes a visual snapshot of the rendered web page, preserving the exact layout, content, and appearance as seen by a user.
We get several options for screenshots with Scrape.do. Each option requires us to set render
and returnJSON
to True
. In order to take the screenshot, we need a real browser to do it.
Here is a standard screenshot, it uses screenShot
.
import requests
url = "https://api.scrape.do"
params = {
"token": "YOUR_TOKEN",
"url": "https://httpbin.co/",
"render": True,
"screenShot": True,
"returnJSON": True
}
response = requests.get(url, params=params)
print(response.text)
Here is fullScreenShot
.
import requests
url = "https://api.scrape.do"
params = {
"token": "YOUR_TOKEN",
"url": "https://httpbin.co/",
"render": True,
"fullScreenShot": True,
"returnJSON": True
}
response = requests.get(url, params=params)
print(response.text)
Here is particularScreenShot
. We use this one to take a shot of a specific CSS selector.
import requests
url = "https://api.scrape.do"
params = {
"token": "YOUR_TOKEN",
"url": "https://httpbin.co/",
"render": True,
"particularScreenShot": "h1",
"returnJSON": True
}
response = requests.get(url, params=params)
print(response.text)
Auto Parsing
Auto Parsing is a really cool feature in which your API will actually try and scrape the site for you. However, Scrape.do does not support auto parsing of any kind.
ScrapeOps and a few other sites support Auto Parsing. You can view their auto parsing features on the links below.
Case Study: Using ScraperAPI on IMDb Top 250 Movies
Now, we're going to scrape the 250 movies from IMDB. Our scrapers will be pretty much identical. The major difference is the param for our API key. With ScrapeOps, we use api_key
. With Scrape.do, we use token
. Pretty much everything else in the code remains the same.
Take a look at the snippets below, you'll notice the subtle difference between the proxy functions.
Here is the proxy function for ScrapeOps:
def get_scrapeops_url(url):
payload = {
"api_key": API_KEY,
"url": url,
}
proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)
return proxy_url
Here is our proxy function for Scrape.do.
def get_scrapedo_url(url):
payload = {
"token": API_KEY,
"url": url,
}
proxy_url = "https://api.scrape.do/?" + urlencode(payload)
return proxy_url
The full ScrapeOps code is available for you below.
import os
import requests
from bs4 import BeautifulSoup
import json
import csv
import logging
from urllib.parse import urlencode
import concurrent.futures
## Logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
API_KEY = ""
with open("config.json", "r") as config_file:
config = json.load(config_file)
API_KEY = config["scrapeops_api_key"]
def get_scrapeops_url(url):
payload = {
"api_key": API_KEY,
"url": url,
}
proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)
return proxy_url
def scrape_movies(url, location="us", retries=3):
success = False
tries = 0
while not success and tries <= retries:
response = requests.get(get_scrapeops_url(url))
try:
if response.status_code != 200:
raise Exception(f"Failed response from server, status code: {response.status_code}")
soup = BeautifulSoup(response.text, "html.parser")
json_tag = soup.select_one("script[type='application/ld+json']")
json_data = json.loads(json_tag.text)["itemListElement"]
movie_list_length = 0
movie_list = []
for item in json_data:
movie_list.append(item["item"])
movie_list_length+=len(json_data)
print(f"Movie list length: {len(json_data)}")
with open("scrapeops-top-250.json", "w") as file:
json.dump(movie_list, file, indent=4)
success = True
except Exception as e:
logger.error(f"Failed to process page: {e}, retries left: {retries-tries}")
tries+=1
if not success:
raise Exception(f"Failed to scrape page, MAX RETRIES {retries} EXCEEDED!!!")
if __name__ == "__main__":
MAX_RETRIES = 3
logger.info("Starting IMDB scrape")
url = "https://www.imdb.com/chart/top/"
scrape_movies(url, retries=MAX_RETRIES)
logger.info("Scrape complete")
Here are the results from the ScrapeOps Proxy Aggregator. It took 6.159 seconds.
Here is the full Scrape.do code.
import os
import requests
from bs4 import BeautifulSoup
import json
import csv
import logging
from urllib.parse import urlencode
import concurrent.futures
## Logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
API_KEY = ""
with open("config.json", "r") as config_file:
config = json.load(config_file)
API_KEY = config["scrape_do_api_key"]
def get_scrapedo_url(url):
payload = {
"token": API_KEY,
"url": url,
}
proxy_url = "https://api.scrape.do/?" + urlencode(payload)
return proxy_url
def scrape_movies(url, location="us", retries=3):
success = False
tries = 0
while not success and tries <= retries:
response = requests.get(get_scrapedo_url(url))
try:
if response.status_code != 200:
raise Exception(f"Failed response from server, status code: {response.status_code}")
soup = BeautifulSoup(response.text, "html.parser")
json_tag = soup.select_one("script[type='application/ld+json']")
json_data = json.loads(json_tag.text)["itemListElement"]
movie_list_length = 0
movie_list = []
for item in json_data:
movie_list.append(item["item"])
movie_list_length+=len(json_data)
print(f"Movie list length: {len(json_data)}")
with open("scrape-do-top-250.json", "w") as file:
json.dump(movie_list, file, indent=4)
success = True
except Exception as e:
logger.error(f"Failed to process page: {e}, retries left: {retries-tries}")
tries+=1
if not success:
raise Exception(f"Failed to scrape page, MAX RETRIES {retries} EXCEEDED!!!")
if __name__ == "__main__":
MAX_RETRIES = 3
logger.info("Starting IMDB scrape")
url = "https://www.imdb.com/chart/top/"
scrape_movies(url, retries=MAX_RETRIES)
logger.info("Scrape complete")
Go ahead and compare them to the results from Scrape.do which took 5.256 seconds.
Scrape.do was slightly faster than ScrapeOps. 6.159 - 5.256 = 0.903 second difference. Scrape.do was just under a second faster than ScrapeOps. Depending on your conditions (location, time of day, and internet connection), ScrapeOps could just as easily be faster. In fact, on another run (an hour later), ScrapeOps clocked at 5.12 seconds and Scrape.do came in at 6.424.
Here is Scrape.do's other run.
Here is the other one for ScrapeOps.
Alternative: ScrapeOps Proxy API Aggregator
ScrapeOps and Scrape.do offer some pretty similar products. However, ScrapeOps really shines with our variety of pricing plans. With Scrape.do, you're stuck with one of 3 options or you need to make a custom plan.
With ScrapeOps, you get to choose between 8 different plans ranging in price from $9 per month to $249 per month.
Not only does ScrapeOps offer plans comparable to those of Scrape.do, we offer a wider variety with a lower barrier to entry (starting at $9/month).
Troubleshooting
Issue #1: Request Timeouts
We can set a timeout
argument with Python Requests. Sometimes we run into issues where our requests time out. To fix this, just set a custom timeout
.
import requests
# 5 second timeout
response = requests.get("https://httpbin.org/get", timeout=5)
Issue #2: Handling CAPTCHAs
Proxies are supposed to get us past CAPTCHAs. If you're receiving CAPTCHAs, your scraper has already failed to appear human. However, this does sometimes happen in the wild. To get through CAPTCHAs, first retry your request. If that doesn't work, change your location and/or consider using a residential IP address.
If the solutions outlined above fail (they shouldn't), you can always use 2captcha. We have an excellent article on bypassing CAPTCHAs here.
Issue #4: Invalid Response Data
Error codes are a common occurrence in all facets of web development. To handle error codes, we need to know why they're occurring. You can view the Scrape.do error codes here. The ScrapeOps error codes are available for review here.
In most cases, you need to double check your parameters or make sure your bill is paid. Every once in awhile, you may receive a different error code that you can find in the links above.
The Legal & Ethical Implications of Web Scraping
Web scraping is generally legal for public data. Private data is subject to numerous privacy laws and intellectual property policies. Public data is any data that is not gated begind a login.
You should also take into account the Terms and Conditions and the robots.txt
of the site you're scraping. You can view these documents from IMDB below.
Consequences of Misuse
Violating either the terms of service or privacy policies of a website can lead to several consequences:
-
Account Suspension or IP Blocking: Scraping websites without regard for their policies often leads to being blocked from accessing the site. For authenticated platforms, this may result in account suspension, making further interactions with the site impossible from that account.
-
Legal Penalties: Violating a website's ToS or scraping data unlawfully can lead to legal action. Laws such as the Computer Fraud and Abuse Act (CFAA) in the U.S. have been used to pursue lawsuits against unauthorized scraping, especially if it's done at scale or causes harm (such as server overload). Companies can face lawsuits for unauthorized use of proprietary data or violating intellectual property rights.
-
Data Breaches and Privacy Violations: If scraping is used to collect personal or sensitive data without consent, it can lead to severe privacy violations. This can expose businesses to penalties under regulations like GDPR, which can impose heavy fines for non-compliance, and reputational damage.
-
Server Overload: Excessive scraping can strain a website’s servers, especially if done without rate-limiting or throttling. This can cause performance issues for the website, leading to possible financial or legal claims against the scraper for damages caused by server downtime.
Ethical Considerations
-
Fair Use: Even if scraping is legal, it's important to consider the ethical use of the data. For instance, scraping content to directly copy and republish it for profit without adding value is generally unethical and may infringe on copyright laws. Ethical scraping should aim to provide new insights, analysis, or utility from the data.
-
User Consent: Scraping platforms that collect user-generated content (like social media) should consider user privacy and consent. Even if the content is publicly available, using it in ways that violate privacy expectations can lead to ethical concerns and backlash.
-
Transparency: Scrapers should be transparent about their intentions, especially if the scraping is for commercial purposes. Providing appropriate attributions or using data responsibly demonstrates ethical integrity.
Conclusion
ScrapeOps and Scrape.do offer very similar products. Both of these solutions give you a reliable rotating proxy with residential and mobile options all over the world. They're also very similar in terms of cost.
ScrapeOps offers a wider variety of plans and both APIs take a similar amount of time for our responses.
Both of these solutions will help you get the data you need.
More Web Scraping Guides
Whether you're brand new to scraping or you're a hardened web developer, we have something for you. We wrote the playbook on scraping with Python.
Bookmark one of the articles below and level up your scraping toolbox!