ScrapingAnt: Web Scraping Integration Guide
ScrapingAnt is a great proxy provider. They are one of the many providers used in our ScrapeOps Proxy Aggregator.
In this article, we're going to go through their proxy bit by bit and see how it stacks up against the ScrapeOps Proxy Aggregator.
- TLDR: Scraping With ScraperAnt
- What is ScrapingAnt?
- Setting Up the ScraperAnt API
- Advanced Functionality
- JavaScript Rendering
- Country Geotargeting
- Residential Proxies
- Custom Headers
- Static Proxies
- Screenshot Functionality
- Auto Parsing
- Case Study: IMDB Top 250 Movies
- Alternative: ScrapeOps Proxy Aggregator
- Troubleshooting
- Conclusion
- More Web Scraping Guides
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
TLDR: Web Scraping With ScraperAnt?
Getting started with ScrapingAnt is super easy. All you need is the proxy function below.
- This code takes in a URL and returns a ScrapingAnt proxied url ready for use.
- We also set
browser
toFalse
, this way you're paying 1 API credit per request instead of 10. - To customize your proxy further, you can read their additional params here.
def get_proxy_url(url):
payload = {
"x-api-key": API_KEY,
"url": url,
"browser": False
}
proxy_url = 'https://api.scrapingant.com/v2/general?' + urlencode(payload)
return proxy_url
What Is ScrapingAnt?
As mentioned above, ScrapingAnt is one of the providers used in the ScrapeOps Proxy Aggregator. It gives us numerous options for customizing our scrape and has a very upfront pricing structure. We can use ScrapingAnt to access countless sites across the web that would normally block our scraper.
Whenever we use a proxy provider, the process goes as follows.
- We send our
url
and ourapi_key
to the proxy service. - The provider attempts to get our
url
through one of their servers. - The provider receives their response.
- The provider sends the response back to us.
During a scrape like this, the proxy server can route all of our requests through multiple IP addresses. This makes our requests appear as if they're coming from a bunch of different sources and that each request is coming from a different user. When you use any scraping API, all of the following are true.
- You tell the API which site you want to access.
- Their servers access the site for you.
- You scrape your desired site(s).
How Does ScrapingAnt API Work?
When we use the ScrapingAnt API, we send them a URL and our API key. The URL tells ScrapingAnt the site we'd like to access. Our API key tells ScrapingAnt who we are. This way, their servers can tell how many credits we have left on our plan and what our plan allows us to do.
The table below contains a full list of parameters we can send to ScrapingAnt using a GET request.
Parameter | Description |
---|---|
x-api-key (required) | Your ScrapingAnt API key (string) |
url (required) | The url you'd like to scrape (string) |
browser | Render th page with a headless browser (boolean, true by default) |
return_page_source | Return the unaltered page (boolean, false by default, requires browser) |
cookies | Pass cookies in with a request for authentication (string) |
js_snippet | Execute a JavaScript snippet (string, requires browser) |
proxy_type | Specify your IP type (string, datacenter by default) |
proxy_country | The country you want to be routed through (string) |
wait_for_selector | Wait for a specific CSS selector to show (string, requires browser) |
block_resource | Block resources from loading (images, media, etc.) (string) |
Here is an example of a request with the ScrapingAnt API.
import requests
url = "https://api.scrapingant.com/v2/general"
params = {
'url': 'https://example.com',
'x-api-key': 'your-super-secret-api-key'
}
response = requests.get(url, params=params)
print(response.text)
Response Format
ScrapingAnt allows us to retrieve our data as JSON using the extended
endpoint. In our example earlier, we could alter it to retrieve JSON in the following way.
import requests
url = "https://api.scrapingant.com/v2/extended"
params = {
'url': 'https://example.com',
'x-api-key': 'your-super-secret-api-key'
}
response = requests.get(url, params=params)
print(response.text)
To receive your response as JSON, simply change endpoints from general
to extended
.
ScrapingAnt API Pricing
You can view the lower tier price options from ScrapingAnt below.
Their higher cost plans are in the next image.
Plan | API Credits per Month | Price per Month |
---|---|---|
Enthusiast | 100,000 | $19 |
Startup | 500,000 | $49 |
Business | 3,000,000 | $249 |
Business Pro | 8,000,000 | $599 |
Custom | N/A | $699+ |
With each of these plans, you only pay for successful requests. If the API fails to get your page, you pay nothing. Each plan also includes the following:
- Page Rendering
- Rotating Proxies
- JavaScript Execution
- Custom Cookies
- Fastest AWS and Hetzner Servers
- Unlimited Parallel Requests
- Residential Proxies
- Supports All Progamming Languages
- CAPTCHA Avoidance
Response Status Codes
When using their API, there are a series of status codes we might get back. 200 is the one we want.
Status Code | Type | Possible Causes/Solutions |
---|---|---|
200 | Success | It worked! |
400 | Bad Request | Improper Request Format |
403 | API Key | Usage Exceeded, or Wrong API Key |
404 | Not Found | Site Not Found, Page Not Found |
405 | Not Allowed | Method Not Allowed |
409 | Concurrency Limit | Exceeded Concurrency Limit |
422 | Invalid | Invalid Value Provided |
423 | Detected by Anti-bot | Please Change/Retry the Request |
500 | Internal Server Error | Context Cancelled, Unknown Error |
Setting Up ScrapingAnt API
Before we get our ScrapingAnt API key, we need to create an account. If you haven't already, you can do that here.
You can use any of the following methods to create your new ScrapingAnt account.
- Create an account with Google
- Create an account with Github
- Create an account with an email address and password
Once you have an account, you can go to their dashboard and gain access to everything you need from ScrapingAnt. The dashboard includes all of your account management along with a request generator and links to their documentation.
Here is the request generator.
On the dashboard screenshot, I exposed my API key. This may seem like a big deal, but it's really not. If you navigate to the profile tab, you'll see a button called GENERATE NEW API TOKEN
. I can click this button (like in the screenshot below) and I'll receive a new key that you don't have access to.
Once you got an API key, you're all set to start using the ScrapingAnt API.
API Endpoint Integration
With the ScrapingAnt API, we're really only using two endpoints. One of these is for a JSON response and the other is for standard HTML. We use the general
endpoint for a standard request. We use the extended
endpoint for a JSON response. This gives developers some flexibility and allows them to choose the preferences.
While we posted them above in separate examples, you can view them both below for convenience.
HTML Response
import requests
url = "https://api.scrapingant.com/v2/general"
params = {
'url': 'https://example.com',
'x-api-key': 'your-super-secret-api-key'
}
response = requests.get(url, params=params)
print(response.text)
JSON Response
import requests
url = "https://api.scrapingant.com/v2/extended"
params = {
'url': 'https://example.com',
'x-api-key': 'your-super-secret-api-key'
}
response = requests.get(url, params=params)
print(response.text)
As you can see in the examples above, we use these endpoints to control our response type.
Proxy Port Integration
When we use a proxy port, our browser or HTTP client will pass all requests through a specific location by default. For standard HTTP, we use port 8080
. When using HTTPS, we use port 443
. You can view the full url structure below.
'http': 'http://scrapingant:your-super-secret-api-key@proxy.scrapingant.com:8080'
'https': 'https://scrapingant:your-super-secret-api-key@proxy.scrapingant.com:443'
Below is an example of how to do this using Python Requests.
# pip install requests
import requests
API_KEY = "your-super-secret-api-key"
url = "https://quotes.toscrape.com"
proxy_url = f"{API_KEY}:@proxy.scrapingant.com"
proxies = {
"http": f"http://{proxy_url}:8080",
"https": f"https://{proxy_url}:443"
}
response = requests.get(url, proxies=proxies, verify=False)
print(response.text)
Proxy ports are best when you just want to set it and forget it. If you don't need to make special requests through your proxy or customize it at all, they can be a very convenient option. This sort of thing is best for newbies and people who don't want to think about their proxy logic.
SDK Integration
ScrapingAnt has an SDK (Software Development Kit) available for anyone who wants to use it. SDKs are far easier for beginners and people who aren't familiar with web development. An SDK allows you to not think about the low level requests.
You can install it via pip
.
pip install scrapingant-client
Here's an example of it in action.
from scrapingant_client import ScrapingAntClient
client = ScrapingAntClient(token='<YOUR-SCRAPINGANT-API-TOKEN>')
# Scrape the example.com site.
result = client.general_request('https://example.com')
print(result.content)
As you can see above, this approach has a much lower barrier to entry.
Managing Concurrency
Managing concurrency is pretty straightforward if you're familiar with ThreadPoolExecutor
. ThreadPoolExecutor
allows us to open a new thread pool with x
number of threads. On each open thread, we can run a function of our choosing.
import requests
from bs4 import BeautifulSoup
import concurrent.futures
from urllib.parse import urlencode
API_KEY = 'YOUR_API_KEY'
NUM_THREADS = 5
def get_proxy_url(url):
payload = {"x-api-key": API_KEY, "url": url}
proxy_url = 'https://api.scrapingant.com/v2/general' + urlencode(payload)
return proxy_url
## Example list of urls to scrape
list_of_urls = [
"https://quotes.toscrape.com/page/1/",
"https://quotes.toscrape.com/page/2/",
"http://quotes.toscrape.com/page/3/",
]
output_data_list = []
def scrape_page(url):
try:
response = requests.get(get_proxy_url(url))
if response.status_code == 200:
soup = BeautifulSoup(response.text, "html.parser")
title = soup.find("h1").text
## add scraped data to "output_data_list" list
output_data_list.append({
'title': title,
})
except Exception as e:
print('Error', e)
with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
executor.map(scrape_page, list_of_urls)
print(output_data_list)
executor.map()
holds all the keys here:
- Our first argument is
scrape_page
: the function we want to call on each thread. - Our second is
list_of_urls
: the list of arguments we want to pass intoscrape_page
.
Any other arguments to the function also get passed in as arrays.
Advanced Functionality
We briefly touched on ScrapingAnt's advanced functionality earlier in this piece. Now, we'll look at it more in detail. Take a look at the table below for a breakdown of it all.
Parameter | API Credits | Description |
---|---|---|
browser | 10 | use a headless browser, true by default |
cookies | 1 | pass cookies with the request for authentication |
custom_headers | 1 | send custom headers to the server |
proxy_type | 1 or 25(residential) | choose your IP address type |
proxy_country | 1 | set a custom geolocation |
js_snippet | 10 | execute JavaScript snippet, requires browser |
wait_for_selector | 10 | waits for a CSS selector, requires browser |
You can view their full API documentation here.
JavaScript Rendering
Many modern websites rely heavily on JavaScript to dynamically load content, manipulate the DOM, and make API calls.
JavaScript rendering functionality refers to the capability of web scraping tools or browsers to fully load and execute JavaScript on a web page.
JavaScript rendering is essential for scraping dynamic websites that load content client-side, allowing for accurate data extraction and better handling of interactive features.
ScrapingAnt will render JavaScript by default. It does this by using a headless browser. To turn off the headless browser, simply include "browser": False
inside your payload. If you're looking to save API credits, this is a really important parameter to remember. Requests without the browser cost 1 API credit. Reqeusts with the browser cost 10.
The following code renders a page using the browser.
# pip install requests
import requests
url = "https://quotes.toscrape.com"
api_key = "your-super-secret-api-key"
params = {
"url": url,
"x-api-key": api_key,
}
response = requests.get('https://api.scrapingant.com/v2/general', params=params)
print(response.text)
To turn this off, we would use the snippet below instead.
# pip install requests
import requests
url = "https://quotes.toscrape.com"
api_key = "your-super-secret-api-key"
params = {
"url": url,
"x-api-key": api_key,
"browser": False
}
response = requests.get('https://api.scrapingant.com/v2/general', params=params)
print(response.text)
You can view the documentation for this here.
Controlling The Browser
To control our browser, we use the ScrapingAnt browser to execute JavaScript directly. We first add our JavaScript as a string, then we encode it in Base64. Before encoding it to Base64, we encode it to bytes using utf-8
. We then encode it in Base64 and decode it using utf-8
so we get a string that can be sent to the ScrapingAnt API.
# pip install requests
import requests
import base64
url = "https://api.scrapingant.com/v2/general"
js_action = "document.getElementById('myButton').click();"
encoded_js = base64.b64encode(js_action.encode("utf-8")).decode("utf-8")
params = {
"url": "https://example.com",
"x-api-key": "your-super-secret-api-key",
"js_snippet": encoded_js,
}
response = requests.get(url, params=params)
print(response.text)
The browser control docs are available here.
Country Geotargeting
Country geotargeting functionality allows web scraping tools or proxies to simulate requests from specific geographic locations or countries.
By using IP addresses tied to certain regions, this feature enables users to access location-specific content, services, and pricing as if they were physically present in that country.
Country geotargeting allows users to access and interact with region-specific content, monitor pricing differences, verify ads, and test localized services, making it crucial for global business operations, competitive analysis, and compliance.
Geolocation is really easy to control and it costs us nothing to set a custom country! If you turn you browser off, and set a custom location, you're still only paying 1 API credit for each request.
# pip install requests
import requests
url = "https://quotes.toscrape.com"
api_key = "your-super-secret-api-key"
params = {
"url": url,
"x-api-key": api_key,
"browser": False,
"proxy_country": "US"
}
response = requests.get('https://api.scrapingant.com/v2/general', params=params)
print(response.text)
On top of that, their country list is huge comapared to other providers.
Country | Country Code |
---|---|
Brazil | "BR" |
Canada | "CA" |
China | "CN" |
Czech Republic | "CZ" |
France | "FR" |
Germany | "DE" |
Hong Kong | "HK" |
India | "IN" |
Indonesia | "ID" |
Italy | "IT" |
Israel | "IL" |
Japan | "JP" |
Netherelands | "NL" |
Poland | "PL" |
Russia | "RU" |
Saudi Arabia | "SA" |
Singapore | "SG" |
South Korea | "KR" |
Spain | "ES" |
United Kingdom | "GB" |
United Arab Emirates | "AE" |
United States | "US" |
Vietnam | "VN" |
You can view the full documentation for this here.
Residential Proxies
Unlike data center proxies, which originate from cloud servers or hosting providers, residential proxies appear more legitimate to websites because they come from real user devices.
Residential proxies are ideal for avoiding detection, bypassing geo-restrictions, accessing localized content, and improving the success rate of web scraping or automated tasks. Their ability to mimic genuine users makes them essential for tasks requiring high reliability and low chances of being blocked.
Residential requests use up 25 API credits as opposed the 1 credit used by a standard datacenter IP address. We can change to a residential proxy using the proxy_type
parameter. This is set to datacenter by default, but we can simple change it to residential.
Here's a code example of how to use them.
# pip install requests
import requests
url = "https://quotes.toscrape.com"
api_key = "your-super-secret-api-key"
params = {
"url": url,
"x-api-key": api_key,
"browser": False,
"proxy_type": "residential"
}
response = requests.get('https://api.scrapingant.com/v2/general', params=params)
print(response.text)
You can view their full Residential Proxy Port integration guide here.
Custom Headers
Custom header functionality allows users to manually set and modify the HTTP request headers sent with web scraping or API requests. Typically, proxy APIs automatically manage these headers for optimal performance, but many proxy APIs also provide the option to send custom headers when needed.
Why Use Custom Headers?
-
Access Specific Data: Some websites or APIs require certain headers to provide access to specific data. For example, they may require an Authorization header or a special token to authenticate the request.
-
POST Requests: When sending POST requests, specific headers like Content-Type or Accept might be necessary to ensure that the target server processes the request correctly.
-
Bypass Anti-Bot Systems: Custom headers can help mimic real user behavior, making it easier to bypass certain anti-bot systems. Modifying headers like User-Agent, Referer, or Accept-Language can make your requests look like they’re coming from a genuine browser session.
Word of Caution
-
Impact on Performance: If used incorrectly, custom headers can reduce proxy performance. Sending the same static headers repeatedly may give away the fact that the requests are automated, increasing the likelihood of detection by anti-bot systems.
-
Need for Header Rotation: For large-scale web scraping, you need a system to continuously generate clean, dynamic headers to avoid being blocked. Static headers make your scraper more detectable and vulnerable to being flagged.
-
Only When Necessary: Custom headers should only be used if required. Letting proxy APIs handle headers is often more efficient since they are optimized to avoid detection and ensure higher success rates.
Adding custom headers is really very simple. We just add the prefix "Ant"
to our header name. ScrapingAnt will then take these and pass them on to the server when it talks to it.
import requests
url = "https://api.scrapingant.com/v2/general"
params = {
"url": "https://httpbin.org/headers",
"x-api-key": "<YOUR_SCRAPINGANT_API_KEY>"
}
headers = {
"Ant-Custom-Header": "I <3 ScrapingAnt"
}
response = requests.get(url, params=params, headers=headers)
print(response.text)
Take a look at their docs here.
Static Proxies
Static proxy functionality, also known as sticky sessions, allows users to maintain the same IP address for an extended period when sending multiple requests.
Instead of switching IPs with each request (as rotating proxies do), static proxies ensure that the IP remains consistent for the duration of the session, making it appear as though all requests are coming from the same user.
Scraping Ant does not give us the power to run a static proxy. Static proxies are often used for session management (remaining logged in over a period of time).
However, ScrapingAnt does give us the ability to pass cookies along to the site we're scraping. With most sites, once you login, your browser receives a cookie and this cookie is used to tell the website who you are and that you're logged in.
To pass cookies with ScrapingAnt, we can simply use the cookie
parameter.
import requests
url = "https://api.scrapingant.com/v2/general"
params = {
"url": "https://example.com",
"x-api-key": "your-super-secret-api-key",
"cookies": "cookie_1=cookie_value_1",
"browser": "false"
}
response = requests.get(url, params=params)
print(response.text)
Screenshot Functionality
Screenshot functionality allows web scraping tools or automation software to capture a visual snapshot of a web page as it appears during the scraping process.
When you scrape the web, screenshots can be an irreplaceable debugging tool. However, ScrapingAnt sadly doesn't support screenshots. There are several others that do:
Auto Parsing
Auto Parsing is an excellent feature for a scraping API. With Auto Parsing, we can actually tell ScrapingAnt to try and scrape the site for us! With this functionality, we only need to focus on our jobs as developers. We don't need to pick through all the nasty HTML. When autoparsing, it's good to exercise caution. ScrapingAnt uses AI to attempt the parse, and AI is sometimes prone to errors.
On top of that, we're not given an upfront cost model for the AI parser. ScrapingAnt will execute the request and then charge our account based on the parse... but we do not know the cost before the parse has been executed.
The following snippet tells ScrapingAnt that we want it to parse the page using AI.
import requests
url = "https://api.scrapingant.com/v2/extract"
params = {
"url": "https://example.com",
"x-api-key": "your-super-secret-api-key",
"browser": "false",
"extract_properties": "title, content"
}
response = requests.get(url, params=params)
print(response.text)
Unlike sites that have pre-built parsers, ScrapingAnt uses an AI parser. There is no list of supported sites because (theoreticallly) they're all supported. However, AI is prone to errors, so don't expect perfection.
Case Study: Using Scraper APIs on IMDb Top 250 Movies
Time to scrape the top 250 movies from IMDB. We'll be using two virtually identical scrapers. The only difference will be the proxy function. Aside frrom the base domain name, the only difference in the proxy function will be the format of the API key. With ScrapeOps, we use api_key
for the API key. With ScrapingAnt, we use x-api-key
.
Take a look at the snippets below, you'll notice the subtle difference between the proxy functions.
Here is the proxy function for ScrapeOps:
def get_scrapeops_url(url):
payload = {
"api_key": API_KEY,
"url": url,
}
proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)
return proxy_url
Here is the same function for ScrapingAnt.
def get_proxy_url(url):
payload = {
"x-api-key": API_KEY,
"url": url,
"browser": False
}
proxy_url = 'https://api.scrapingant.com/v2/general?' + urlencode(payload)
return proxy_url
The full ScrapeOps code is available for you below.
import os
import requests
from bs4 import BeautifulSoup
import json
import csv
import logging
from urllib.parse import urlencode
import concurrent.futures
## Logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
API_KEY = ""
with open("config.json", "r") as config_file:
config = json.load(config_file)
API_KEY = config["scrapeops_api_key"]
def get_scrapeops_url(url):
payload = {
"api_key": API_KEY,
"url": url,
}
proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)
return proxy_url
def scrape_movies(url, location="us", retries=3):
success = False
tries = 0
while not success and tries <= retries:
response = requests.get(get_scrapeops_url(url))
try:
if response.status_code != 200:
raise Exception(f"Failed response from server, status code: {response.status_code}")
soup = BeautifulSoup(response.text, "html.parser")
json_tag = soup.select_one("script[type='application/ld+json']")
json_data = json.loads(json_tag.text)["itemListElement"]
movie_list_length = 0
movie_list = []
for item in json_data:
movie_list.append(item["item"])
movie_list_length+=len(json_data)
print(f"Movie list length: {len(json_data)}")
with open("scrapeops-top-250.json", "w") as file:
json.dump(movie_list, file, indent=4)
success = True
except Exception as e:
logger.error(f"Failed to process page: {e}, retries left: {retries-tries}")
tries+=1
if not success:
raise Exception(f"Failed to scrape page, MAX RETRIES {retries} EXCEEDED!!!")
if __name__ == "__main__":
MAX_RETRIES = 3
logger.info("Starting IMDB scrape")
url = "https://www.imdb.com/chart/top/"
scrape_movies(url, retries=MAX_RETRIES)
logger.info("Scrape complete")
Take a look at the ScrapeOps results. The run took 4.335 seconds.
Here is our ScrapingAnt code as well.
import os
import requests
from bs4 import BeautifulSoup
import json
import csv
import logging
from urllib.parse import urlencode
import concurrent.futures
## Logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
API_KEY = ""
with open("config.json", "r") as config_file:
config = json.load(config_file)
API_KEY = config["scrapingant_api_key"]
def get_proxy_url(url):
payload = {
"x-api-key": API_KEY,
"url": url,
"browser": False
}
proxy_url = 'https://api.scrapingant.com/v2/general?' + urlencode(payload)
return proxy_url
def scrape_movies(url, location="us", retries=3):
success = False
tries = 0
while not success and tries <= retries:
response = requests.get(get_proxy_url(url))
try:
if response.status_code != 200:
raise Exception(f"Failed response from server, status code: {response.status_code}")
soup = BeautifulSoup(response.text, "html.parser")
json_tag = soup.select_one("script[type='application/ld+json']")
json_data = json.loads(json_tag.text)["itemListElement"]
movie_list_length = 0
movie_list = []
for item in json_data:
movie_list.append(item["item"])
movie_list_length+=len(json_data)
print(f"Movie list length: {len(json_data)}")
with open("scrapingant-top-250.json", "w") as file:
json.dump(movie_list, file, indent=4)
success = True
except Exception as e:
logger.error(f"Failed to process page: {e}, retries left: {retries-tries}")
tries+=1
if not success:
raise Exception(f"Failed to scrape page, MAX RETRIES {retries} EXCEEDED!!!")
if __name__ == "__main__":
MAX_RETRIES = 3
logger.info("Starting IMDB scrape")
url = "https://www.imdb.com/chart/top/"
scrape_movies(url, retries=MAX_RETRIES)
logger.info("Scrape complete")
Below is the output from ScrapingAnt. Our run took 5.304 seconds.
ScrapeOps was barely faster than ScrapingAnt. 5.304 - 4.335 = 0.969. Our difference was apprixmately one second. Depending on your location, hardware and internet connection, you might receive different results.
Alternative: ScrapeOps Proxy API Aggregator
ScrapeOps Proxy API Aggregator is a service that combines the power of multiple top-tier proxy providers into a single solution, offering a variety of benefits for web scraping and automation tasks.
Here’s why you might want to use it:
-
Access to Multiple Proxy Providers: ScrapeOps integrates with over 20 leading residential and data center proxy providers, including popular names like Smartproxy, Bright Data, and Oxylabs. This means you don’t need to juggle multiple accounts or services; you can manage all your proxy needs through one platform.
-
Automatic Proxy Switching: The aggregator automatically switches between proxy providers based on performance, ensuring that you’re always using the best proxy for your task. This results in a 98% success rate, as it continuously optimizes the proxies used, reducing the chances of being blocked or flagged.
-
Bypass Anti-Bot Measures: With ScrapeOps, you can rotate through multiple proxies and user agents, making it easier to avoid detection by anti-bot systems. This is crucial for large-scale web scraping projects where sites are heavily guarded against automated requests.
-
Cost Optimization: ScrapeOps monitors proxy provider performance and pricing, helping you choose the most cost-effective option for your specific task. This ensures that you get the best balance of price and performance, which is especially useful for businesses working with large volumes of data.
-
Competitive Pricing Plans: The platform offers flexible pricing with bandwidth-based plans, starting with 500 MB of free bandwidth credits. Paid plans start at $9 per month, making it accessible for both small and large scraping projects. This flexibility allows you to scale your proxy usage as needed.
- Streamlined Management: Instead of managing multiple proxy providers, credentials, and payments, ScrapeOps centralizes everything, making it easier to maintain control over your proxy usage. It also offers reporting and analytics, so you can track proxy performance and optimize your scraping strategy.
ScrapeOps offers a larger variety of plans and costs much less to get started on a premium plan, $9. On top of that, ScrapeOps doesn't just use ScrapingAnt as a provider, we have over 20 providers and we're adding new ones each week. This gives ScrapeOps far better reliability than other centralized solutions. If one provider fails, we simply route you through another.
Troubleshooting
Issue #1: Request Timeouts
We can set a timeout
argument with Python Requests. Sometimes we run into issues where our requests time out. To fix this, just set a custom timeout
.
import requests
# 5 second timeout
response = requests.get("https://httpbin.org/get", timeout=5)
Issue #2: Handling CAPTCHAs
If your proxy service is making you submit CAPTCHA requests, something is wrong. Both ScrapeOps and ScrapingAnt are built to avoid CAPTCHAs by default. However, sometimes proxy providers can fail. If you run into a CAPTCHA, first, try to submit the request again. This will often take care of it (usually the proxy provider will give you a new IP address). If that fails, try using a Residential Proxy (both ScrapeOps and ScrapingAnt offer these).
If the solutions outlined above fail (they shouldn't), you can always use 2captcha. We have an excellent article on bypassing CAPTCHAs here.
Issue #3: Headless Browser Integrations
When using headless browsers like Puppeteer or Playwright with Proxy APIs, there are often integration challenges. A headless browser operates without a graphical user interface (GUI) and is typically used for automation, web scraping, or testing tasks. However, these tools can run into issues when interacting with proxy APIs, leading to inefficient requests or failures.
Headless browsers typically are suited for use with Proxy APIs like ScrapingAnt as:
- There can be compatibility issues and unforeseen bugs when making background network requests as the headers and cookies don't get maintained across the requests
- The charge per successful request and to scrape 1 page, a headless browser could make 10-100+ requests
- If you want to use a headless browser then you need to use the proxy port integration method
Issue #4: Invalid Response Data
Anytime you're dealing with Web Development, you will sometimes run into invalid responses. To handle an invalid response, you need to understand the error code and what it means. ScrapingAnt error codes are available for review here
The ScrapeOps error codes are available here.
In most cases, you need to double check your parameters or make sure your bill is paid. Every once in awhile, you may receive a different error code that you can find in the links above.
The Legal & Ethical Implications of Web Scraping
When we scrape public data, we're typically completely legal. Public data is any data that is not gated behind a login. Private data is a completely different story.
When dealing with private data, you are subject to a whole different slew of privacy laws and intellectual property regulations. The data we scraped in this article was public.
You should also take into account the Terms and Conditions and the robots.txt
of the site you're scraping. You can view these documents from IMDB below.
Consequences of Misuse
Violating either the terms of service or privacy policies of a website can lead to several consequences:
-
Account Suspension or IP Blocking: Scraping websites without regard for their policies often leads to being blocked from accessing the site. For authenticated platforms, this may result in account suspension, making further interactions with the site impossible from that account.
-
Legal Penalties: Violating a website's ToS or scraping data unlawfully can lead to legal action. Laws such as the Computer Fraud and Abuse Act (CFAA) in the U.S. have been used to pursue lawsuits against unauthorized scraping, especially if it's done at scale or causes harm (such as server overload). Companies can face lawsuits for unauthorized use of proprietary data or violating intellectual property rights.
-
Data Breaches and Privacy Violations: If scraping is used to collect personal or sensitive data without consent, it can lead to severe privacy violations. This can expose businesses to penalties under regulations like GDPR, which can impose heavy fines for non-compliance, and reputational damage.
-
Server Overload: Excessive scraping can strain a website’s servers, especially if done without rate-limiting or throttling. This can cause performance issues for the website, leading to possible financial or legal claims against the scraper for damages caused by server downtime.
Ethical Considerations
-
Fair Use: Even if scraping is legal, it's important to consider the ethical use of the data. For instance, scraping content to directly copy and republish it for profit without adding value is generally unethical and may infringe on copyright laws. Ethical scraping should aim to provide new insights, analysis, or utility from the data.
-
User Consent: Scraping platforms that collect user-generated content (like social media) should consider user privacy and consent. Even if the content is publicly available, using it in ways that violate privacy expectations can lead to ethical concerns and backlash.
-
Transparency: Scrapers should be transparent about their intentions, especially if the scraping is for commercial purposes. Providing appropriate attributions or using data responsibly demonstrates ethical integrity.
Conclusion
Both ScrapeOps and ScrapingAnt give us convenient and reliable ways to scrape the web. ScrapeOps has a little bit more functionality, but ScrapingAnt gives a great experience as well. Both proxies are similar in terms of speed and efficiency. ScrapingAnt's headless by default might be annoying to some users. By default, you're paying 10x normal API credits for your request, but you can manually turn this off by using "browser": False
.
Both of these solutions will help you get the data you need.
More Web Scraping Guides
It doesn't matter if you're brand new to scraping or a hardened developer, we have something for you. We wrote the playbook on it. Bookmark one of the articles below and level up your scraping toolbox!