ScrapingAnt: Web Scraping Integration Guide

ScrapingAnt is a great proxy provider. They are one of the many providers used in our ScrapeOps Proxy Aggregator. In this article, we're going to go through their proxy bit by bit and see how it stacks up against the ScrapeOps Proxy Aggregator.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

TLDR: Web Scraping With ScraperAnt?

Getting started with ScrapingAnt is super easy. All you need is the proxy function below.

This code takes in a URL and returns a ScrapingAnt proxied url ready for use.
We also set browser to False, this way you're paying 1 API credit per request instead of 10.
To customize your proxy further, you can read their additional params here.

def get_proxy_url(url):    payload = {        "x-api-key": API_KEY,        "url": url,        "browser": False        }    proxy_url = 'https://api.scrapingant.com/v2/general?' + urlencode(payload)    return proxy_url

What Is ScrapingAnt?

As mentioned above, ScrapingAnt is one of the providers used in the ScrapeOps Proxy Aggregator. It gives us numerous options for customizing our scrape and has a very upfront pricing structure. We can use ScrapingAnt to access countless sites across the web that would normally block our scraper. Whenever we use a proxy provider, the process goes as follows.

We send our url and our api_key to the proxy service.
The provider attempts to get our url through one of their servers.
The provider receives their response.
The provider sends the response back to us.

During a scrape like this, the proxy server can route all of our requests through multiple IP addresses. This makes our requests appear as if they're coming from a bunch of different sources and that each request is coming from a different user. When you use any scraping API, all of the following are true.

You tell the API which site you want to access.
Their servers access the site for you.
You scrape your desired site(s).

How Does ScrapingAnt API Work?

When we use the ScrapingAnt API, we send them a URL and our API key. The URL tells ScrapingAnt the site we'd like to access. Our API key tells ScrapingAnt who we are. This way, their servers can tell how many credits we have left on our plan and what our plan allows us to do. The table below contains a full list of parameters we can send to ScrapingAnt using a GET request.

Parameter	Description
`x-api-key` (required)	Your ScrapingAnt API key (string)
`url` (required)	The url you'd like to scrape (string)
`browser`	Render th page with a headless browser (boolean, true by default)
`return_page_source`	Return the unaltered page (boolean, false by default, requires browser)
`cookies`	Pass cookies in with a request for authentication (string)
`js_snippet`	Execute a JavaScript snippet (string, requires browser)
`proxy_type`	Specify your IP type (string, datacenter by default)
`proxy_country`	The country you want to be routed through (string)
`wait_for_selector`	Wait for a specific CSS selector to show (string, requires browser)
`block_resource`	Block resources from loading (images, media, etc.) (string)

Here is an example of a request with the ScrapingAnt API.

import requests
url = "https://api.scrapingant.com/v2/general"params = {    'url': 'https://example.com',    'x-api-key': 'your-super-secret-api-key'}
response = requests.get(url, params=params)
print(response.text)

Response Format

ScrapingAnt allows us to retrieve our data as JSON using the extended endpoint. In our example earlier, we could alter it to retrieve JSON in the following way.

import requests
url = "https://api.scrapingant.com/v2/extended"params = {    'url': 'https://example.com',    'x-api-key': 'your-super-secret-api-key'}
response = requests.get(url, params=params)
print(response.text)

To receive your response as JSON, simply change endpoints from general to extended.

ScrapingAnt API Pricing

You can view the lower tier price options from ScrapingAnt below. Their higher cost plans are in the next image.

Plan	API Credits per Month	Price per Month
Enthusiast	100,000	$19
Startup	500,000	$49
Business	3,000,000	$249
Business Pro	8,000,000	$599
Custom	N/A	$699+

With each of these plans, you only pay for successful requests. If the API fails to get your page, you pay nothing. Each plan also includes the following:

Page Rendering
Rotating Proxies
JavaScript Execution
Custom Cookies
Fastest AWS and Hetzner Servers
Unlimited Parallel Requests
Residential Proxies
Supports All Progamming Languages
CAPTCHA Avoidance

Response Status Codes

When using their API, there are a series of status codes we might get back. 200 is the one we want.

Status Code	Type	Possible Causes/Solutions
200	Success	It worked!
400	Bad Request	Improper Request Format
403	API Key	Usage Exceeded, or Wrong API Key
404	Not Found	Site Not Found, Page Not Found
405	Not Allowed	Method Not Allowed
409	Concurrency Limit	Exceeded Concurrency Limit
422	Invalid	Invalid Value Provided
423	Detected by Anti-bot	Please Change/Retry the Request
500	Internal Server Error	Context Cancelled, Unknown Error

Setting Up ScrapingAnt API

Before we get our ScrapingAnt API key, we need to create an account. If you haven't already, you can do that here. You can use any of the following methods to create your new ScrapingAnt account.

Create an account with Google
Create an account with Github
Create an account with an email address and password

Once you have an account, you can go to their dashboard and gain access to everything you need from ScrapingAnt. The dashboard includes all of your account management along with a request generator and links to their documentation. Here is the request generator. On the dashboard screenshot, I exposed my API key. This may seem like a big deal, but it's really not. If you navigate to the profile tab, you'll see a button called GENERATE NEW API TOKEN. I can click this button (like in the screenshot below) and I'll receive a new key that you don't have access to. Once you got an API key, you're all set to start using the ScrapingAnt API.

API Endpoint Integration

With the ScrapingAnt API, we're really only using two endpoints. One of these is for a JSON response and the other is for standard HTML. We use the general endpoint for a standard request. We use the extended endpoint for a JSON response. This gives developers some flexibility and allows them to choose the preferences. While we posted them above in separate examples, you can view them both below for convenience. HTML Response

import requests
url = "https://api.scrapingant.com/v2/general"params = {    'url': 'https://example.com',    'x-api-key': 'your-super-secret-api-key'}
response = requests.get(url, params=params)
print(response.text)

JSON Response

import requests
url = "https://api.scrapingant.com/v2/extended"params = {    'url': 'https://example.com',    'x-api-key': 'your-super-secret-api-key'}
response = requests.get(url, params=params)
print(response.text)

As you can see in the examples above, we use these endpoints to control our response type.

Proxy Port Integration

When we use a proxy port, our browser or HTTP client will pass all requests through a specific location by default. For standard HTTP, we use port 8080. When using HTTPS, we use port 443. You can view the full url structure below.

'http': 'http://scrapingant:your-super-secret-api-key@proxy.scrapingant.com:8080'
'https': 'https://scrapingant:your-super-secret-api-key@proxy.scrapingant.com:443'

Below is an example of how to do this using Python Requests.

# pip install requestsimport requests
API_KEY = "your-super-secret-api-key"url = "https://quotes.toscrape.com"proxy_url = f"{API_KEY}:@proxy.scrapingant.com"proxies = {    "http": f"http://{proxy_url}:8080",    "https": f"https://{proxy_url}:443"    }response = requests.get(url, proxies=proxies, verify=False)print(response.text)

Proxy ports are best when you just want to set it and forget it. If you don't need to make special requests through your proxy or customize it at all, they can be a very convenient option. This sort of thing is best for newbies and people who don't want to think about their proxy logic.

SDK Integration

ScrapingAnt has an SDK (Software Development Kit) available for anyone who wants to use it. SDKs are far easier for beginners and people who aren't familiar with web development. An SDK allows you to not think about the low level requests. You can install it via pip.

pip install scrapingant-client

Here's an example of it in action.

from scrapingant_client import ScrapingAntClient
client = ScrapingAntClient(token='<YOUR-SCRAPINGANT-API-TOKEN>')# Scrape the example.com site.result = client.general_request('https://example.com')print(result.content)

As you can see above, this approach has a much lower barrier to entry.

Managing Concurrency

Managing concurrency is pretty straightforward if you're familiar with ThreadPoolExecutor. ThreadPoolExecutor allows us to open a new thread pool with x number of threads. On each open thread, we can run a function of our choosing.

import requestsfrom bs4 import BeautifulSoupimport concurrent.futuresfrom urllib.parse import urlencode
API_KEY = 'YOUR_API_KEY'NUM_THREADS = 5
def get_proxy_url(url):    payload = {"x-api-key": API_KEY, "url": url}    proxy_url = 'https://api.scrapingant.com/v2/general' + urlencode(payload)    return proxy_url
## Example list of urls to scrapelist_of_urls = [            "https://quotes.toscrape.com/page/1/",            "https://quotes.toscrape.com/page/2/",            "http://quotes.toscrape.com/page/3/",            ]
output_data_list = []
def scrape_page(url):    try:        response = requests.get(get_proxy_url(url))        if response.status_code == 200:            soup = BeautifulSoup(response.text, "html.parser")            title = soup.find("h1").text                        ## add scraped data to "output_data_list" list            output_data_list.append({                'title': title,            })                except Exception as e:        print('Error', e)                with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:    executor.map(scrape_page, list_of_urls)    print(output_data_list)

executor.map() holds all the keys here:

Our first argument is scrape_page: the function we want to call on each thread.
Our second is list_of_urls: the list of arguments we want to pass into scrape_page.

Any other arguments to the function also get passed in as arrays.

Advanced Functionality

We briefly touched on ScrapingAnt's advanced functionality earlier in this piece. Now, we'll look at it more in detail. Take a look at the table below for a breakdown of it all.

Parameter	API Credits	Description
`browser`	10	use a headless browser, true by default
`cookies`	1	pass cookies with the request for authentication
`custom_headers`	1	send custom headers to the server
`proxy_type`	1 or 25(residential)	choose your IP address type
`proxy_country`	1	set a custom geolocation
`js_snippet`	10	execute JavaScript snippet, requires `browser`
`wait_for_selector`	10	waits for a CSS selector, requires `browser`

You can view their full API documentation here.

JavaScript Rendering

Many modern websites rely heavily on JavaScript to dynamically load content, manipulate the DOM, and make API calls. JavaScript rendering functionality refers to the capability of web scraping tools or browsers to fully load and execute JavaScript on a web page. JavaScript rendering is essential for scraping dynamic websites that load content client-side, allowing for accurate data extraction and better handling of interactive features. ScrapingAnt will render JavaScript by default. It does this by using a headless browser. To turn off the headless browser, simply include "browser": False inside your payload. If you're looking to save API credits, this is a really important parameter to remember. Requests without the browser cost 1 API credit. Reqeusts with the browser cost 10. The following code renders a page using the browser.

# pip install requestsimport requests
url = "https://quotes.toscrape.com"api_key = "your-super-secret-api-key"params = {    "url": url,    "x-api-key": api_key,}response = requests.get('https://api.scrapingant.com/v2/general', params=params)print(response.text)

To turn this off, we would use the snippet below instead.

# pip install requestsimport requests
url = "https://quotes.toscrape.com"api_key = "your-super-secret-api-key"params = {    "url": url,    "x-api-key": api_key,    "browser": False}response = requests.get('https://api.scrapingant.com/v2/general', params=params)print(response.text)

You can view the documentation for this here.

Controlling The Browser

To control our browser, we use the ScrapingAnt browser to execute JavaScript directly. We first add our JavaScript as a string, then we encode it in Base64. Before encoding it to Base64, we encode it to bytes using utf-8. We then encode it in Base64 and decode it using utf-8 so we get a string that can be sent to the ScrapingAnt API.

# pip install requestsimport requestsimport base64
url = "https://api.scrapingant.com/v2/general"
js_action = "document.getElementById('myButton').click();"
encoded_js = base64.b64encode(js_action.encode("utf-8")).decode("utf-8")
params = {    "url": "https://example.com",    "x-api-key": "your-super-secret-api-key",    "js_snippet": encoded_js,}
response = requests.get(url, params=params)print(response.text)

The browser control docs are available here.

Country Geotargeting

Country geotargeting functionality allows web scraping tools or proxies to simulate requests from specific geographic locations or countries. By using IP addresses tied to certain regions, this feature enables users to access location-specific content, services, and pricing as if they were physically present in that country. Country geotargeting allows users to access and interact with region-specific content, monitor pricing differences, verify ads, and test localized services, making it crucial for global business operations, competitive analysis, and compliance. Geolocation is really easy to control and it costs us nothing to set a custom country! If you turn you browser off, and set a custom location, you're still only paying 1 API credit for each request.

# pip install requestsimport requests
url = "https://quotes.toscrape.com"api_key = "your-super-secret-api-key"params = {    "url": url,    "x-api-key": api_key,    "browser": False,    "proxy_country": "US"}response = requests.get('https://api.scrapingant.com/v2/general', params=params)print(response.text)

On top of that, their country list is huge comapared to other providers.

Country	Country Code
Brazil	`"BR"`
Canada	`"CA"`
China	`"CN"`
Czech Republic	`"CZ"`
France	`"FR"`
Germany	`"DE"`
Hong Kong	`"HK"`
India	`"IN"`
Indonesia	`"ID"`
Italy	`"IT"`
Israel	`"IL"`
Japan	`"JP"`
Netherelands	`"NL"`
Poland	`"PL"`
Russia	`"RU"`
Saudi Arabia	`"SA"`
Singapore	`"SG"`
South Korea	`"KR"`
Spain	`"ES"`
United Kingdom	`"GB"`
United Arab Emirates	`"AE"`
United States	`"US"`
Vietnam	`"VN"`

You can view the full documentation for this here.

Residential Proxies

Unlike data center proxies, which originate from cloud servers or hosting providers, residential proxies appear more legitimate to websites because they come from real user devices. Residential proxies are ideal for avoiding detection, bypassing geo-restrictions, accessing localized content, and improving the success rate of web scraping or automated tasks. Their ability to mimic genuine users makes them essential for tasks requiring high reliability and low chances of being blocked. Residential requests use up 25 API credits as opposed the 1 credit used by a standard datacenter IP address. We can change to a residential proxy using the proxy_type parameter. This is set to datacenter by default, but we can simple change it to residential. Here's a code example of how to use them.

# pip install requestsimport requests
url = "https://quotes.toscrape.com"api_key = "your-super-secret-api-key"params = {    "url": url,    "x-api-key": api_key,    "browser": False,    "proxy_type": "residential"}response = requests.get('https://api.scrapingant.com/v2/general', params=params)print(response.text)

You can view their full Residential Proxy Port integration guide here.

Custom Headers

Custom header functionality allows users to manually set and modify the HTTP request headers sent with web scraping or API requests. Typically, proxy APIs automatically manage these headers for optimal performance, but many proxy APIs also provide the option to send custom headers when needed.

Why Use Custom Headers?

Access Specific Data: Some websites or APIs require certain headers to provide access to specific data. For example, they may require an Authorization header or a special token to authenticate the request.
POST Requests: When sending POST requests, specific headers like Content-Type or Accept might be necessary to ensure that the target server processes the request correctly.
Bypass Anti-Bot Systems: Custom headers can help mimic real user behavior, making it easier to bypass certain anti-bot systems. Modifying headers like User-Agent, Referer, or Accept-Language can make your requests look like they’re coming from a genuine browser session.

Word of Caution

Impact on Performance: If used incorrectly, custom headers can reduce proxy performance. Sending the same static headers repeatedly may give away the fact that the requests are automated, increasing the likelihood of detection by anti-bot systems.
Need for Header Rotation: For large-scale web scraping, you need a system to continuously generate clean, dynamic headers to avoid being blocked. Static headers make your scraper more detectable and vulnerable to being flagged.
Only When Necessary: Custom headers should only be used if required. Letting proxy APIs handle headers is often more efficient since they are optimized to avoid detection and ensure higher success rates.

Adding custom headers is really very simple. We just add the prefix "Ant" to our header name. ScrapingAnt will then take these and pass them on to the server when it talks to it.

import requests
url = "https://api.scrapingant.com/v2/general"params = {    "url": "https://httpbin.org/headers",    "x-api-key": "<YOUR_SCRAPINGANT_API_KEY>"}headers = {    "Ant-Custom-Header": "I <3 ScrapingAnt"}
response = requests.get(url, params=params, headers=headers)print(response.text)

Take a look at their docs here.

Static Proxies

Static proxy functionality, also known as sticky sessions, allows users to maintain the same IP address for an extended period when sending multiple requests. Instead of switching IPs with each request (as rotating proxies do), static proxies ensure that the IP remains consistent for the duration of the session, making it appear as though all requests are coming from the same user. Scraping Ant does not give us the power to run a static proxy. Static proxies are often used for session management (remaining logged in over a period of time). However, ScrapingAnt does give us the ability to pass cookies along to the site we're scraping. With most sites, once you login, your browser receives a cookie and this cookie is used to tell the website who you are and that you're logged in. To pass cookies with ScrapingAnt, we can simply use the cookie parameter.

import requests
url = "https://api.scrapingant.com/v2/general"params = {    "url": "https://example.com",    "x-api-key": "your-super-secret-api-key",    "cookies": "cookie_1=cookie_value_1",    "browser": "false"}
response = requests.get(url, params=params)print(response.text)

Screenshot Functionality

Screenshot functionality allows web scraping tools or automation software to capture a visual snapshot of a web page as it appears during the scraping process. When you scrape the web, screenshots can be an irreplaceable debugging tool. However, ScrapingAnt sadly doesn't support screenshots. There are several others that do:

Auto Parsing

Auto Parsing is an excellent feature for a scraping API. With Auto Parsing, we can actually tell ScrapingAnt to try and scrape the site for us! With this functionality, we only need to focus on our jobs as developers. We don't need to pick through all the nasty HTML. When autoparsing, it's good to exercise caution. ScrapingAnt uses AI to attempt the parse, and AI is sometimes prone to errors. On top of that, we're not given an upfront cost model for the AI parser. ScrapingAnt will execute the request and then charge our account based on the parse... but we do not know the cost before the parse has been executed. The following snippet tells ScrapingAnt that we want it to parse the page using AI.

import requests
url = "https://api.scrapingant.com/v2/extract"params = {    "url": "https://example.com",    "x-api-key": "your-super-secret-api-key",    "browser": "false",    "extract_properties": "title, content"}
response = requests.get(url, params=params)print(response.text)

Unlike sites that have pre-built parsers, ScrapingAnt uses an AI parser. There is no list of supported sites because (theoreticallly) they're all supported. However, AI is prone to errors, so don't expect perfection.

Case Study: Using Scraper APIs on IMDb Top 250 Movies

Time to scrape the top 250 movies from IMDB. We'll be using two virtually identical scrapers. The only difference will be the proxy function. Aside frrom the base domain name, the only difference in the proxy function will be the format of the API key. With ScrapeOps, we use api_key for the API key. With ScrapingAnt, we use x-api-key. Take a look at the snippets below, you'll notice the subtle difference between the proxy functions. Here is the proxy function for ScrapeOps:

def get_scrapeops_url(url):    payload = {        "api_key": API_KEY,        "url": url,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

Here is the same function for ScrapingAnt.

def get_proxy_url(url):    payload = {        "x-api-key": API_KEY,        "url": url,        "browser": False        }    proxy_url = 'https://api.scrapingant.com/v2/general?' + urlencode(payload)    return proxy_url

The full ScrapeOps code is available for you below.

import osimport requestsfrom bs4 import BeautifulSoupimport jsonimport csvimport loggingfrom  urllib.parse import urlencodeimport concurrent.futures
## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["scrapeops_api_key"]
def get_scrapeops_url(url):    payload = {        "api_key": API_KEY,        "url": url,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url


def scrape_movies(url, location="us", retries=3):    success = False    tries = 0
    while not success and tries <= retries:        response = requests.get(get_scrapeops_url(url))
        try:            if response.status_code != 200:                raise Exception(f"Failed response from server, status code: {response.status_code}")
            soup = BeautifulSoup(response.text, "html.parser")            json_tag = soup.select_one("script[type='application/ld+json']")            json_data = json.loads(json_tag.text)["itemListElement"]
            movie_list_length = 0
            movie_list = []
            for item in json_data:                movie_list.append(item["item"])            movie_list_length+=len(json_data)
            print(f"Movie list length: {len(json_data)}")            with open("scrapeops-top-250.json", "w") as file:                json.dump(movie_list, file, indent=4)                success = True        except Exception as e:            logger.error(f"Failed to process page: {e}, retries left: {retries-tries}")            tries+=1
    if not success:        raise Exception(f"Failed to scrape page, MAX RETRIES {retries} EXCEEDED!!!")


if __name__ == "__main__":
    MAX_RETRIES = 3
    logger.info("Starting IMDB scrape")
    url = "https://www.imdb.com/chart/top/"
    scrape_movies(url, retries=MAX_RETRIES)
    logger.info("Scrape complete")

Take a look at the ScrapeOps results. The run took 4.335 seconds. Here is our ScrapingAnt code as well.

import osimport requestsfrom bs4 import BeautifulSoupimport jsonimport csvimport loggingfrom  urllib.parse import urlencodeimport concurrent.futures
## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["scrapingant_api_key"]
def get_proxy_url(url):    payload = {        "x-api-key": API_KEY,        "url": url,        "browser": False        }    proxy_url = 'https://api.scrapingant.com/v2/general?' + urlencode(payload)    return proxy_url


def scrape_movies(url, location="us", retries=3):    success = False    tries = 0
    while not success and tries <= retries:        response = requests.get(get_proxy_url(url))
        try:            if response.status_code != 200:                raise Exception(f"Failed response from server, status code: {response.status_code}")
            soup = BeautifulSoup(response.text, "html.parser")            json_tag = soup.select_one("script[type='application/ld+json']")            json_data = json.loads(json_tag.text)["itemListElement"]
            movie_list_length = 0
            movie_list = []
            for item in json_data:                movie_list.append(item["item"])            movie_list_length+=len(json_data)
            print(f"Movie list length: {len(json_data)}")            with open("scrapingant-top-250.json", "w") as file:                json.dump(movie_list, file, indent=4)                success = True        except Exception as e:            logger.error(f"Failed to process page: {e}, retries left: {retries-tries}")            tries+=1
    if not success:        raise Exception(f"Failed to scrape page, MAX RETRIES {retries} EXCEEDED!!!")


if __name__ == "__main__":
    MAX_RETRIES = 3
    logger.info("Starting IMDB scrape")
    url = "https://www.imdb.com/chart/top/"
    scrape_movies(url, retries=MAX_RETRIES)
    logger.info("Scrape complete")

Below is the output from ScrapingAnt. Our run took 5.304 seconds. ScrapeOps was barely faster than ScrapingAnt. 5.304 - 4.335 = 0.969. Our difference was apprixmately one second. Depending on your location, hardware and internet connection, you might receive different results.

Alternative: ScrapeOps Proxy API Aggregator

ScrapeOps Proxy API Aggregator is a service that combines the power of multiple top-tier proxy providers into a single solution, offering a variety of benefits for web scraping and automation tasks. Here’s why you might want to use it:

Access to Multiple Proxy Providers: ScrapeOps integrates with over 20 leading residential and data center proxy providers, including popular names like Smartproxy, Bright Data, and Oxylabs. This means you don’t need to juggle multiple accounts or services; you can manage all your proxy needs through one platform.
Automatic Proxy Switching: The aggregator automatically switches between proxy providers based on performance, ensuring that you’re always using the best proxy for your task. This results in a 98% success rate, as it continuously optimizes the proxies used, reducing the chances of being blocked or flagged.
Bypass Anti-Bot Measures: With ScrapeOps, you can rotate through multiple proxies and user agents, making it easier to avoid detection by anti-bot systems. This is crucial for large-scale web scraping projects where sites are heavily guarded against automated requests.
Cost Optimization: ScrapeOps monitors proxy provider performance and pricing, helping you choose the most cost-effective option for your specific task. This ensures that you get the best balance of price and performance, which is especially useful for businesses working with large volumes of data.
Competitive Pricing Plans: The platform offers flexible pricing with bandwidth-based plans, starting with 500 MB of free bandwidth credits. Paid plans start at $9 per month, making it accessible for both small and large scraping projects. This flexibility allows you to scale your proxy usage as needed.

Streamlined Management: Instead of managing multiple proxy providers, credentials, and payments, ScrapeOps centralizes everything, making it easier to maintain control over your proxy usage. It also offers reporting and analytics, so you can track proxy performance and optimize your scraping strategy.

ScrapeOps offers a larger variety of plans and costs much less to get started on a premium plan, $9. On top of that, ScrapeOps doesn't just use ScrapingAnt as a provider, we have over 20 providers and we're adding new ones each week. This gives ScrapeOps far better reliability than other centralized solutions. If one provider fails, we simply route you through another.

Troubleshooting

Issue #1: Request Timeouts

We can set a timeout argument with Python Requests. Sometimes we run into issues where our requests time out. To fix this, just set a custom timeout.

import requests
# 5 second timeoutresponse = requests.get("https://httpbin.org/get", timeout=5)

Issue #2: Handling CAPTCHAs

If your proxy service is making you submit CAPTCHA requests, something is wrong. Both ScrapeOps and ScrapingAnt are built to avoid CAPTCHAs by default. However, sometimes proxy providers can fail. If you run into a CAPTCHA, first, try to submit the request again. This will often take care of it (usually the proxy provider will give you a new IP address). If that fails, try using a Residential Proxy (both ScrapeOps and ScrapingAnt offer these). If the solutions outlined above fail (they shouldn't), you can always use 2captcha. We have an excellent article on bypassing CAPTCHAs here.

Issue #3: Headless Browser Integrations

When using headless browsers like Puppeteer or Playwright with Proxy APIs, there are often integration challenges. A headless browser operates without a graphical user interface (GUI) and is typically used for automation, web scraping, or testing tasks. However, these tools can run into issues when interacting with proxy APIs, leading to inefficient requests or failures. Headless browsers typically are suited for use with Proxy APIs like ScrapingAnt as:

There can be compatibility issues and unforeseen bugs when making background network requests as the headers and cookies don't get maintained across the requests
The charge per successful request and to scrape 1 page, a headless browser could make 10-100+ requests
If you want to use a headless browser then you need to use the proxy port integration method

Issue #4: Invalid Response Data

Anytime you're dealing with Web Development, you will sometimes run into invalid responses. To handle an invalid response, you need to understand the error code and what it means. ScrapingAnt error codes are available for review here The ScrapeOps error codes are available here. In most cases, you need to double check your parameters or make sure your bill is paid. Every once in awhile, you may receive a different error code that you can find in the links above.

The Legal & Ethical Implications of Web Scraping

When we scrape public data, we're typically completely legal. Public data is any data that is not gated behind a login. Private data is a completely different story. When dealing with private data, you are subject to a whole different slew of privacy laws and intellectual property regulations. The data we scraped in this article was public. You should also take into account the Terms and Conditions and the robots.txt of the site you're scraping. You can view these documents from IMDB below.

Consequences of Misuse

Violating either the terms of service or privacy policies of a website can lead to several consequences:

Account Suspension or IP Blocking: Scraping websites without regard for their policies often leads to being blocked from accessing the site. For authenticated platforms, this may result in account suspension, making further interactions with the site impossible from that account.
Legal Penalties: Violating a website's ToS or scraping data unlawfully can lead to legal action. Laws such as the Computer Fraud and Abuse Act (CFAA) in the U.S. have been used to pursue lawsuits against unauthorized scraping, especially if it's done at scale or causes harm (such as server overload). Companies can face lawsuits for unauthorized use of proprietary data or violating intellectual property rights.
Data Breaches and Privacy Violations: If scraping is used to collect personal or sensitive data without consent, it can lead to severe privacy violations. This can expose businesses to penalties under regulations like GDPR, which can impose heavy fines for non-compliance, and reputational damage.
Server Overload: Excessive scraping can strain a website’s servers, especially if done without rate-limiting or throttling. This can cause performance issues for the website, leading to possible financial or legal claims against the scraper for damages caused by server downtime.

Ethical Considerations

Fair Use: Even if scraping is legal, it's important to consider the ethical use of the data. For instance, scraping content to directly copy and republish it for profit without adding value is generally unethical and may infringe on copyright laws. Ethical scraping should aim to provide new insights, analysis, or utility from the data.
User Consent: Scraping platforms that collect user-generated content (like social media) should consider user privacy and consent. Even if the content is publicly available, using it in ways that violate privacy expectations can lead to ethical concerns and backlash.
Transparency: Scrapers should be transparent about their intentions, especially if the scraping is for commercial purposes. Providing appropriate attributions or using data responsibly demonstrates ethical integrity.

Conclusion

Both ScrapeOps and ScrapingAnt give us convenient and reliable ways to scrape the web. ScrapeOps has a little bit more functionality, but ScrapingAnt gives a great experience as well. Both proxies are similar in terms of speed and efficiency. ScrapingAnt's headless by default might be annoying to some users. By default, you're paying 10x normal API credits for your request, but you can manually turn this off by using "browser": False. Both of these solutions will help you get the data you need.

More Web Scraping Guides

It doesn't matter if you're brand new to scraping or a hardened developer, we have something for you. We wrote the playbook on it. Bookmark one of the articles below and level up your scraping toolbox!