Skip to main content

Scrapfly Integration Guide

Scrapfly: Web Scraping Integration Guide

Scrapfly is designed to be a one-stop shop for scraping. Scrapfly is also one of our many providers for the ScrapeOps Proxy Aggregator. They include many of our same features such as automated proxy management, JavaScript rendering, geotargeting, browser controls, screenshots, and auto parsing.

Today, we're going to walk through their process from start to finish. From our initial signup, all the way to a real world case study, we're going to gain a solid grasp of how to use the Scrapfly API and how it compares to the ScrapeOps Proxy Aggregator.


TLDR: Web Scraping With Scrapfly?

Getting started with Scrapfly is pretty easy once you've set up your account and you've got an API key.

  • Create a new config.json file with your API key.
  • Then, write a script that reads your key and start scraping!

The example below is just an example, but it holds all the information you need to use your API key and make requests to Scrapfly.

import requests
import json
from urllib.parse import urlencode

API_KEY = ""

with open("config.json") as file:
config = json.load(file)
API_KEY = config["scrapfly_api_key"]

def get_scrapfly_url(url):
payload = {
"key": API_KEY,
"url": url,
}
proxy_url = "https://api.scrapfly.io/scrape?" + urlencode(payload)
return proxy_url

url = "https://quotes.toscrape.com"
response = requests.get(get_scrapfly_url(url))
data = response.json()
print(json.dumps(data, indent=4))

In the snippet above, you can get started very quickly with Scrapfly. If you need to add customization such as JavaScript rendering or geotargeting, simply add parameters to the payload of the function.

You can view additional parameters in the API documentation here.


What Is Scrapfly?

Like ScrapeOps, Scrapfly is built as a one-stop shop for all of your scraping needs. They manage proxy pools so you don't have to manage individual proxies.

Scrapfly goes through and connects you to the best proxy for your scrape. Also, like ScrapeOps, they give you numerous options to customize your proxy connection to make your scraping job easier.

Scrapfly Homepage text

The reasons to use Scrapfly are very similar to the uses for the ScrapeOps Proxy Aggregator as well. This includes JavaScript rendering, browser actions, geotargeting, and much more. On top of all that, if you're using Scrapfly or ScrapeOps, you want reliability.


How Does Scrapfly Work?

Scrapfly maintains a large pool of datacenter and residential proxies all over the world. By default, it first attempts your request through a datacenter proxy and if the request is unsuccessful, it will retry with a better (often residential) proxy. Then, after the Scrapfly server receives its response, it sends a response back to you that includes the page you wanted to scrape.

When you scrape a site with a service like this, here is the basic process.

  1. You make a request to Scrapfly using your API key, target url, and any other custom parameters you wish to pass.
  2. Scrapfly receives the your request and attempts to retrieve your target url.
  3. If Scrapfly receives a failed response, they retry with a better proxy until they either timeout, reach a retry limit or get the response.
  4. After they receive the requested content, they send it back to your scraper.

Response Format

All of our responses come as JSON by default. Think back to our code example in the TLDR section. You can view it again below.

We retrieve our data with response.json(). If you run it yourself, you'll receive all sorts of useful information in your response such as config, context, result, content, request_headers, and response_headers.

import requests
import json
from urllib.parse import urlencode

API_KEY = ""

with open("config.json") as file:
config = json.load(file)
API_KEY = config["scrapfly_api_key"]

def get_scrapfly_url(url):
payload = {
"key": API_KEY,
"url": url,
}
proxy_url = "https://api.scrapfly.io/scrape?" + urlencode(payload)
return proxy_url

url = "https://quotes.toscrape.com"
response = requests.get(get_scrapfly_url(url))
data = response.json()
print(json.dumps(data, indent=4))

Scrapfly Pricing

Scrapfly offers 4 separate pricing plans. Their most affordable plan comes in at $30 per month and their highest tier plan runs $500 per month.

Scrapfly Pricing

If none of these plans meet your needs, they do give the option to setup custom plans as well to suit your needs. The table below gives a solid breakdown of Scrapfly's available plans.

PlanAPI CreditsCost Per Normal RequestMonthly Price
Discovery200,000$0.00015$30
Pro1,000,000$0.0001$100
Startup2,500,000$0.0001$250
Enterprise5,500,000$0.00009$500

Like ScrapeOps Proxy Aggregator, when using Scrapfly, you typically only pay for successful requests.

Response Status Codes

Status codes are very important. If you're receiving anything other than a 200, something is wrong. To properly troubleshoot these status codes, we need a place to reference them.

The table below outlines the status codes you'll run into when using Scrapfly.

Status CodeTypeDescription
200SuccessEverything is working!
400Bad RequestYou need to double check your parameters.
404Not FoundDouble check your url, the site wasn't found.
422Unprocessible EntityUnable to process response.
429Too Many RequestsSlow down your requests.
500Internal Server ErrorScrapfly is having an internal issue.
502Service ErrorScrapfly's host is having an internal error.
503Temporarily UnavailableScrapfly is undergoing maintenance.
504Not ReachableScrapfly is not reachable or timed out.

Setting Up Scrapfly

Signing up for Scrapfly is a bit tedious. Unlike many other providers, the have a somewhat in-depth KYC (know your customer) process. They collect information about your employment, your reasons for using their site, the sites you want to scrape, and any other proxy products you've used in the past. They collect your email and phone number as well.

To get started:

  • You need to fill in a some personal information and complete a CAPTCHA.
  • You'll need your basic contact information (phone number and email address).
  • You will also need to disclose your reasons for using the site, the sites you wish to scrape and any other proxy providers you've used or tested in the past.

Signup Scrapfly

After a somewhat intrusive (but arguably justified) signup process, you'll receive a confirmation email. Once you've confirmed your email, you can access the dashboard and you're ready to go with 1,000 API credits.

alt text

With Scrapfly, we can use either their REST API or their SDK. As mentioned here, Scrapfly does not support HTTP proxy port integration. Our two ways of access are as follows:

  • REST API: We use the REST API when we're comfortable with an HTTP library (in our case, Python Requests) and we'd like to build our requests ourselves.
  • SDK: Using their SDK is a great way to get started... especially for beginners. The SDK allows us full connection to the REST API, but much of the underlying HTTP has been extracted away so we don't need to think about it as much.

If you click the API Player tab, you'll be taken to their Request builder. When dealing with any new scraping API, builders like this are an incredibly useful tool.

These builders allow us to create custom API requests using a variety of different frameworks and HTTP clients.

API Endpoint Integration

We've already performed API endpoint integration in our previous code example. With Endpoint Integration, we send all of our requests to a specific API endpoint. The API then reads these parameters and executes our request accordingly.

Let's look at this basic request one more time.

import requests
import json
from urllib.parse import urlencode

API_KEY = ""

with open("config.json") as file:
config = json.load(file)
API_KEY = config["scrapfly_api_key"]

def get_scrapfly_url(url):
payload = {
"key": API_KEY,
"url": url,
}
proxy_url = "https://api.scrapfly.io/scrape?" + urlencode(payload)
return proxy_url

url = "https://quotes.toscrape.com"
response = requests.get(get_scrapfly_url(url))
data = response.json()
print(json.dumps(data, indent=4))

Take a look at our proxy URL without the payload: "https://api.scrapfly.io/scrape?. All of our requests are going to the /scrape endpoint.

Whenever we make a request to the API (no matter what our parameters are), it gets sent to this specific endpoint. All of our parameters get wrapped in url encoding and appended to this url "https://api.scrapfly.io/scrape?.

Their full API documentation is available here.

SDK Integration

Scrapfly also gives us the option to use their SDK. The SDK abstracts away much of the lower level HTTP code that we deal with when using Endpoint Integration.

To install the Python SDK, run the following command.

pip install 'scrapfly-sdk'

You can then test your proxy connection with the following code. Make sure to replace the API key with your own.

from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

scrapfly = ScrapflyClient(key='YOUR-SUPER-SECRET-API-KEY')

api_response:ScrapeApiResponse = scrapfly.scrape(scrape_config=ScrapeConfig(url='https://httpbin.dev/anything'))

# Automatic retry errors marked "retryable" and wait delay recommended before retrying
api_response:ScrapeApiResponse = scrapfly.resilient_scrape(scrape_config=ScrapeConfig(url='https://httpbin.dev/anything'))

# Automatic retry error based on status code
api_response:ScrapeApiResponse = scrapfly.resilient_scrape(scrape_config=ScrapeConfig(url='https://httpbin.dev/status/500'), retry_on_status_code=[500])

# scrape result, content, iframes, response headers, response cookies states, screenshots, ssl, dns etc
print(api_response.scrape_result)

# html content
print(api_response.scrape_result['content'])

# Context of scrape, session, webhook, asp, cache, debug
print(api_response.context)

# raw api result
print(api_response.content)

# True if the scrape respond with >= 200 < 300 http status
print(api_response.success)

# Api status code /!\ Not the api status code of the scrape!
print(api_response.status_code)

# Upstream website status code
print(api_response.upstream_status_code)

# Convert API Scrape Result into well known requests.Response object
print(api_response.upstream_result_into_response())

The full documentation for Scrapfly's Python SDK is available here.

Managing Concurrency

alt text

With Scrapfly, you can make up to 5 concurrent requests even on the free plan!

To make use of your concurrency, you can use ThreadPoolExecutor to execute multiple requests at once. In the code below, we define a function to scrape the h1 from each page, scrape_page(). We then pass this function and our list_of_urls into executor.map().

import requests
from bs4 import BeautifulSoup
import concurrent.futures
import json
from urllib.parse import urlencode

API_KEY = ""
NUM_THREADS = 3

with open("config.json") as file:
config = json.load(file)
API_KEY = config["scrapfly_api_key"]

def get_scrapfly_url(url):
payload = {
"key": API_KEY,
"url": url,
}
proxy_url = "https://api.scrapfly.io/scrape?" + urlencode(payload)
return proxy_url

list_of_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
'http://quotes.toscrape.com/page/3/',
]

output_data_list = []

def scrape_page(url):
try:
response = requests.get(get_scrapfly_url(url))
if response.status_code == 200:
soup = BeautifulSoup(response.json()["result"]["content"], "html.parser")
title = soup.find('h1').text

## add scraped data to "output_data_list" list
output_data_list.append({
'title': title,
})

except Exception as e:
print('Error', e)


with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
executor.map(scrape_page, list_of_urls)

print(output_data_list)

When using ThreadPoolExecutor, we open up a pool of threads with the max_workers argument. We then pass the following into executor.map():

  • scrape_page: the function we wish to call on all open threads.
  • list_of_urls: a list arguments to be passed into the function above.

Advanced Functionality

Scrapfly is host to a pretty large set of advanced functionalties.

From JavaScript to geotargeting and all the way to auto extraction, Scrapfly claims to be able to pretty much do it all! We outline most of this functionality in the table below.

NOTE: some features will cost extra to perform.

ParameterAPI CreditsDescription
proxy_pool1 - 25Use either residential or datacenter proxies.
headersNoneSend custom headers to send to your target.
countryNoneRoute your request through a specific location.
langNoneSet a custom language for your response.
osNoneSet a custom OS for your scraper. (Not Recommended)
timeoutNoneCustom timeout for the response (Not Recommended)
formatNoneFormat in which to receive your repsonse.
retryNoneRetry in the event of a failed response.
proxified_responseNoneReturn HTML response directly as the response body
debugNoneStore the scraped data, screenshot of response.
correlation_idNoneCorrelation ID for a group of scrapes in progess.
tagsNoneAdd tags to a group item.
dnsNoneQuery and retrieve DNS info for the target site.
sslNoneCollect the target site's SSL data.
webhook_nameNoneMake a request to a webhook after retrieving page.
extraction_template1Attempt to parse the page automatically.
extraction_prompt5Prompt an AI to extract the data.
extraction_model5Attempt to parse the page using a specific model.
aspVariableGet past anti-bots.
cost_budgetN/ASet a price limit when attempting ASP.
render_js5Open a browser and render JavaScript content.
wait_for_selector5Wait for specific selector to appear on the page.
js5Execute a set of JavaScript instructions on page.
screenshot5Take a screenshot of the page or an HMTL element.
screenshot_flags5Flags to customize a screenshot.
js_scenario5Execute a set of JS actions (scroll, click etc.)
geolocationNoneSet a custom location.
auto_scroll5Scroll to the bottom of the page and load JS.
rendering_stage5Wait until domcontentloaded or complete
cacheNoneStore the scraped content on Scrapfly servers.
cache_ttlNoneCache time til live.
cache_clearNoneForce clear the cache, then scrape and replace.
sessionNoneReuse a browsing session (cookies and fingerprint).
session_sticky_proxyNoneReuse a browsing session (actual IP addresses).

These functionalities can be reviewed here.


Javascript Rendering

JavaScript Rendering is the process of executing JavaScript code to dynamically generate or modify the content on a web page.

Unlike traditional server-side rendering, where the server sends a fully constructed HTML page to the browser, JavaScript rendering often involves loading a skeleton HTML page and then using JavaScript to build or enhance the content on the client side after the page has loaded.

JavaScript rendering plays a critical role in delivering modern, fast, and interactive web applications.

  • Dynamic Content: Enables dynamic updates without reloading the page, enhancing user interactivity.
  • Single Page Applications (SPAs): JavaScript rendering is essential for creating SPAs where content changes without navigating away from the current page.
  • Improved User Experience: Faster, smoother, and more interactive web applications that respond instantly to user input.
  • SEO and Search Engine Crawling: Modern websites that rely on JavaScript rendering may use SSR or dynamic rendering to ensure that search engines can index content properly. Java

We can render JavaScript with the render_js parameter. It does exactly what it sounds like. It renders JavaScript.

In the snippet below, we visit WhatIsMyIp. This site initially gives us no content. It then uses JavaScript to dynamically load our IP address onto the page. Without JavaScript support, we'll be unable to scrape our IP.

Make sure to set render_js to "true". For whatever reason, if you directly urlencode(True), this leads to an incorrect reading by the server and Scrapfly ignores the request to render JavaScript.

import requests
import json
from bs4 import BeautifulSoup
from urllib.parse import urlencode

API_KEY = ""

with open("config.json") as file:
config = json.load(file)
API_KEY = config["scrapfly_api_key"]

def get_scrapfly_url(url):
payload = {
"key": API_KEY,
"url": url,
"render_js": "true",
"rendering_wait": 2000
}
proxy_url = "https://api.scrapfly.io/scrape?" + urlencode(payload)
return proxy_url

url = "https://www.whatismyip.com/"

response = requests.get(get_scrapfly_url(url))
content = response.json()["result"]["content"]

soup = BeautifulSoup(content, "html.parser")
ip_info = soup.select_one("a[id='ipv4']").get("title")
print(ip_info)
  • "render_js": "true" tells Scrapfly that we wish to open a browser and render JavaScript content.
  • "rendering_wait": 2000 tells Scrapfly to wait for 2 seconds (2,000 milliseconds) for our content to render and then send the response back to us.

Their documentation on rendering JavaScript is available here.

Controlling The Browser

With Scrapfly, not only can we open a browser, but we can also control one! We can use js_scenario to control our browser.

We write all of our JavaScript actions as an array of JSON objects. Then, we use Base64 encoding to convert the JSON to a binary format. This Base64 encoding prevents the data from getting corrupted in transit.

Using js_scenario below, we once again wait for the page to render, but instead of calling rendering_wait, we give the instructions directly inside the js_scenario.

import requests
import json
from bs4 import BeautifulSoup
from base64 import b64encode
from urllib.parse import urlencode

API_KEY = ""

with open("config.json") as file:
config = json.load(file)
API_KEY = config["scrapfly_api_key"]

def get_scrapfly_url(url):
payload = {
"key": API_KEY,
"url": url,
"render_js": "true",
"js_scenario": b64encode(b"""
[
{ "wait": 2500 }
]
""").decode("utf-8")
}
proxy_url = "https://api.scrapfly.io/scrape?" + urlencode(payload)
return proxy_url

url = "https://www.whatismyip.com/"

response = requests.get(get_scrapfly_url(url))
content = response.json()["result"]["content"]

soup = BeautifulSoup(content, "html.parser")
ip_info = soup.select_one("a[id='ipv4']").get("title")
print(ip_info)
  • js_scenario tells Scrapfly that we'd like to perform a list of JavaScript actions.
  • We create an array of JSON objects and encode them in Base64 to prevent them from getting corrupted in transit.

Country Geotargeting

Country Geotargeting is a technique used to deliver tailored content, services, or advertisements to users based on their geographic location, specifically targeting users from certain countries. This is accomplished by detecting the user's IP address, which provides an approximation of their location.

Proxy services like Scrapfly route user traffic through proxy servers located in specific countries.

This allows users to access content, services, or websites as if they are browsing from the targeted country. This is particularly useful for accessing region-specific content or bypassing geo-restrictions.

Country geotargeting can help:

  • Bypass Geo-Restrictions: Access region-locked content like streaming services, websites, or apps that are restricted to specific countries.
  • Localized SEO and Ad Verification: Ensure that ads or search engine results are accurately displayed in different regions by simulating user traffic from target countries.
  • Access Regional Deals and Pricing: Take advantage of country-specific promotions, pricing, or services that vary based on user location.
  • Test Localization: Developers and testers use country-specific proxies to verify that websites and apps function properly across various regions with correct localization.
  • Avoid IP-based Blocks: Bypass IP restrictions on websites or services that limit access based on geographic location.
  • Enhanced Privacy: For users in restrictive regions, using a proxy from another country helps bypass censorship and provides anonymity.

To control our country, we can use the country parameter. Take a look at the code below. We add "country": "us" to our payload.

Once we've got our parameters setup, we go through and perform a request from our actual location and our request from the proxy location to compare.

import requests
import json
from bs4 import BeautifulSoup
from base64 import b64encode
from urllib.parse import urlencode

API_KEY = ""

with open("config.json") as file:
config = json.load(file)
API_KEY = config["scrapfly_api_key"]

def get_scrapfly_url(url):
payload = {
"key": API_KEY,
"url": url,
"country": "us"
}

proxy_url = "https://api.scrapfly.io/scrape?" + urlencode(payload)
return proxy_url

url = "https://httpbin.org/ip"


test_response = requests.get(url)
print("real location", test_response.json())
response = requests.get(get_scrapfly_url(url))

content = response.json()["result"]["content"]
print("proxy location:", content)

Here is the output.

alt text

Now it's time to check and make sure that the proxy location is inside the United States. We show up in Newark, New Jersey.

alt text

Geotargeting is a staple when you're scraping the web. You can view Scrapfly's country documentation here. Here are some of the country codes you can use with Scrapfly.

CountryCode
United Arab Emiratesae
Australiaau
Brazilbr
Canadaca
Chinacn
Germanyde
Spaines
United Kingdomgb
Indiain
Japanjp
Mexicomx
Portugalpt
Russiaru
Turkeytr
United Statesus

The list above is non-exhaustive, if you wish to view their full list of countries (there are alot), you can view it here.


Residential Proxies

Residential proxies are another important staple in web scraping.

A residential proxy is a type of proxy server that routes internet traffic through real residential IP addresses provided by Internet Service Providers (ISPs).

These IPs are linked to physical locations (homes) and are associated with actual devices such as computers, mobile phones, or routers, making them appear as legitimate users online.

Residential proxies are particularly useful when you need to mimic real users and access geographically restricted content, or when performing tasks like web scraping, ad verification, or managing multiple accounts with a lower risk of detection.

We can use the proxy_pool argument to specify that we want to use Scrapfly's residential proxy pool. When forwarding our request, Scrapfly will then automatically route our request through their residential pool.

import requests
import json
from bs4 import BeautifulSoup
from base64 import b64encode
from urllib.parse import urlencode

API_KEY = ""

with open("config.json") as file:
config = json.load(file)
API_KEY = config["scrapfly_api_key"]

def get_scrapfly_url(url):
payload = {
"key": API_KEY,
"url": url,
"proxy_pool": "public_residential_pool"
}

proxy_url = "https://api.scrapfly.io/scrape?" + urlencode(payload)
return proxy_url

url = "https://httpbin.org/ip"

response = requests.get(get_scrapfly_url(url))

content = response.json()["result"]["content"]
print("proxy location:", content)

Here is our output.

alt text

You can view their proxy_pool docs here.


Custom Headers

HTTP headers provide additional information about the request, such as authentication tokens, content type, and user-agent data, and are crucial for proper communication between clients and servers.

By default, most proxy APIs or systems manage request headers automatically to optimize performance. However, some allow users to customize headers if specific data is needed to access the desired target.

Custom headers can sometimes be pretty useful when you're scraping the web. Sometimes you're accessing a site that needs special or additional headers that your proxy service isn't aware of.

Why Use Custom Headers?

Custom headers are a powerful tool for more advanced or specific use cases, especially when interacting with APIs, bypassing detection systems, or replicating the behavior of real users.

Word of Caution

Custom headers require careful management to avoid performance degradation or triggering blocks.

  • Incorrect or static custom headers can negatively impact proxy performance.
  • If custom headers are not properly rotated or randomized, websites may detect repetitive behavior and block access.
  • For large-scale tasks, a system for continuously generating clean and randomized headers is essential to avoid detection and ensure smooth operation.

Proxy services typically optimize default headers for best performance, so custom headers should only be used when necessary.

With Scrapfly, we can set custom headers with the headers parameter. You put your actual header inside square brackets, [Your Header Name], and give it the header prefix. To set Your Custom Header, we pass "header[Your Header Name]": "Your Header Value".

import requests
import json
from bs4 import BeautifulSoup
from base64 import b64encode
from urllib.parse import urlencode

API_KEY = ""

with open("config.json") as file:
config = json.load(file)
API_KEY = config["scrapfly_api_key"]

def get_scrapfly_url(url):
payload = {
"key": API_KEY,
"url": url,
"headers[Your Header Name]": "Your Header Value"
}

proxy_url = "https://api.scrapfly.io/scrape?" + urlencode(payload)
return proxy_url

url = "https://httpbin.org/ip"

response = requests.get(get_scrapfly_url(url))

content = response.json()["result"]["content"]
print("proxy location:", content)

Their header documentation is available here.


Static Proxies

Static Proxy Functionality (also called sticky session proxies) is a type of proxy service where the user is assigned a single, static IP address that remains consistent throughout the session or for a specific period.

Unlike rotating proxies, which switch IPs with every request, static proxies maintain the same IP for multiple requests.

Static proxies are ideal for scenarios that require consistent interactions with websites, maintaining user sessions, and avoiding detection, making them an excellent tool for tasks like account management, ad verification, and market research.

For this, we use the session argument. Give your session a name and it will be saved by Scrapfly for up to 7 days. Scrapfly discards all sessions after 7 days.

Here is the code to set a session.

import requests
import json
from bs4 import BeautifulSoup
from base64 import b64encode
from urllib.parse import urlencode

API_KEY = ""

with open("config.json") as file:
config = json.load(file)
API_KEY = config["scrapfly_api_key"]

def get_scrapfly_url(url):
payload = {
"key": API_KEY,
"url": url,
"session": "Name of Your Session"
}

proxy_url = "https://api.scrapfly.io/scrape?" + urlencode(payload)
return proxy_url

url = "https://httpbin.org/ip"

response = requests.get(get_scrapfly_url(url))

content = response.json()["result"]["content"]
print("proxy location:", content)

The full info on session can be viewed here.


Screenshot Functionality

Screenshot functionality in proxy services allows users to capture an image or visual representation of a web page as it appears at a specific moment in time.

Screenshots are incredibly important when you're scraping the web. Screenshot functionality is an essential tool for verifying content, visual elements, and user experience across different contexts, providing valuable insights for businesses, marketers, and developers.

From data verification to debugging, we use them all the time. To take a screenshot with Scrapfly, we can use the screenshots parameter. We can actually take multiple screenshots on a single request.

Take a look at the example below.

import requests
import json
from bs4 import BeautifulSoup
from base64 import b64encode
from urllib.parse import urlencode

API_KEY = ""

with open("config.json") as file:
config = json.load(file)
API_KEY = config["scrapfly_api_key"]

def get_scrapfly_url(url):
payload = {
"key": API_KEY,
"url": url,
"screenshots[all]": "fullPage",
"screenshots[reviews]": "#reviews"
}

proxy_url = "https://api.scrapfly.io/scrape?" + urlencode(payload)
return proxy_url

url = "https://web-scraping.dev/product/1"

response = requests.get(get_scrapfly_url(url))

print(response.json())

The documentation for screenshots is available here.


Auto Parsing

Auto parsing is a rather new but fashionable feature with web scrapers.

Auto Parsing (also known as Auto Extract) refers to the automated process of extracting specific data or elements from a web page or document without needing manual coding or complex scraping techniques.

With auto parsing, the system intelligently identifies and extracts structured data (e.g., text, images, prices, or product details) from the HTML or JSON of a webpage.

Auto parsing functionality is highly beneficial for those who need to automate data extraction without extensive coding knowledge. It simplifies data collection and improves efficiency, making it ideal for businesses, marketers, analysts, and researchers.

With auto parsing, you send a request to the API. The API then goes through and attempts to parse the page for you. If successful, they send you a response back containing the extracted data.

To do this with Scrapfly, we can use any of the following parameters: extraction_prompt, extraction_model, extraction_template.

You can view an example of LLM extraction below.

import requests
import json
from bs4 import BeautifulSoup
from base64 import b64encode
from urllib.parse import urlencode

API_KEY = ""

with open("config.json") as file:
config = json.load(file)
API_KEY = config["scrapfly_api_key"]

def get_scrapfly_url(url):
payload = {
"key": API_KEY,
"url": url,
"extraction_prompt": "Please find all reviews for this product"
}

proxy_url = "https://api.scrapfly.io/scrape?" + urlencode(payload)
return proxy_url

url = "https://web-scraping.dev/product/1"

response = requests.get(get_scrapfly_url(url))

print(response.json())

With extraction_prompt, you give the prompt to be entered into an LLM. It's like your asking ChatGPT to scrape for you.

The full documentation on extraction can be found here.


Case Study: Using Scrapfly on IMDb Top 250 Movies

Now, it's time for a little comparison.

Here, we'll use Scrapfly and the ScrapeOps Proxy Aggregator to scrape the top 250 movies from IMDB.

Our two scripts are almost exactly the same. The major difference is that we use "api_key" with ScrapeOps and "key" with Scrapfly.

Scrapfly

Here is our proxy function for Scrapfly.

def get_scrapfly_url(url):
payload = {
"key": API_KEY,
"url": url,
}
proxy_url = "https://api.scrapfly.io/scrape?" + urlencode(payload)
return proxy_url

Here is the full code.

import os
import requests
from bs4 import BeautifulSoup
import json
import csv
import logging
from urllib.parse import urlencode

## Logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

API_KEY = ""

with open("config.json", "r") as config_file:
config = json.load(config_file)
API_KEY = config["scrapfly_api_key"]

def get_scrapfly_url(url):
payload = {
"key": API_KEY,
"url": url,
}
proxy_url = "https://api.scrapfly.io/scrape?" + urlencode(payload)
return proxy_url


def scrape_movies(url, location="us", retries=3):
success = False
tries = 0

while not success and tries <= retries:
response = requests.get(get_scrapfly_url(url))

try:
if response.status_code != 200:
raise Exception(f"Failed response from server, status code: {response.status_code}")

soup = BeautifulSoup(response.json()["result"]["content"], "html.parser")
json_tag = soup.select_one("script[type='application/ld+json']")
json_data = json.loads(json_tag.text)["itemListElement"]

movie_list_length = 0

movie_list = []

for item in json_data:
movie_list.append(item["item"])
movie_list_length+=len(json_data)

print(f"Movie list length: {len(json_data)}")
with open("scrapfly-top-250.json", "w") as file:
json.dump(movie_list, file, indent=4)
success = True
except Exception as e:
logger.error(f"Failed to process page: {e}, retries left: {retries-tries}")
tries+=1

if not success:
raise Exception(f"Failed to scrape page, MAX RETRIES {retries} EXCEEDED!!!")



if __name__ == "__main__":

MAX_RETRIES = 3

logger.info("Starting IMDB scrape")

url = "https://www.imdb.com/chart/top/"

scrape_movies(url, retries=MAX_RETRIES)

logger.info("Scrape complete")

Here are the results, Scrapfly finished in 7.483 seconds.

Scrapfly Performance Results

ScrapeOps Proxy Aggregator

With the ScrapeOps Proxy Aggregator, we use this proxy function instead. It's largely the same as our proxy function from the Scrapfly example. The main difference here is the api_key parameter. With Scrapfly, this is simply called, key.

def get_scrapeops_url(url):
payload = {
"api_key": API_KEY,
"url": url,
}
proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)
return proxy_url

Here is the full code for scraping IMDB with ScrapeOps.

import os
import requests
from bs4 import BeautifulSoup
import json
import csv
import logging
from urllib.parse import urlencode

## Logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

API_KEY = ""

with open("config.json", "r") as config_file:
config = json.load(config_file)
API_KEY = config["scrapeops_api_key"]

def get_scrapeops_url(url):
payload = {
"api_key": API_KEY,
"url": url,
}
proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)
return proxy_url



def scrape_movies(url, location="us", retries=3):
success = False
tries = 0

while not success and tries <= retries:
response = requests.get(get_scrapeops_url(url))

try:
if response.status_code != 200:
raise Exception(f"Failed response from server, status code: {response.status_code}")

soup = BeautifulSoup(response.text, "html.parser")
json_tag = soup.select_one("script[type='application/ld+json']")
json_data = json.loads(json_tag.text)["itemListElement"]

movie_list_length = 0

movie_list = []

for item in json_data:
movie_list.append(item["item"])
movie_list_length+=len(json_data)

print(f"Movie list length: {len(json_data)}")
with open("scrapeops-top-250.json", "w") as file:
json.dump(movie_list, file, indent=4)
success = True
except Exception as e:
logger.error(f"Failed to process page: {e}, retries left: {retries-tries}")
tries+=1

if not success:
raise Exception(f"Failed to scrape page, MAX RETRIES {retries} EXCEEDED!!!")



if __name__ == "__main__":

MAX_RETRIES = 3

logger.info("Starting IMDB scrape")

url = "https://www.imdb.com/chart/top/"

scrape_movies(url, retries=MAX_RETRIES)

logger.info("Scrape complete")

The full scrape took 5.728 seconds using the ScrapeOps Proxy Aggregator.

ScrapeOps Performance Results

Results

The ScrapeOps Proxy aggregator was significantly faster with a time of 5.728 seconds while Scrapfly took 8.313 seconds for the same scrape. While results may vary, in our initial testing, ScrapeOps was 37% faster.

If you chose to follow along, your results will be different based on your hardware and internet connection.

That being said, in our testing, the ScrapeOps Proxy Aggregator was quite a bit faster than Scrapfly for the same scraping job.


Alternative: ScrapeOps Proxy API Aggregator

The ScrapeOps Proxy Aggregator gets us access to just about all the same functionality as Scrapfly with a much larger selection of plans.

  • With Scrapfly, we can choose from 4 plans. With ScrapeOps, we get to choose from 8 different plans.
  • The lowest tier plan with Scrapfly costs $30 per month while the lowest tier plan from ScrapeOps costs $9 per month.
  • At the highest tier, Scrapfly costs $500 per month and ScrapeOps only costs $249 per month.

ScrapeOps Pricing Plans

Scrapfly Plans

PlanAPI CreditsCost Per Normal RequestMonthly Price
Discovery200,000$0.00015$30
Pro1,000,000$0.0001$100
Startup2,500,000$0.0001$250
Enterprise5,500,000$0.00009$500

ScrapeOps Plans

API CreditsCost Per Normal ReqestMonthly PriceScrapfly Equivalent
25,000$0.00036$9None
50,000$0.0003$15None
100,000$0.00019$19None
250,000$0.000116$29Discovery: $30 ($0.00015/request)
500,000$0.000108$54None
1,000,000$0.000099$99Pro: $100 ($0.0001/request)
2,000,000$0.0000995$199None
3,000,000$0.000083$249Startup $250 (0.0001/request)

As you can see in the table above, at every tier that Scrapfly offers an equivalent, we cost less money per request. We also offer 5 plans that have no Scrapfly equivalent period.


Troubleshooting

Issue #1: Request Timeouts

Requests timeouts can be a real pain. Luckily, it's pretty easy to set a custom timeout with Python Requests. To handle these timeouts, we simply need to use the timeout keyword argument.

Take a look at the example snippet below. We set a timeout of 5 seconds.

import requests

# 5 second timeout
response = requests.get("https://httpbin.org/get", timeout=5)

Issue #2: Handling CAPTCHAs

Dealing with CAPTCHAs can be a bit more difficult than timeout errors. Honestly, if you're receiving a CAPTCHA, something is likely not right with your scraper. Both Scrapfly and ScrapeOps are built to specifically avoid CAPTCHAs and bypass anti-bots.

First, retry your request. If you are consistently receiving CAPTCHAs with Scrapfly, enable asp. If you are consistently receiving them with ScrapeOps, you should use the bypass argument.

Another way of resolving this issue is with a 3rd party service like 2Captcha.

We also have a great article devoted entirely to CAPTCHAs here.

Issue #3: Invalid Response Data

Invalid response data is a really common issue in all areas of web development. To take care of these sorts of errors, you need to be aware of the status code that was sent. We've got a cheat sheet here.

Most importantly, understand your status code and solve the problem accordingly.


Here at ScrapeOps, we only scrape public data. This is a very important part of scraping the web legally. Public data is public information (much like a billboard).

If you scrape private data (data gated behind a login), this falls under a completely separate set of IP and privacy laws.

If you choose to scrape private data, there are many potential consequences including:

  • Terms of Service Violations: These can result in all sorts of headache such including court orders and civil lawsuits.

  • Computer Fraud and Other Hacking Charges: Depending on how you access your data and the rules governing that data, you can even face prison time. Violating laws of this sort don't always come with a financial penalty, some people are required to actually go to prison and serve hard time.

  • Other Legal Consequences: Depending on what you do with said data, you can face all sorts of other legal headache stemming from IP (intellectual property) and privacy laws that vary based on jurisdiction.

Ethical Consequences

When you agree to a site's Terms, it is usually treated as a legally binding contract. Websites have Terms and Conditions because they want you to follow a certain set of rules when accessing their product. Alongside site Terms, we also should take into consideration the robots.txt of the target site.

  • Terms Violations: When you violate a legally binding contract, you are subject to any repercussions defined in that contract including suspension and even a permanent ban. Depending on the terms, the target site might even have grounds to sue you.

  • robots.txt Violations: Violating a sites robots policies is not technically illegal. However, there are many other things that can happen. Examples of this include reputational damage to you and your company. No company wants to be the next headline related to unethical practices.


Conclusion

In conclusion, you now know a little bit about how to use both Scrapfly and also how to use the ScrapeOps Proxy Aggregator. You learned all sorts of reasons to apply Scrapfly's advanced functionalities when scraping.

You should also understand that ScrapeOps supports almost all the same functionalities at a lower price and a typically faster request speed. Take these new tools and go build something!


More Web Scraping Guides

ScrapeOps is loaded with learning resources. We even wrote the playbook on web scraping in Python. You can view it here. To view more of our proxy integration guides, take a look at the articles below.