Scrapfly: Web Scraping Integration Guide
Scrapfly is designed to be a one-stop shop for scraping. Scrapfly is also one of our many providers for the ScrapeOps Proxy Aggregator. They include many of our same features such as automated proxy management, JavaScript rendering, geotargeting, browser controls, screenshots, and auto parsing.
Today, we're going to walk through their process from start to finish. From our initial signup, all the way to a real world case study, we're going to gain a solid grasp of how to use the Scrapfly API and how it compares to the ScrapeOps Proxy Aggregator.
- TLDR: Scraping With Scrapfly
- What is Scrapfly?
- Setting Up
- Advanced Functionality
- JavaScript Rendering
- Geotargeting
- Residential Proxies
- Custom Headers
- Static Proxies
- Screenshots
- Auto Parsing
- Case Study: Top 250 Movies from IMDB
- Alternative: ScrapeOps Proxy Aggregator
- Conclusion
- More Web Scraping Guides
TLDR: Web Scraping With Scrapfly?
Getting started with Scrapfly is pretty easy once you've set up your account and you've got an API key.
- Create a new
config.json
file with your API key. - Then, write a script that reads your key and start scraping!
The example below is just an example, but it holds all the information you need to use your API key and make requests to Scrapfly.
import requests
import json
from urllib.parse import urlencode
API_KEY = ""
with open("config.json") as file:
config = json.load(file)
API_KEY = config["scrapfly_api_key"]
def get_scrapfly_url(url):
payload = {
"key": API_KEY,
"url": url,
}
proxy_url = "https://api.scrapfly.io/scrape?" + urlencode(payload)
return proxy_url
url = "https://quotes.toscrape.com"
response = requests.get(get_scrapfly_url(url))
data = response.json()
print(json.dumps(data, indent=4))
In the snippet above, you can get started very quickly with Scrapfly. If you need to add customization such as JavaScript rendering or geotargeting, simply add parameters to the payload
of the function.
You can view additional parameters in the API documentation here.
What Is Scrapfly?
Like ScrapeOps, Scrapfly is built as a one-stop shop for all of your scraping needs. They manage proxy pools so you don't have to manage individual proxies.
Scrapfly goes through and connects you to the best proxy for your scrape. Also, like ScrapeOps, they give you numerous options to customize your proxy connection to make your scraping job easier.
The reasons to use Scrapfly are very similar to the uses for the ScrapeOps Proxy Aggregator as well. This includes JavaScript rendering, browser actions, geotargeting, and much more. On top of all that, if you're using Scrapfly or ScrapeOps, you want reliability.
How Does Scrapfly Work?
Scrapfly maintains a large pool of datacenter and residential proxies all over the world. By default, it first attempts your request through a datacenter proxy and if the request is unsuccessful, it will retry with a better (often residential) proxy. Then, after the Scrapfly server receives its response, it sends a response back to you that includes the page you wanted to scrape.
When you scrape a site with a service like this, here is the basic process.
- You make a request to Scrapfly using your API key, target url, and any other custom parameters you wish to pass.
- Scrapfly receives the your request and attempts to retrieve your target url.
- If Scrapfly receives a failed response, they retry with a better proxy until they either timeout, reach a retry limit or get the response.
- After they receive the requested content, they send it back to your scraper.
Response Format
All of our responses come as JSON by default. Think back to our code example in the TLDR section. You can view it again below.
We retrieve our data
with response.json()
. If you run it yourself, you'll receive all sorts of useful information in your response such as config
, context
, result
, content
, request_headers
, and response_headers
.
import requests
import json
from urllib.parse import urlencode
API_KEY = ""
with open("config.json") as file:
config = json.load(file)
API_KEY = config["scrapfly_api_key"]
def get_scrapfly_url(url):
payload = {
"key": API_KEY,
"url": url,
}
proxy_url = "https://api.scrapfly.io/scrape?" + urlencode(payload)
return proxy_url
url = "https://quotes.toscrape.com"
response = requests.get(get_scrapfly_url(url))
data = response.json()
print(json.dumps(data, indent=4))
Scrapfly Pricing
Scrapfly offers 4 separate pricing plans. Their most affordable plan comes in at $30 per month and their highest tier plan runs $500 per month.
If none of these plans meet your needs, they do give the option to setup custom plans as well to suit your needs. The table below gives a solid breakdown of Scrapfly's available plans.
Plan | API Credits | Cost Per Normal Request | Monthly Price |
---|---|---|---|
Discovery | 200,000 | $0.00015 | $30 |
Pro | 1,000,000 | $0.0001 | $100 |
Startup | 2,500,000 | $0.0001 | $250 |
Enterprise | 5,500,000 | $0.00009 | $500 |
Like ScrapeOps Proxy Aggregator, when using Scrapfly, you typically only pay for successful requests.
Response Status Codes
Status codes are very important. If you're receiving anything other than a 200, something is wrong. To properly troubleshoot these status codes, we need a place to reference them.
The table below outlines the status codes you'll run into when using Scrapfly.
Status Code | Type | Description |
---|---|---|
200 | Success | Everything is working! |
400 | Bad Request | You need to double check your parameters. |
404 | Not Found | Double check your url, the site wasn't found. |
422 | Unprocessible Entity | Unable to process response. |
429 | Too Many Requests | Slow down your requests. |
500 | Internal Server Error | Scrapfly is having an internal issue. |
502 | Service Error | Scrapfly's host is having an internal error. |
503 | Temporarily Unavailable | Scrapfly is undergoing maintenance. |
504 | Not Reachable | Scrapfly is not reachable or timed out. |
Setting Up Scrapfly
Signing up for Scrapfly is a bit tedious. Unlike many other providers, the have a somewhat in-depth KYC (know your customer) process. They collect information about your employment, your reasons for using their site, the sites you want to scrape, and any other proxy products you've used in the past. They collect your email and phone number as well.
To get started:
- You need to fill in a some personal information and complete a CAPTCHA.
- You'll need your basic contact information (phone number and email address).
- You will also need to disclose your reasons for using the site, the sites you wish to scrape and any other proxy providers you've used or tested in the past.
After a somewhat intrusive (but arguably justified) signup process, you'll receive a confirmation email. Once you've confirmed your email, you can access the dashboard and you're ready to go with 1,000 API credits.
With Scrapfly, we can use either their REST API or their SDK. As mentioned here, Scrapfly does not support HTTP proxy port integration. Our two ways of access are as follows:
- REST API: We use the REST API when we're comfortable with an HTTP library (in our case, Python Requests) and we'd like to build our requests ourselves.
- SDK: Using their SDK is a great way to get started... especially for beginners. The SDK allows us full connection to the REST API, but much of the underlying HTTP has been extracted away so we don't need to think about it as much.
If you click the API Player tab, you'll be taken to their Request builder. When dealing with any new scraping API, builders like this are an incredibly useful tool.
These builders allow us to create custom API requests using a variety of different frameworks and HTTP clients.
API Endpoint Integration
We've already performed API endpoint integration in our previous code example. With Endpoint Integration, we send all of our requests to a specific API endpoint. The API then reads these parameters and executes our request accordingly.
Let's look at this basic request one more time.
import requests
import json
from urllib.parse import urlencode
API_KEY = ""
with open("config.json") as file:
config = json.load(file)
API_KEY = config["scrapfly_api_key"]
def get_scrapfly_url(url):
payload = {
"key": API_KEY,
"url": url,
}
proxy_url = "https://api.scrapfly.io/scrape?" + urlencode(payload)
return proxy_url
url = "https://quotes.toscrape.com"
response = requests.get(get_scrapfly_url(url))
data = response.json()
print(json.dumps(data, indent=4))
Take a look at our proxy URL without the payload
: "https://api.scrapfly.io/scrape?
. All of our requests are going to the /scrape
endpoint.
Whenever we make a request to the API (no matter what our parameters are), it gets sent to this specific endpoint. All of our parameters get wrapped in url encoding and appended to this url "https://api.scrapfly.io/scrape?
.
Their full API documentation is available here.
SDK Integration
Scrapfly also gives us the option to use their SDK. The SDK abstracts away much of the lower level HTTP code that we deal with when using Endpoint Integration.
To install the Python SDK, run the following command.
pip install 'scrapfly-sdk'
You can then test your proxy connection with the following code. Make sure to replace the API key with your own.
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key='YOUR-SUPER-SECRET-API-KEY')
api_response:ScrapeApiResponse = scrapfly.scrape(scrape_config=ScrapeConfig(url='https://httpbin.dev/anything'))
# Automatic retry errors marked "retryable" and wait delay recommended before retrying
api_response:ScrapeApiResponse = scrapfly.resilient_scrape(scrape_config=ScrapeConfig(url='https://httpbin.dev/anything'))
# Automatic retry error based on status code
api_response:ScrapeApiResponse = scrapfly.resilient_scrape(scrape_config=ScrapeConfig(url='https://httpbin.dev/status/500'), retry_on_status_code=[500])
# scrape result, content, iframes, response headers, response cookies states, screenshots, ssl, dns etc
print(api_response.scrape_result)
# html content
print(api_response.scrape_result['content'])
# Context of scrape, session, webhook, asp, cache, debug
print(api_response.context)
# raw api result
print(api_response.content)
# True if the scrape respond with >= 200 < 300 http status
print(api_response.success)
# Api status code /!\ Not the api status code of the scrape!
print(api_response.status_code)
# Upstream website status code
print(api_response.upstream_status_code)
# Convert API Scrape Result into well known requests.Response object
print(api_response.upstream_result_into_response())
The full documentation for Scrapfly's Python SDK is available here.
Managing Concurrency
With Scrapfly, you can make up to 5 concurrent requests even on the free plan!
To make use of your concurrency, you can use ThreadPoolExecutor
to execute multiple requests at once. In the code below, we define a function to scrape the h1
from each page, scrape_page()
. We then pass this function and our list_of_urls
into executor.map()
.
import requests
from bs4 import BeautifulSoup
import concurrent.futures
import json
from urllib.parse import urlencode
API_KEY = ""
NUM_THREADS = 3
with open("config.json") as file:
config = json.load(file)
API_KEY = config["scrapfly_api_key"]
def get_scrapfly_url(url):
payload = {
"key": API_KEY,
"url": url,
}
proxy_url = "https://api.scrapfly.io/scrape?" + urlencode(payload)
return proxy_url
list_of_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
'http://quotes.toscrape.com/page/3/',
]
output_data_list = []
def scrape_page(url):
try:
response = requests.get(get_scrapfly_url(url))
if response.status_code == 200:
soup = BeautifulSoup(response.json()["result"]["content"], "html.parser")
title = soup.find('h1').text
## add scraped data to "output_data_list" list
output_data_list.append({
'title': title,
})
except Exception as e:
print('Error', e)
with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
executor.map(scrape_page, list_of_urls)
print(output_data_list)
When using ThreadPoolExecutor
, we open up a pool of threads with the max_workers
argument. We then pass the following into executor.map()
:
scrape_page
: the function we wish to call on all open threads.list_of_urls
: a list arguments to be passed into the function above.
Advanced Functionality
Scrapfly is host to a pretty large set of advanced functionalties.
From JavaScript to geotargeting and all the way to auto extraction, Scrapfly claims to be able to pretty much do it all! We outline most of this functionality in the table below.
NOTE: some features will cost extra to perform.
Parameter | API Credits | Description |
---|---|---|
proxy_pool | 1 - 25 | Use either residential or datacenter proxies. |
headers | None | Send custom headers to send to your target. |
country | None | Route your request through a specific location. |
lang | None | Set a custom language for your response. |
os | None | Set a custom OS for your scraper. (Not Recommended) |
timeout | None | Custom timeout for the response (Not Recommended) |
format | None | Format in which to receive your repsonse. |
retry | None | Retry in the event of a failed response. |
proxified_response | None | Return HTML response directly as the response body |
debug | None | Store the scraped data, screenshot of response. |
correlation_id | None | Correlation ID for a group of scrapes in progess. |
tags | None | Add tags to a group item. |
dns | None | Query and retrieve DNS info for the target site. |
ssl | None | Collect the target site's SSL data. |
webhook_name | None | Make a request to a webhook after retrieving page. |
extraction_template | 1 | Attempt to parse the page automatically. |
extraction_prompt | 5 | Prompt an AI to extract the data. |
extraction_model | 5 | Attempt to parse the page using a specific model. |
asp | Variable | Get past anti-bots. |
cost_budget | N/A | Set a price limit when attempting ASP. |
render_js | 5 | Open a browser and render JavaScript content. |
wait_for_selector | 5 | Wait for specific selector to appear on the page. |
js | 5 | Execute a set of JavaScript instructions on page. |
screenshot | 5 | Take a screenshot of the page or an HMTL element. |
screenshot_flags | 5 | Flags to customize a screenshot. |
js_scenario | 5 | Execute a set of JS actions (scroll, click etc.) |
geolocation | None | Set a custom location. |
auto_scroll | 5 | Scroll to the bottom of the page and load JS. |
rendering_stage | 5 | Wait until domcontentloaded or complete |
cache | None | Store the scraped content on Scrapfly servers. |
cache_ttl | None | Cache time til live. |
cache_clear | None | Force clear the cache, then scrape and replace. |
session | None | Reuse a browsing session (cookies and fingerprint). |
session_sticky_proxy | None | Reuse a browsing session (actual IP addresses). |
These functionalities can be reviewed here.
Javascript Rendering
JavaScript Rendering is the process of executing JavaScript code to dynamically generate or modify the content on a web page.
Unlike traditional server-side rendering, where the server sends a fully constructed HTML page to the browser, JavaScript rendering often involves loading a skeleton HTML page and then using JavaScript to build or enhance the content on the client side after the page has loaded.
JavaScript rendering plays a critical role in delivering modern, fast, and interactive web applications.
- Dynamic Content: Enables dynamic updates without reloading the page, enhancing user interactivity.
- Single Page Applications (SPAs): JavaScript rendering is essential for creating SPAs where content changes without navigating away from the current page.
- Improved User Experience: Faster, smoother, and more interactive web applications that respond instantly to user input.
- SEO and Search Engine Crawling: Modern websites that rely on JavaScript rendering may use SSR or dynamic rendering to ensure that search engines can index content properly. Java
We can render JavaScript with the render_js
parameter. It does exactly what it sounds like. It renders JavaScript.
In the snippet below, we visit WhatIsMyIp. This site initially gives us no content. It then uses JavaScript to dynamically load our IP address onto the page. Without JavaScript support, we'll be unable to scrape our IP.
Make sure to set render_js
to "true"
. For whatever reason, if you directly urlencode(True)
, this leads to an incorrect reading by the server and Scrapfly ignores the request to render JavaScript.
import requests
import json
from bs4 import BeautifulSoup
from urllib.parse import urlencode
API_KEY = ""
with open("config.json") as file:
config = json.load(file)
API_KEY = config["scrapfly_api_key"]
def get_scrapfly_url(url):
payload = {
"key": API_KEY,
"url": url,
"render_js": "true",
"rendering_wait": 2000
}
proxy_url = "https://api.scrapfly.io/scrape?" + urlencode(payload)
return proxy_url
url = "https://www.whatismyip.com/"
response = requests.get(get_scrapfly_url(url))
content = response.json()["result"]["content"]
soup = BeautifulSoup(content, "html.parser")
ip_info = soup.select_one("a[id='ipv4']").get("title")
print(ip_info)
"render_js": "true"
tells Scrapfly that we wish to open a browser and render JavaScript content."rendering_wait": 2000
tells Scrapfly to wait for 2 seconds (2,000 milliseconds) for our content to render and then send the response back to us.
Their documentation on rendering JavaScript is available here.
Controlling The Browser
With Scrapfly, not only can we open a browser, but we can also control one! We can use js_scenario
to control our browser.
We write all of our JavaScript actions as an array of JSON objects. Then, we use Base64 encoding to convert the JSON to a binary format. This Base64 encoding prevents the data from getting corrupted in transit.
Using js_scenario
below, we once again wait for the page to render, but instead of calling rendering_wait
, we give the instructions directly inside the js_scenario
.
import requests
import json
from bs4 import BeautifulSoup
from base64 import b64encode
from urllib.parse import urlencode
API_KEY = ""
with open("config.json") as file:
config = json.load(file)
API_KEY = config["scrapfly_api_key"]
def get_scrapfly_url(url):
payload = {
"key": API_KEY,
"url": url,
"render_js": "true",
"js_scenario": b64encode(b"""
[
{ "wait": 2500 }
]
""").decode("utf-8")
}
proxy_url = "https://api.scrapfly.io/scrape?" + urlencode(payload)
return proxy_url
url = "https://www.whatismyip.com/"
response = requests.get(get_scrapfly_url(url))
content = response.json()["result"]["content"]
soup = BeautifulSoup(content, "html.parser")
ip_info = soup.select_one("a[id='ipv4']").get("title")
print(ip_info)
js_scenario
tells Scrapfly that we'd like to perform a list of JavaScript actions.- We create an array of JSON objects and encode them in Base64 to prevent them from getting corrupted in transit.
Country Geotargeting
Country Geotargeting is a technique used to deliver tailored content, services, or advertisements to users based on their geographic location, specifically targeting users from certain countries. This is accomplished by detecting the user's IP address, which provides an approximation of their location.
Proxy services like Scrapfly route user traffic through proxy servers located in specific countries.
This allows users to access content, services, or websites as if they are browsing from the targeted country. This is particularly useful for accessing region-specific content or bypassing geo-restrictions.
Country geotargeting can help:
- Bypass Geo-Restrictions: Access region-locked content like streaming services, websites, or apps that are restricted to specific countries.
- Localized SEO and Ad Verification: Ensure that ads or search engine results are accurately displayed in different regions by simulating user traffic from target countries.
- Access Regional Deals and Pricing: Take advantage of country-specific promotions, pricing, or services that vary based on user location.
- Test Localization: Developers and testers use country-specific proxies to verify that websites and apps function properly across various regions with correct localization.
- Avoid IP-based Blocks: Bypass IP restrictions on websites or services that limit access based on geographic location.
- Enhanced Privacy: For users in restrictive regions, using a proxy from another country helps bypass censorship and provides anonymity.
To control our country, we can use the country parameter. Take a look at the code below. We add "country": "us"
to our payload
.
Once we've got our parameters setup, we go through and perform a request from our actual location and our request from the proxy location to compare.
import requests
import json
from bs4 import BeautifulSoup
from base64 import b64encode
from urllib.parse import urlencode
API_KEY = ""
with open("config.json") as file:
config = json.load(file)
API_KEY = config["scrapfly_api_key"]
def get_scrapfly_url(url):
payload = {
"key": API_KEY,
"url": url,
"country": "us"
}
proxy_url = "https://api.scrapfly.io/scrape?" + urlencode(payload)
return proxy_url
url = "https://httpbin.org/ip"
test_response = requests.get(url)
print("real location", test_response.json())
response = requests.get(get_scrapfly_url(url))
content = response.json()["result"]["content"]
print("proxy location:", content)
Here is the output.
Now it's time to check and make sure that the proxy location is inside the United States. We show up in Newark, New Jersey.
Geotargeting is a staple when you're scraping the web. You can view Scrapfly's country
documentation here. Here are some of the country codes you can use with Scrapfly.
Country | Code |
---|---|
United Arab Emirates | ae |
Australia | au |
Brazil | br |
Canada | ca |
China | cn |
Germany | de |
Spain | es |
United Kingdom | gb |
India | in |
Japan | jp |
Mexico | mx |
Portugal | pt |
Russia | ru |
Turkey | tr |
United States | us |
The list above is non-exhaustive, if you wish to view their full list of countries (there are alot), you can view it here.
Residential Proxies
Residential proxies are another important staple in web scraping.
A residential proxy is a type of proxy server that routes internet traffic through real residential IP addresses provided by Internet Service Providers (ISPs).
These IPs are linked to physical locations (homes) and are associated with actual devices such as computers, mobile phones, or routers, making them appear as legitimate users online.
Residential proxies are particularly useful when you need to mimic real users and access geographically restricted content, or when performing tasks like web scraping, ad verification, or managing multiple accounts with a lower risk of detection.
We can use the proxy_pool
argument to specify that we want to use Scrapfly's residential proxy pool. When forwarding our request, Scrapfly will then automatically route our request through their residential pool.
import requests
import json
from bs4 import BeautifulSoup
from base64 import b64encode
from urllib.parse import urlencode
API_KEY = ""
with open("config.json") as file:
config = json.load(file)
API_KEY = config["scrapfly_api_key"]
def get_scrapfly_url(url):
payload = {
"key": API_KEY,
"url": url,
"proxy_pool": "public_residential_pool"
}
proxy_url = "https://api.scrapfly.io/scrape?" + urlencode(payload)
return proxy_url
url = "https://httpbin.org/ip"
response = requests.get(get_scrapfly_url(url))
content = response.json()["result"]["content"]
print("proxy location:", content)
Here is our output.
You can view their proxy_pool
docs here.
Custom Headers
HTTP headers provide additional information about the request, such as authentication tokens, content type, and user-agent data, and are crucial for proper communication between clients and servers.
By default, most proxy APIs or systems manage request headers automatically to optimize performance. However, some allow users to customize headers if specific data is needed to access the desired target.
Custom headers can sometimes be pretty useful when you're scraping the web. Sometimes you're accessing a site that needs special or additional headers that your proxy service isn't aware of.
Why Use Custom Headers?
Custom headers are a powerful tool for more advanced or specific use cases, especially when interacting with APIs, bypassing detection systems, or replicating the behavior of real users.
Word of Caution
Custom headers require careful management to avoid performance degradation or triggering blocks.
- Incorrect or static custom headers can negatively impact proxy performance.
- If custom headers are not properly rotated or randomized, websites may detect repetitive behavior and block access.
- For large-scale tasks, a system for continuously generating clean and randomized headers is essential to avoid detection and ensure smooth operation.
Proxy services typically optimize default headers for best performance, so custom headers should only be used when necessary.
With Scrapfly, we can set custom headers with the headers
parameter. You put your actual header inside square brackets, [Your Header Name]
, and give it the header
prefix. To set Your Custom Header
, we pass "header[Your Header Name]": "Your Header Value"
.
import requests
import json
from bs4 import BeautifulSoup
from base64 import b64encode
from urllib.parse import urlencode
API_KEY = ""
with open("config.json") as file:
config = json.load(file)
API_KEY = config["scrapfly_api_key"]
def get_scrapfly_url(url):
payload = {
"key": API_KEY,
"url": url,
"headers[Your Header Name]": "Your Header Value"
}
proxy_url = "https://api.scrapfly.io/scrape?" + urlencode(payload)
return proxy_url
url = "https://httpbin.org/ip"
response = requests.get(get_scrapfly_url(url))
content = response.json()["result"]["content"]
print("proxy location:", content)
Their header documentation is available here.
Static Proxies
Static Proxy Functionality (also called sticky session proxies) is a type of proxy service where the user is assigned a single, static IP address that remains consistent throughout the session or for a specific period.
Unlike rotating proxies, which switch IPs with every request, static proxies maintain the same IP for multiple requests.
Static proxies are ideal for scenarios that require consistent interactions with websites, maintaining user sessions, and avoiding detection, making them an excellent tool for tasks like account management, ad verification, and market research.
For this, we use the session
argument. Give your session a name and it will be saved by Scrapfly for up to 7 days. Scrapfly discards all sessions after 7 days.
Here is the code to set a session.
import requests
import json
from bs4 import BeautifulSoup
from base64 import b64encode
from urllib.parse import urlencode
API_KEY = ""
with open("config.json") as file:
config = json.load(file)
API_KEY = config["scrapfly_api_key"]
def get_scrapfly_url(url):
payload = {
"key": API_KEY,
"url": url,
"session": "Name of Your Session"
}
proxy_url = "https://api.scrapfly.io/scrape?" + urlencode(payload)
return proxy_url
url = "https://httpbin.org/ip"
response = requests.get(get_scrapfly_url(url))
content = response.json()["result"]["content"]
print("proxy location:", content)
The full info on session
can be viewed here.
Screenshot Functionality
Screenshot functionality in proxy services allows users to capture an image or visual representation of a web page as it appears at a specific moment in time.
Screenshots are incredibly important when you're scraping the web. Screenshot functionality is an essential tool for verifying content, visual elements, and user experience across different contexts, providing valuable insights for businesses, marketers, and developers.
From data verification to debugging, we use them all the time. To take a screenshot with Scrapfly, we can use the screenshots
parameter. We can actually take multiple screenshots on a single request.
Take a look at the example below.
import requests
import json
from bs4 import BeautifulSoup
from base64 import b64encode
from urllib.parse import urlencode
API_KEY = ""
with open("config.json") as file:
config = json.load(file)
API_KEY = config["scrapfly_api_key"]
def get_scrapfly_url(url):
payload = {
"key": API_KEY,
"url": url,
"screenshots[all]": "fullPage",
"screenshots[reviews]": "#reviews"
}
proxy_url = "https://api.scrapfly.io/scrape?" + urlencode(payload)
return proxy_url
url = "https://web-scraping.dev/product/1"
response = requests.get(get_scrapfly_url(url))
print(response.json())
The documentation for screenshots
is available here.
Auto Parsing
Auto parsing is a rather new but fashionable feature with web scrapers.
Auto Parsing (also known as Auto Extract) refers to the automated process of extracting specific data or elements from a web page or document without needing manual coding or complex scraping techniques.
With auto parsing, the system intelligently identifies and extracts structured data (e.g., text, images, prices, or product details) from the HTML or JSON of a webpage.
Auto parsing functionality is highly beneficial for those who need to automate data extraction without extensive coding knowledge. It simplifies data collection and improves efficiency, making it ideal for businesses, marketers, analysts, and researchers.
With auto parsing, you send a request to the API. The API then goes through and attempts to parse the page for you. If successful, they send you a response back containing the extracted data.
To do this with Scrapfly, we can use any of the following parameters: extraction_prompt
, extraction_model
, extraction_template
.
You can view an example of LLM extraction below.
import requests
import json
from bs4 import BeautifulSoup
from base64 import b64encode
from urllib.parse import urlencode
API_KEY = ""
with open("config.json") as file:
config = json.load(file)
API_KEY = config["scrapfly_api_key"]
def get_scrapfly_url(url):
payload = {
"key": API_KEY,
"url": url,
"extraction_prompt": "Please find all reviews for this product"
}
proxy_url = "https://api.scrapfly.io/scrape?" + urlencode(payload)
return proxy_url
url = "https://web-scraping.dev/product/1"
response = requests.get(get_scrapfly_url(url))
print(response.json())
With extraction_prompt
, you give the prompt to be entered into an LLM. It's like your asking ChatGPT to scrape for you.
The full documentation on extraction can be found here.
Case Study: Using Scrapfly on IMDb Top 250 Movies
Now, it's time for a little comparison.
Here, we'll use Scrapfly and the ScrapeOps Proxy Aggregator to scrape the top 250 movies from IMDB.
Our two scripts are almost exactly the same. The major difference is that we use "api_key"
with ScrapeOps and "key"
with Scrapfly.
Scrapfly
Here is our proxy function for Scrapfly.
def get_scrapfly_url(url):
payload = {
"key": API_KEY,
"url": url,
}
proxy_url = "https://api.scrapfly.io/scrape?" + urlencode(payload)
return proxy_url
Here is the full code.
import os
import requests
from bs4 import BeautifulSoup
import json
import csv
import logging
from urllib.parse import urlencode
## Logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
API_KEY = ""
with open("config.json", "r") as config_file:
config = json.load(config_file)
API_KEY = config["scrapfly_api_key"]
def get_scrapfly_url(url):
payload = {
"key": API_KEY,
"url": url,
}
proxy_url = "https://api.scrapfly.io/scrape?" + urlencode(payload)
return proxy_url
def scrape_movies(url, location="us", retries=3):
success = False
tries = 0
while not success and tries <= retries:
response = requests.get(get_scrapfly_url(url))
try:
if response.status_code != 200:
raise Exception(f"Failed response from server, status code: {response.status_code}")
soup = BeautifulSoup(response.json()["result"]["content"], "html.parser")
json_tag = soup.select_one("script[type='application/ld+json']")
json_data = json.loads(json_tag.text)["itemListElement"]
movie_list_length = 0
movie_list = []
for item in json_data:
movie_list.append(item["item"])
movie_list_length+=len(json_data)
print(f"Movie list length: {len(json_data)}")
with open("scrapfly-top-250.json", "w") as file:
json.dump(movie_list, file, indent=4)
success = True
except Exception as e:
logger.error(f"Failed to process page: {e}, retries left: {retries-tries}")
tries+=1
if not success:
raise Exception(f"Failed to scrape page, MAX RETRIES {retries} EXCEEDED!!!")
if __name__ == "__main__":
MAX_RETRIES = 3
logger.info("Starting IMDB scrape")
url = "https://www.imdb.com/chart/top/"
scrape_movies(url, retries=MAX_RETRIES)
logger.info("Scrape complete")
Here are the results, Scrapfly finished in 7.483 seconds.
ScrapeOps Proxy Aggregator
With the ScrapeOps Proxy Aggregator, we use this proxy function instead. It's largely the same as our proxy function from the Scrapfly example. The main difference here is the api_key
parameter. With Scrapfly, this is simply called, key
.
def get_scrapeops_url(url):
payload = {
"api_key": API_KEY,
"url": url,
}
proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)
return proxy_url
Here is the full code for scraping IMDB with ScrapeOps.
import os
import requests
from bs4 import BeautifulSoup
import json
import csv
import logging
from urllib.parse import urlencode
## Logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
API_KEY = ""
with open("config.json", "r") as config_file:
config = json.load(config_file)
API_KEY = config["scrapeops_api_key"]
def get_scrapeops_url(url):
payload = {
"api_key": API_KEY,
"url": url,
}
proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)
return proxy_url
def scrape_movies(url, location="us", retries=3):
success = False
tries = 0
while not success and tries <= retries:
response = requests.get(get_scrapeops_url(url))
try:
if response.status_code != 200:
raise Exception(f"Failed response from server, status code: {response.status_code}")
soup = BeautifulSoup(response.text, "html.parser")
json_tag = soup.select_one("script[type='application/ld+json']")
json_data = json.loads(json_tag.text)["itemListElement"]
movie_list_length = 0
movie_list = []
for item in json_data:
movie_list.append(item["item"])
movie_list_length+=len(json_data)
print(f"Movie list length: {len(json_data)}")
with open("scrapeops-top-250.json", "w") as file:
json.dump(movie_list, file, indent=4)
success = True
except Exception as e:
logger.error(f"Failed to process page: {e}, retries left: {retries-tries}")
tries+=1
if not success:
raise Exception(f"Failed to scrape page, MAX RETRIES {retries} EXCEEDED!!!")
if __name__ == "__main__":
MAX_RETRIES = 3
logger.info("Starting IMDB scrape")
url = "https://www.imdb.com/chart/top/"
scrape_movies(url, retries=MAX_RETRIES)
logger.info("Scrape complete")
The full scrape took 5.728 seconds using the ScrapeOps Proxy Aggregator.
Results
The ScrapeOps Proxy aggregator was significantly faster with a time of 5.728 seconds while Scrapfly took 8.313 seconds for the same scrape. While results may vary, in our initial testing, ScrapeOps was 37% faster.
If you chose to follow along, your results will be different based on your hardware and internet connection.
That being said, in our testing, the ScrapeOps Proxy Aggregator was quite a bit faster than Scrapfly for the same scraping job.
Alternative: ScrapeOps Proxy API Aggregator
The ScrapeOps Proxy Aggregator gets us access to just about all the same functionality as Scrapfly with a much larger selection of plans.
- With Scrapfly, we can choose from 4 plans. With ScrapeOps, we get to choose from 8 different plans.
- The lowest tier plan with Scrapfly costs $30 per month while the lowest tier plan from ScrapeOps costs $9 per month.
- At the highest tier, Scrapfly costs $500 per month and ScrapeOps only costs $249 per month.
Scrapfly Plans
Plan | API Credits | Cost Per Normal Request | Monthly Price |
---|---|---|---|
Discovery | 200,000 | $0.00015 | $30 |
Pro | 1,000,000 | $0.0001 | $100 |
Startup | 2,500,000 | $0.0001 | $250 |
Enterprise | 5,500,000 | $0.00009 | $500 |
ScrapeOps Plans
API Credits | Cost Per Normal Reqest | Monthly Price | Scrapfly Equivalent |
---|---|---|---|
25,000 | $0.00036 | $9 | None |
50,000 | $0.0003 | $15 | None |
100,000 | $0.00019 | $19 | None |
250,000 | $0.000116 | $29 | Discovery: $30 ($0.00015/request) |
500,000 | $0.000108 | $54 | None |
1,000,000 | $0.000099 | $99 | Pro: $100 ($0.0001/request) |
2,000,000 | $0.0000995 | $199 | None |
3,000,000 | $0.000083 | $249 | Startup $250 (0.0001/request) |
As you can see in the table above, at every tier that Scrapfly offers an equivalent, we cost less money per request. We also offer 5 plans that have no Scrapfly equivalent period.
Troubleshooting
Issue #1: Request Timeouts
Requests timeouts can be a real pain. Luckily, it's pretty easy to set a custom timeout with Python Requests. To handle these timeouts, we simply need to use the timeout
keyword argument.
Take a look at the example snippet below. We set a timeout of 5 seconds.
import requests
# 5 second timeout
response = requests.get("https://httpbin.org/get", timeout=5)
Issue #2: Handling CAPTCHAs
Dealing with CAPTCHAs can be a bit more difficult than timeout errors. Honestly, if you're receiving a CAPTCHA, something is likely not right with your scraper. Both Scrapfly and ScrapeOps are built to specifically avoid CAPTCHAs and bypass anti-bots.
First, retry your request. If you are consistently receiving CAPTCHAs with Scrapfly, enable asp
. If you are consistently receiving them with ScrapeOps, you should use the bypass
argument.
Another way of resolving this issue is with a 3rd party service like 2Captcha.
We also have a great article devoted entirely to CAPTCHAs here.
Issue #3: Invalid Response Data
Invalid response data is a really common issue in all areas of web development. To take care of these sorts of errors, you need to be aware of the status code that was sent. We've got a cheat sheet here.
Most importantly, understand your status code and solve the problem accordingly.
The Legal & Ethical Implications of Web Scraping
Legal Considerations
Here at ScrapeOps, we only scrape public data. This is a very important part of scraping the web legally. Public data is public information (much like a billboard).
If you scrape private data (data gated behind a login), this falls under a completely separate set of IP and privacy laws.
If you choose to scrape private data, there are many potential consequences including:
-
Terms of Service Violations: These can result in all sorts of headache such including court orders and civil lawsuits.
-
Computer Fraud and Other Hacking Charges: Depending on how you access your data and the rules governing that data, you can even face prison time. Violating laws of this sort don't always come with a financial penalty, some people are required to actually go to prison and serve hard time.
-
Other Legal Consequences: Depending on what you do with said data, you can face all sorts of other legal headache stemming from IP (intellectual property) and privacy laws that vary based on jurisdiction.
Ethical Consequences
When you agree to a site's Terms, it is usually treated as a legally binding contract. Websites have Terms and Conditions because they want you to follow a certain set of rules when accessing their product. Alongside site Terms, we also should take into consideration the robots.txt
of the target site.
-
Terms Violations: When you violate a legally binding contract, you are subject to any repercussions defined in that contract including suspension and even a permanent ban. Depending on the terms, the target site might even have grounds to sue you.
-
robots.txt Violations: Violating a sites robots policies is not technically illegal. However, there are many other things that can happen. Examples of this include reputational damage to you and your company. No company wants to be the next headline related to unethical practices.
Conclusion
In conclusion, you now know a little bit about how to use both Scrapfly and also how to use the ScrapeOps Proxy Aggregator. You learned all sorts of reasons to apply Scrapfly's advanced functionalities when scraping.
You should also understand that ScrapeOps supports almost all the same functionalities at a lower price and a typically faster request speed. Take these new tools and go build something!
More Web Scraping Guides
ScrapeOps is loaded with learning resources. We even wrote the playbook on web scraping in Python. You can view it here. To view more of our proxy integration guides, take a look at the articles below.