ZenRows: Web Scraping Integration Guide
ZenRows is a powerful web scraping solution that simplifies the process of extracting data from websites by handling common challenges like captchas, IP blocks, and dynamic content. ZenRows offers a Scraper API, Residential Proxies and they are releasing a Scraping Browser soon.
In this guide, we'll explore how to integrate Zenrows into your web scraping projects, enabling you to scrape data effortlessly while maintaining compliance and performance.
- TLDR: Scraping With ZenRows
- What is ZenRows?
- Setting Up the ZenRows API
- Advanced Functionality
- JavaScript Rendering
- Country Geotargeting
- Residential Proxies
- Custom Headers
- Static Proxies
- Screenshot Functionality
- Auto Parsing
- Case Study: IMDB Top 250 Movies
- Alternative: ScrapeOps Proxy Aggregator
- Troubleshooting
- Conclusion
- More Web Scraping Guides
TLDR: Web Scraping With ZenRows
Scraping with ZenRows is almost the same as it is with ScrapeOps. If you're just looking for a quick way to get started with it, go ahead and use the function below.
def get_zenrows_url(url):
payload = {
"apikey": API_KEY,
"url": url,
}
proxy_url = "https://api.zenrows.com/v1/?" + urlencode(payload)
return proxy_url
To customize your proxy, take a look at the API docs here.
What Is ZenRows?
Much like ScrapeOps, ZenRows is something of an all in one proxy solution. With their API, we can bypass anti-bots, rotate proxies, run a headless browser, much more.
As you can see highlighted below, Zenrows is actually one of our providers for the ScrapeOps Proxy Aggregator. Their product is very similar to ours here at ScrapeOps.
Much like ScrapeOps, they allow us to set custom countries, wait for content to render, pass custom headers, set premium proxies and much more.
When we use the ZenRows API (much like the ScrapeOps API), here is how the base process goes:
- We send our
url
and ourapi_key
to ZenRows. - ZenRows attempts to get our
url
through one of their servers. - ZenRows gets their response.
- ZenRows forwards the response back to us.
Throughout this process, ZenRows can rotate IP addresses and make all of our requests look like they're coming from somewhere else. Just like with the ScrapeOps, there are many other bells and whistles we can use with the API, but this overall process remains pretty much the same.
- You tell the API which site you want to access.
- Their servers access the site for you.
- You scrape your desired site(s).
How Does Zenrows API Work?
ZenRows is a proxy provider. This means that we send them a url
and our api_key
and they send back our response from the website. They accomplish this by using different IP addresses to access the site.
There are numerous options we can use to customize our request, but overall the process remains much the same.
The table below contains a list of common parameters used with the ZenRows API. This list is non-exhaustive, you may view their full API documentation here.
Parameter | Description |
---|---|
apikey (requried) | Your ZenRows API key (string) |
url (required) | The url you'd like to scrape (string) |
js_render | Render JavaScript components on the page (boolean) |
premium_proxy | Use a premium proxy (boolean) |
proxy_country | Use with a premium proxy to set your geolocation (string) |
session_id | Reuses an IP address for sticky sessions (int) |
device | Either "mobile" or "desktop" (string) |
original_status | Show the original status returned by the website (bool) |
wait_for | Wait for a specific CSS selector to show up on the page (string) |
wait | Wait a certain amount of time before returning the response (int) |
screenshot | Take a screenshot of the page (boolean) |
screenshot_fullpage | Take a sscreenshot of the full page (boolean) |
screenshot_selector | Take a screenshot of a certain CSS selector (string) |
Here is an example of a request you might make with the ZenRows API.
# pip install requests
import requests
url = "https://quotes.toscrape.com/"
api_key = "YOUR-ZENROWS-API-KEY"
params = {
"url": url,
"apikey": api_key,
}
response = requests.get('https://api.zenrows.com/v1/', params=params)
print(response.text)
Response Format
With ZenRows, we get the option to take our response as either JSON or HTML. This gives us the ability to better fine tune our scrape. Our responses come as HTML by default, but we can use an additional parameter to set our response to JSON.
Remember the code snippet from above? We'll make a small change to it.
# pip install requests
import requests
url = "https://quotes.toscrape.com/"
api_key = "YOUR-ZENROWS-API-KEY"
params = {
"url": url,
"apikey": api_key,
"json_response": "true"
}
response = requests.get('https://api.zenrows.com/v1/', params=params)
print(response.text)
To get a JSON response, we only need to add one parameter, "json_response": True
. For a regular HTML response, we don't need to add anything.
ZenRows API Pricing
You can view the lowest price options from ZenRows below.
Their higher cost plans are in the next image.
Plan | Residential Bandwidth | URL Limit | Price per Month |
---|---|---|---|
Developer | 12.73GB ($5.50/GB) | 250,000 ($0.28/1,000) | $69 |
Startup | 24.76GB ($5.25/GB) | 1,000,000 ($0.13/1,000) | $129 |
Business | 60GB ($5.00/GB) | 3,000,000 ($0.10/1,000) | $299 |
Business 500 | 111.11GB ($4.50/GB) | 6,000,000 ($0.08/1,000) | $499 |
Business 1K | 285.71GB ($3.50/GB) | 12,000,000 ($0.08/1,000) | $999 |
Business 2K | 643.92GB ($3.15/GB) | 25,000,000 ($0.08/1,000) | $1,999 |
Business 3K | 1,071.43GB ($2.80/GB) | 38,000,000 ($0.08/1,000) | $2,999 |
Custom | N/A | N/A | N/A |
With each of these plans, you only pay for successful requests. If the API fails to get your page, you pay nothing. Each plan also includes the following:
- Proxy Rotator
- User-Agent Rotator
- WAF Bypass
- Basic Analytics
- CAPTCHA Bypass
- Auto-parsing
- JavaScript Rendering
Response Status Codes
When using their API, there are a series of status codes we might get back. 200 is the one we want.
Status Code | Type | Possible Causes |
---|---|---|
200 | Success | It worked! |
400 | Bad Request | Forbidden Domain, Invalid Parameters |
401 | Unauthorized | Missing API Key, Invalid API Key |
402 | Payment Required | Usage Exceeded, Didn't Pay the Bill |
403 | Forbidden | User Not Verified, IP Address Blocked |
404 | Not Found | Site Not Found, Page Not Found |
405 | Not Allowed | Method Not Allowed |
407 | Proxy Authentication | Invalid Authorization Header |
413 | Content Too Large | Response Size Greater Than Limit |
422 | Unprocessable Entity | Failed to Retrieve Content |
424 | Failed Dependency | Failed to Solve CAPTCHA |
429 | Too Many Requests | Concurrency or Rate Limit Exceeded |
500 | Internal Server Error | Context Cancelled, Unknown Error |
502 | Bad Gateway | Could not parse Content |
504 | Gateway Timeout | Operation Exceeded Time Limit |
Setting Up ZenRows API
We'll get started by setting up and account and creating an API key. Once, you've signed up, you are given an API key. You can sign with any of the following methods:
- Github
- Create an account with an email address and password
After signing up, you can navigate to their dashboard and see your API key located in the upper right. In the lower right portion of the screen, they also have a nifty little request builder. This is perfect for testing.
As you probably noticed, in the screenshot above, I exposed my API key for all you readers to see. No worries! We can change our API key very easily from the account settings tab.
Once you got an API key, you're all set to start using the ZenRows API.
API Endpoint Integration
Now, let's talk about the API endpoints. We're only going to use one endpoint, very similar to how we use only one with ScrapeOps. Take a look at the line below from some of our earlier examples.
response = requests.get('https://api.zenrows.com/v1/', params=params)
Our base domain is https://api.zenrows.com
. Pretty simple right?
Our endpoint is /v1
. To customize our requests, we send different parameters to this endpoint. Think back to the following snippet from earlier.
# pip install requests
import requests
url = "https://quotes.toscrape.com/"
api_key = "YOUR-ZENROWS-API-KEY"
params = {
"url": url,
"apikey": api_key,
"json_response":
}
response = requests.get('https://api.zenrows.com/v1/', params=params)
print(response.text)
The params we send in this case are "url"
, "apikey"
, and "json_response"
. We'll send all of our custom parameters to this v1
endpoint.
Proxy Port Integration
Proxy Port Integration tells our HTTP client (if it supports this), to make all requests to a certain location. This allows us to forward all of our requests through said proxy.
http://<YOUR_ZENROWS_API_KEY>:premium_proxy=true@proxy.zenrows.com:8001
Below is an example of how to do this using Python Requests.
# pip install requests
import requests
url = "https://https://quotes.toscrape.com"
proxy = "http://YOUR-SUPER-SECRET-API-KEY:@proxy.zenrows.com:8001"
proxies = {"http": proxy, "https": proxy}
response = requests.get(url, proxies=proxies, verify=False)
print(response.text)
This form of proxy integration is best used when you're dealing with tons of different functionality and you don't necessarily want fine control over the proxy. You just want to use the proxy and get on with your day.
SDK Integration
SDK (Software Development Kit) integration is an excellent option for developers, particularly beginners, who want to streamline their web scraping process without diving deep into the complexities of HTTP requests, handling proxies, and managing response parsing.
SDK integration is ideal in various scenarios, particularly when you want a quick, efficient, and user-friendly way to interact with a web scraping service. Here's when you should consider using it:
-
Beginner-Friendly Projects: If you're new to web scraping or API integration, using an SDK can significantly lower the learning curve. It allows you to focus on the core aspects of your project without getting bogged down by technical complexities.
-
Rapid Prototyping: When you're looking to build a prototype or proof of concept quickly, SDKs can help you deliver faster since you won't need to manually code every interaction with the scraping service.
-
Standard Use Cases: If your scraping needs fall within standard scenarios—like scraping eCommerce data, monitoring competitors, or collecting blog posts—SDK integration provides a ready-made solution that works out of the box.
-
Consistent Maintenance: If you need ongoing support and updates to handle changes in website structure, rate limits, or captcha systems, using an SDK ensures that your integration remains functional and up-to-date with minimal effort.
ZenRows also has an SDK (Software Development Kit). This method is much easier for beginners who might not be familiar with HTTP clients yet. These SDKs abstract away a large portion of the lower level HTTP work.
Take a look at the example below.
# pip install zenrows
from zenrows import ZenRowsClient
client = ZenRowsClient("YOUR-SUPER-SECRET-API-KEY")
url = "https://quotes.toscrape.com"
response = client.get(url)
print(response.text)
As you can see above, this approach has a much lower barrier to entry.
Managing Concurrency
Managing concurrency is pretty straightforward if you know what you're doing. Once of the easiest ways to do this with Python Requests is to make use of ThreadPoolExecutor
.
ThreadPoolExecutor
gives us the abiliity to open a new pool with x
number of threads. On each available thread, we call a function of our choosing.
import requests
from bs4 import BeautifulSoup
import concurrent.futures
from urllib.parse import urlencode
API_KEY = 'YOUR_API_KEY'
NUM_THREADS = 5
def get_proxy_url(url):
payload = {"api_key": API_KEY, "url": url}
proxy_url = 'https://api.zenrows.com/v1/' + urlencode(payload)
return proxy_url
## Example list of urls to scrape
list_of_urls = [
"https://quotes.toscrape.com/page/1/",
"https://quotes.toscrape.com/page/2/",
"http://quotes.toscrape.com/page/3/",
]
output_data_list = []
def scrape_page(url):
try:
response = requests.get(get_proxy_url(url))
if response.status_code == 200:
soup = BeautifulSoup(response.text, "html.parser")
title = soup.find("h1").text
## add scraped data to "output_data_list" list
output_data_list.append({
'title': title,
})
except Exception as e:
print('Error', e)
with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
executor.map(scrape_page, list_of_urls)
print(output_data_list)
Pay close attention to executor.map()
in this situation.
- Our first argument is
scrape_page
: the function we want to call on each thread. - Our second is
list_of_urls
: the list of arguments we want to pass intoscrape_page
.
Any other arguments to the function also get passed in as arrays.
Advanced Functionality
Advanced functionality was touched on briefly earlier. With advanced functionality, we can customize our scrape to do things like set our geolocation, or render JavaScript and much more.
There's a bit of a hangup when using these advance functionalities though. They cost extra... a lot extra. Check out the table below for a breakdown of these functionalities and their cost.
Parameter | API Cost X Normal | Description |
---|---|---|
js_render | 5x | render JavaScript on the webpage |
custom_headers | 1x | set custom headers to the server |
premium_proxy | 10x | use premium IP addresses |
proxy_country | 10x - 25x | set a custom geolocation, requires premium_proxy |
session_id | 1x | use to keep browsing sessions in tact between requests |
device | 1x | "mobile" or "desktop" , "desktop" by default |
original_status | 1x | return the original status code from the site |
allowed_status_codes | 1x | returns list of allowed status codes for debugging |
wait_for | 5x | waits for a CSS selector, requires js_render |
wait | 5x | wait for a period of time, requires js_render |
block_resources | 5x | block certain resources, requires js_render |
json_response | 1x | return response as JSON instead of HTML |
css_extractor | Not Specified | extract elements with a certain CSS selector |
auto_parse | Not Specified | attempt to automatically parse the page |
markdown_response | Not Specified | return the parsed content as a markdown file |
screenshot | 5x | requires js_render , takes a screenshot of the page |
screenshot_fullpage | 5x | requires js_render , take a full page screenshot |
screenshot_selector | 5x | requires js_render , screenshot a certain element |
You can view their full API documentation here.
JavaScript Rendering
Many modern websites, especially those using JavaScript frameworks like React, Angular, or Vue, load data dynamically after the initial HTML is served. This means the content you're trying to scrape might not be immediately visible in the static HTML, requiring JavaScript to run before the desired data is accessible.
JavaScript rendering is essential when dealing with dynamic websites that rely on JavaScript to load content. Here are the key reasons to use it:
-
Access Dynamic Content: Many modern websites use JavaScript to load important data, such as product listings, reviews, or stock availability. Without JavaScript rendering, you’ll miss this dynamically loaded content because it doesn't appear in the initial HTML.
-
Scrape JavaScript-Heavy Websites: Sites built with frameworks like React, Angular, or Vue often deliver content dynamically through JavaScript. Rendering ensures you can scrape the full page, including elements that only appear after JavaScript execution.
-
Avoid Incomplete Data: If a page loads data asynchronously (e.g., product prices or user comments), traditional scraping may return empty or incomplete results. JavaScript rendering ensures all page elements are fully loaded before scraping.
-
Handle Single-Page Applications (SPAs): SPAs dynamically update the page without reloading it, making traditional scraping methods ineffective. JavaScript rendering allows you to scrape these applications by ensuring that all components are fully visible.
-
Improve Scraping Accuracy: By rendering JavaScript, you reduce the risk of missing critical information or encountering incomplete data, leading to more accurate and reliable scraping results.
When we tell ZenRows to render JavaScript, the browser will render JavaScript content on the page. We do this by setting the js_render
param to True
.
Here's an example in Python.
# pip install requests
import requests
url = "https://quotes.toscrape.com"
apikey = "YOUR_ZENROWS_API_KEY"
params = {
"url": url,
"apikey": apikey,
"js_render": 'true',
}
response = requests.get('https://api.zenrows.com/v1/', params=params)
print(response.text)
You can view the documentation for this here.
Controlling The Browser
ZenRows comes with a builtin headless browser. We can send instructions to this browser using their API. The instruction set is relatively simple.
# pip install requests
import requests
url = "https://httpbin.io/anything"
apikey = "YOUR_ZENROWS_API_KEY"
params = {
"url": url,
"apikey": apikey,
"js_render": "true",
}
response = requests.get("https://api.zenrows.com/v1/", params=params)
print(response.text)
-->
Parameter | Description |
---|---|
wait | wait for a period of time |
wait_for | wait for a CSS selector |
json_response | return the response as JSON instead of HTML |
block_resources | block resources from loading |
js_instructions | instructions to run on the page, such as click |
screenshot | take a screenshot of the page |
screenshot_fullpage | take a full page screenshot |
Here is a snippet that contains js_instructions
to click a button and wait a half second.
# pip install requests
import requests
url = 'https://www.example.com'
apikey = 'YOUR_ZENROWS_API_KEY'
params = {
'url': url,
'apikey': apikey,
'js_render': 'true',
'js_instructions': """[{"click":".button-selector"},{"wait":500}]""",
}
response = requests.get('https://api.zenrows.com/v1/', params=params)
print(response.text)
The browser control docs are available here.
Country Geotargeting
Websites often serve different data, pricing, or availability based on the region from which a visitor is accessing the site, and country geotargeting ensures that your scraper can access the exact content relevant to the target location.
Here are the main reasons to use geotargeting:
-
Access Location-Specific Content: Many websites deliver different content based on the visitor's location. Geotargeting allows you to scrape the exact content that users in specific regions would see.
-
Bypass Regional Restrictions: Some websites restrict access to certain data, features, or services based on the user's geographic location. Country geotargeting lets you bypass these restrictions by routing your scraping requests through proxies located in the target region.
-
Monitor International Competitors: If you're tracking competitors in multiple countries, geotargeting allows you to collect data on how they operate in different markets. This includes variations in pricing strategies, localized offerings, and marketing campaigns tailored to specific regions.
-
Perform Regional Market Research: Country geotargeting helps businesses gather insights for different markets. It allows you to scrape data specific to a target region, such as local customer reviews, product availability, or localized marketing strategies.
-
Localized SEO and Ad Tracking: If you're conducting SEO research, country geotargeting lets you see how websites rank in different countries, track regional keywords, or observe location-specific ads. It’s also useful for tracking how brands adjust their advertising and SEO strategies in various locations.
-
Test Website Localization: Developers can use geotargeting to ensure that websites are properly localized for different regions. This includes testing localized language versions, currency displays, and regional features to ensure they work correctly based on the user's location.
We can also use the proxy to choose our geolocation. We can do this by using the premium_proxy
and proxy_country
parameters.
# pip install requests
import requests
url = "https://www.example.com"
apikey = "YOUR_ZENROWS_API_KEY"
params = {
"url": url,
"apikey": apikey,
"premium_proxy": "true",
"proxy_country": "us"
}
response = requests.get('https://api.zenrows.com/v1/', params=params)
print(response.text)
Here is their list of country codes.
Country | Country Code |
---|---|
United States | "us" |
Canada | "ca" |
United Kingdom | "gb" |
Germany | "de" |
France | "fr" |
Spain | "es" |
Brazil | "br" |
Mexico | "mx" |
India | "in" |
Japan | "jp" |
China | "cn" |
You can view the full documentation for this here.
Residential Proxies
A residential proxy is a type of proxy server that uses IP addresses assigned to real residential homes by internet service providers (ISPs). These IPs appear as if they come from everyday users rather than data centers, making them more difficult for websites to detect and block.
They provide more reliability and fewer interruptions, making them ideal for scraping websites with strict anti-bot protections.
Here're some solid reasons why the residential proxies are used:
-
Avoid IP Blocking and Bans: Websites often block data center IPs or proxies because they are easily identifiable as non-human traffic. Residential proxies appear as real users, reducing the risk of being blocked or flagged as suspicious activity.
-
Access Geo-Restricted Content: Residential proxies can be used to simulate traffic from specific geographic regions, helping you access region-locked content, such as location-specific versions of websites, prices, or products.
-
Bypass CAPTCHAs and Anti-Scraping Measures: Many websites deploy sophisticated anti-scraping techniques like CAPTCAs or rate limits to stop automated traffic. Residential proxies can bypass these measures by making the traffic appear to come from legitimate users, which reduces the likelihood of encountering captchas or other obstacles.
-
Improve Scraping Success Rates: For large-scale web scraping projects, residential proxies increase the chances of successfully gathering data without interruptions or blocks.
-
High Anonymity: Residential proxies provide a high level of anonymity, as they obscure the identity and origin of the scraper. This allows for stealthy data collection while maintaining the appearance of regular user activity.
-
Consistent Web Sessions: Some residential proxies offer static IPs that allow you to maintain consistent sessions, which is important for tasks like logging into accounts, managing cookies, or scraping data that requires a persistent connection.
We can tell ZenRows to use a Premium Proxy (residential proxy) by setting premium_proxy
to True
. This tells ZenRows to use a residential IP address instead of a datacenter IP. Sites with stringent anti-bot measures tend to block datacenter IP addresses.
Here's a code example of how to use them.
# pip install requests
import requests
url = "https://www.example.com"
apikey = "YOUR_ZENROWS_API_KEY"
params = {
"url": url,
"apikey": apikey,
"premium_proxy": "true",
}
response = requests.get('https://api.zenrows.com/v1/', params=params)
print(response.text)
You can view their full Premium Proxy documentation here.
Custom Headers
When we're making requests, we sometimes need to add special headers. By default, proxy APIs manage these headers to optimize performance and ensure requests appear as normal user traffic.
However, most proxy APIs offer the flexibility to send custom headers when specific conditions or requirements arise.
The custom header functionality in proxy APIs allows you to manually specify HTTP request headers, which are key-value pairs sent from a client (like a web scraper) to a server.
Why Use Custom Headers?
Custom headers are essential in web scraping and API requests when you need more control over how your requests are handled by the target server. While proxy APIs optimize headers automatically for performance, there are specific situations where custom headers become necessary:
-
Requesting Specific Data: Certain websites require specific headers to return the desired data. For example, sending the correct
User-Agent
orAccept-Language
header can make the server respond with the appropriate content. -
POST Requests: When making POST requests, especially for form submissions or API interactions, custom headers like
Content-Type
,Authorization
, andX-Requested-With
are often required to ensure the request is processed correctly. -
Bypassing Anti-Bot Systems: Some websites use sophisticated anti-bot systems that check headers to detect automated traffic. Custom headers can help you mimic real user behavior, making it harder for these systems to block your requests.
-
Handling Authorization: For websites or APIs that require authentication, custom headers like Authorization (e.g., for tokens or API keys) are essential for gaining access to protected resources.
Word of Caution
Misuse of custom headers can reduce the effectiveness of your scraping, making it crucial to implement custom headers carefully and only when absolutely necessary.
-
Decreased Performance: Using custom headers incorrectly can reduce the performance of your proxy requests. For example, sending static or poorly chosen headers can make your requests more likely to be flagged as automated, leading to blocks or captchas.
-
Static Headers Can Trigger Detection: If your custom headers remain the same across multiple requests, this consistency can trigger anti-scraping measures. Websites may detect that the requests are automated, leading to IP bans or additional verification steps.
-
Need for Dynamic Header Generation: For large-scale scraping operations, it’s important to continuously generate fresh, dynamic headers to avoid detection. Automated systems should be in place to ensure headers vary and appear authentic.
-
Use Only When Necessary: Custom headers should be used only when essential. In most cases, proxy APIs manage headers better, optimizing for performance and evading detection. Overriding these can sometimes do more harm than good if not handled carefully.
When we send these to a proxy server, sometimes things can get crossed and things can get mixed up when we're dealing with headers to two sites. To keep our custom headers, we use the custom_headers
endpoint.
You can view an example of this below.
import requests
# Set the URL and API key
url = "https://api.zenrows.com/v1/"
params = {
"apikey": "YOUR_ZENROWS_API_KEY",
"url": "https://httpbin.io/anything",
"custom_headers": "true"
}
# Set the headers
headers = {
"Referer": "https://www.google.com"
}
# Make the request
response = requests.get(url, headers=headers, params=params)
# Print the response
print(response.text)
Take a look at their docs here.
Static Proxies
Static proxies, often referred to as sticky sessions, are proxy servers that maintain a consistent IP address for an extended period or for the duration of a user session. For instance, if you want to login on a site and remain logged in through the ZenRows Proxy, you'll need a Static Proxy.
Static proxies offer several advantages that make them valuable for specific use cases in web scraping and online activities. Here are the key reasons to use static proxies:
-
Session Consistency: Static proxies are ideal for tasks that require maintaining session state, such as logging into a website, managing cookies, or interacting with user accounts.
-
Avoiding CAPTCHA and Verification: Using the same IP address consistently can help reduce the likelihood of triggering CAPTCHA challenges or account verification processes.
-
Long-Term Data Collection: For projects that involve long-term scraping or data collection, static proxies allow you to accumulate data over time without losing context or identity.
-
Improved Performance: Since static proxies do not change IPs frequently, they can reduce latency and improve the speed of requests, as you won’t have to negotiate new sessions or face interruptions caused by IP switching.
-
Easier Account Management: When managing multiple accounts on platforms that have strict anti-bot measures, using static proxies can help you operate multiple accounts without raising flags due to IP changes, making it easier to manage activities associated with each account.
-
Reduced Risk of IP Blacklisting: With static proxies, the risk of getting your IP blacklisted is lower compared to using a pool of rotating proxies, where frequent changes may draw attention and result in blocks.
In order to make use of a Static Proxy, you need to retrieve your sessionId
. To do this, we pass session_id
as 12345
. This tells the ZenRows server that you'd like to keep your session in tact.
But how does ZenRows remember which session is mine?
ZenRows tracks your session using your API key.
Here is an example.
import requests
# Set the URL and API key
url = "https://api.zenrows.com/v1/"
params = {
"apikey": "YOUR_ZENROWS_API_KEY",
"url": "https://quotes.toscrape.com",
"session_id": 12345
}
# Make the request
response = requests.get(url, headers=headers, params=params)
# Print the response
print(response.text)
Take a look at docs here
Screenshot Functionality
The screenshot functionality in proxy APIs allows you to capture images of web pages as they are rendered in a browser. This feature typically takes a snapshot of the page at a specific moment, including all visible content, styles, and layouts, providing a visual representation of the website.
The screenshot functionality offers several key benefits that enhance the effectiveness of web scraping and data collection projects. Here are the primary reasons to utilize this feature:
-
Visual Documentation: Screenshots serve as a visual record of web pages at specific points in time, making them useful for audits, compliance checks, and tracking changes in design or content.
-
Error and Bug Reporting: When issues arise during scraping or site interaction, screenshots can help document errors, layout problems, or unexpected behavior.
-
Competitive Analysis: Capturing screenshots of competitors' websites enables businesses to analyze their design, layout, and content strategies.
-
Content Verification: Screenshots can provide proof of the scraped content, ensuring that it matches expectations or contractual obligations.
-
User Experience Testing: In usability testing, screenshots can be used to evaluate the design and layout of web applications or websites.
-
Monitoring Changes: Regularly capturing screenshots of a webpage allows you to track changes over time. This is particularly useful for monitoring dynamic content, such as pricing updates or promotional changes.
With ZenRows, screenshots are really easy. We get both a screenshot
argument, one for screenshot_fullpage
and on top of all that, we have screenshot_selector
to take a shot of a specific element on the page. Go ahead and take a look at the code examples below.
Here, we take a regular screenshot.
import requests
# Set the URL and API key
url = "https://api.zenrows.com/v1/"
params = {
"apikey": "YOUR_ZENROWS_API_KEY",
"url": "https://httpbin.io/anything",
"js_render": "true",
"screenshot": "true"
}
# Make the request
response = requests.get(url, headers=headers, params=params)
# Print the response
print(response.text)
Here is a full page screenshot.
import requests
# Set the URL and API key
url = "https://api.zenrows.com/v1/"
params = {
"apikey": "YOUR_ZENROWS_API_KEY",
"url": "https://httpbin.io/anything",
"js_render": "true",
"screenshot_fullpage": "true"
}
# Make the request
response = requests.get(url, headers=headers, params=params)
# Print the response
print(response.text)
Our final example here is a screenshot of a specific page element.
import requests
# Set the URL and API key
url = "https://api.zenrows.com/v1/"
params = {
"apikey": "YOUR_ZENROWS_API_KEY",
"url": "https://quotes.toscrape.com",
"js_render": "true",
"screenshot_selector": "div.container"
}
# Make the request
response = requests.get(url, headers=headers, params=params)
# Print the response
print(response.text)
The full documentation for screenshots is available here.
Auto Parsing
Auto parsing (also known as auto extract) is a feature that automatically identifies and extracts key data from web pages without the need for manual coding to locate specific HTML elements.
Auto Parsing is an excellent feature. With Auto Parsing, we can actually tell ZenRows to scrape the site for us! With this functionality, we only need to focus on our jobs as developers. We don't need to pick through all the nasty HTML.
Auto parsing is really handy when you need a quick, user-friendly, and efficient way to extract data, especially for dynamic or complex websites. It reduces time spent on setup and maintenance, making it ideal for large-scale projects or for those without extensive technical skills.
This snippet tells ZenRows to parse the site for us.
import requests
# Set the URL and API key
url = "https://api.zenrows.com/v1/"
params = {
"apikey": "YOUR_ZENROWS_API_KEY",
"url": "https://www.amazon.com/dp/B01LD5GO7I/",
"autoparse": "true"
}
# Make the request
response = requests.get(url, headers=headers, params=params)
# Print the response
print(response.text)
You can use this feature to parse Amazon, YouTube, Zillow and many many more sites. You can view the full list here. The docs for this feature are available here.
Case Study: Using Scraper APIs on IMDb Top 250 Movies
Now, it's time for a case study. We're gonna pit ScrapeOps and ZenRows head to head and see how they match up.
We're going to scrape the top 250 movies from IMDB. Once we've scraped our data, we'll save it to a JSON file.
The two code examples below are virtually identical. The major difference is the proxy function. Aside from the base domain that we're pinging, we use the api_key
param with ScrapeOps and with Zenrows we use apikey
.
Here is the proxy function for ScrapeOps:
def get_scrapeops_url(url):
payload = {
"api_key": API_KEY,
"url": url,
}
proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)
return proxy_url
Here is the same function for ZenRows.
def get_zenrows_url(url):
payload = {
"apikey": API_KEY,
"url": url,
}
proxy_url = "https://api.zenrows.com/v1/?" + urlencode(payload)
return proxy_url
The full ScrapeOps code is availavle for you below.
import os
import requests
from bs4 import BeautifulSoup
import json
import csv
import logging
from urllib.parse import urlencode
import concurrent.futures
## Logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
API_KEY = ""
with open("config.json", "r") as config_file:
config = json.load(config_file)
API_KEY = config["scrapeops_api_key"]
def get_scrapeops_url(url):
payload = {
"api_key": API_KEY,
"url": url,
}
proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)
return proxy_url
def scrape_movies(url, location="us", retries=3):
success = False
tries = 0
while not success and tries <= retries:
response = requests.get(get_scrapeops_url(url))
try:
if response.status_code != 200:
raise Exception(f"Failed response from server, status code: {e}")
soup = BeautifulSoup(response.text, "html.parser")
json_tag = soup.select_one("script[type='application/ld+json']")
json_data = json.loads(json_tag.text)["itemListElement"]
movie_list_length = 0
movie_list = []
for item in json_data:
movie_list.append(item["item"])
movie_list_length+=len(json_data)
print(f"Movie list length: {len(json_data)}")
with open("scrapeops-top-250.json", "w") as file:
json.dump(movie_list, file, indent=4)
success = True
except Exception as e:
logger.error(f"Failed to process page: {e}, retries left: {retries-tries}")
tries+=1
if not success:
raise Exception(f"Failed to scrape page, MAX RETRIES {retries} EXCEEDED!!!")
if __name__ == "__main__":
MAX_RETRIES = 3
logger.info("Starting IMDB scrape")
url = "https://www.imdb.com/chart/top/"
scrape_movies(url, retries=MAX_RETRIES)
logger.info("Scrape complete")
This code took 5.401 seconds to run.
Here is our ZenRows example.
import os
import requests
from bs4 import BeautifulSoup
import json
import csv
import logging
from urllib.parse import urlencode
import concurrent.futures
## Logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
API_KEY = ""
with open("config.json", "r") as config_file:
config = json.load(config_file)
API_KEY = config["zenrows_api_key"]
def get_zenrows_url(url):
payload = {
"apikey": API_KEY,
"url": url,
}
proxy_url = "https://api.zenrows.com/v1/?" + urlencode(payload)
return proxy_url
def scrape_movies(url, location="us", retries=3):
success = False
tries = 0
while not success and tries <= retries:
response = requests.get(get_zenrows_url(url))
try:
if response.status_code != 200:
raise Exception(f"Failed response from server, status code: {response.status_code}")
soup = BeautifulSoup(response.text, "html.parser")
json_tag = soup.select_one("script[type='application/ld+json']")
json_data = json.loads(json_tag.text)["itemListElement"]
movie_list_length = 0
movie_list = []
for item in json_data:
movie_list.append(item["item"])
movie_list_length+=len(json_data)
print(f"Movie list length: {len(json_data)}")
with open("zenrows-top-250.json", "w") as file:
json.dump(movie_list, file, indent=4)
success = True
except Exception as e:
logger.error(f"Failed to process page: {e}, retries left: {retries-tries}")
tries+=1
if not success:
raise Exception(f"Failed to scrape page, MAX RETRIES {retries} EXCEEDED!!!")
if __name__ == "__main__":
MAX_RETRIES = 3
logger.info("Starting IMDB scrape")
url = "https://www.imdb.com/chart/top/"
scrape_movies(url, retries=MAX_RETRIES)
logger.info("Scrape complete")
Below is the output from the ZenRows example.
ZenRows was barely faster. 5.401 - 4.878 = 0.523. With roughly a half second difference, this is negligible. Depending on time and location, either proxy could come out faster.
Since the ScrapeOps API uses ZenRows as one of its providers under the hood, you actually will probably get more reliability out of ScrapeOps. If ZenRows fails, ScrapeOps will try a different provider.
Alternative: ScrapeOps Proxy API Aggregator
The ScrapeOps Proxy API provides an excellent alternative to ZenRows. With ScrapeOps, you actually get access to ZenRows under the hood along with a boatload of other proxy providers.
We also get better pricing from ScrapeOps. As you saw earlier in this article, ZenRows lowest tier subscription costs $69 per month at $0.28 per URL. With ScrapeOps, we can get basically the same plan for $29 per month.
With the ScrapeOps Proxy API, you can get virtually the same plan for less than half the price. Since ScrapeOps uses ZenRows as a provider, you still get access to ZenRows as well.
Troubleshooting
Issue #1: Request Timeouts
When scraping, timeouts can be an unending source of headache. To handle timeouts with Python Requests, we can use the timeout
argument. The snippet below shows how to properly set a timeout.
import requests
# 5 second timeout
response = requests.get("https://httpbin.org/get", timeout=5)
Issue #2: Handling CAPTCHAs
If your proxy service is making you submit CAPTCHA requests, something is wrong. Both ScrapeOps and ZenRows are built to bypass CAPTCHAs for you by default. However, sometimes proxy providers can fail. If you run into a CAPTCHA, first, try to submit the request again. If you are consistently being prompted to complete a CAPTCHA, ZenRows allows you to pass any of the following arguments:
[
{"solve_captcha": {"type": "hcaptcha"}},
{"solve_captcha": {"type": "recaptcha"}},
{"solve_captcha": {"type": "cloudflare_turnstile"}},
{"solve_captcha": {"type": "hcaptcha", "options": {"solve_inactive": true}}},
{"solve_captcha": {"type": "recaptcha", "options": {"solve_inactive": true}}}
]
You can also use a service like 2captcha. We have an excellent article on bypassing CAPTCHAs here.
Issue #4: Invalid Response Data
To deal with invalid responses, you need to check the status code. Check out ZenRows error codes here. The ScrapeOps error codes are available here.
In most cases, you need to double check your parameters or make sure your bill is paid. Every once in awhile, you may receive a different error code that you can find in the links above.
The Legal & Ethical Implications of Web Scraping
Scraping the web is generally considered legal as long as you're scraping public data. If you don't have to login to view the data, this is considered public information and therefore public data. Much like a sign posted in the middle of your town. Reading it (and even taking a picture) is perfectly fine because it's public information.
Scraping private data (data gated behind a login) from the web is completely different legal territory. When data is private, you're subject to the same laws and intellectual property policies as the sites you're scraping.
However, even when we scrape public data, we're subject to both a website's terms and condidtions along with their robots.txt
file. You can view those for IMDB below.
Violating either of these could result in either suspension or even permanent banning.
Potential Consequences of Misuse
-
Account Suspension or Blocking: If scraping is done excessively or against the site's ToS, your IP address or account could be banned. This can permanently prevent access to the target site.
-
Legal Penalties: Improper scraping can result in lawsuits, hefty fines, and legal penalties. For example, companies like LinkedIn have taken legal actions against unauthorized scrapers for violating their ToS, claiming damages for lost revenue and resources.
-
Reputation Damage: Misusing web scraping tools can damage the reputation of individuals or businesses involved, especially in cases where scraping leads to publicized legal disputes or privacy violations.
-
Risk to Users: If scraped data contains personal or sensitive information, misuse can harm the individuals involved. This may expose scrapers to lawsuits or fines under data protection laws, making it critical to anonymize or aggregate sensitive data to avoid direct harm to users.
Web scraping can provide immense value for business insights, competitive research, and data analysis, but it must be done responsibly.
Always respect website terms of service, comply with privacy policies, and ensure that your scraping activities remain within legal and ethical boundaries. By doing so, you protect yourself from legal repercussions and help foster a fair, transparent digital ecosystem.
Conclusion
When using ZenRows, there are numerous ways we can access websites. Whether you're using their SDK, calling the API directly, or configuring the proxy straight into your HTTP client, you have an easy and reliable way to get your data.
While the price barrier to ZenRows might seem pretty high, we can also gain access to ZenRows under the hood by using ScrapeOps for about half the price.
Both of these solutions will help you get the data you need.
More Web Scraping Guides
At ScrapeOps, we've got tons of learning resources. It doesn't matter if you're brand new to scraping or a hardened developer, we have something for you. We wrote the playbook on scraping with Python.
Bookmark one of the articles below and level up your skillset!