Zyte API: Web Scraping Integration Guide
Proxy management is an integral part of the Zyte API. Their Smart Proxy Manager is currently being merged into the Zyte API. The Zyte Smart Proxy Manager goes through and "automatically selects the leanest set of proxies and techniques to keep your crawl healthy". This process automates your proxy connections so all you have to focus on is writing your scraper.
Here, we'll go through the process of signing up and using the Zyte API and then pit it head to head against the ScrapeOps Proxy Aggregator head to head.
- TLDR: Scraping With Zyte API
- What is the Zyte API?
- Setting Up
- Advanced Functionality
- JavaScript Rendering
- Geotargeting
- Residential Proxies
- Custom Headers
- Static Proxies
- Screenshots
- Auto Parsing
- Case Study: Top 250 Movies from IMDB
- Alternative: ScrapeOps Proxy Aggregator
- Conclusion
- More Web Scraping Guides
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
TLDR: Web Scraping With Zyte API?
With proxy ports, getting started with the Zyte API is pretty easy.
Download their CA Certificate and configure it using their instructions for your OS. Once you've done that, you're ready to go.
Here's an example to get you started. If you run into SSL issues, you can pass verify=False
into your requests.
import requests
import json
config = {}
with open("config.json") as file:
config = json.load(file)
ca_cert_path = "zyte-ca.crt"
proxies = {
"http": f"http://{config['zyte_api_key']}:@api.zyte.com:8011",
"https": f"http://{config['zyte_api_key']}:@api.zyte.com:8011"
}
response = requests.get("https://toscrape.com", proxies=proxies, verify=ca_cert_path)
print(response.content)
When you hook into their proxy port, you can set it and forget it.
- Make sure to pass
proxies=proxies
andverify="path-to-your-ca-certificate"
and you can continue building everything else like you would normally. - Make sure to use these proxies for responsible web scraping. Don't disseminate private data, and don't use the API to violate any other site's terms and conditions.
What Is The Zyte API?
The Zyte Smart Proxy Manager is a portion of the Zyte API. Their Smart Proxy Manager automatically keeps a list of available, healthy proxies and selects the best one for your specific scrape.
When we're scraping the web, more often than not, we're bogged down writing parsers and trying to extract data from our target sites. With both the Zyte API and the ScrapeOps Proxy Aggregator, the proxy management gets handled for you.
Both of these solutions use rotating proxies. They allow us to scrape more efficiently by getting us past CATPCHAs, anti-bot systems and they even allow us to render JavaScript content on the page before sending our response back.
All in all, this makes scraping a site far easier than manual proxy management. When you manage proxies manually, you have to create them, maintain a list of them, and select the best one.
When we use proxy managers like the ones mentioned above, all we have to worry about is our scrape and our normal code. We don't need to write tons of boilerplate and manage infrastructure, these products handle it for us.
How Does The Zyte API Work?
Zyte's API has a pretty simple function when we examine it at the highest level. We use it to gain access to the target site. When we break down what's actually going on, their Smart Proxy Manager is doing a ton more than you would think.
When you make a request to the Smart Proxy Manager through the Zyte API, the following happens:
- Zyte pickes the best available proxy out of its pool.
- Using that proxy, Zyte fetches the page and executes any additional instructions we gave it (like rendering JavaScript).
- Zyte ensures that we received a valid response. If we did not, it will repeat steps 1 and 2 until we get one.
- Zyte sends the response back to us.
Let's make a simple request using the Zyte API. Start by creating a config.json
file. We'll use this to hold our API keys. Here are its contents.
Our scrapers will read our API keys from this file so we don't have to hardcode them into the scraper (it's bad practice to hardcode API keys!).
{
"scrapeops_api_key": "YOUR-SCRAPEOPS-API-KEY",
"zyte_api_key": "YOUR-ZYTE-API-KEY"
}
Once we've got our API keys stored, we need to do something with them. Before we can use the API successfully, we need to setup the Zyte CA Certificate.
You can find instructions for that here. Follow the instructions specific to your OS.
- If you are on Windows, follow the instructions for Windows.
- If you are on Linux, follow the Linux instructions.
- If you are on Mac, follow the Mac instructions.
The easiest way to bundle use certificate with Requests is to simply specify the path to the certificate in your code.
For this tutorial, I'm just going to keep the certificate inside my project folder, that makes it easy to find. Here's the output from my ls
command.
You can see that the certificate is highlighted.
Now, to make a simple request, we need to read both our config file and our CA certificate.
import requests
import json
config = {}
with open("config.json") as file:
config = json.load(file)
ca_cert_path = "zyte-ca.crt"
proxies = {
"http": f"http://{config['zyte_api_key']}:@api.zyte.com:8011",
"https": f"http://{config['zyte_api_key']}:@api.zyte.com:8011"
}
response = requests.get("https://toscrape.com", proxies=proxies, verify=ca_cert_path)
print(response.text)
- First, we create a variable to hold our configuration.
- Then, we read the config file and load our configuration into our program using
json.load()
. - We specify the path to both the config file and our CA Certificate.
- When we make a request, we need to pass these things in along with the request:
proxies=proxies
tells Requests to use the proxies we set up.verify=ca_cert_path
tells Requests to use the CA Certificate we downloaded for verification.
Response Format
In the example above, our response came in HTML format by default. You can view the full HTML below.
<!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Scraping Sandbox</title>
<link href="./css/bootstrap.min.css" rel="stylesheet">
<link href="./css/main.css" rel="stylesheet">
</head>
<body>
<div class="container">
<div class="row">
<div class="col-md-1"></div>
<div class="col-md-10 well">
<img class="logo" src="img/zyte.png" width="200px">
<h1 class="text-right">Web Scraping Sandbox</h1>
</div>
</div>
<div class="row">
<div class="col-md-1"></div>
<div class="col-md-10">
<h2>Books</h2>
<p>A <a href="http://books.toscrape.com">fictional bookstore</a> that desperately wants to be scraped. It's a safe place for beginners learning web scraping and for developers validating their scraping technologies as well. Available at: <a href="http://books.toscrape.com">books.toscrape.com</a></p>
<div class="col-md-6">
<a href="http://books.toscrape.com"><img src="./img/books.png" class="img-thumbnail"></a>
</div>
<div class="col-md-6">
<table class="table table-hover">
<tr><th colspan="2">Details</th></tr>
<tr><td>Amount of items </td><td>1000</td></tr>
<tr><td>Pagination </td><td>✔</td></tr>
<tr><td>Items per page </td><td>max 20</td></tr>
<tr><td>Requires JavaScript </td><td>✘</td></tr>
</table>
</div>
</div>
</div>
<div class="row">
<div class="col-md-1"></div>
<div class="col-md-10">
<h2>Quotes</h2>
<p><a href="http://quotes.toscrape.com/">A website</a> that lists quotes from famous people. It has many endpoints showing the quotes in many different ways, each of them including new scraping challenges for you, as described below.</p>
<div class="col-md-6">
<a href="http://quotes.toscrape.com"><img src="./img/quotes.png" class="img-thumbnail"></a>
</div>
<div class="col-md-6">
<table class="table table-hover">
<tr><th colspan="2">Endpoints</th></tr>
<tr><td><a href="http://quotes.toscrape.com/">Default</a></td><td>Microdata and pagination</td></tr>
<tr><td><a href="http://quotes.toscrape.com/scroll">Scroll</a> </td><td>infinite scrolling pagination</td></tr>
<tr><td><a href="http://quotes.toscrape.com/js">JavaScript</a> </td><td>JavaScript generated content</td></tr>
<tr><td><a href="http://quotes.toscrape.com/js-delayed">Delayed</a> </td><td>Same as JavaScript but with a delay (?delay=10000)</td></tr>
<tr><td><a href="http://quotes.toscrape.com/tableful">Tableful</a> </td><td>a table based messed-up layout</td></tr>
<tr><td><a href="http://quotes.toscrape.com/login">Login</a> </td><td>login with CSRF token (any user/passwd works)</td></tr>
<tr><td><a href="http://quotes.toscrape.com/search.aspx">ViewState</a> </td><td>an AJAX based filter form with ViewStates</td></tr>
<tr><td><a href="http://quotes.toscrape.com/random">Random</a> </td><td>a single random quote</td></tr>
</table>
</div>
</div>
</div>
</div>
</body>
</html>
We can use the Zyte API to customize our parameters. Take a look at the code example below. Something that might seem strange, but is actually considered more secure: the Zyte API uses POST requests instead of GET.
This is a more secure method than leaving the API key exposed in a URL. Many APIs will have you send a GET with your API key in the parameters, Zyte instead has you send your key using a secure header.
Our response already comes in a json
format, but the body is Base64 encoded. This binary encoding ensures the integrity of our data in transit.
Once we've received our response, we can go ahead and decode it using Python's builtin base64
library. The example below prints both the encoded and decoded responses so you can see the difference.
import requests
import json
from base64 import b64decode
with open("config.json") as file:
config = json.load(file)
api_response = requests.post(
"https://api.zyte.com/v1/extract",
auth=(config["zyte_api_key"], ""),
json={
"url": "https://toscrape.com",
"httpResponseBody": True
}
)
json_response = api_response.json()
print("------------------------Raw Response--------------------------")
print(json.dumps(json_response, indent=4))
if "httpResponseBody" in json_response:
json_response["httpResponseBody"] = b64decode(json_response["httpResponseBody"]).decode('utf-8')
print("------------------------Decoded Response----------------------")
print(json.dumps(json_response, indent=4))
Zyte API Pricing
Take a look at their pricing below. These are tiers for non-rendered responses. Depending on what you choose to do with their API, costs may vary.
With the PAYG (pay as you go) plan you only pay for the data you actually use. There are 4 other separate tiers if you choose to go with a monthly plan instead.
The table in the screenshot above can be a bit difficult to understand. Here's our breakdown of it.
- The price per request varies based on the difficulty of the website.
- There are 5 tiers of difficulty based on what resources are required to scrape the site, the more residential IPs and compute resources your scrape needs, the more the request will cost.
- In the costs section, we show the minimum and maximum price per 1000 requests.
Plan | Cost per 1000 Requests | Monthly Price |
---|---|---|
PAYG | $0.20 - $1.90 | Variable (based on usage) |
Monthly-100 | $0.10 - $0.95 | $100 |
Monthly-350 | $0.07 - $0.65 | $350 |
Monthly-1000 | $0.05 - $0.52 | $1000 |
Response Status Codes
In all facets of web development (scraping included), status codes are important. As you might already know, 200 indicates a successful response.
Anything other than 200 typically indicates that something went wrong somewhere. To troubleshoot these status codes, take a look at the table below.
Here are the status codes that Zyte includes in their documentation.
Status Code | Meaning | Description |
---|---|---|
200 | Success | Everything worked! |
400 | Invalid | Invalid request or JSON information |
401 | Authorization Error | Issues with the way you're sending your API key. |
403 | Account Suspension | Account has been suspended from accessing the API. |
404 | Site Not Found | The site was not found at the requested domain. |
422 | Incompatible Parameters | Double check your parameters they're conflicting. |
429 | Over User Limit | You've exceeded your rate limit. |
451 | Forbidden Domain | Zyte API doesn't permit access to the requested site. |
500 | Internal Server Error | Zyte experienced an internal issue. |
503 | Overloaded | Zyte is overloaded, try again later. |
520 | Download Error | Temporary error retrieving content, try again. |
521 | Download Error | Permanent download error, open a support ticket. |
Setting Up Zyte API
Now that we've got some background information on Zyte's API, let's get started.
- First, create an account with either Google, or using an email and password.
- Afterward, you can select a trial plan. You can choose either Zyte API or Smart Proxy Management.
- Smart Proxy Management was a separate product, but is now being merged with the Zyte API.
- Both of these options will give you access to the Zyte API via your API key and you'll receive a $5 credit to your account for a free trial.
- Free trials last until you either use the $5 or until 30 days have passed: whichever is sooner.
After setting up your free trial, you'll need to go through a checkout process where you enter your credit/debit card information. You will be charged a minimal amount (like $1) and it will be immediately refunded. The purpose for this is to ensure that your card works.
As we mentioned earlier, almost all the requests we send to Zyte will use the POST method. This is more secure, but it does make our code slightly more difficult to write. Our API keys are sent in a secure Authorization header. If you remember from earlier, Python Requests abstracts this away with our auth
argument in the request.
To view your API key, select the Zyte API dropdown and click API access.
You'll then be taken to a screen where you can view and replace your keys.
The Zyte API supports integration with the following methods:
/extract
endpoint: This is where we sent our post request in the second example.- Proxy ports: Our first code example used proxy ports, these are to set your connection and forget about managing the proxy afterward.
- SDK: python-zyte-api, and scrapy-zyte-api. These SDKs (software development kits) allow you to get started with their API quickly while abstracting away much of the HTTP stuff and authentication.
API Endpoint Integration
We've already used the Zyte REST API once. With Endpoint Integration, we simple make all of our HTTP requests to a specific endpoint and receive standard HTTP responses from Zyte's server.
As mentioned before, all of our requests go to the /extract
endpoint. You can view the Endpoint Integration snippet again below.
import requests
import json
from base64 import b64decode
with open("config.json") as file:
config = json.load(file)
api_response = requests.post(
"https://api.zyte.com/v1/extract",
auth=(config["zyte_api_key"], ""),
json={
"url": "https://toscrape.com",
"httpResponseBody": True
}
)
json_response = api_response.json()
print("------------------------Raw Response--------------------------")
print(json.dumps(json_response, indent=4))
if "httpResponseBody" in json_response:
json_response["httpResponseBody"] = b64decode(json_response["httpResponseBody"]).decode('utf-8')
print("------------------------Decoded Response----------------------")
print(json.dumps(json_response, indent=4))
auth
holds a tuple: our API key, and an empty string to use as our password.json
holds the parameters we'd like to pass into the API:"url"
: the url that we'd like to scrape."httpResponseBody"
: we want the body of the response.
Proxy Port Integration
Just like Endpoint Integration, we've actually already covered Proxy Port Integration. You can view our proxy port code again below.
- First, we read our API key from a config file.
- Then, we set our HTTP and HTTPS proxies to the port url:
f"http://{config['zyte_api_key']}:@api.zyte.com:8011"
. When we make our requests to the target site, they all get routed through this port.
As mentioned earlier, we use the verify
keyword argument with the path to Zyte's CA Certificate. If you followed the installation steps for your OS, you might not need the verify
argument.
This type of integration is great when you simply want to setup your proxy and forget about it. You're not worried about customization, you simply want access to a site.
import requests
import json
config = {}
with open("config.json") as file:
config = json.load(file)
ca_cert_path = "zyte-ca.crt"
proxies = {
"http": f"http://{config['zyte_api_key']}:@api.zyte.com:8011",
"https": f"http://{config['zyte_api_key']}:@api.zyte.com:8011"
}
response = requests.get("https://toscrape.com", proxies=proxies, verify=ca_cert_path)
print(response.content)
proxies=proxies
tells Requests that we want to integrate with a proxy port. We assign it the value of theproxies
dict that we declared earlier in the code.
SDK Integration
Here, we'll look at the python-zyte-api
. This is a premade kit for you to use when scraping, all you need to do is install it and use your API key.
You can install it with pip.
pip install zyte-api
Here is the basic usage from their documentation.
from zyte_api import ZyteAPI
client = ZyteAPI(api_key="YOUR_API_KEY")
response = client.get({"url": "https://toscrape.com", "httpResponseBody": True})
You can view their full docs for this here.
Async Response Integration
Async (asynchronous) response integration is the ability to handle operations that run in the background without blocking the execution of other tasks.
An asynchronous response means that a request is initiated, but the system doesn't wait for the response to complete before continuing to execute other tasks. Instead, it processes the response once it's available.
Async response integration is essential for creating efficient, scalable, and responsive systems, especially when dealing with external APIs, real-time data processing, or user interface operations.
We can get also get async responses with their Python SDK. The Python Requests library is synchronous. To use async/non-blocking requests with it requires quite a bit of overhead. With zyte-api
, you can make async requests pretty much right out of the box!
In the code below, we import asyncio
. Then, we define an async main()
that uses async
/await
when scraping the site. The main function then gets run with asyncio.run(main())
.
import asyncio
from zyte_api import AsyncZyteAPI
async def main():
client = AsyncZyteAPI(api_key="YOUR_API_KEY")
response = await client.get(
{"url": "https://toscrape.com", "httpResponseBody": True}
)
asyncio.run(main())
You can once again view the zyte-api
docs here.
Managing Concurrency
Concurrency is not directly managed through the API. There is no direct mention of concurrency in their documentation other than the rate limiting. If you are getting status 429, you will need to decrease your concurrent threads as you are being rate limited.
In the code below, we use ThreadPoolExecutor
to open up 3 threads. On each thread, we scrape a separate page using executor.map()
.
import requests
from bs4 import BeautifulSoup
import json
import concurrent.futures
from base64 import b64decode
with open("config.json") as file:
config = json.load(file)
output_data = []
url = "https://api.zyte.com/v1/extract"
def scrape_page(page_number):
try:
response = requests.post(
url,
auth=(config["zyte_api_key"], ""),
json={
"url": f"http://quotes.toscrape.com/page/{page_number+1}/",
"httpResponseBody": True
})
if response.status_code != 200:
raise Exception(f"Failed Status code: {response.status_code}")
content = b64decode(response.json()["httpResponseBody"]).decode('utf-8')
soup = BeautifulSoup(content, "html.parser")
title = soup.find('h1').text
output_data.append({
'title': title.strip(),
})
except Exception as e:
print('Error', e)
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
executor.map(scrape_page, range(3))
print(output_data)
Our most important things to note when managing concurrency comes in this snippet:
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
executor.map(scrape_page, range(3))
max_workers=3
tellsThreadPoolExecutor
that we want to use a maximum of 3 threads. To use more threads, increase this number. To use less threads, decrease this number.scrape_page
is the function we want to call on all available threads.range(3)
creates a list ranging from 0 to 2. This is the list of pages we wish to scrape. In ourscrape_page()
function, we adjust our url topage_number+1
to account for this. The pages begin at 1, but we begin counting at 0.
Advanced Functionality
Now that we've got a feel for their API, we'll dive into some of Zyte's more advanced functionalities.
Zyte gives us plenty of flexibility when it comes to customization. Check out the table below for a full list of features we can use. Each feature gets passed in as a field in our JSON body.
The prices are not specifically listed in Zyte's documentation, they vary based on the site difficulty and resources used (as mentioned earlier in the pricing plan).
Field | Description | Default |
---|---|---|
browserHTML | Open a real browser and render the page. | False |
screenshot | Take a screenshot using the browser. | False |
article | Get article data from the page. | False |
articleList | Retrieve a list of articles. | False |
articleNavigation | Find the navigation through the articles. | False |
forumThread | Extract forum threads. | False |
jobPosting | Extract job postings. | False |
jobPostingList | Extract a list of job postings. | False |
product | Extract product data. | False |
productList | Extract a list of product data. | False |
customAttributes | Extract page elements based on criteria. | null |
geolocation | Make a request through a specific country. | Based on site server |
javascript | Forces JavaScript execution on the browser. | Based on site |
actions | List of actions to perform in the browser. | null |
session | Create a reusable session. | null |
networkCapture | Capture network requests from the browser. | null |
device | Emulate a specific device. | desktop |
cookieManagement | How cookies are managed in browswer. | auto |
requestCookies | List of cookies to send with a request. | null |
responseCookies | Show cookies from request in its response. | False |
serp | Search engine results of the domain. | False |
ipType | Use either a residential or datacenter IP. | datacenter |
There are tons of different functionalities we can enable. In the next few sections, we'll just go over the main ones used when scraping the web.
If we don't cover your specific need in this article, you can view a full list of the available functionality here.
JavaScript Rendering
JavaScript rendering is the process of executing JavaScript code within a web page to dynamically create and manipulate content before it is displayed to users. Almost no site is completely available without JavaScript rendering.
- Enhanced User Experience: Provides dynamic, interactive content for improved user engagement.
- Reduced Server Load: Offloads rendering tasks to the client, minimizing server resource usage.
- Dynamic Content Updates: Enables real-time content changes without full page reloads.
- Responsive Interfaces: Allows for quick user feedback and seamless interactions.
- Rich Functionality: Supports complex features like forms, animations, and transitions.
- Single Page Applications (SPAs): Facilitates smooth navigation within a single web page.
- Client-Side Data Manipulation: Empowers users to interact with data directly in the browser.
To render JavaScript, we need to enable the browser. We can do this by using browserHTML
inside of our JSON body. This tells the Zyte API that we want to open a real browser and render the page.
import requests
api_response = requests.post(
"https://api.zyte.com/v1/extract",
auth=("YOUR_API_KEY", ""),
json={
"url": "https://toscrape.com",
"browserHtml": True,
},
)
browser_html: str = api_response.json()["browserHtml"]
browserHTML
tells Zyte that we want to render the content inside an actual browser and execute JavaScript. To use a browser, but forcibly disable JavaScript, we can pass "javascript": False
.
import requests
api_response = requests.post(
"https://api.zyte.com/v1/extract",
auth=("YOUR_API_KEY", ""),
json={
"url": "https://toscrape.com",
"browserHtml": True,
"javascript": False
},
)
browser_html: str = api_response.json()["browserHtml"]
Docs for browserHTML
are available here.
Controlling The Browser
We can control the browser with actions
. actions
allows us to pass a list of actions to execute from within the browser.
In the snippet below, we tell Zyte to scroll to the bottom of the page before returning our response. in our actions
list, we have one action: "action": "scrollBottom"
.
import requests
from parsel import Selector
api_response = requests.post(
"https://api.zyte.com/v1/extract",
auth=("YOUR_API_KEY", ""),
json={
"url": "https://quotes.toscrape.com/scroll",
"browserHtml": True,
"actions": [
{
"action": "scrollBottom",
},
],
},
)
browser_html = api_response.json()["browserHtml"]
quote_count = len(Selector(browser_html).css(".quote"))
You can view their actions
documentation here.
Country Geotargeting
No scrape is complete without geotargeting. Geotargeting is used to route our request through a specific location and allows users to access and extract data from web services or websites based on specific geographical locations.
By utilizing proxies that are located in different countries, we can mimic the behavior of users from those regions, enabling them to retrieve localized content, conduct market research, or verify ads.
We can control this by using geolocation
inside of the JSON body. When we pass this into the API, we need to pass in a specific country code for our location. Zyte will then make our request from an IP in that location.
import json
from base64 import b64decode
import requests
api_response = requests.post(
"https://api.zyte.com/v1/extract",
auth=("YOUR_API_KEY", ""),
json={
"url": "http://ip-api.com/json",
"httpResponseBody": True,
"geolocation": "AU",
},
)
http_response_body: bytes = b64decode(api_response.json()["httpResponseBody"])
response_data = json.loads(http_response_body)
country_code = response_data["countryCode"]
You can view a full list of country codes here. To control your location, pass "geolocation": "COUNTRY-CODE"
. If you want to appear in the US, pass "geolocation": "US"
.
Residential Proxies
Residential proxies are a staple in web scraping.
Residential proxies are IP addresses provided by internet service providers (ISPs) to homeowners, as opposed to data center proxies, which are generated in bulk by servers. Residential proxies are associated with real devices and real users, making them less likely to be flagged as suspicious by websites and online services.
Residential proxies in proxy APIs offer a valuable tool for businesses and individuals seeking to perform web scraping, access geotargeted content, and conduct market research without facing detection or blocking.
Their use of real user IPs enhances anonymity and reliability, making them essential for tasks that require legitimate user representation.
When Zyte or ScrapeOps makes a request and it fails from a datacenter IP, they typically retry using a residential IP address. To force a request to use a residential IP, we can add ipType
to our JSON body.
Warning: To use strictly residential proxies, users are required to undergo Zyte's KYC (know your customer process).
That said, here is the code to force a residential IP address. First, we check our provider type using a datacenter IP. Then, we check with a residential IP. The output of both gets printed to the terminal.
from base64 import b64decode
import requests
from parsel import Selector
for ip_type in ("datacenter", "residential"):
api_response = requests.post(
"https://api.zyte.com/v1/extract",
auth=("YOUR_API_KEY", ""),
json={
"url": "https://www.whatismyisp.com/",
"httpResponseBody": True,
"ipType": ip_type,
},
)
http_response_body_bytes = b64decode(api_response.json()["httpResponseBody"])
http_response_body = http_response_body_bytes.decode()
logout = Selector(http_response_body).css("h1 > span::text").get()
print(logout)
The documentation for ipType
is available here.
Custom Headers
Custom header functionality in proxy APIs allows users to specify their own HTTP headers when making requests through a proxy.
While proxy APIs typically manage headers automatically to optimize performance, custom headers can be necessary in certain situations to achieve specific objectives.
Why Use Custom Headers?
- Specific Data Requirements: Necessary for requests that require particular headers to return desired data.
- POST Requests: Essential for including headers like Content-Type or Authorization for proper request processing.
- Bypassing Anti-Bot Systems: Helps mimic legitimate user behavior to avoid detection and blocking.
- Session Management: Facilitates the inclusion of cookies or tokens needed for authenticated requests.
- Targeted Marketing and Advertising: Enables the delivery of specific campaigns based on user context.
- Content Delivery: Ensures receipt of the most relevant content by indicating user preferences.
Word of Caution
- Performance Impacts: Misuse can lead to reduced performance and detection as automated traffic.
- Continuous Header Generation: Necessary for large-scale scraping to avoid blocks.
- Use Judiciously: Should be employed only when absolutely necessary to prevent complications.
To set custom headers, we can use customHttpRequestHeaders
. We pass these headers in as an array of JSON objects.
import json
from base64 import b64decode
import requests
api_response = requests.post(
"https://api.zyte.com/v1/extract",
auth=("YOUR_API_KEY", ""),
json={
"url": "https://httpbin.org/anything",
"httpResponseBody": True,
"customHttpRequestHeaders": [
{
"name": "Accept-Language",
"value": "fa",
},
],
},
)
http_response_body = b64decode(api_response.json()["httpResponseBody"])
headers = json.loads(http_response_body)["headers"]
To set custom headers, we use customHttpHeaders
and we set its value to an array of JSON objects. This feature is found in their documentation here. This feature is imperative when you're scraping at scale.
Sometimes sites require special headers, and you need to ensure that you're always using clean headers.
Static Proxies
While they only apply to a specific niche of scraping, Static Proxies (Sticky Sessions) are also a staple when scraping the web.
Static proxies, often referred to as sticky sessions, are a type of proxy server that maintains a consistent IP address for a user over multiple requests.
This means that once a user is assigned a specific proxy IP, they can continue using that same IP for subsequent requests, rather than having their IP address change with each new request.
When you scrape the web, sometimes you need to hang on to your session. Static proxies are a valuable tool for users needing consistent IP addresses for applications like web scraping, session management, and online research. The most common use for this is when you're logging into a site to view information.
To handle sessions, we need to deal with two parameters, sessionContext
and sessionContextParameters
.
from base64 import b64decode
import requests
api_response = requests.post(
"https://api.zyte.com/v1/extract",
auth=("YOUR_API_KEY", ""),
json={
"url": "http://httpbin.org/cookies",
"httpResponseBody": True,
"sessionContext": [
{
"name": "id",
"value": "cookies",
},
],
"sessionContextParameters": {
"actions": [
{
"action": "goto",
"url": "http://httpbin.org/cookies/set/foo/bar",
},
],
},
},
)
http_response_body_bytes = b64decode(api_response.json()["httpResponseBody"])
http_response_body = http_response_body_bytes.decode()
print(http_response_body)
The full documentation for this is available here.
You can use sessions when you need to keep your browsing session in tact, for instance, if you don't want to be logged out from the site between requests.
Screenshot Functionality
Screenshot functionality in proxy APIs allows users to capture visual representations of web pages or specific content displayed on the internet. This feature is a powerful tool for enabling visual representation of web content for verification, monitoring, and analysis.
It enhances the ability to track changes, verify content, and support testing processes.
Zyte offers some very robust screenshot functionality. To take a screenshot using the Zyte API, we can use the screenshot
parameter. There are also several options we can use along with this screenshot
param.
from base64 import b64decode
import requests
api_response = requests.post(
"https://api.zyte.com/v1/extract",
auth=("YOUR_API_KEY", ""),
json={
"url": "https://toscrape.com",
"screenshot": True,
},
)
screenshot: bytes = b64decode(api_response.json()["screenshot"])
Here is an example of taking a more customized screenshot of the full page.
from base64 import b64decode
import requests
api_response = requests.post(
"https://api.zyte.com/v1/extract",
auth=("YOUR_API_KEY", ""),
json={
"url": "https://toscrape.com",
"screenshot": True,
"screenshotOptions": {
"format": "png",
"fullPage": True
}
},
)
screenshot: bytes = b64decode(api_response.json()["screenshot"])
Their full documentation on screenshots is available here.
Auto Parsing
We can all admit that parsing is also one of the more difficult tasks when scraping the web.
Auto parsing, sometimes referred to as auto extract, is a functionality offered by some proxy APIs that automatically extracts and structures data from web pages without requiring users to write complex scraping scripts.
This feature simplifies the data retrieval process, making it accessible even to those with limited technical expertise.
Zyte offers one of the best Auto Parsing experiences on the entire web.
In the code below, we use Zyte's product
parameter to automatically extract all of the books from the page.
import requests
api_response = requests.post(
"https://api.zyte.com/v1/extract",
auth=("YOUR_API_KEY", ""),
json={
"url": (
"https://books.toscrape.com/catalogue"
"/a-light-in-the-attic_1000/index.html"
),
"product": True,
},
)
product = api_response.json()["product"]
print(product)
You can view a full list of their auto parsing features here.
Case Study: Using Zyte API on IMDb Top 250 Movies
Now, it's time for a bit of an experiment. We're going to scrape IMDB's top 250 movies using both the Zyte API and the ScrapeOps API.
While our requests to the respective APIs are done quite differently, the overall process is much the same. With Zyte, we send our API parameters in the JSON body of a POST request.
With ScrapeOps, we use a GET, so we create a function to GET a ScrapeOps url and then get the data from our ScrapeOps url.
Here is the code we use to access the content with the ScrapeOps API. We create a function, get_scrapeops_url()
. It takes in our API parameters and returns a url that takes us to our custom content.
def get_scrapeops_url(url):
payload = {
"api_key": API_KEY,
"url": url,
}
proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)
return proxy_url
To access our content from the Zyte API, we use the following POST request. As we did earlier, we place our parameters inside the JSON body of the request.
response = requests.post(
"https://api.zyte.com/v1/extract",
auth=(API_KEY, ""),
json= {
"url": url,
"httpResponseBody": True
}
)
Then, we decode our Zyte response like we did earlier.
content = b64decode(response.json()["httpResponseBody"]).decode('utf-8')
Here is our full ScrapeOps code.
import os
import requests
from bs4 import BeautifulSoup
import json
import csv
import logging
from urllib.parse import urlencode
## Logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
API_KEY = ""
with open("config.json", "r") as config_file:
config = json.load(config_file)
API_KEY = config["scrapeops_api_key"]
def get_scrapeops_url(url):
payload = {
"api_key": API_KEY,
"url": url,
}
proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)
return proxy_url
def scrape_movies(url, location="us", retries=3):
success = False
tries = 0
while not success and tries <= retries:
response = requests.get(get_scrapeops_url(url))
try:
if response.status_code != 200:
raise Exception(f"Failed response from server, status code: {response.status_code}")
soup = BeautifulSoup(response.text, "html.parser")
json_tag = soup.select_one("script[type='application/ld+json']")
json_data = json.loads(json_tag.text)["itemListElement"]
movie_list_length = 0
movie_list = []
for item in json_data:
movie_list.append(item["item"])
movie_list_length+=len(json_data)
print(f"Movie list length: {len(json_data)}")
with open("scrapeops-top-250.json", "w") as file:
json.dump(movie_list, file, indent=4)
success = True
except Exception as e:
logger.error(f"Failed to process page: {e}, retries left: {retries-tries}")
tries+=1
if not success:
raise Exception(f"Failed to scrape page, MAX RETRIES {retries} EXCEEDED!!!")
if __name__ == "__main__":
MAX_RETRIES = 3
logger.info("Starting IMDB scrape")
url = "https://www.imdb.com/chart/top/"
scrape_movies(url, retries=MAX_RETRIES)
logger.info("Scrape complete")
Here is our full code using the Zyte API.
import os
import requests
from bs4 import BeautifulSoup
import json
from base64 import b64decode
import logging
from urllib.parse import urlencode
## Logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
API_KEY = ""
with open("config.json", "r") as config_file:
config = json.load(config_file)
API_KEY = config["zyte_api_key"]
def scrape_movies(url, location="us", retries=3):
success = False
tries = 0
while not success and tries <= retries:
response = requests.post(
"https://api.zyte.com/v1/extract",
auth=(API_KEY, ""),
json= {
"url": url,
"httpResponseBody": True
}
)
try:
if response.status_code != 200:
raise Exception(f"Failed response from server, status code: {response.status_code}")
content = b64decode(response.json()["httpResponseBody"]).decode('utf-8')
soup = BeautifulSoup(content, "html.parser")
json_tag = soup.select_one("script[type='application/ld+json']")
json_data = json.loads(json_tag.text)["itemListElement"]
movie_list_length = 0
movie_list = []
for item in json_data:
movie_list.append(item["item"])
movie_list_length+=len(json_data)
print(f"Movie list length: {len(json_data)}")
with open("zyte-top-250.json", "w") as file:
json.dump(movie_list, file, indent=4)
success = True
except Exception as e:
logger.error(f"Failed to process page: {e}, retries left: {retries-tries}")
tries+=1
if not success:
raise Exception(f"Failed to scrape page, MAX RETRIES {retries} EXCEEDED!!!")
if __name__ == "__main__":
MAX_RETRIES = 3
logger.info("Starting IMDB scrape")
url = "https://www.imdb.com/chart/top/"
scrape_movies(url, retries=MAX_RETRIES)
logger.info("Scrape complete")
ScrapeOps parsed and saved the data in 4.955 seconds.
With Zyte, we completed the scrape in 4.199 seconds.
All in all, both of these APIs are pretty performant. While accessing the site via Zyte has a little bit higher learning curve, it is a little bit faster. 4.955 seconds - 4.199 seconds = 0.756 seconds difference. The difference here is pretty minimal.
There were no real challenges getting through. Both proxies made it through without using any of our retry logic.
Alternative: ScrapeOps Proxy API Aggregator
With the ScrapeOps Proxy Aggregator, you get access to a larger set of price plans and you get much of the same service and reliability that you get with the Zyte API. In fact, Zyte is one of the many providers we use in our proxy aggregators.
When you use the ScrapeOps Proxy Aggregator, you get access to Zyte and numerous other proxy providers as well. We're adding new service providers all the time. You can take a look at our plans below.
The range from as low as $9 per month to as high as $249 per month. In comparison to Zyte, our lower tier plans cost far less and our higher tier plans cost drastically less. The Zyte mid-tier plan is $350 per month and our highest tier plan at ScrapeOps is $249.
Troubleshooting
Issue #1: Request Timeouts
Handling timeouts is really easy to do with Python Requests. All we need to do is pass the timeout
keyword argument with our request. This tells Requests to wait up to our timeout limit before throwing an exception. Simply pass an int
object into the timeout
argument.
import requests
# 5 second timeout
response = requests.get("https://httpbin.org/get", timeout=5)
Issue #2: Handling CAPTCHAs
With both ScrapeOps and Zyte, if you're running into CAPTCHAs, there's a problem. Both of these API providers provide CAPTCHA avoidance and tend to bypass them completely.
However, sometimes things happen and the anti-bots do trip us up. If you run into a CAPTCHA, first, retry the request.
If you're receiving CAPTCHAs consistently, you can use a service like 2Captcha. We've got a full article on getting past CAPTCHAs here. It goes through CAPTCHA solving libraries and even services like 2Captcha that we mentioned above.
Issue #3: Invalid Response Data
When dealing invalid responses, we need to troubleshoot our status codes. As mentioned earlier, if you are receiving anything other than a 200, something is wrong.
You can view Zyte's status codes here. The ScrapeOps status codes are available for you to review here.
In most cases, once you understand the response code, you'll know exactly what to change.
- If you're receiving a 429, slow down your requests.
- If you receive a 404, you're looking for a page that doesn't exist.
To make a long story short, look up your status code and solve it accordingly.
The Legal & Ethical Implications of Web Scraping
All of the data scraped in this article has been public data. Public data is generally legal to scrape. Private data (data gated behind a login or some other type of authentication), is a completely different story.
When you scrape private data, you are subject to the same IP (intellectual property) and privacy laws that govern the site you're scraping.
You also need to be aware of your target site's terms and conditions and their robots.txt
file as well. IMDB's terms and robots.txt
are available for you to review in the links below.
Violating these terms can have severe consequences. If you choose to misuse a site or an API, it can result in:
- Account suspension or even termination if you have an account at the site. Your account is subject to their Terms and Conditions.
- Lawsuits and other legal penalties depending on which data you scrape and how it gets disseminated. Companies can attain reputation damage and they can sue you for being liable.
- Reputation damage to the companies involved. As mentioned above, when these companies have their reputations damaged, they can sue you.
- Risks to users such as exposure of personal data. This can expose you to lawsuits or even prison time.
Conclusion
You now know how to integrate with Zyte API using proxy ports, their SDK, and their REST API. You got a crash course in how to make POST requests with authentication using Python Requests.
Whether you choose the ScrapeOps Proxy Aggregator or the Zyte API, you can get a stable and reliable connection to your data.
You can get one of the more expensive plans with all the bells and whistles from Zyte, or you can get a much more reasonable price plan from ScrapeOps with most of the same functionality.
If you are planning on using Auto Parsing features, Zyte is clearly ahead (albeit far more expensive), but if you're just looking to scrape the web normally, ScrapeOps is clearly the better value.
More Web Scraping Guides
At ScrapeOps, we've got tons of learning resources for anyone that wants to read them. If you're brand new to web development, we've got something for you. If you're a seasoned dev, we've also got something for you.
We love web scraping so much, we even wrote a playbook on it!
If you'd like to learn more about integration with other proxies, check out the links below!