Skip to main content

Zyte API: Web Scraping Integration Guide

Proxy management is an integral part of the Zyte API. Their Smart Proxy Manager is currently being merged into the Zyte API. The Zyte Smart Proxy Manager goes through and "automatically selects the leanest set of proxies and techniques to keep your crawl healthy". This process automates your proxy connections so all you have to focus on is writing your scraper.

Here, we'll go through the process of signing up and using the Zyte API and then pit it head to head against the ScrapeOps Proxy Aggregator head to head.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.


TLDR: Web Scraping With Zyte API?

With proxy ports, getting started with the Zyte API is pretty easy.

Download their CA Certificate and configure it using their instructions for your OS. Once you've done that, you're ready to go.

Here's an example to get you started. If you run into SSL issues, you can pass verify=False into your requests.

import requests
import json

config = {}

with open("config.json") as file:
config = json.load(file)

ca_cert_path = "zyte-ca.crt"

proxies = {
"http": f"http://{config['zyte_api_key']}:@api.zyte.com:8011",
"https": f"http://{config['zyte_api_key']}:@api.zyte.com:8011"
}

response = requests.get("https://toscrape.com", proxies=proxies, verify=ca_cert_path)

print(response.content)

When you hook into their proxy port, you can set it and forget it.

  • Make sure to pass proxies=proxies and verify="path-to-your-ca-certificate" and you can continue building everything else like you would normally.
  • Make sure to use these proxies for responsible web scraping. Don't disseminate private data, and don't use the API to violate any other site's terms and conditions.

What Is The Zyte API?

The Zyte Smart Proxy Manager is a portion of the Zyte API. Their Smart Proxy Manager automatically keeps a list of available, healthy proxies and selects the best one for your specific scrape.

Zyte Homepage

When we're scraping the web, more often than not, we're bogged down writing parsers and trying to extract data from our target sites. With both the Zyte API and the ScrapeOps Proxy Aggregator, the proxy management gets handled for you.

Both of these solutions use rotating proxies. They allow us to scrape more efficiently by getting us past CATPCHAs, anti-bot systems and they even allow us to render JavaScript content on the page before sending our response back.

All in all, this makes scraping a site far easier than manual proxy management. When you manage proxies manually, you have to create them, maintain a list of them, and select the best one.

When we use proxy managers like the ones mentioned above, all we have to worry about is our scrape and our normal code. We don't need to write tons of boilerplate and manage infrastructure, these products handle it for us.


How Does The Zyte API Work?

Zyte's API has a pretty simple function when we examine it at the highest level. We use it to gain access to the target site. When we break down what's actually going on, their Smart Proxy Manager is doing a ton more than you would think.

When you make a request to the Smart Proxy Manager through the Zyte API, the following happens:

  1. Zyte pickes the best available proxy out of its pool.
  2. Using that proxy, Zyte fetches the page and executes any additional instructions we gave it (like rendering JavaScript).
  3. Zyte ensures that we received a valid response. If we did not, it will repeat steps 1 and 2 until we get one.
  4. Zyte sends the response back to us.

Let's make a simple request using the Zyte API. Start by creating a config.json file. We'll use this to hold our API keys. Here are its contents.

Our scrapers will read our API keys from this file so we don't have to hardcode them into the scraper (it's bad practice to hardcode API keys!).

{
"scrapeops_api_key": "YOUR-SCRAPEOPS-API-KEY",
"zyte_api_key": "YOUR-ZYTE-API-KEY"
}

Once we've got our API keys stored, we need to do something with them. Before we can use the API successfully, we need to setup the Zyte CA Certificate.

You can find instructions for that here. Follow the instructions specific to your OS.

  • If you are on Windows, follow the instructions for Windows.
  • If you are on Linux, follow the Linux instructions.
  • If you are on Mac, follow the Mac instructions.

The easiest way to bundle use certificate with Requests is to simply specify the path to the certificate in your code.

For this tutorial, I'm just going to keep the certificate inside my project folder, that makes it easy to find. Here's the output from my ls command.

You can see that the certificate is highlighted.

CA Certificate

Now, to make a simple request, we need to read both our config file and our CA certificate.

import requests
import json

config = {}

with open("config.json") as file:
config = json.load(file)

ca_cert_path = "zyte-ca.crt"

proxies = {
"http": f"http://{config['zyte_api_key']}:@api.zyte.com:8011",
"https": f"http://{config['zyte_api_key']}:@api.zyte.com:8011"
}

response = requests.get("https://toscrape.com", proxies=proxies, verify=ca_cert_path)

print(response.text)
  • First, we create a variable to hold our configuration.
  • Then, we read the config file and load our configuration into our program using json.load().
  • We specify the path to both the config file and our CA Certificate.
  • When we make a request, we need to pass these things in along with the request:
    • proxies=proxies tells Requests to use the proxies we set up.
    • verify=ca_cert_path tells Requests to use the CA Certificate we downloaded for verification.

Response Format

In the example above, our response came in HTML format by default. You can view the full HTML below.

<!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Scraping Sandbox</title>
<link href="./css/bootstrap.min.css" rel="stylesheet">
<link href="./css/main.css" rel="stylesheet">
</head>
<body>
<div class="container">
<div class="row">
<div class="col-md-1"></div>
<div class="col-md-10 well">
<img class="logo" src="img/zyte.png" width="200px">
<h1 class="text-right">Web Scraping Sandbox</h1>
</div>
</div>

<div class="row">
<div class="col-md-1"></div>
<div class="col-md-10">
<h2>Books</h2>
<p>A <a href="http://books.toscrape.com">fictional bookstore</a> that desperately wants to be scraped. It's a safe place for beginners learning web scraping and for developers validating their scraping technologies as well. Available at: <a href="http://books.toscrape.com">books.toscrape.com</a></p>
<div class="col-md-6">
<a href="http://books.toscrape.com"><img src="./img/books.png" class="img-thumbnail"></a>
</div>
<div class="col-md-6">
<table class="table table-hover">
<tr><th colspan="2">Details</th></tr>
<tr><td>Amount of items </td><td>1000</td></tr>
<tr><td>Pagination </td><td>&#10004;</td></tr>
<tr><td>Items per page </td><td>max 20</td></tr>
<tr><td>Requires JavaScript </td><td>&#10008;</td></tr>
</table>
</div>
</div>
</div>

<div class="row">
<div class="col-md-1"></div>
<div class="col-md-10">
<h2>Quotes</h2>
<p><a href="http://quotes.toscrape.com/">A website</a> that lists quotes from famous people. It has many endpoints showing the quotes in many different ways, each of them including new scraping challenges for you, as described below.</p>
<div class="col-md-6">
<a href="http://quotes.toscrape.com"><img src="./img/quotes.png" class="img-thumbnail"></a>
</div>
<div class="col-md-6">
<table class="table table-hover">
<tr><th colspan="2">Endpoints</th></tr>
<tr><td><a href="http://quotes.toscrape.com/">Default</a></td><td>Microdata and pagination</td></tr>
<tr><td><a href="http://quotes.toscrape.com/scroll">Scroll</a> </td><td>infinite scrolling pagination</td></tr>
<tr><td><a href="http://quotes.toscrape.com/js">JavaScript</a> </td><td>JavaScript generated content</td></tr>
<tr><td><a href="http://quotes.toscrape.com/js-delayed">Delayed</a> </td><td>Same as JavaScript but with a delay (?delay=10000)</td></tr>
<tr><td><a href="http://quotes.toscrape.com/tableful">Tableful</a> </td><td>a table based messed-up layout</td></tr>
<tr><td><a href="http://quotes.toscrape.com/login">Login</a> </td><td>login with CSRF token (any user/passwd works)</td></tr>
<tr><td><a href="http://quotes.toscrape.com/search.aspx">ViewState</a> </td><td>an AJAX based filter form with ViewStates</td></tr>
<tr><td><a href="http://quotes.toscrape.com/random">Random</a> </td><td>a single random quote</td></tr>
</table>
</div>
</div>
</div>
</div>
</body>
</html>

We can use the Zyte API to customize our parameters. Take a look at the code example below. Something that might seem strange, but is actually considered more secure: the Zyte API uses POST requests instead of GET.

This is a more secure method than leaving the API key exposed in a URL. Many APIs will have you send a GET with your API key in the parameters, Zyte instead has you send your key using a secure header.

Our response already comes in a json format, but the body is Base64 encoded. This binary encoding ensures the integrity of our data in transit.

Once we've received our response, we can go ahead and decode it using Python's builtin base64 library. The example below prints both the encoded and decoded responses so you can see the difference.

import requests
import json
from base64 import b64decode

with open("config.json") as file:
config = json.load(file)

api_response = requests.post(
"https://api.zyte.com/v1/extract",
auth=(config["zyte_api_key"], ""),
json={
"url": "https://toscrape.com",
"httpResponseBody": True
}
)

json_response = api_response.json()

print("------------------------Raw Response--------------------------")
print(json.dumps(json_response, indent=4))

if "httpResponseBody" in json_response:
json_response["httpResponseBody"] = b64decode(json_response["httpResponseBody"]).decode('utf-8')

print("------------------------Decoded Response----------------------")
print(json.dumps(json_response, indent=4))

Zyte API Pricing

Take a look at their pricing below. These are tiers for non-rendered responses. Depending on what you choose to do with their API, costs may vary.

With the PAYG (pay as you go) plan you only pay for the data you actually use. There are 4 other separate tiers if you choose to go with a monthly plan instead.

Zyte Pricing Tiers

The table in the screenshot above can be a bit difficult to understand. Here's our breakdown of it.

  • The price per request varies based on the difficulty of the website.
  • There are 5 tiers of difficulty based on what resources are required to scrape the site, the more residential IPs and compute resources your scrape needs, the more the request will cost.
  • In the costs section, we show the minimum and maximum price per 1000 requests.
PlanCost per 1000 RequestsMonthly Price
PAYG$0.20 - $1.90Variable (based on usage)
Monthly-100$0.10 - $0.95$100
Monthly-350$0.07 - $0.65$350
Monthly-1000$0.05 - $0.52$1000

Response Status Codes

In all facets of web development (scraping included), status codes are important. As you might already know, 200 indicates a successful response.

Anything other than 200 typically indicates that something went wrong somewhere. To troubleshoot these status codes, take a look at the table below.

Here are the status codes that Zyte includes in their documentation.

Status CodeMeaningDescription
200SuccessEverything worked!
400InvalidInvalid request or JSON information
401Authorization ErrorIssues with the way you're sending your API key.
403Account SuspensionAccount has been suspended from accessing the API.
404Site Not FoundThe site was not found at the requested domain.
422Incompatible ParametersDouble check your parameters they're conflicting.
429Over User LimitYou've exceeded your rate limit.
451Forbidden DomainZyte API doesn't permit access to the requested site.
500Internal Server ErrorZyte experienced an internal issue.
503OverloadedZyte is overloaded, try again later.
520Download ErrorTemporary error retrieving content, try again.
521Download ErrorPermanent download error, open a support ticket.

Setting Up Zyte API

Now that we've got some background information on Zyte's API, let's get started.

  1. First, create an account with either Google, or using an email and password.
  2. Afterward, you can select a trial plan. You can choose either Zyte API or Smart Proxy Management.
    • Smart Proxy Management was a separate product, but is now being merged with the Zyte API.
    • Both of these options will give you access to the Zyte API via your API key and you'll receive a $5 credit to your account for a free trial.
    • Free trials last until you either use the $5 or until 30 days have passed: whichever is sooner.

Zyte Free Trial

After setting up your free trial, you'll need to go through a checkout process where you enter your credit/debit card information. You will be charged a minimal amount (like $1) and it will be immediately refunded. The purpose for this is to ensure that your card works.

Zyte Checkout

As we mentioned earlier, almost all the requests we send to Zyte will use the POST method. This is more secure, but it does make our code slightly more difficult to write. Our API keys are sent in a secure Authorization header. If you remember from earlier, Python Requests abstracts this away with our auth argument in the request.

To view your API key, select the Zyte API dropdown and click API access.

Zyte API Access

You'll then be taken to a screen where you can view and replace your keys.

Zyte Key Management

The Zyte API supports integration with the following methods:

  • /extract endpoint: This is where we sent our post request in the second example.
  • Proxy ports: Our first code example used proxy ports, these are to set your connection and forget about managing the proxy afterward.
  • SDK: python-zyte-api, and scrapy-zyte-api. These SDKs (software development kits) allow you to get started with their API quickly while abstracting away much of the HTTP stuff and authentication.

API Endpoint Integration

We've already used the Zyte REST API once. With Endpoint Integration, we simple make all of our HTTP requests to a specific endpoint and receive standard HTTP responses from Zyte's server.

As mentioned before, all of our requests go to the /extract endpoint. You can view the Endpoint Integration snippet again below.

import requests
import json
from base64 import b64decode

with open("config.json") as file:
config = json.load(file)

api_response = requests.post(
"https://api.zyte.com/v1/extract",
auth=(config["zyte_api_key"], ""),
json={
"url": "https://toscrape.com",
"httpResponseBody": True
}
)

json_response = api_response.json()

print("------------------------Raw Response--------------------------")
print(json.dumps(json_response, indent=4))

if "httpResponseBody" in json_response:
json_response["httpResponseBody"] = b64decode(json_response["httpResponseBody"]).decode('utf-8')

print("------------------------Decoded Response----------------------")
print(json.dumps(json_response, indent=4))
  • auth holds a tuple: our API key, and an empty string to use as our password.
  • json holds the parameters we'd like to pass into the API:
    • "url": the url that we'd like to scrape.
    • "httpResponseBody": we want the body of the response.

Proxy Port Integration

Just like Endpoint Integration, we've actually already covered Proxy Port Integration. You can view our proxy port code again below.

  1. First, we read our API key from a config file.
  2. Then, we set our HTTP and HTTPS proxies to the port url: f"http://{config['zyte_api_key']}:@api.zyte.com:8011". When we make our requests to the target site, they all get routed through this port.

As mentioned earlier, we use the verify keyword argument with the path to Zyte's CA Certificate. If you followed the installation steps for your OS, you might not need the verify argument.

This type of integration is great when you simply want to setup your proxy and forget about it. You're not worried about customization, you simply want access to a site.

import requests
import json

config = {}

with open("config.json") as file:
config = json.load(file)

ca_cert_path = "zyte-ca.crt"

proxies = {
"http": f"http://{config['zyte_api_key']}:@api.zyte.com:8011",
"https": f"http://{config['zyte_api_key']}:@api.zyte.com:8011"
}

response = requests.get("https://toscrape.com", proxies=proxies, verify=ca_cert_path)

print(response.content)
  • proxies=proxies tells Requests that we want to integrate with a proxy port. We assign it the value of the proxies dict that we declared earlier in the code.

SDK Integration

Here, we'll look at the python-zyte-api. This is a premade kit for you to use when scraping, all you need to do is install it and use your API key.

You can install it with pip.

pip install zyte-api

Here is the basic usage from their documentation.

from zyte_api import ZyteAPI

client = ZyteAPI(api_key="YOUR_API_KEY")
response = client.get({"url": "https://toscrape.com", "httpResponseBody": True})

You can view their full docs for this here.

Async Response Integration

Async (asynchronous) response integration is the ability to handle operations that run in the background without blocking the execution of other tasks.

An asynchronous response means that a request is initiated, but the system doesn't wait for the response to complete before continuing to execute other tasks. Instead, it processes the response once it's available.

Async response integration is essential for creating efficient, scalable, and responsive systems, especially when dealing with external APIs, real-time data processing, or user interface operations.

We can get also get async responses with their Python SDK. The Python Requests library is synchronous. To use async/non-blocking requests with it requires quite a bit of overhead. With zyte-api, you can make async requests pretty much right out of the box!

In the code below, we import asyncio. Then, we define an async main() that uses async/await when scraping the site. The main function then gets run with asyncio.run(main()).

import asyncio

from zyte_api import AsyncZyteAPI


async def main():
client = AsyncZyteAPI(api_key="YOUR_API_KEY")
response = await client.get(
{"url": "https://toscrape.com", "httpResponseBody": True}
)


asyncio.run(main())

You can once again view the zyte-api docs here.

Managing Concurrency

Concurrency is not directly managed through the API. There is no direct mention of concurrency in their documentation other than the rate limiting. If you are getting status 429, you will need to decrease your concurrent threads as you are being rate limited.

In the code below, we use ThreadPoolExecutor to open up 3 threads. On each thread, we scrape a separate page using executor.map().

import requests
from bs4 import BeautifulSoup
import json
import concurrent.futures
from base64 import b64decode

with open("config.json") as file:
config = json.load(file)


output_data = []

url = "https://api.zyte.com/v1/extract"

def scrape_page(page_number):
try:
response = requests.post(
url,
auth=(config["zyte_api_key"], ""),
json={
"url": f"http://quotes.toscrape.com/page/{page_number+1}/",
"httpResponseBody": True
})
if response.status_code != 200:
raise Exception(f"Failed Status code: {response.status_code}")

content = b64decode(response.json()["httpResponseBody"]).decode('utf-8')
soup = BeautifulSoup(content, "html.parser")
title = soup.find('h1').text

output_data.append({
'title': title.strip(),
})

except Exception as e:
print('Error', e)


with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
executor.map(scrape_page, range(3))

print(output_data)

Our most important things to note when managing concurrency comes in this snippet:

with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
executor.map(scrape_page, range(3))
  • max_workers=3 tells ThreadPoolExecutor that we want to use a maximum of 3 threads. To use more threads, increase this number. To use less threads, decrease this number.
  • scrape_page is the function we want to call on all available threads.
  • range(3) creates a list ranging from 0 to 2. This is the list of pages we wish to scrape. In our scrape_page() function, we adjust our url to page_number+1 to account for this. The pages begin at 1, but we begin counting at 0.

Advanced Functionality

Now that we've got a feel for their API, we'll dive into some of Zyte's more advanced functionalities.

Zyte gives us plenty of flexibility when it comes to customization. Check out the table below for a full list of features we can use. Each feature gets passed in as a field in our JSON body.

The prices are not specifically listed in Zyte's documentation, they vary based on the site difficulty and resources used (as mentioned earlier in the pricing plan).

FieldDescriptionDefault
browserHTMLOpen a real browser and render the page.False
screenshotTake a screenshot using the browser.False
articleGet article data from the page.False
articleListRetrieve a list of articles.False
articleNavigationFind the navigation through the articles.False
forumThreadExtract forum threads.False
jobPostingExtract job postings.False
jobPostingListExtract a list of job postings.False
productExtract product data.False
productListExtract a list of product data.False
customAttributesExtract page elements based on criteria.null
geolocationMake a request through a specific country.Based on site server
javascriptForces JavaScript execution on the browser.Based on site
actionsList of actions to perform in the browser.null
sessionCreate a reusable session.null
networkCaptureCapture network requests from the browser.null
deviceEmulate a specific device.desktop
cookieManagementHow cookies are managed in browswer.auto
requestCookiesList of cookies to send with a request.null
responseCookiesShow cookies from request in its response.False
serpSearch engine results of the domain.False
ipTypeUse either a residential or datacenter IP.datacenter

There are tons of different functionalities we can enable. In the next few sections, we'll just go over the main ones used when scraping the web.

If we don't cover your specific need in this article, you can view a full list of the available functionality here.


JavaScript Rendering

JavaScript rendering is the process of executing JavaScript code within a web page to dynamically create and manipulate content before it is displayed to users. Almost no site is completely available without JavaScript rendering.

  • Enhanced User Experience: Provides dynamic, interactive content for improved user engagement.
  • Reduced Server Load: Offloads rendering tasks to the client, minimizing server resource usage.
  • Dynamic Content Updates: Enables real-time content changes without full page reloads.
  • Responsive Interfaces: Allows for quick user feedback and seamless interactions.
  • Rich Functionality: Supports complex features like forms, animations, and transitions.
  • Single Page Applications (SPAs): Facilitates smooth navigation within a single web page.
  • Client-Side Data Manipulation: Empowers users to interact with data directly in the browser.

To render JavaScript, we need to enable the browser. We can do this by using browserHTML inside of our JSON body. This tells the Zyte API that we want to open a real browser and render the page.

import requests

api_response = requests.post(
"https://api.zyte.com/v1/extract",
auth=("YOUR_API_KEY", ""),
json={
"url": "https://toscrape.com",
"browserHtml": True,
},
)
browser_html: str = api_response.json()["browserHtml"]

browserHTML tells Zyte that we want to render the content inside an actual browser and execute JavaScript. To use a browser, but forcibly disable JavaScript, we can pass "javascript": False.

import requests

api_response = requests.post(
"https://api.zyte.com/v1/extract",
auth=("YOUR_API_KEY", ""),
json={
"url": "https://toscrape.com",
"browserHtml": True,
"javascript": False
},
)
browser_html: str = api_response.json()["browserHtml"]

Docs for browserHTML are available here.

Controlling The Browser

We can control the browser with actions. actions allows us to pass a list of actions to execute from within the browser.

In the snippet below, we tell Zyte to scroll to the bottom of the page before returning our response. in our actions list, we have one action: "action": "scrollBottom".

import requests
from parsel import Selector

api_response = requests.post(
"https://api.zyte.com/v1/extract",
auth=("YOUR_API_KEY", ""),
json={
"url": "https://quotes.toscrape.com/scroll",
"browserHtml": True,
"actions": [
{
"action": "scrollBottom",
},
],
},
)
browser_html = api_response.json()["browserHtml"]
quote_count = len(Selector(browser_html).css(".quote"))

You can view their actions documentation here.


Country Geotargeting

No scrape is complete without geotargeting. Geotargeting is used to route our request through a specific location and allows users to access and extract data from web services or websites based on specific geographical locations.

By utilizing proxies that are located in different countries, we can mimic the behavior of users from those regions, enabling them to retrieve localized content, conduct market research, or verify ads.

We can control this by using geolocation inside of the JSON body. When we pass this into the API, we need to pass in a specific country code for our location. Zyte will then make our request from an IP in that location.

import json
from base64 import b64decode

import requests

api_response = requests.post(
"https://api.zyte.com/v1/extract",
auth=("YOUR_API_KEY", ""),
json={
"url": "http://ip-api.com/json",
"httpResponseBody": True,
"geolocation": "AU",
},
)
http_response_body: bytes = b64decode(api_response.json()["httpResponseBody"])
response_data = json.loads(http_response_body)
country_code = response_data["countryCode"]

You can view a full list of country codes here. To control your location, pass "geolocation": "COUNTRY-CODE". If you want to appear in the US, pass "geolocation": "US".


Residential Proxies

Residential proxies are a staple in web scraping.

Residential proxies are IP addresses provided by internet service providers (ISPs) to homeowners, as opposed to data center proxies, which are generated in bulk by servers. Residential proxies are associated with real devices and real users, making them less likely to be flagged as suspicious by websites and online services.

Residential proxies in proxy APIs offer a valuable tool for businesses and individuals seeking to perform web scraping, access geotargeted content, and conduct market research without facing detection or blocking.

Their use of real user IPs enhances anonymity and reliability, making them essential for tasks that require legitimate user representation.

When Zyte or ScrapeOps makes a request and it fails from a datacenter IP, they typically retry using a residential IP address. To force a request to use a residential IP, we can add ipType to our JSON body.

Warning: To use strictly residential proxies, users are required to undergo Zyte's KYC (know your customer process).

That said, here is the code to force a residential IP address. First, we check our provider type using a datacenter IP. Then, we check with a residential IP. The output of both gets printed to the terminal.

from base64 import b64decode

import requests
from parsel import Selector

for ip_type in ("datacenter", "residential"):
api_response = requests.post(
"https://api.zyte.com/v1/extract",
auth=("YOUR_API_KEY", ""),
json={
"url": "https://www.whatismyisp.com/",
"httpResponseBody": True,
"ipType": ip_type,
},
)
http_response_body_bytes = b64decode(api_response.json()["httpResponseBody"])
http_response_body = http_response_body_bytes.decode()
logout = Selector(http_response_body).css("h1 > span::text").get()
print(logout)

The documentation for ipType is available here.


Custom Headers

Custom header functionality in proxy APIs allows users to specify their own HTTP headers when making requests through a proxy.

While proxy APIs typically manage headers automatically to optimize performance, custom headers can be necessary in certain situations to achieve specific objectives.

Why Use Custom Headers?

  • Specific Data Requirements: Necessary for requests that require particular headers to return desired data.
  • POST Requests: Essential for including headers like Content-Type or Authorization for proper request processing.
  • Bypassing Anti-Bot Systems: Helps mimic legitimate user behavior to avoid detection and blocking.
  • Session Management: Facilitates the inclusion of cookies or tokens needed for authenticated requests.
  • Targeted Marketing and Advertising: Enables the delivery of specific campaigns based on user context.
  • Content Delivery: Ensures receipt of the most relevant content by indicating user preferences.

Word of Caution

  • Performance Impacts: Misuse can lead to reduced performance and detection as automated traffic.
  • Continuous Header Generation: Necessary for large-scale scraping to avoid blocks.
  • Use Judiciously: Should be employed only when absolutely necessary to prevent complications.

To set custom headers, we can use customHttpRequestHeaders. We pass these headers in as an array of JSON objects.

import json
from base64 import b64decode

import requests

api_response = requests.post(
"https://api.zyte.com/v1/extract",
auth=("YOUR_API_KEY", ""),
json={
"url": "https://httpbin.org/anything",
"httpResponseBody": True,
"customHttpRequestHeaders": [
{
"name": "Accept-Language",
"value": "fa",
},
],
},
)
http_response_body = b64decode(api_response.json()["httpResponseBody"])
headers = json.loads(http_response_body)["headers"]

To set custom headers, we use customHttpHeaders and we set its value to an array of JSON objects. This feature is found in their documentation here. This feature is imperative when you're scraping at scale.

Sometimes sites require special headers, and you need to ensure that you're always using clean headers.


Static Proxies

While they only apply to a specific niche of scraping, Static Proxies (Sticky Sessions) are also a staple when scraping the web.

Static proxies, often referred to as sticky sessions, are a type of proxy server that maintains a consistent IP address for a user over multiple requests.

This means that once a user is assigned a specific proxy IP, they can continue using that same IP for subsequent requests, rather than having their IP address change with each new request.

When you scrape the web, sometimes you need to hang on to your session. Static proxies are a valuable tool for users needing consistent IP addresses for applications like web scraping, session management, and online research. The most common use for this is when you're logging into a site to view information.

To handle sessions, we need to deal with two parameters, sessionContext and sessionContextParameters.

from base64 import b64decode

import requests

api_response = requests.post(
"https://api.zyte.com/v1/extract",
auth=("YOUR_API_KEY", ""),
json={
"url": "http://httpbin.org/cookies",
"httpResponseBody": True,
"sessionContext": [
{
"name": "id",
"value": "cookies",
},
],
"sessionContextParameters": {
"actions": [
{
"action": "goto",
"url": "http://httpbin.org/cookies/set/foo/bar",
},
],
},
},
)
http_response_body_bytes = b64decode(api_response.json()["httpResponseBody"])
http_response_body = http_response_body_bytes.decode()
print(http_response_body)

The full documentation for this is available here.

You can use sessions when you need to keep your browsing session in tact, for instance, if you don't want to be logged out from the site between requests.


Screenshot Functionality

Screenshot functionality in proxy APIs allows users to capture visual representations of web pages or specific content displayed on the internet. This feature is a powerful tool for enabling visual representation of web content for verification, monitoring, and analysis.

It enhances the ability to track changes, verify content, and support testing processes.

Zyte offers some very robust screenshot functionality. To take a screenshot using the Zyte API, we can use the screenshot parameter. There are also several options we can use along with this screenshot param.

from base64 import b64decode

import requests

api_response = requests.post(
"https://api.zyte.com/v1/extract",
auth=("YOUR_API_KEY", ""),
json={
"url": "https://toscrape.com",
"screenshot": True,
},
)
screenshot: bytes = b64decode(api_response.json()["screenshot"])

Here is an example of taking a more customized screenshot of the full page.

from base64 import b64decode

import requests

api_response = requests.post(
"https://api.zyte.com/v1/extract",
auth=("YOUR_API_KEY", ""),
json={
"url": "https://toscrape.com",
"screenshot": True,
"screenshotOptions": {
"format": "png",
"fullPage": True
}
},
)
screenshot: bytes = b64decode(api_response.json()["screenshot"])

Their full documentation on screenshots is available here.


Auto Parsing

We can all admit that parsing is also one of the more difficult tasks when scraping the web.

Auto parsing, sometimes referred to as auto extract, is a functionality offered by some proxy APIs that automatically extracts and structures data from web pages without requiring users to write complex scraping scripts.

This feature simplifies the data retrieval process, making it accessible even to those with limited technical expertise.

Zyte offers one of the best Auto Parsing experiences on the entire web.

In the code below, we use Zyte's product parameter to automatically extract all of the books from the page.

import requests

api_response = requests.post(
"https://api.zyte.com/v1/extract",
auth=("YOUR_API_KEY", ""),
json={
"url": (
"https://books.toscrape.com/catalogue"
"/a-light-in-the-attic_1000/index.html"
),
"product": True,
},
)
product = api_response.json()["product"]
print(product)

You can view a full list of their auto parsing features here.


Case Study: Using Zyte API on IMDb Top 250 Movies

Now, it's time for a bit of an experiment. We're going to scrape IMDB's top 250 movies using both the Zyte API and the ScrapeOps API.

While our requests to the respective APIs are done quite differently, the overall process is much the same. With Zyte, we send our API parameters in the JSON body of a POST request.

With ScrapeOps, we use a GET, so we create a function to GET a ScrapeOps url and then get the data from our ScrapeOps url.

Here is the code we use to access the content with the ScrapeOps API. We create a function, get_scrapeops_url(). It takes in our API parameters and returns a url that takes us to our custom content.

def get_scrapeops_url(url):
payload = {
"api_key": API_KEY,
"url": url,
}
proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)
return proxy_url

To access our content from the Zyte API, we use the following POST request. As we did earlier, we place our parameters inside the JSON body of the request.

response = requests.post(
"https://api.zyte.com/v1/extract",
auth=(API_KEY, ""),
json= {
"url": url,
"httpResponseBody": True
}
)

Then, we decode our Zyte response like we did earlier.

content = b64decode(response.json()["httpResponseBody"]).decode('utf-8')

Here is our full ScrapeOps code.

import os
import requests
from bs4 import BeautifulSoup
import json
import csv
import logging
from urllib.parse import urlencode

## Logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

API_KEY = ""

with open("config.json", "r") as config_file:
config = json.load(config_file)
API_KEY = config["scrapeops_api_key"]

def get_scrapeops_url(url):
payload = {
"api_key": API_KEY,
"url": url,
}
proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)
return proxy_url



def scrape_movies(url, location="us", retries=3):
success = False
tries = 0

while not success and tries <= retries:
response = requests.get(get_scrapeops_url(url))

try:
if response.status_code != 200:
raise Exception(f"Failed response from server, status code: {response.status_code}")

soup = BeautifulSoup(response.text, "html.parser")
json_tag = soup.select_one("script[type='application/ld+json']")
json_data = json.loads(json_tag.text)["itemListElement"]

movie_list_length = 0

movie_list = []

for item in json_data:
movie_list.append(item["item"])
movie_list_length+=len(json_data)

print(f"Movie list length: {len(json_data)}")
with open("scrapeops-top-250.json", "w") as file:
json.dump(movie_list, file, indent=4)
success = True
except Exception as e:
logger.error(f"Failed to process page: {e}, retries left: {retries-tries}")
tries+=1

if not success:
raise Exception(f"Failed to scrape page, MAX RETRIES {retries} EXCEEDED!!!")



if __name__ == "__main__":

MAX_RETRIES = 3

logger.info("Starting IMDB scrape")

url = "https://www.imdb.com/chart/top/"

scrape_movies(url, retries=MAX_RETRIES)

logger.info("Scrape complete")

Here is our full code using the Zyte API.

import os
import requests
from bs4 import BeautifulSoup
import json
from base64 import b64decode
import logging
from urllib.parse import urlencode

## Logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

API_KEY = ""

with open("config.json", "r") as config_file:
config = json.load(config_file)
API_KEY = config["zyte_api_key"]


def scrape_movies(url, location="us", retries=3):
success = False
tries = 0

while not success and tries <= retries:
response = requests.post(
"https://api.zyte.com/v1/extract",
auth=(API_KEY, ""),
json= {
"url": url,
"httpResponseBody": True
}
)

try:
if response.status_code != 200:
raise Exception(f"Failed response from server, status code: {response.status_code}")

content = b64decode(response.json()["httpResponseBody"]).decode('utf-8')
soup = BeautifulSoup(content, "html.parser")
json_tag = soup.select_one("script[type='application/ld+json']")
json_data = json.loads(json_tag.text)["itemListElement"]

movie_list_length = 0

movie_list = []

for item in json_data:
movie_list.append(item["item"])
movie_list_length+=len(json_data)

print(f"Movie list length: {len(json_data)}")
with open("zyte-top-250.json", "w") as file:
json.dump(movie_list, file, indent=4)
success = True
except Exception as e:
logger.error(f"Failed to process page: {e}, retries left: {retries-tries}")
tries+=1

if not success:
raise Exception(f"Failed to scrape page, MAX RETRIES {retries} EXCEEDED!!!")



if __name__ == "__main__":

MAX_RETRIES = 3

logger.info("Starting IMDB scrape")

url = "https://www.imdb.com/chart/top/"

scrape_movies(url, retries=MAX_RETRIES)

logger.info("Scrape complete")

ScrapeOps parsed and saved the data in 4.955 seconds.

ScrapeOps Performance Results

With Zyte, we completed the scrape in 4.199 seconds.

Zyte Performance Results

All in all, both of these APIs are pretty performant. While accessing the site via Zyte has a little bit higher learning curve, it is a little bit faster. 4.955 seconds - 4.199 seconds = 0.756 seconds difference. The difference here is pretty minimal.

There were no real challenges getting through. Both proxies made it through without using any of our retry logic.


Alternative: ScrapeOps Proxy API Aggregator

With the ScrapeOps Proxy Aggregator, you get access to a larger set of price plans and you get much of the same service and reliability that you get with the Zyte API. In fact, Zyte is one of the many providers we use in our proxy aggregators.

ScrapeOps Proxy Providers

When you use the ScrapeOps Proxy Aggregator, you get access to Zyte and numerous other proxy providers as well. We're adding new service providers all the time. You can take a look at our plans below.

The range from as low as $9 per month to as high as $249 per month. In comparison to Zyte, our lower tier plans cost far less and our higher tier plans cost drastically less. The Zyte mid-tier plan is $350 per month and our highest tier plan at ScrapeOps is $249.

ScrapeOps Pricing


Troubleshooting

Issue #1: Request Timeouts

Handling timeouts is really easy to do with Python Requests. All we need to do is pass the timeout keyword argument with our request. This tells Requests to wait up to our timeout limit before throwing an exception. Simply pass an int object into the timeout argument.

import requests

# 5 second timeout
response = requests.get("https://httpbin.org/get", timeout=5)

Issue #2: Handling CAPTCHAs

With both ScrapeOps and Zyte, if you're running into CAPTCHAs, there's a problem. Both of these API providers provide CAPTCHA avoidance and tend to bypass them completely.

However, sometimes things happen and the anti-bots do trip us up. If you run into a CAPTCHA, first, retry the request.

If you're receiving CAPTCHAs consistently, you can use a service like 2Captcha. We've got a full article on getting past CAPTCHAs here. It goes through CAPTCHA solving libraries and even services like 2Captcha that we mentioned above.

Issue #3: Invalid Response Data

When dealing invalid responses, we need to troubleshoot our status codes. As mentioned earlier, if you are receiving anything other than a 200, something is wrong.

You can view Zyte's status codes here. The ScrapeOps status codes are available for you to review here.

In most cases, once you understand the response code, you'll know exactly what to change.

  • If you're receiving a 429, slow down your requests.
  • If you receive a 404, you're looking for a page that doesn't exist.

To make a long story short, look up your status code and solve it accordingly.


All of the data scraped in this article has been public data. Public data is generally legal to scrape. Private data (data gated behind a login or some other type of authentication), is a completely different story.

When you scrape private data, you are subject to the same IP (intellectual property) and privacy laws that govern the site you're scraping.

You also need to be aware of your target site's terms and conditions and their robots.txt file as well. IMDB's terms and robots.txt are available for you to review in the links below.

Violating these terms can have severe consequences. If you choose to misuse a site or an API, it can result in:

  • Account suspension or even termination if you have an account at the site. Your account is subject to their Terms and Conditions.
  • Lawsuits and other legal penalties depending on which data you scrape and how it gets disseminated. Companies can attain reputation damage and they can sue you for being liable.
  • Reputation damage to the companies involved. As mentioned above, when these companies have their reputations damaged, they can sue you.
  • Risks to users such as exposure of personal data. This can expose you to lawsuits or even prison time.

Conclusion

You now know how to integrate with Zyte API using proxy ports, their SDK, and their REST API. You got a crash course in how to make POST requests with authentication using Python Requests.

Whether you choose the ScrapeOps Proxy Aggregator or the Zyte API, you can get a stable and reliable connection to your data.

You can get one of the more expensive plans with all the bells and whistles from Zyte, or you can get a much more reasonable price plan from ScrapeOps with most of the same functionality.

If you are planning on using Auto Parsing features, Zyte is clearly ahead (albeit far more expensive), but if you're just looking to scrape the web normally, ScrapeOps is clearly the better value.


More Web Scraping Guides

At ScrapeOps, we've got tons of learning resources for anyone that wants to read them. If you're brand new to web development, we've got something for you. If you're a seasoned dev, we've also got something for you.

We love web scraping so much, we even wrote a playbook on it!

If you'd like to learn more about integration with other proxies, check out the links below!