Skip to main content

ScrapeOwl Integration Guide

ScrapeOwl: Web Scraping Integration Guide

ScrapeOwl is a proxy API service much like the ScrapeOps Proxy Aggregator. They handle the low level proxy stuff so you don't have to. When we use a product like this, its purpose is to do the proxy management for us. This way, the only thing we need to focus on is our scrape.

If you decide to follow along, by the end of this guide, you'll have a solid understanding off proxy integration with ScrapeOwl and with our Proxy Aggregator as well so you can make an informed decision when signing up for a proxy service.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.


TLDR: Web Scraping With ScrapeOwl?

Getting started with ScrapeOwl is a relatively simple process. Once you've got an API key, make sure you save it to a file or hardcode it into your scraper (not recommended). Then, all you have to do is ping their REST API. The code example below does just that.

It came straight from their request builder. The only thing modified was the API key.

The request builder writes the API key into the prebuilt request. We added a config.json file to hold our key and keep the key out of the actual Python file.

import requests
import json

API_KEY = ""

with open("config.json") as file:
config = json.load(file)
API_KEY = config["scrapeowl_api_key"]

#API details
scrapeowl_url = "https://api.scrapeowl.com/v1/scrape"

#Object of the request
object_of_data = {
"api_key": API_KEY,
"url": "https://httpbin.org/ip",
"json_response": True
}

#Convert object to JSON
data = json.dumps(object_of_data)

#Set headers
headers = {
"Content-Type": "application/json"
}

#Making http post request
response = requests.post(scrapeowl_url, data, headers=headers)

#Print the JSON response from API
print(response.json())

When you decide to use ScrapeOwl (or any proxy service for that matter), it comes with a service agreement. ScrapeOwl's terms are available here. It's important to follow these guidelines if you want to continue using their product. Always respect the rules of the service you're using.

Also, don't use proxies for illegal activities. There are many reasons why, starting with the fact that it's immoral and unethical. Alongside that, if you do choose to do anything illegal, your actions will be traced to the proxy IP you used. Then they'll be traced to the proxy service.

At that point, just about any reasonable business would trace the illegal activity to your API key. So even if your morals don't make you use proxies ethically, law enforcement will.


What is ScrapeOwl?

ScrapeOwl acts as a middleman between your scraper and the site you wish to scrape. Websites are often configured to block scrapers and other types of bots.

To get around this, we use a service like ScrapeOwl to access the site. Once they've accessed the site, they send the page back to our scraper.

Services like this allow us to control our geolocation, JavaScript execution, proxy type (datacenter, residential/mobile) and more.

Homepage

We use services like this to gain access to the sites we want to scrape. These services automate our proxy management and allow us to focus on parsing our data.

With this type of service, you don't need to maintain a list of proxies. You don't need to rotate them, and you don't need to set them up manually. ScrapeOwl do all of this stuff for you.


How Does ScrapeOwl Work?

As previously mentioned, ScrapeOwl is a proxy service. When you want access to a site, ScrapeOwl selects an IP address from its pool of proxies and makes the request for you. Then, you get the response back from ScrapeOwl. The overall process goes something like this:

  1. You make a request to ScrapeOwl using your api_key and your target url. For advanced functionality, you can use additional parameters.

  2. ScrapeOwl receives the request and tries to fetch your url. If unsuccesul, ScrapeOwl will retry with a better (often mobile or residential proxy.)

  3. ScrapeOwl receives its response and executes any instructions that are leftover from your request. Things like this include JavaScript rendering, and timed waits.

  4. ScrapeOwl sends the HTML page back to you. This way, you've got access to the content you want. All you need to worry about is parsing that content.

Products like ScrapeOwl make our lives much easier when scraping the web. We don't need to worry about managing proxies at a granular level. All we need to do is access our target site.

Using ScrapeOwl requires the following components inside of our parameters to the API.

ComponentDescription
API KeyUsed to authenticate your account.
Target UrlThe website you'd like to scrape.
Custom OptionsAdvanced functionality for your request (geotargeting, wait time, etc.)

Response Format

By default, our respons from ScrapeOwl comes in JSON format. This gives us all sorts of useful information along with the actual page we want to retrieve.

If you look below, you'll see an example (taken from their documentation) of their JSON schema. It includes the all sorts of useful information about the request.

{
"status": 200,
"is_billed": true,
"credits": {
"available": 0,
"used": 0,
"request_cost": 0
},
"html": "{\n \"origin\": \"98.118.113.251\"\n}\n"
}
  • status: The status code returned by the API.
  • is_billed: Whether or not your account was charged for the request.
  • credits:
    • available: The available credits on your account.
    • used: The amount of credits you've used.
    • request_cost: The cost (in API credits) of this specific request.
  • html: The actual page from the website you're scraping. When you parse your data, it's all going to come from this field.

ScrapeOwl Pricing

Take a look at the screenshot below. This contains the available plans you can get with ScrapeOwl. There are only three, but each one caters to a different type of user, Bootstrap, Startup and Business.

When you look at the features, it might be a bit difficult to understand, but we'll break it down for you!

ScrapeOwl Plans

PlanAPI CreditsMonthly CostBasic Request Cost
Bootstrap250,000$29$0.00016
Startup1,000,000$99$0.000099
Business3,000,000$249$0.000083

While their selection is a bit limited, ScrapeOwl's cost for a basic request is extremely reasonable. When we're making more advanced requests, this price goes up.

Let's look at the cost breakdown for advanced functionality. The table below holds a cost breakdown of our requests.

FeatureCost (API Credits)Description
Basic Request1Simple request without `render_js.
Basic with render_js5render_js with rotating proxies
Premium Proxy10Premium proxy without render_js.
Premium with render_js25render_js with a premium proxy.

Response Status Codes

Anytime you deal with Web Development, status codes are imperative.

Most of us know that a 200 indicates a successful request. If you didn't know, now you do! What does successful mean?
This means that your request was formatted properly, had the proper permissions, and received a proper response from the server.

Anything other than a 200 is indicative of a failed response. Whether you're out of credits, forgot your API key, or you used the wrong parameters, anything other than 200 requires your attention and you probably need to fix something.

The table below outlines the status codes you might receive from ScrapeOwl when using their API. If you run into issues, this table will give you a good idea of what your status code means and how to fix it.

Status CodeMeaningDescription
200 billedSuccessEverything worked!
400Bad RequestDouble check your params to make sure they're right.
401UnauthorizedYour API key is incorrect or you forgot to use it.
403ForbiddenYou don't have enough API credits for this call.
404 billedPage Not FoundDouble check your url, the page wasn't found.
429Too Many RequestsSlow your requests down and space them apart.
500Internal ErrorSomething failed with ScrapeOwl. Try again.

Setting Up ScrapeOwl

Getting started with ScrapeOwl is really easy. They actually offer a very similar 1,000 credit trial (just like ScrapeOps). Simply click the Sign Up button. This will take you to the registration page.

Scrapeowl Offer

Registration is pretty straightforward. They offer several integrations including Google and Github. They also give you the option to create an account the old fashioned way with a username and password.

Register Scrapeowl

Once you've signed up, you'll be taken to their dashboard. This sort of like the control center for everything to do with your account.

Scrapeowl Dashboard

If you click on the tab, API Request Builder, you can get started making requests easily. Request Builder will automatically generate basic requests for us using cURL, NodeJS, Python and PHP. Just put in your parameters, and it spits out a custom request just for you.

Scrapeowl Requests Builder

When we use ScrapeOwl, we get the option to connect using the following methods.

  • REST API: With this method, we make requests to the ScrapeOwl server using our api_key, url and any additional parameters. Use their REST API when you want a battle tested proxy connection. With the API, you also get the option for more granular control over your requests.

  • Proxy Mode: This method uses proxy port integration. You set up a basic connection to your proxy and continue with the rest of your code as normal. This feature is currently in beta, but it is usable.

API Endpoint Integration

You've actually already seen API endpoint intgration with the ScrapeOwl API. When we use endpoint integration, we get fine, granular control over our proxied requests.

Let's look at this example again and break down how it works. Take a look at the snippet below.

import requests
import json

API_KEY = ""

with open("config.json") as file:
config = json.load(file)
API_KEY = config["scrapeowl_api_key"]

#API details
scrapeowl_url = "https://api.scrapeowl.com/v1/scrape"

#Object of the request
object_of_data = {
"api_key": API_KEY,
"url": "https://httpbin.org/ip",
"json_response": True
}

#Convert object to JSON
data = json.dumps(object_of_data)

#Set headers
headers = {
"Content-Type": "application/json"
}

#Making http post request
response = requests.post(scrapeowl_url, data, headers=headers)

#Print the JSON response from API
print(response.json())
  • https://api.scrapeowl.com/v1 is the base domain for all of our API requests.
  • These requests get sent to the /scrape endpoint using POST.
  • Anytime you make a request to this endpoint, you need the following:
    • api_key: Use this to authenticate your request and tie it to your account.
    • url: This is the url of the target site. The page you wish to scrape.
    • Any additional params such as render_js or country as well.

You can find more information on making a basic request here.

Proxy Port Integration

Proxy Port integration is a must-have for many developers. ScrapeOwl calls this Proxy Mode. The Proxy Mode feature is currently in beta but is usable. This example comes straight from their documentation.

To avoid SSL errors, you need to set verify=False when making requests using Proxy Mode. verify=False tells Requests to ignore any SSL errors it encounters.

import requests

#API details
url_to_scrape = "https://httpbin.org/ip"

#ScrapeOwl proxy
proxies = {
"http": "http://scrapeowl:YOUR_API_KEY@proxy.scrapeowl.com:9000",
"https": "http://scrapeowl:YOUR_API_KEY@proxy.scrapeowl.com:9000",
}

#Making http GET request
response = requests.get(url_to_scrape, proxies=proxies, verify=False)

#Print the response from site
print(response.text)

To use Proxy Mode, we:

  • Create a proxies object. This contains the url and protocols for our proxies (HTTP and HTTPS).
  • Make our request: requests.get(url_to_scrape, proxies=proxies, verify=False).
    • url_to_scrape: Pretty self explanatory. The url you want to scrape.
    • proxies=proxies: Tells Requests that you want requests to setup proxies using our proxies object we created earlier.
    • verify=False: Tells Requests that we wish to ignore any SSL errors that might happen as a result of our proxy connection.

Proxy Mode is a great setup for devs that aren't concerned about the finer details. When you use Proxy Mode, you're typically more concerned with coding and you just want access to the site. Proxy Mode gains access for you so all you need to worry about is your parsing code.

You can find their full documentation on Proxy Mode here.

Managing Concurrency

With the free trial from ScrapeOwl, you only get one concurrent thread. However, the lowest tier paid plan, Bootstrap, gives you access to up to 10 concurrent threads at once. This is an amazing feature if you're looking to accomplish alot in a short amount of time.

With concurrency, we can make requests while we're still awaiting the resultes of previous requests... We can make a bunch of requests basically all at the same time.

In the example below, we write a function called scrape_page() and we use ThreadPoolExecutor to run it on multiple threads. This allows us to run scrape_page() on multiple urls simultaneously. Use concurrency to maximize the speed and efficiency of your scraper.

import requests
import json
from bs4 import BeautifulSoup
import concurrent.futures
from urllib.parse import urlencode

API_KEY = 'YOUR-SUPER-SECRET-API-KEY'

NUM_THREADS = 5

def get_proxy_url(url):
payload = {'api_key': API_KEY, 'url': url}
proxy_url = 'INSERT_PROXY_ENDPOINT' + urlencode(payload)
return proxy_url

## Example list of urls to scrape
list_of_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
'http://quotes.toscrape.com/page/3/',
]

output_data_list = []

def scrape_page(url):

scrapeowl_url = "https://api.scrapeowl.com/v1/scrape"

payload = json.dumps({
"api_key": API_KEY,
"url": url,
"json_response": True
})

headers = {
"Content-Type": "application/json"
}

try:
response = requests.post(scrapeowl_url, payload, headers=headers)
if response.status_code == 200:
soup = BeautifulSoup(response.json()["html"], "html.parser")
title = soup.find('h1').text

## add scraped data to "output_data_list" list
output_data_list.append({
'title': title,
})

except Exception as e:
print('Error', e)


with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
executor.map(scrape_page, list_of_urls)

print(output_data_list)
  • ThreadPoolExecutor opens up a new pool of threads with a max_workers set by us.
  • scrape_page is the function we'd like to run on each thread.
  • list_of_urls is actually the list of urls we want to run scrape page on. ThreadPoolExecutor takes each one of these urls and passes it into its own instance of scrape_page.

On the paid plans, ScrapeOwl gives you access to some very generous multithreading. If you're doing giant crawls and scraping at scale, this feature is very helpful.

Why scrape one page at a time when you could scrape 30?


Advanced Functionality

ScrapeOwl gives us a ton of advanced functionality to work with. Whether you're looking to appear in a custom country or you need to render JavaScript, ScrapeOwl appears to have your needs covered.

The table below lays out the different parameters you can use for advanced functionalities.

ParameterAdditional Cost (API credits)Description
api_key0Your API key for authentication.
url0The url you want to scrape.
elements0List of elements to extract from the page.
html0Return only the HTML.(defaults to False)
return_headers0Return the headers sent from the target site.
return_cookies0Return cookies from the target site.
cookies0Cookies to send to the target url.
headers0Headers to send to the target url.
request_method0GET, POST, or PUT (defaults to GET)
post_data0JSON body to be sent with a POST request.
premium_proxies10 (basic)/25 (render_js)Use residential proxies to scrape.
country10/25 (requires premium_proxies)The country you wish to appear in.
render_js5 (basic)/25 (premium_proxies)Open a headless browser and render JavaScript.
custom_js5/25 (requires render_js)Execute a set of custom JavaScript instructions.
wait_for5/25 (requires render_js)Wait [x] seconds for for [x] element to appear.
reject_requests5/25 (requires render_js)Block requests from executing on the page.
json_response0Return our response as JSON.(defaults to True)
screenshot5/25 (requires render_js)Take a screenshot. (defaults to False)
block_resources5/25 (requires render_js)Block resources from loading(defaults to True)

You can their full list of advanced functionality in the docs here.


Javascript Rendering

JavaScript Rendering is the process of executing JavaScript code to generate or manipulate content on a web page, often dynamically. Many modern web applications rely on JavaScript to render content in the browser after the initial HTML page has loaded.

  • Enables dynamic content generation without page reloads.
  • Powers Single Page Applications (SPAs) for seamless user experience.
  • Handles real-time user interactions and updates.
  • Loads and renders asynchronous data (e.g., API calls) efficiently.
  • Enhances modern SEO practices for JavaScript-heavy websites.
  • Improves performance through smooth transitions and partial updates.

Rendering JavaScript is really easy. All we need is the render_js parameter. When we set this parameter to True, it tells ScrapeOwl to open a browser and and render JavaScript content when loading the page.

import requests
import json

#API details
scrapeowl_url = "https://api.scrapeowl.com/v1/scrape"

#Object of the request
object_of_data = {
"api_key": "YOUR-SUPER-SECRET-API-KEY",
"url": "https://httpbin.org/ip",
"json_response": True,
"render_js": True
}

#Convert object to JSON
data = json.dumps(object_of_data)

#Set headers
headers = {
"Content-Type": "application/json"
}

#Making http post request
response = requests.post(scrapeowl_url, data, headers=headers)

#Print the JSON response from API
print(response.json())

The full documentation for JavaScript rendering is available here. You need to scroll down to view the render_js documentation.

Controlling The Browser

render_js actually gives us the ability to control the browser as well. When we use render_js, we can also pass an additional parameter, custom_js.

When we pass these two commands in together, ScrapeOwl will open a browser and execute our custom_js instructions.

import requests
import json

#API details
scrapeowl_url = "https://api.scrapeowl.com/v1/scrape"

#Object of the request
object_of_data = {
"api_key": "your-super-secret-api-key",
"url": "https://httpbin.org/ip",
"json_response": True,
"render_js": True,
"custom_js": "window.scrollTo(0,document.body.scrollHeight);"
}

#Convert object to JSON
data = json.dumps(object_of_data)

#Set headers
headers = {
"Content-Type": "application/json"
}

#Making http post request
response = requests.post(scrapeowl_url, data, headers=headers)

#Print the JSON response from API
print(response.json())

The documentation for this is available here.


Country Geotargeting

Country Geotargeting allows websites or online platforms to deliver content or advertisements based on the geographic location of a user, specifically targeting users in certain countries.

This is typically achieved by detecting the user's IP address and determining their location, then adjusting the content accordingly.

Geotargeting is one of the primary reasons to use any proxy service to begin with. When we use geotargeting, our requests get routed through whatever proxy location we choose. This means that if you want the site to think you're in Brazil, ScrapeOwl will make your request through a Brazilian proxy.

ScrapeOwl gives a pretty large list of locations we can use. Take a look below.

CountryCountry Code
Brazilbr
Canadaca
Francefr
Germanyde
Greecege
Israelil
Indiain
Italyit
Mexicomx
Netherlandsnl
Russiaru
Spaines
Swedense
United Kingdomgb
United Statesus
import requests
import json

#API details
scrapeowl_url = "https://api.scrapeowl.com/v1/scrape"

#Object of the request
object_of_data = {
"api_key": "your-super-secret-api-key",
"url": "https://httpbin.org/ip",
"premium_proxies": True,
"country": "us",
"json_response": True
}

#Convert object to JSON
data = json.dumps(object_of_data)

#Set headers
headers = {
"Content-Type": "application/json"
}

#Making http post request
response = requests.post(scrapeowl_url, data, headers=headers)

#Print the JSON response from API
print(response.json())

You can view the full documentation for this feature here. Remember to scroll down to the list of countries.

Geotargeting with ScrapeOwl is nice, however, ScrapeOwl requires a premium proxy in order to use a custom geolocation. This means you're paying 10 API credits at a minimum to use this feature.

Most other providers (ScrapeOps included) will let you use geolocation with their datacenter proxies for no additional charge.


Residential Proxies

We've already talked a little bit about premium proxies.

Premium (residential) proxies will route your request through a residential IP instead of using a datacenter. These proxies are associated with physical devices, like home computers or mobile phones, giving them the appearance of legitimate residential users rather than data centers.

This feature is really important when dealing with some sites. Some websites block datacenter IPs altogether. To use premium proxies, simply pass the premium_proxies parameter into the body of your request.

When using premium proxies with ScrapeOwl, you should pass a country in as we did above in our geotargeting example. If you don't pass in a location, it will default to us. In the example below, ScrapeOwl will automatically set our location to us.

import requests
import json

#API details
scrapeowl_url = "https://api.scrapeowl.com/v1/scrape"

#Object of the request
object_of_data = {
"api_key": "your-super-secret-api-key",
"url": "https://httpbin.org/ip",
"premium_proxies": True,
"json_response": True
}

#Convert object to JSON
data = json.dumps(object_of_data)

#Set headers
headers = {
"Content-Type": "application/json"
}

#Making http post request
response = requests.post(scrapeowl_url, data, headers=headers)

#Print the JSON response from API
print(response.json())

The full docs for premium proxies are available here.


Custom Headers

The Custom Header functionality in proxy APIs allows users to define and send their own HTTP request headers instead of relying on the default headers managed by the proxy service.

Headers are part of the metadata sent in HTTP requests that provide important information about the request itself, such as content type, user agent, authentication, etc.

While proxy APIs typically optimize headers for performance, this feature offers flexibility for users who need more control.

Why Use Custom Headers?

  • Accessing Specific Data: Certain web services or APIs require specific headers (like custom authentication tokens or content types) to access the desired data.

  • POST Requests: When making POST requests, which involve sending data to the server, some headers (e.g., Content-Type or Authorization) are necessary to successfully send and retrieve data.

  • Bypassing Anti-Bot Systems: Some websites use sophisticated anti-bot measures that detect automated requests. Custom headers (like rotating User-Agents, Referrer, etc.) can be used to mimic real human browsing behavior and bypass these systems.

Word of Caution

  • Performance Reduction: Using static or incorrect custom headers can negatively impact the proxy performance. If the same headers are repeatedly sent, it might give away automated behavior, making it easier for websites to detect and block the requests.

  • Risk of Blocks: In large-scale web scraping, sending poorly managed or unrotated headers can lead to frequent blocking by websites. Proper header rotation systems are essential to avoid detection.

  • Only Use When Necessary: Proxy APIs generally optimize headers for maximum efficiency. Custom headers should only be used when absolutely required for the task to prevent performance degradation or detection.

Setting custom headers is pretty easy. To do this, all we need is the headers parameter. We then pass our headers in as a JSON object.

ScrapeOwl then reads these headers and sends them to our target site as actual headers. Our target site will then interpret the headers and perform whatever our headers tell it to.

import requests
import json

#API details
scrapeowl_url = "https://api.scrapeowl.com/v1/scrape"

#Object of the request
object_of_data = {
"api_key": "your-super-secret-api-key",
"url": "https://httpbin.org/ip",
"headers": {
'Accept-Language': 'en-US'
}
}

#Convert object to JSON
data = json.dumps(object_of_data)

#Set headers
headers = {
"Content-Type": "application/json"
}

#Making http post request
response = requests.post(scrapeowl_url, data, headers=headers)

#Print the JSON response from API
print(response.json())

Their custom header documentation is available for you to view here.


Static Proxies

Static Proxies or Sticky Sessions are used to keep browsing sessions in tact. This means all requests during that session are routed through the same IP, instead of using a different IP for each request (which happens with rotating proxies).

Static proxies (or sticky session proxies) are used to ensure continuity and consistency in network interactions by maintaining the same IP address over an extended session. This helps:

  • Maintain Session Stability: Prevents session resets by using the same IP for the entire interaction.
  • Avoid IP Rotation Detection: Keeps a stable IP, reducing the risk of being flagged for suspicious activity.
  • Support Multi-Account Management: Ensures consistency when managing multiple accounts, as each account interacts from a specific IP.
  • Enable Reliable Web Scraping: Essential for tasks that need session persistence, such as pagination or logged-in activities.
  • Gather Consistent Data: Useful for ad verification, ensuring browsing data remains uniform for accurate insights.

For instance, if you wish to log into a site and stay logged in throughout your session, you would use the session param to setup a sticky session.

Sessions require you to use a premium proxy, so they do come at a cost with ScrapeOwl. Each session request requires the premium_proxies parameter. By default, requests with session costs us a minimum of 10 API credits.

import requests
import json

#API details
scrapeowl_url = "https://api.scrapeowl.com/v1/scrape"

#Object of the request
object_of_data = {
"api_key": "your-super-secret-api-key",
"url": "https://httpbin.org/ip",
"premium_proxies": True,
"session": 1234,
"json_response": True
}

#Convert object to JSON
data = json.dumps(object_of_data)

#Set headers
headers = {
"Content-Type": "application/json"
}

#Making http post request
response = requests.post(scrapeowl_url, data, headers=headers)

#Print the JSON response from API
print(response.json())

You can reuse these sessions by reusing our session number. Their sessions docs are available here.


Screenshot Functionality

When you're scraping the web, screenshots can be a lifesaver. With a screenshot, you can verify the integrity of your data and you can also debug any problems that arise quickly.

When we take a screenshot using the ScrapeOwl API, our screenshot gets saved in the cloud and we get the url of our screenshot in our response.

import requests
import json

#API details
scrapeowl_url = "https://api.scrapeowl.com/v1/scrape"

#Object of the request
object_of_data = {
"api_key": "YOUR-SUPER-SECRET-API-KEY",
"url": "https://httpbin.org/ip",
"json_response": True,
"render_js": True,
"screenshot": True
}

#Convert object to JSON
data = json.dumps(object_of_data)

#Set headers
headers = {
"Content-Type": "application/json"
}

#Making http post request
response = requests.post(scrapeowl_url, data, headers=headers)

#Print the JSON response from API
print(response.json())

Here is an example response from using this code.

{
'status': 200,
'is_billed': True,
'credits':
{
'available': 1000,
'used': 5,
'request_cost': 5},
'resolved_url': 'https://httpbin.org/ip',
'screenshot_url': 'https://fra1.digitaloceanspaces.com/scrapeowl/11209/cm1tp167z2rbo07ql97b94kzp/full-page.png',
'headers': {},
'cookies': [],
'data': {},
'html': '<html><head><meta name="color-scheme" content="light dark"><meta charset="utf-8"></head><body><pre>{\n "origin": "185.161.253.48"\n}\n</pre><div class="json-formatter-container"></div></body></html>'}

As you can see in the snippet above, our screenshot gets saved in the cloud and shows up in our JSON as the value of screenshot_url.

  • Verifying Page Data: If you wish to verify the data from your scrape, a screenshot allows you to do that quickly and conveniently.

  • Debugging: If something goes wrong, all you need to do is look at the screenshot. This makes it alot easier to diagnose problems.

Their screenshot functionality is documented inside the table here.


Auto Parsing

While ScrapeOwl charges us for some interesting things like geolocation, they don't charge us for auto parsing. This a a rather interesting take.

Most other providers charge a much higher price for auto parsing features. Instead of having a default page layout or using AI to parse the page, ScrapeOwl allows us to use CSS selectors and xPath to scrape.

They actually don't even require us to open a browser. In the example below, we tell ScrapeOwl to find the h1 element from the page.

import requests
import json

#API details
scrapeowl_url = "https://api.scrapeowl.com/v1/scrape"

#Object of the request
object_of_data = {
"api_key": "your-super-secret-api-key",
"url": "https://quotes.toscrape.com",
"elements": [
{
"type": "css",
"selector": "h1"
}
],
"json_response": True
}

#Convert object to JSON
data = json.dumps(object_of_data)

#Set headers
headers = {
"Content-Type": "application/json"
}

#Making http post request
response = requests.post(scrapeowl_url, data, headers=headers)

#Print the JSON response from API
print(response.json())

You can view a screenshot of the response below. The scraped data is highlighted for you to see. As you can see, ScrapeOwl extracted our h1 text, Quotes to Scrape.

Autoparsing results

The free autoparsing with ScrapeOwl is pretty good. The fact that it comes at no additional cost is an incredible value. You can view the documentation for it here.


Case Study: Using ScrapeOwl on IMDb Top 250 Movies

Now, we're going to perform a case study. In this section, we'll scrape IMBD's Top 250 movies using both ScrapeOwl and the ScrapeOps Proxy Aggregator.

Much of our code for each service will remain the same. The only major difference is our connection to the proxy.

  • With ScrapeOwl, we're making POST requests.
  • With ScrapeOps, we're making GET requests.

ScrapeOwl

The code we use to connect to ScrapeOwl is available for you to review below.

scrape_owl_url = "https://api.scrapeowl.com/v1/scrape"

payload = json.dumps({
"api_key": API_KEY,
"url": url,
"json_response": True
})

headers = {
"Content-Type": "application/json"
}

success = False
tries = 0

while not success and tries <= retries:
response = requests.post(
scrape_owl_url,
payload,
headers=headers
)

Here is our full ScrapeOwl code. The snippet you looked at above is part of our scrape_movies() function.

import os
import requests
from bs4 import BeautifulSoup
import json
import csv
import logging
from urllib.parse import urlencode
import concurrent.futures

## Logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

API_KEY = ""

with open("config.json", "r") as config_file:
config = json.load(config_file)
API_KEY = config["scrapeowl_api_key"]


def scrape_movies(url, retries=3):
scrape_owl_url = "https://api.scrapeowl.com/v1/scrape"

payload = json.dumps({
"api_key": API_KEY,
"url": url,
"json_response": True
})

headers = {
"Content-Type": "application/json"
}

success = False
tries = 0

while not success and tries <= retries:
response = requests.post(
scrape_owl_url,
payload,
headers=headers
)

try:
if response.status_code != 200:
raise Exception(f"Failed response from server, status code: {response.status_code}")

soup = BeautifulSoup(response.json()["html"], "html.parser")
json_tag = soup.select_one("script[type='application/ld+json']")
json_data = json.loads(json_tag.text)["itemListElement"]

movie_list_length = 0

movie_list = []

for item in json_data:
movie_list.append(item["item"])
movie_list_length+=len(json_data)

print(f"Movie list length: {len(json_data)}")
with open("scrapeowl-top-250.json", "w") as file:
json.dump(movie_list, file, indent=4)
success = True
except Exception as e:
logger.error(f"Failed to process page: {e}, retries left: {retries-tries}")
tries+=1

if not success:
raise Exception(f"Failed to scrape page, MAX RETRIES {retries} EXCEEDED!!!")



if __name__ == "__main__":

MAX_RETRIES = 3

logger.info("Starting IMDB scrape")

url = "https://www.imdb.com/chart/top/"

scrape_movies(url, retries=MAX_RETRIES)

logger.info("Scrape complete")

With ScrapeOwl, we finished the run in 5.502 seconds. This is pretty decent.

Scrapeowl performance terminal

ScrapeOps Proxy Aggregator

Now, we'll do the same test using the ScrapeOps Proxy Aggregator. Most of the script is the same, the major difference is our connection to the API.

With ScrapeOps, we make a GET request instead of a POST request. We also write a function that wraps all of our parameters and creates a special proxied url.

def get_scrapeops_url(url):
payload = {
"api_key": API_KEY,
"url": url,
}
proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)
return proxy_url

You can view our full ScrapeOps code below. Aside from our proxy connection, everything else is almost identical to the ScrapeOwl example.

import os
import requests
from bs4 import BeautifulSoup
import json
import csv
import logging
from urllib.parse import urlencode
import concurrent.futures

## Logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

API_KEY = ""

with open("config.json", "r") as config_file:
config = json.load(config_file)
API_KEY = config["scrapeops_api_key"]

def get_scrapeops_url(url):
payload = {
"api_key": API_KEY,
"url": url,
}
proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)
return proxy_url


def scrape_movies(url, retries=3):
success = False
tries = 0

while not success and tries <= retries:
response = requests.get(get_scrapeops_url(url))

try:
if response.status_code != 200:
raise Exception(f"Failed response from server, status code: {response.status_code}")

soup = BeautifulSoup(response.text, "html.parser")
json_tag = soup.select_one("script[type='application/ld+json']")
json_data = json.loads(json_tag.text)["itemListElement"]

movie_list_length = 0

movie_list = []

for item in json_data:
movie_list.append(item["item"])
movie_list_length+=len(json_data)

print(f"Movie list length: {len(json_data)}")
with open("scrapeops-top-250.json", "w") as file:
json.dump(movie_list, file, indent=4)
success = True
except Exception as e:
logger.error(f"Failed to process page: {e}, retries left: {retries-tries}")
tries+=1

if not success:
raise Exception(f"Failed to scrape page, MAX RETRIES {retries} EXCEEDED!!!")



if __name__ == "__main__":

MAX_RETRIES = 3

logger.info("Starting IMDB scrape")

url = "https://www.imdb.com/chart/top/"

scrape_movies(url, retries=MAX_RETRIES)

logger.info("Scrape complete")

Here are the results from the same scrape, but with the ScrapeOps Proxy Aggregator. This run was completed in 5.821 seconds.

Scrapeops performance terminal

Results

ScrapeOwl and ScrapeOps finished with very similar timing.

With ScrapeOwl, the run took 5.502 seconds and the ScrapeOps Proxy Aggregator took 5.823 seconds. All in all, this is a difference of 0.321 seconds... less than a third of a second!

Your results might vary based on your hardware and internet connection. When it's this close, sometimes ScrapeOwl will be faster and sometimes ScrapeOps will be, there is really no consistent clearcut winner (although on our initial test, ScrapeOwl beat us by a third of a second).


Alternative: ScrapeOps Proxy API Aggregator

The ScrapeOps Proxy Aggregator gives us access to almost everything we get with ScrapeOwl. Aside from the free autoparsing feature, ScrapeOps offers a more cost effective approach every step of the way.

  • They have 3 plans, we have 8.
  • They charge for geotargeting, we don't.
  • They require a premium proxy for geotargeting, we don't.

Here are the plans we offer at ScrapeOps.

Scrapeops Pricing Plans

API CreditsMonthly CostPrice Per Basic RequestScrapeOwl Equivalent
25,000$9$0.00036None
50,000$15$0.0003None
100,000$19$0.00019None
250,000$29$0.000116Bootstrap($0.00016/request)
500,000$54$0.000108None
1,000,000$99$0.000099Bootstrap($0.000099/request)
2,000,000$199$0.0000995None
3,000,000$249$0.000083Business ($0.000083/request)

At the higher price points, $99 and $249, ScrapeOwl does remain competitive with ScrapeOps pricing. However we have a much larger variety of plans at the lower tiers.

Especially if you're just getting started with scraping at the hobbyist level, we've got a lot more to offer you. If you're running an enterprise level operation, you can take your pick between ScrapeOps and ScrapeOwl.

At that point, it's all personal taste!


Troubleshooting

Issue #1: Request Timeouts

A request timeout occurs when a client (such as a web browser or an application) sends a request to a server but does not receive a response within a specified period. This results in the request being aborted, often leading to error messages or failed operations.

When dealing with HTTP, we all run into timeouts. In this guide, we've been using Python Requests. If you run into timeout issues with Python Requests, you can set a default timeout argument. The code below does exactly that.

import requests

# 5 second timeout
response = requests.get("https://httpbin.org/get", timeout=5)

Issue #2: Handling CAPTCHAs

CAPTCHAs can be an absolute nightmare. CAPTCHAs are used to authenticate that the person on the end is human. Both ScrapeOwl and the ScrapeOps Proxy Aggregator are designed to get around CAPTCHAs. If you are getting CAPTCHAs, something is probably wrong. However, in the event that you do receive a CAPTCHA, try the following steps.

  1. Retry your request. It's possible that a fresh IP address can fix it.

  2. With ScrapeOps, you can add the bypass argument to bypass any anti-bots.

  3. Try using setting "residential": True (ScrapeOps) or "premium_proxies": True (ScrapeOwl).

You can also use a 3rd party service like 2captcha. We have a great article on bypassing CAPTCHAs here as well.

Issue #3: Invalid Response Data

When you receive an invalid response, it can be pretty troubling. It's important to lookup your status code and solve your problem accordingly.

  • If you're getting a 401, double check your API keys.
  • If you're getting a 404, double check your URL.

If you need to view status codes, take a look here.

Most importantly, check your status code and troubleshoot the error accordingly.


Here at ScrapeOps, we follow ethical web scraping practices. When practicing ethical scraping, only scrape public data. Respect any agreements you make with site owners. Most importantly, follow the law.

Here, we only scrape public data. Private data is subject to a whole slew of intellectual property and privacy laws. When you scrape private data, all sorts of legal issues can arise such as:

  • Financial Penalties: When you violate people's privacy or property, they can sue you for damages. When you break a Terms of Service Agreement, you can be sued for breach of contract.

  • Jail/Prison Time: Violating laws mentioned above can lead not only to financial penalties but also hacking charges and privacy violations. These sorts of crimes come with fines and even jail or prison time.

Ethical

When we scrape the web, we also need to follow certain ethical guidelines. Don't disseminate private data and try to respect your target site's robots.txt. Failure to follow a site's policies isn't always illegal, but it can result in other types of damage.

  • Reputational Damage: No company wants to be the next headline for using legal but unethical practices. You don't want your company to be either.

  • User Consent: Just because a social media platform is public, doesn't mean people want their data collected. While it might be legal to stalk people on X(formerly Twitter), that doesn't make it right. Think about how you would feel if somebody collected all of your data.


Conclusion

In conclusion, you now know how to integrate with ScrapeOwl and ScrapeOps effectively. You've got a solid understanding of how both if these products work and you've been educated enough to make an informed decision.

If you want cheap autoparsing, go with ScrapeOwl. If you want more flexibility, go with ScrapeOps.

You have a solid understanding of how to use Python Requests and you also learned how to scrape real life data from IMDB! Take this new knowledge, sign up for a free trial and go build something!


More Web Scraping Guides

We love web scraping. If you're looking to buy services, learn to scrape, or just use a free trial, we have something for you here at ScrapeOps.

Check out our Python Web Scraping Playbook. If you'd like to learn more about scraping, check out the links below!