Skip to main content

ScraperAPI: Python Web Scraping Integration Guide

ScraperAPI is a robust web scraping tool designed to simplify the process of data extraction from websites. It manages proxy rotation, handles CAPTCHAs, and ensures successful requests by providing a reliable infrastructure for web scraping.

This comprehensive guide offers detailed instructions, code examples, and best practices to help you make the most of ScraperAPI's features. For this purpose, we'll divide into :

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.


TLDR: Integrating ScraperAPI Efficiently

Integrating ScraperAPI into your web scraping projects can significantly streamline your data extraction process. Let's take a look at some of quick tips and configurations to get you started efficiently:

  • First, sign up on the ScraperAPI website and obtain your unique API key. To integrate ScraperAPI, replace your current request URLs with ScraperAPI’s endpoint and append your API key. Take advantage of features like automatic proxy rotation to avoid IP blocking and CAPTCHA handling to bypass any encountered CAPTCHAs seamlessly.

  • For optimizing scraping efficiency, adjust the concurrency settings to balance the number of simultaneous requests and the rate limit of the target website. Use geolocation targeting to scrape region-specific data and set custom headers to mimic genuine browser requests, helping avoid detection and blocking. Implement robust error handling to manage common issues like timeouts and retries effectively.

When scraping, it's crucial to follow ethical and responsible practices. Always check and respect the website's robots.txt file to understand its scraping policies and avoid overwhelming the target server by adhering to appropriate rate limits. Ensure compliance with data privacy laws and regulations, such as GDPR, when handling scraped data, and use the data for legitimate and ethical purposes.


What Is ScraperAPI?

Web scraping is the automated process of extracting data from websites. This technique is invaluable for gathering large volumes of data quickly and efficiently, enabling tasks such as market research, competitive analysis, and content aggregation. However, web scraping can be complex due to various obstacles like CAPTCHAs, IP blocking, and dynamic content loading.

ScraperAPI simplifies the web scraping process by acting as an intermediary that handles these challenges. It provides a robust infrastructure that manages proxy rotation, bypasses CAPTCHAs, and ensures successful data extraction from target websites. With ScraperAPI, developers can focus on what matters most: extracting and utilizing meaningful data.

ScraperAPI Homepage

Using ScraperAPI, you can scrape a wide range of data types, including:

  • Product information: Prices, descriptions, reviews from e-commerce sites.
  • Content: Articles, blog posts, comments from news sites or forums.
  • Market data: Stock prices, market trends, financial reports.
  • Social media data: User profiles, posts, comments from social networks.

Common challenges in web scraping include several obstacles that can hinder the data extraction process:

  • CAPTCHAs: These are automated tests used by websites to differentiate between human users and bots. CAPTCHAs can block automated bots from accessing content, making it difficult to scrape data. ScraperAPI can handle these CAPTCHAs for you, ensuring uninterrupted data extraction.

  • IP Blocking: Websites often monitor and restrict the number of requests from a single IP address to prevent scraping. This can result in temporary or permanent IP bans. ScraperAPI addresses this by rotating IP addresses, allowing you to make multiple requests without getting blocked.

  • Dynamic Content: Many websites use JavaScript to load content dynamically, which can be challenging to scrape using traditional methods. ScraperAPI offers features to render JavaScript, ensuring that you can access and extract data from dynamically loaded web pages.

ScraperAPI's advanced features help overcome these common challenges, providing a seamless and efficient web scraping experience. In the next section, let's take a look at how ScraperAPI actually works.

How Does ScraperAPI Work?

ScraperAPI functions as a proxy service, managing HTTP requests and responses on behalf of the user. This service allows you to make requests to target websites without dealing with the intricacies of handling IP addresses, CAPTCHAs, and other scraping hurdles.

Key components of ScraperAPI include::

ComponentDescription
API KeyUnique key for accessing ScraperAPI services.
EndpointsSpecific URLs for different functionalities (e.g., rendering JavaScript).
ParametersCustomizable options for requests (e.g., headers, cookies).

Here is an example of a basic ScraperAPI request:

import requests

api_key = "YOUR_API_KEY"
url = "http://api.scraperapi.com"
params = {
'api_key': api_key,
'url': 'http://example.com',
'render': 'true'
}

response = requests.get(url, params=params)
print(response.text)

  • This example demonstrates how to make a simple request to ScraperAPI, specifying the target URL and enabling JavaScript rendering. Make sure to acquire your API key by visiting here
  • Building onto this simple example and leveraging ScraperAPI, you can efficiently and effectively scrape data from websites, overcoming common challenges and optimizing your data extraction process.

Response Format

ScraperAPI can return responses in either JSON or HTML format, depending on the parameters specified in your request.

  • JSON Response: By default, ScraperAPI returns a JSON response that includes metadata such as the status code, headers, and the original HTML content.

  • HTML Response: You can request the raw HTML content by setting the render parameter to true and specifying the response format as html.

Example Code for JSON Response:

import requests

api_key = "YOUR_API_KEY"
url = "http://api.scraperapi.com"
params = {
'api_key': api_key,
'url': 'http://example.com',
'render': 'false',
'format': 'json'
}

response = requests.get(url, params=params)
print(response.json())

Example Code for HTML Response:

import requests

api_key = "YOUR_API_KEY"
url = "http://api.scraperapi.com"
params = {
'api_key': api_key,
'url': 'http://example.com',
'render': 'true',
'format': 'html'
}

response = requests.get(url, params=params)
print(response.text)

ScraperAPI Pricing

ScraperAPI offers several pricing plans based on the number of API credits you need. The service uses a monthly subscription model, and there is no pay-as-you-go option available. You are only charged for successful requests, which are defined as requests that receive a 2xx HTTP status code.

PlanPrice per MonthAPI CreditsConcurrent ThreadsGeotargetingJS RenderingOther Features
Hobby$49100,00020US & EUYesPremium Proxies, JSON Auto Parsing, Smart Proxy Rotation, Custom Header Support, Unlimited Bandwidth, Automatic Retries, Desktop & Mobile User Agents, 99.9% Uptime Guarantee, Custom Session Support, CAPTCHA & Anti-Bot Detection, Professional Support
Startup$1491,000,00050US & EUYesAll features of Hobby plan
Business$2993,000,000100AllYesAll features of Startup plan
Professional$99914,000,000400AllYesAll features of Business plan
EnterpriseCustomMore than 14,000,000CustomAllYesAll premium features, premium support, and an account manager
  • All plans are billed monthly and reset each month. There is no pay-as-you-go option.
  • You are only charged for successful requests, which are defined as requests that return a 2xx HTTP status code.

Response Status Codes

ScraperAPI returns various status codes to indicate the outcome of your requests. Here is a table of possible status codes and their meanings:

Status CodeDescription
200OK - The request was successful, and the response contains the requested data.
400Bad Request - The request was invalid or cannot be processed by the server.
401Unauthorized - The API key is missing or invalid.
403Forbidden - The request is understood, but it has been refused or access is not allowed.
404Not Found - The requested resource could not be found.
429Too Many Requests - You have exceeded your API request limit.
500Internal Server Error - An error occurred on the server side.
503Service Unavailable - The server is currently unavailable (e.g., due to maintenance or overload).

For more detailed information on ScraperAPI's plans and features, you can visit their pricing page and their documentation.


Setting Up ScraperAPI

Setting up ScraperAPI is straightforward and involves a few key steps. Here, we will walk through creating an account, obtaining an API key, configuring basic settings, understanding request limits, and exploring integration options. Additionally, example code snippets will demonstrate the initial setup in Python.

  1. Sign Up for ScraperAPI: Visit the ScraperAPI website and sign up for an account.
  2. Navigate to the API Key Section: After logging in, go to the API Key section in your dashboard.
  3. Copy Your Unique API Key: This key is essential for authenticating your requests to ScraperAPI.

In the ScraperAPI dashboard, you can configure various settings to optimize your scraping tasks. These settings include:

  • Geolocation: Choose the desired geographic location for your IP address.
  • Rendering: Enable JavaScript rendering if you need to scrape dynamically loaded content.
  • Retries: Set the number of retries for failed requests.

Moreover, ScraperAPI offers different plans with varying request limits. Ensure you choose a plan that meets your needs:

  • Free Plan: Limited number of requests per month, suitable for testing and small-scale scraping.
  • Paid Plans: Higher request limits, priority support, and additional features such as concurrent requests and premium proxies.

Make sure to keep track of your request usage in the ScraperAPI dashboard to avoid exceeding your plan's limits.

List the Integration Options

ScraperAPI provides multiple integration options to fit different use cases:

  • API Endpoint: You can send HTTP requests directly to the ScraperAPI endpoint.
  • Proxy Port: Configure your scraping tool or browser to use ScraperAPI as a proxy.
  • SDK: ScraperAPI offers SDKs in various programming languages to simplify integration.

Types of Requests Supported by the Proxy

ScraperAPI supports a wide range of HTTP request methods, including:

  • GET: Retrieve data from a specified resource.
  • POST: Submit data to be processed to a specified resource.
  • PUT: Update a specified resource with new data.
  • HEAD: Retrieve metadata from a specified resource without the response body.
  • DELETE: Remove a specified resource.

Here is an example of a basic ScraperAPI request using ScraperAPI in Python:

import requests

api_key = "YOUR_API_KEY"
url = "http://api.scraperapi.com"
params = {
'api_key': api_key,
'url': 'http://example.com',
'render': 'true'
}

response = requests.get(url, params=params)
print(response.text)

This example demonstrates how to make a simple request to ScraperAPI, specifying the target URL and enabling JavaScript rendering. Make sure to acquire your API key by visiting ScraperAPI. By following these steps, you can efficiently set up ScraperAPI and start scraping data from websites while managing your request limits and optimizing your scraping tasks.

API Endpoint Integration

API endpoint integration with ScraperAPI allows you to send HTTP requests directly to ScraperAPI's servers, which then handle the complexities of web scraping for you. This integration method is ideal for scenarios where you need to scrape web data programmatically and want to bypass obstacles like CAPTCHAs, IP blocking, and geographic restrictions.

Why/When to Use API Endpoint Integration?

  • Simplicity: Integrating via API endpoints is straightforward and does not require complex setup or configuration.
  • Scalability: Ideal for large-scale scraping projects as it supports concurrent requests and automatic retries.
  • Flexibility: Supports various HTTP methods (GET, POST, PUT, DELETE, etc.), allowing you to perform different types of web requests.
  • Customization: Offers options to customize requests with headers, cookies, and other parameters to mimic different browsers or user agents.
  • Reliability: Ensures high success rates and handles IP rotation, geotargeting, and CAPTCHA solving automatically.

Let's take a look at an example that shows how to integrate ScraperAPI using an API endpoint with Python to perform a DELETE request:

import requests

api_key = "YOUR_API_KEY"
target_url = "http://example.com/resource/1"
api_url = "http://api.scraperapi.com"

params = {
'api_key': api_key,
'url': target_url
}

response = requests.delete(api_url, params=params)
print(response.status_code)
print(response.text)

In this example:

  • Replace YOUR_API_KEY with your actual ScraperAPI key.
  • Replace http://example.com/resource/1 with the URL of the resource you want to delete.

This example demonstrates how to use the DELETE method to remove a resource from the target URL. For more information, you can visit the ScraperAPI documentation.

Proxy Port Integration [If Applicable]

Proxy port integration with ScraperAPI allows you to route your web scraping requests through ScraperAPI's proxy servers using a proxy configuration. This method is particularly useful for scenarios where you are transitioning from a traditional proxy solution or when using headless browsers that support proxy configurations.

Why/When to Use It?

  • Seamless Transition: Ideal for those transitioning from traditional proxy solutions, as it allows for a more straightforward integration without changing much of your existing code.
  • Headless Browser Integration: Better suited for headless browsers (like Puppeteer or Selenium) which can be configured to use proxies, making it easier to handle dynamic content and bypass scraping obstacles.

Here is an example of integrating ScraperAPI using a proxy port configuration in Python with the requests library:

import requests

api_key = "YOUR_API_KEY"
proxy = {
"http": f"http://scraperapi:{api_key}@proxy-server.scraperapi.com:8001",
"https": f"http://scraperapi:{api_key}@proxy-server.scraperapi.com:8001"
}

response = requests.get("http://example.com", proxies=proxy)
print(response.text)

The proxy configuration routes all HTTP and HTTPS requests through ScraperAPI's proxy server. For more information, you can visit the ScraperAPI documentation on Proxy Integration.

SDK Integration [If Applicable]

SDK integration with ScraperAPI provides libraries in various programming languages that simplify the process of making API calls. These SDKs abstract the complexities of HTTP requests, making it easier for beginners to integrate ScraperAPI into their projects.

Why/When to Use It?

  • Ease of Integration: SDKs are designed to be beginner-friendly, reducing the amount of boilerplate code needed to make requests.
  • Language-Specific Support: Each SDK is tailored to a specific programming language, ensuring that it adheres to the best practices and conventions of that language.

Here is an example using the ScraperAPI SDK in Python:

from scraperapi import ScraperAPIClient

client = ScraperAPIClient("YOUR_API_KEY")
result = client.get("http://example.com", render=True)

print(result.text)

  • The render=True parameter enables JavaScript rendering.

List of Available SDKs can be seen below:

  1. Python: scraperapi-python
  2. Node.js: scraperapi-sdk-nodejs
  3. PHP: scraperapi-php
  4. Ruby: scraperapi-ruby

By using these integration methods, you can choose the best approach for your scraping needs, whether it's through direct API endpoints, proxy configurations, or SDKs tailored to your preferred programming language.

Async Response Integration [If Applicable]

Async Response Integration with ScraperAPI allows you to send asynchronous requests, meaning you don't have to wait for the response immediately. This method is particularly useful for handling large volumes of requests without overloading your server, as ScraperAPI manages concurrency, retries, and error handling on its end.

Why/When to Use It?

  • Concurrency Management: Asynchronous requests handle multiple requests simultaneously without blocking your application.
  • Reduce Server Workload: Offloading request handling to ScraperAPI reduces the processing burden on your server, leading to lower server costs.
  • Automatic Retries: ScraperAPI manages retries for failed requests, ensuring higher success rates without additional coding on your part.

Here is an example demonstrating how to use ScraperAPI's async response integration with Python's asyncio and aiohttp libraries:

import aiohttp
import asyncio

async def fetch(session, url):
api_key = "YOUR_API_KEY"
async with session.get(f"http://api.scraperapi.com?api_key={api_key}&url={url}") as response:
return await response.text()

async def main():
async with aiohttp.ClientSession() as session:
tasks = []
urls = ["http://example.com/page1", "http://example.com/page2", "http://example.com/page3"]
for url in urls:
tasks.append(fetch(session, url))

results = await asyncio.gather(*tasks)
for result in results:
print(result)

asyncio.run(main())

  • The script fetches multiple URLs concurrently using asynchronous requests. For more information, you can visit the ScraperAPI documentation on Async Integration.

By utilizing async response integration, you can efficiently manage large-scale scraping tasks with better performance and reliability.

Managing Concurrency

ScraperAPI can be integrated with popular scraping libraries like BeautifulSoup and Scrapy to automate and optimize web scraping tasks. Both libraries have unique features that make them suitable for different scraping scenarios.

  • BeautifulSoup is a Python library for parsing HTML and XML documents. It provides Pythonic idioms for iterating, searching, and modifying the parse tree. This library is particularly useful for smaller scraping tasks where the HTML structure is not overly complex. It is known for its simplicity and ease of use, allowing developers to quickly extract data from web pages.

  • Scrapy is a powerful and fast open-source web crawling framework written in Python. Scrapy is suitable for complex and large-scale scraping projects requiring high performance and scalability. It is designed for large-scale web scraping with features like:

    • Built-in support for handling requests and responses: Scrapy automatically manages request retries, redirects, and handles cookies and sessions.
    • Selectors: Scrapy uses XPath and CSS selectors to extract data from web pages.
    • Item Pipelines: Scrapy provides mechanisms to clean, validate, and store scraped data.

Here’s a detailed example of integrating ScraperAPI with BeautifulSoup to scrape data from Quotes to Scrape:

import requests
from bs4 import BeautifulSoup
import concurrent.futures
from urllib.parse import urlencode

API_KEY = 'YOUR_API_KEY'
NUM_RETRIES = 3
NUM_THREADS = 5

list_of_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]

scraped_quotes = []

def scrape_url(url):
params = {'api_key': API_KEY, 'url': url}

for _ in range(NUM_RETRIES):
try:
response = requests.get('http://api.scraperapi.com/', params=urlencode(params))
if response.status_code in [200, 404]:
break
except requests.exceptions.ConnectionError:
response = None

if response and response.status_code == 200:
soup = BeautifulSoup(response.text, "html.parser")
quotes_sections = soup.find_all('div', class_="quote")

for quote_block in quotes_sections:
quote = quote_block.find('span', class_='text').text
author = quote_block.find('small', class_='author').text

scraped_quotes.append({
'quote': quote,
'author': author
})

with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
executor.map(scrape_url, list_of_urls)

print(scraped_quotes)

  • scrape_url takes a URL as input, sends a request to ScraperAPI, and retries up to NUM_RETRIES times if necessary.
  • Concurrent threads allow the script to handle multiple scraping tasks simultaneously, significantly speeding up the data extraction process. Instead of waiting for one URL to be scraped before starting the next, multiple URLs are processed in parallel.
  • The concurrent.futures.ThreadPoolExecutorcreates a pool of worker threads, specified by NUM_THREADS. Each thread executes the scrape_url function on different URLs from the list_of_urls.
  • By using concurrent threads, the script can maximize the usage of system resources and reduce the total time required for scraping multiple pages. This is particularly useful when dealing with a large number of URLs or when the target websites have slower response times.

This script demonstrates how to integrate ScraperAPI with BeautifulSoup for efficient web scraping with concurrent requests. Adjust the list of URLs, parsing logic, and other settings as needed for your specific use case.


Advanced Functionality

ScraperAPI offers advanced functionality that allows you to fine-tune your scraping tasks to overcome specific challenges and enhance your data extraction process. These features include custom headers, cookies, IP geolocation, CAPTCHA solving, and more.

To enable these advanced features, you can add specific query parameters to your ScraperAPI requests. Each parameter may consume additional API credits depending on the complexity of the task.

Here is an example of how to use some advanced functionalities in a Python request:

import requests

api_key = "YOUR_API_KEY"
target_url = "http://example.com"
api_url = "http://api.scraperapi.com"

params = {
'api_key': api_key,
'url': target_url,
'country_code': 'us', # Geolocation
'render': 'true', # JavaScript rendering
'premium': 'true', # Use premium proxies
'custom_headers': '{"User-Agent": "Mozilla/5.0"}', # Custom headers
}

response = requests.get(api_url, params=params)
print(response.text)

Here is a table that shows all the advanced functionality:

ParameterAPI CreditsDescription
render5 per requestEnables JavaScript rendering for dynamic content.
country_code1 per requestSpecifies the IP geolocation for the request.
premium1 per requestUses premium proxies for better success rates.
session_number1 per requestMaintains a session for the request to handle cookies.
custom_headers1 per requestAllows setting custom HTTP headers for the request.
captcha10 per requestEnables CAPTCHA solving for the request.
residential2 per requestUses residential IPs to mimic real users.

For more detailed information on each feature, you can visit the ScraperAPI functionality list.


Javascript Rendering

JavaScript rendering allows ScraperAPI to fully render a webpage, executing JavaScript to load dynamic content. This is essential for scraping websites that heavily rely on client-side scripts to display data.

Why Use It?

  • Access Dynamic Content: Scrape content loaded dynamically via JavaScript, which a basic HTML request would miss.
  • Better Data Accuracy: Ensures capturing all the content on the page, including elements that load after the initial request.
  • Avoiding Scraping Detection: Mimics real user behavior, which helps bypass anti-scraping measures.

In terms of cost, enabling JavaScript rendering consumes 5 API credits per request.

import requests

api_key = "YOUR_API_KEY"
target_url = "http://example.com"
api_url = "http://api.scraperapi.com"

params = {
'api_key': api_key,
'url': target_url,
'render': 'true'
}

response = requests.get(api_url, params=params)
print(response.text)

For more information on JavaScript rendering, you can visit the ScraperAPI documentation on JavaScript rendering.

Controlling The Browser [If Applicable]

If the proxy provider supports the ability to insert JavaScript commands into the browser, such as scrolling, clicking, etc., this functionality can significantly enhance your web scraping tasks by allowing interaction with web elements. ScraperAPI supports such advanced features through specific parameters that enable these actions.

import requests

api_key = "YOUR_API_KEY"
target_url = "http://example.com"
api_url = "http://api.scraperapi.com"

params = {
'api_key': api_key,
'url': target_url,
'render': 'true',
'js_snippet': 'document.querySelector("button").click();' # Click a button
}

response = requests.get(api_url, params=params)
print(response.text)

Here is the table that shows the functionality:

ParameterDescription
js_snippetExecutes JavaScript code on the page.
scrollScrolls the page to load dynamic content.
clickSimulates a click on a specified element.

For more detailed information on how to control the browser using ScraperAPI, you can visit their documentation on customizing requests.


Country Geotargeting

Country geotargeting allows you to route your web scraping requests through IP addresses located in specific countries. This ensures that the request appears to come from the desired geographic location, which can be crucial for accessing region-specific content or avoiding geo-restrictions.

Why Use It?

  • Access Regional Content: Some websites display different content based on the visitor's location. Geotargeting allows you to scrape this localized content.
  • Bypass Geo-Restrictions: Access websites or services that are restricted to certain geographic regions.
  • Improve Anonymity: Mimic a real user's behavior from a specific country, reducing the likelihood of being blocked.

Using country geotargeting typically consumes additional API credits per request. For example, enabling geotargeting might cost 1 extra API credit per request, but this can vary by provider.

Let's take a look at a simple example that shows how you can use country geotargeting with ScraperAPI in Python:

import requests

API_KEY = 'YOUR_API_KEY'
target_url = 'http://example.com'
api_url = 'http://api.scraperapi.com'

params = {
'api_key': API_KEY,
'url': target_url,
'country_code': 'us' # Specify the country code for geotargeting
}

response = requests.get(api_url, params=params)
if response.status_code == 200:
print(response.text)
else:
print(f'Error: {response.status_code}')

For more details, you can visit the ScraperAPI documentation on Country Geotargeting.

Country CodeCountry
usUnited States
caCanada
gbUnited Kingdom
frFrance
deGermany
jpJapan
auAustralia
inIndia
brBrazil
mxMexico

Residential Proxies

Residential proxies, also known as premium proxies, route your requests through IP addresses assigned to real residential devices by ISPs. These proxies mimic genuine user traffic, making them highly effective for avoiding detection and blocking during web scraping. There are several reasons why you should use them:

  • Avoid Detection: Residential proxies are less likely to be identified and blocked compared to data center IPs.
  • Access Restricted Content: Bypass geo-restrictions and access region-specific content.
  • Higher Success Rates: Increase the success rates of scraping tasks by simulating real user behavior.

Usually, using residential proxies consumes 2 API credits per request.

Here is how you can use residential proxies with ScraperAPI in Python:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urlencode

API_KEY = 'INSERT_API_KEY_HERE'
NUM_RETRIES = 5

def scrape_with_scraperapi(url):
params = {
'api_key': API_KEY,
'url': url,
'residential': 'true' # Enable Residential Proxies
}

for _ in range(NUM_RETRIES):
try:
response = requests.get('http://api.scraperapi.com/', params=urlencode(params))
if response.status_code in [200, 404]:
break
except requests.exceptions.ConnectionError:
response = None

if response and response.status_code == 200:
return response.text
return None

# Example usage
url = 'http://quotes.toscrape.com/page/1/'
html_content = scrape_with_scraperapi(url)

if html_content:
soup = BeautifulSoup(html_content, "html.parser")
quotes_sections = soup.find_all('div', class_="quote")

for quote_block in quotes_sections:
quote = quote_block.find('span', class_='text').text
author = quote_block.find('small', class_='author').text
print(f'Quote: {quote}\nAuthor: {author}\n')
else:
print('Failed to retrieve content')

For more details, visit the ScraperAPI documentation on Residential Proxies.


Custom Headers

Custom headers allow you to specify your own HTTP headers for requests. While ScraperAPI optimizes headers by default to achieve the best performance, it also lets you send your own headers if needed for specific tasks.

Why use it?

  • Specific Data Needs: Certain websites require specific headers to serve the desired data.
  • POST Requests: Often need custom headers to properly format the data being sent.
  • Bypass Anti-Bot Systems: Custom headers can help bypass some anti-bot measures.

Word of Caution

  • Performance Impact: Incorrectly used custom headers can reduce proxy performance and reveal automated requests.
  • Header Generation: For large-scale scraping, continuously generate clean headers to avoid getting blocked.
  • Use Judiciously: Only use custom headers when absolutely necessary.

Here is how you can use custom headers with ScraperAPI in Python:

import requests
from bs4 import BeautifulSoup

API_KEY = 'INSERT_API_KEY_HERE'
NUM_RETRIES = 5

# Create a session object
session = requests.Session()

session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
'Accept-Language': 'en-US,en;q=0.9',
'Referer': 'http://example.com'
})

# Function to scrape a URL with session management
def scrape_url_with_session(url):
params = {'api_key': API_KEY, 'url': url}
for _ in range(NUM_RETRIES):
try:
response = session.get('http://api.scraperapi.com/', params=urlencode(params))
if response.status_code in [200, 404]:
break
except requests.exceptions.ConnectionError:
response = None

if response and response.status_code == 200:
soup = BeautifulSoup(response.text, "html.parser")
quotes_sections = soup.find_all('div', class_="quote")
for quote_block in quotes_sections:
quote = quote_block.find('span', class_='text').text
author = quote_block.find('small', class_='author').text
scraped_quotes.append({
'quote': quote,
'author': author
})

# Example list of URLs to scrape
list_of_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]

scraped_quotes = []

# Scrape each URL in the list
for url in list_of_urls:
scrape_url_with_session(url)

print(scraped_quotes)

  • The requests.Session() object is created to persist cookies and headers across requests.
  • The session.headers.update() method is used to set custom headers that mimic browser behavior.
  • scrape_url_with_session uses the session object to make requests, ensuring cookies and headers are maintained. Retries the request up to NUM_RETRIES times in case of connection errors.

For more details, visit the ScraperAPI documentation on Custom Headers.


Static Proxies

Static proxies, sometimes referred to as sticky sessions, allow you to use the same IP address for multiple requests over a period of time. This is useful for maintaining sessions or when accessing sites that monitor IP changes to detect and block scrapers.

Why Use It?

  • Maintain Sessions: Useful for maintaining logged-in sessions or stateful interactions.
  • Consistency: Reduces the likelihood of being blocked by frequently changing IPs.
  • Tracking: Helps in scenarios where you need to appear as the same user over multiple requests.

Here is how you can use static proxies with ScraperAPI in Python:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urlencode

API_KEY = 'INSERT_API_KEY_HERE'
SESSION_ID = 'your_session_id'
NUM_RETRIES = 5

def scrape_with_static_proxy(url):
params = {
'api_key': API_KEY,
'url': url,
'session_number': SESSION_ID
}

for _ in range(NUM_RETRIES):
try:
response = requests.get('http://api.scraperapi.com/', params=urlencode(params))
if response.status_code in [200, 404]:
break
except requests.exceptions.ConnectionError:
response = None

if response and response.status_code == 200:
return response.text
return None

# Example usage
url = 'http://quotes.toscrape.com/page/1/'
html_content = scrape_with_static_proxy(url)

if html_content:
soup = BeautifulSoup(html_content, "html.parser")
quotes_sections = soup.find_all('div', class_="quote")

for quote_block in quotes_sections:
quote = quote_block.find('span', class_='text').text
author = quote_block.find('small', class_='author').text
print(f'Quote: {quote}\nAuthor: {author}\n')
else:
print('Failed to retrieve content')


Screenshot Functionality

Screenshot functionality allows you to capture and save an image of a webpage as it is rendered in a browser. This can be useful for visual verification, archiving content, or debugging purposes.

Why Use It?

  • Visual Verification: Ensure the scraped content is as expected.
  • Archiving: Save snapshots of webpages for later reference.
  • Debugging: Helps in identifying issues with rendering or data extraction.

While ScraperAPI does not support this feature, here is an example using ScrapingBee, which does offer screenshot functionality:

from scrapingbee import ScrapingBeeClient

client = ScrapingBeeClient(api_key='YOUR-API-KEY')
response = client.get(
'https://scrapingbee.com/blog', # Example URL
params={
'screenshot': 'true', # Capture screenshot
'screenshot_full_page': 'true' # Capture full page screenshot
}
)
if response.ok:
with open("./screenshot.png", "wb") as f:
f.write(response.content)
else:
print(response.content)

Other proxy providers offering screenshot functionality:


Auto Parsing

Auto parsing, sometimes called auto extract, refers to the automatic extraction and structuring of data from web pages. This functionality simplifies the process of turning unstructured web data into structured formats like JSON.

Why Use It?

  • Efficiency: Automates the parsing process, saving time and effort.
  • Structured Data: Directly retrieves structured data, which is easier to analyze and use.
  • Simplicity: Reduces the need for complex scraping logic and data processing.

Unfortunately, ScraperAPI does not support auto parsing functionality. For this feature, you can use other providers like Diffbot or ParseHub.


Case Study: Using ScraperAPI on IMDb Top 250 Movies

IMDb contains tons of data on movies, TV shows, and even video games. Not only is there a lot of data, but it's also extremely varied. For example, you can explore movie descriptions, cast, ratings, trivia, related movies, awards, and more. In addition to that, you’ll find user-generated data, such as reviews.

IMDB Reviews

In this case study, we will demonstrate how to scrape the IMDb Top 250 Movies chart using ScraperAPI. The steps include initializing ScraperAPI, sending a GET request, extracting data, and storing the extracted data in a JSON file.

First, you need to initialize ScraperAPI with your API key. This key will authenticate your requests and enable you to use ScraperAPI’s features. Then make sure to install: pip install requests pandas

After receiving the HTML response, we will use BeautifulSoup to parse the HTML and extract relevant data, including movie titles, rankings, release years, and ratings. Finally, we will save the extracted data in a JSON file for further analysis or use.

Here’s the complete code example for scraping the IMDb Top 250 Movies chart:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urlencode
import json

SCRAPER_API_KEY = 'INSERT_YOUR_SCRAPER_API_KEY_HERE'
IMDB_URL = 'https://www.imdb.com/chart/top/'

def get_scraperapi_url(url):
payload = {'api_key': SCRAPER_API_KEY, 'url': url}
return 'http://api.scraperapi.com/?' + urlencode(payload)

def fetch_html(url):
proxy_url = get_scraperapi_url(url)
response = requests.get(proxy_url)
if response.status_code == 200:
return response.text
response.raise_for_status() # Raise an HTTPError for bad responses

def parse_imdb_top_250(html_content):
soup = BeautifulSoup(html_content, "html.parser")
movies = []
for item in soup.select('li.ipc-metadata-list-summary-item'):
title_tag = item.select_one('h3.ipc-title__text')
year_tag = item.select_one('.sc-b189961a-8.kLaxqf.cli-title-metadata-item')
rating_tag = item.select_one('.ipc-rating-star--rating')
if title_tag and year_tag and rating_tag:
movies.append({
'title': title_tag.text.strip(),
'year': year_tag.text.strip(),
'rating': rating_tag.text.strip()
})
return movies

def save_to_json(data, filename):
with open(filename, 'w') as f:
json.dump(data, f, indent=4)

def main():
try:
html_content = fetch_html(IMDB_URL)
movies_data = parse_imdb_top_250(html_content)
save_to_json(movies_data, 'imdb_top_250_movies.json')
print("Data successfully scraped and saved to imdb_top_250_movies.json")
except requests.exceptions.RequestException as e:
print(f"Failed to retrieve content: {e}")

if __name__ == "__main__":
main()

Scraper Results

  • get_scraperapi_url function constructs the ScraperAPI URL with the target URL and API key. The fetch_html function sends a GET request to the IMDb page through ScraperAPI.
  • Movie details such as titles, release years, and ratings are extracted by selecting specific HTML elements.
  • The extracted data is stored in a list of dictionaries, where each dictionary represents a movie.

By following these steps, you can efficiently scrape and store data from the IMDb Top 250 Movies chart using ScraperAPI.

While improving your scraping application, make sure to adhere to ethical guidelines to ensure responsible and respectful use of web resources:

  • Ensure that the scraping activities are performed at a reasonable rate to avoid placing excessive load on IMDb’s servers. This can be managed by implementing delays between requests and using ScraperAPI’s rate-limiting features.

  • The extracted data should be used for legitimate and ethical purposes, such as research, education, or personal projects. Avoid using the data in ways that could harm the website or its users, such as for spamming or unauthorized redistribution.

  • Additionally, IMDb and many other modern websites use JavaScript to load content dynamically. This makes it difficult to scrape content that is not present in the initial HTML response.

  • ScraperAPI’s JavaScript rendering feature allows you to handle dynamic content by rendering the JavaScript on the page before the content is returned. By enabling the render parameter in your ScraperAPI requests, you can ensure that the full content, including dynamically loaded elements, is available for scraping.

  • Another common challenge is that websites often monitor and block IP addresses that make numerous requests in a short period, which can interrupt the scraping process and potentially blacklist your IP. ScraperAPI’s IP rotation feature addresses this issue by automatically rotating IP addresses with each request. This helps to distribute the load across multiple IPs, reducing the likelihood of any single IP being detected and blocked.

By incorporating these ethical considerations and leveraging ScraperAPI’s advanced features, you can effectively overcome common web scraping challenges. This approach ensures that your scraping activities are both efficient and respectful of the target website’s resources, enabling you to extract valuable data while maintaining good web scraping practices.


Alternative: ScrapeOps Proxy API Aggregator

ScrapeOps Proxy API Aggregator offers a unified solution for accessing multiple proxy providers through a single API, providing flexibility, reliability, and cost-effectiveness for your web scraping needs.

Why Use ScrapeOps Proxy API Aggregator?

  • Compare Pricing: ScrapeOps generally offers cheaper rates compared to individual proxy providers, making it a cost-effective solution for large-scale scraping projects.
  • More Flexible Plans: ScrapeOps provides a variety of plans, including smaller and more flexible options that can fit different needs and budgets.
  • More Reliable: With access to multiple proxy providers from a single proxy port, ScrapeOps ensures higher reliability and uptime, reducing the risk of IP bans and other disruptions.

Here is simple example of how to use ScrapeOps Proxy API Aggregator with Python Requests:

import requests
import json

API_KEY = 'YOUR_SCRAPEOPS_API_KEY'
target_url = 'http://quotes.toscrape.com/page/1/'
api_url = 'https://proxy.scrapeops.io/v1/'

headers = {
'Accept': 'application/json',
'Content-Type': 'application/json',
'API-KEY': API_KEY
}

payload = {
'url': target_url,
'render': 'true', # Enable JavaScript rendering if needed
'headers': {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
}

response = requests.post(api_url, headers=headers, data=json.dumps(payload))

if response.status_code == 200:
print(response.json()) # Assuming your response is in JSON format
else:
print(f'Error: {response.status_code}')

  • In this example, render parameter enables JavaScript rendering and custom headers are included to mimic a real browser request.

Take advantage of the free trial offered by ScrapeOps, which includes 500MB of free bandwidth. This allows you to test the service and see the benefits for yourself before committing to a paid plan.

For more information and to start your free trial, visit the ScrapeOps Proxy API Aggregator.

DocumentationFor detailed instructions and additional features, refer to the ScrapeOps Proxy API Aggregator Quickstart Guide.


Troubleshooting

Web scraping can sometimes encounter various issues that hinder data extraction. Here are common issues and their solutions to help you troubleshoot effectively.

Issue #1: Request Timeouts

  • Description: Requests to the target website take too long and eventually time out.

  • Possible Causes:

    • Network connectivity issues.
    • The target server is overloaded or slow.
    • Firewalls or security software blocking the request.
  • Solutions:

    • Increase Timeout: Adjust the timeout settings in your request library to allow more time for the server to respond.

      response = requests.get(url, timeout=30)
    • Retry Mechanism: Implement a retry mechanism with exponential backoff.

      import time
      for attempt in range(5):
      try:
      response = requests.get(url)
      break
      except requests.exceptions.Timeout:
      time.sleep(2 ** attempt)
    • Check Network: Ensure your network connection is stable and that there are no firewalls blocking the requests.

Issue #2: Handling CAPTCHAs

  • Description: Encountering CAPTCHAs that prevent automated access to the website.

  • Possible Causes: The website detects and blocks automated requests.

  • Solutions:

    • Use ScraperAPI: Utilize ScraperAPI’s built-in CAPTCHA solving feature.

      params = {'api_key': API_KEY, 'url': url, 'render': 'true'}
      response = requests.get('http://api.scraperapi.com/', params=params)
    • Human Intervention: For critical data, consider implementing a manual CAPTCHA solving step where a human user solves the CAPTCHA. As proxy APIs don't solve embedded CAPTCHAs that always protect logins or specific data but deal with CAPTCHAs shown by anti-bots when they suspect the requests are coming from a scraper.

Issue #3: Headless Browser Integrations

  • Description: Headless browsers, such as Puppeteer and Selenium, are commonly used for web scraping because they can render JavaScript and interact with web elements. However, integrating them with Proxy APIs can present several challenges.

  • Possible Causes:

    • Headless browsers may face compatibility issues when making background network requests. Headers and cookies might not be maintained consistently across requests
    • Using a headless browser can be expensive since scraping a single page might involve 10-100+ requests, each potentially being charged by the proxy provider.
    • For using headless browsers effectively, the proxy port integration method is recommended to maintain session consistency and reduce overhead.
  • Solutions:

    • Use proxy port integration which ensures session consistency and maintains headers and cookies across requests.

      from selenium import webdriver
      from selenium.webdriver.chrome.options import Options

      chrome_options = Options()
      chrome_options.add_argument('--headless')
      chrome_options.add_argument('--proxy-server=http://proxy-server.scrapingbee.com:8001')
      chrome_options.add_argument('--disable-gpu')

      driver = webdriver.Chrome(options=chrome_options)
      driver.get('http://example.com')

      print(driver.page_source)
      driver.quit()

    • Consider using providers like ScrapingBee, ScrapingAnt, or ZenRows for better headless browser compatibility.

Issue #4: Invalid Response Data

  • Description: Sometimes, the data retrieved from a proxy API may be invalid or incomplete due to ban pages or failed requests being sent as successful responses.

  • Possible Causes:

    • Ban Pages: Proxy providers may not always detect ban pages, leading to them being passed through as valid responses.
    • Failed Requests: Network issues or server errors may cause incomplete or incorrect data to be returned.
  • Solutions:

    • Reach out to the proxy provider’s support team to report the issue. They can update their ban detection mechanisms to better identify and handle these cases.

    • Choose proxy providers that perform rigorous ban page validation.

      import requests

      API_KEY = 'YOUR_API_KEY'
      target_url = 'http://example.com'
      api_url = 'http://api.scrapingbee.com/v1/'

      params = {
      'api_key': API_KEY,
      'url': target_url,
      }

      response = requests.get(api_url, params=params)
      if response.status_code == 200:
      data = response.json()
      if 'ban' in data['status']: # Custom logic to check for ban
      print('Ban detected, contacting support...')
      # Contact support logic
      else:
      print(data)
      else:
      print(f'Error: {response.status_code}')
    • Proxy Providers:

      • ScrapingBee: ScrapingBee - Known for better handling of invalid response data and effective customer support.
      • ScrapingAnt: ScrapingAnt - Provides comprehensive validation and support.
      • ZenRows: ZenRows - Offers reliable ban detection and data validation mechanisms.

Web scraping offers powerful capabilities for data extraction, but it comes with significant legal and ethical considerations that must be respected.

  • Web scraping should always be conducted ethically and in compliance with legal standards to avoid potential repercussions:
  • Always review and adhere to the terms of service of the websites you scrape. Violating these terms can lead to legal actions and bans from accessing the site.

Follow Privacy Policies:

  • Ensure that you respect the privacy policies of the websites. Do not scrape personal data without explicit permission, and comply with data protection regulations like GDPR.
  • Ignoring ethical guidelines and legal requirements can result in severe consequences:
    • Account Suspension: Websites may suspend or ban accounts that engage in unauthorized scraping activities.
    • Legal Penalties: Violating terms of service or data privacy laws can lead to legal actions, including fines and litigation.
    • Reputation Damage: Unethical scraping practices can harm your or your organization's reputation, leading to loss of trust and credibility.

By understanding and adhering to these legal and ethical guidelines, you can ensure that your web scraping activities are responsible, legal, and respectful of the target websites.


Conclusion

Integrating ScraperAPI into your web scraping projects enhances efficiency and reliability. We've explored setup, handling CAPTCHAs and IP blocking, and advanced techniques using libraries like Selenium and Pandas. Remember to scrape responsibly by respecting website terms of service, implementing rate limiting, and using data ethically.

Stay updated with ScraperAPI features and continue to learn and explore for more effective and responsible web scraping.


More Python Web Scraping Guides

Want to take your scraping skills to the next level?

Check out Python Web Scraping Playbook or these additional guides: