Scrapingbee Integration Guide

Scrapingbee: Web Scraping Integration Guide

In today's data-driven world, efficient web scraping is crucial for extracting valuable information from websites. Scrapingbee is a powerful web scraping tool designed to simplify this process by providing developers with a clean, hassle-free way to retrieve web data without worrying about issues like JavaScript rendering or complex request handling. By utilizing Scrapingbee, you can access dynamic content, bypass common scraping blockers, and retrieve data in a reliable, scalable way. This comprehensive guide offers detailed instructions, code examples, and best practices to help you make the most of Scrapingbee's features. For this purpose, we'll divide into :

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.


TLDR: Web Scraping With Scrapingbee?

Integrating Scrapingbee into your web scraping projects can significantly streamline your data extraction process. For those looking to dive straight into the action, here are the key tips for setting up and using Scrapingbee:
  • Begin by signing up for a Scrapingbee account and generating your API key. Use the API to send requests by specifying the URL you want to scrape. Scrapingbee handles JavaScript rendering and proxies with ease, making it an efficient option for both static and dynamic pages.
To demonstrate how Scrapingbee works in practice, here’s a simple example using Python’s requests library:
import requestsapi_key = 'your_api_key_here'url = 'https://example.com'
# Making a GET request to Scrapingbeeresponse = requests.get(    'https://app.scrapingbee.com/api/v1/',    params={        'api_key': api_key,        'url': url,        'render_js': 'true'  # This parameter enables JavaScript rendering    })print(response.text)
  • Leverage Scrapingbee's advanced configurations to optimize your scraping. Use the render_js parameter to scrape pages with JavaScript content, and adjust proxy settings to overcome site restrictions.
For even faster results, fine-tune the wait_for and block_resources options to avoid loading unnecessary elements like images and ads.

What Is Scrapingbee?

Scrapingbee is designed to simplify the web scraping process by handling many of the challenges that developers often face. It allows you to bypass technical barriers such as CAPTCHA protection, IP blocking, and complex JavaScript-rendered pages. ScrapingBee Homepage Scrapingbee does this by offering an easy-to-use API that handles the heavy lifting for you, making it ideal for both beginners and experienced developers. With Scrapingbee, you can scrape a variety of data types, including:
  • Text Content: Extract articles, blog posts, or descriptions from websites.
  • Images and Media: Download images, videos, or other multimedia files for analysis or archiving.
  • Structured Data: Retrieve product information, pricing, or user reviews in a structured format.
  • Dynamic Content: Scrape data from pages that rely on JavaScript, such as Single Page Applications (SPAs) or AJAX-loaded content.

Why Use Scrapingbee?

Scrapingbee helps you overcome some of the most common hurdles in web scraping:
  • CAPTCHA Handling: Automatically solves CAPTCHA challenges that block automated bots.
  • IP Blocking: Utilizes rotating proxies and different user agents to prevent detection and IP bans.
  • JavaScript Rendering: Supports scraping data from websites that require JavaScript execution, ensuring you can access all the content you need.
By leveraging Scrapingbee, you can focus on extracting and utilizing the data rather than dealing with technical obstacles, making it a reliable tool for your web scraping needs.

How Does Scrapingbee Work?

Scrapingbee simplifies the process of web scraping by acting as a proxy service that handles your HTTP requests and responses. Instead of building and maintaining complex scraping infrastructure, you can use Scrapingbee's API to fetch web data efficiently and reliably. Key components of Scrapingbee include::
ComponentDescription
API KeyA unique key provided by Scrapingbee that grants access to its services. Each request must include your API key for authentication.
EndpointsScrapingbee offers specific URLs (endpoints) for various functionalities, such as rendering JavaScript, bypassing CAPTCHA, or using rotating proxies. These endpoints determine the behavior of your scraping requests.
ParametersYou can customize your requests by adding parameters such as headers, cookies, proxy settings, and JavaScript rendering options. This flexibility allows you to tailor the scraping process to your specific needs.
To demonstrate how Scrapingbee works in practice, here’s a simple example using Python’s requests library:
import requestsapi_key = 'your_api_key_here'url = 'https://example.com'
# Making a GET request to Scrapingbeeresponse = requests.get(    'https://app.scrapingbee.com/api/v1/',    params={        'api_key': api_key,        'url': url,        'render_js': 'true'  # This parameter enables JavaScript rendering    })print(response.text)
  • render_js instructs Scrapingbee to render JavaScript on the page before returning the HTML, allowing you to scrape dynamic content.
This basic request demonstrates how Scrapingbee handles the complexities of web scraping, letting you retrieve data from a webpage with minimal effort. By adjusting the parameters and endpoints, you can fine-tune your requests to meet specific scraping requirements.

Response Format

Scrapingbee allows you to choose the format of the response you receive, depending on your needs.
  • HTML Response: By default, Scrapingbee returns the raw HTML of the webpage you are scraping. This is useful when you need to extract and parse specific elements directly from the HTML content.
  • JSON Response: If your target webpage returns data in JSON format, or if you prefer to receive metadata along with the HTML, Scrapingbee can provide a JSON response that includes additional information about the request and the webpage.
Extra Data in JSON Response: When requesting a JSON response, Scrapingbee provides additional metadata, such as:
  • Final URL: The final URL after all redirects.
  • Status Code: The HTTP status code of the response.
  • Content-Type: The content type of the response.
  • Cookies: Any cookies set during the request.
To specify the response format in your Scrapingbee request, you can use the extract_rules parameter for JSON or leave it out for raw HTML:
import requests
api_key = 'your_api_key_here'url = 'https://example.com'
response = requests.get(    'https://app.scrapingbee.com/api/v1/',    params={        'api_key': api_key,        'url': url    })
print(response.text)
JSON response would be:
import requests
api_key = 'your_api_key_here'url = 'https://example.com'
response = requests.get(    'https://app.scrapingbee.com/api/v1/',    params={        'api_key': api_key,        'url': url,        'extract_rules': 'your_rules_here'  # Use specific rules or pass 'true' for default metadata    })
print(response.json())

Scrapingbee Pricing

Scrapingbee offers several pricing plans tailored to different levels of web scraping needs. Here's an overview of the available plans:
  1. Monthly Plans: Scrapingbee provides monthly subscription plans that scale according to the number of API credits and concurrent requests you need. The plans are:
  2. Freelance Plan: $29 per month, includes 250,000 API credits and 5 concurrent requests.
  3. Startup Plan: $99 per month, includes 1,000,000 API credits and 10 concurrent requests.
  4. Business Plan: $249 per month, includes 3,000,000 API credits and 50 concurrent requests.
  5. Enterprise Plan: Custom pricing based on your specific needs, including dedicated account management.
  6. Pay-As-You-Go: Scrapingbee also allows for custom plans where pricing and features can be adjusted based on your requirements.
Charging for Successful Requests: You are typically charged for successful requests, but keep in mind that credits are still deducted for failed requests in some scenarios, like blocked attempts due to anti-scraping mechanisms. What is Considered a Successful Request?: A successful request is generally when the target data is successfully retrieved. However, even requests that encounter certain blocks may still consume credits.
PlanMonthly PriceAPI CreditsConcurrent RequestsFeatures
Freelance$29250,0005JavaScript rendering, rotating proxies
Startup$991,000,00010Geotargeting, priority email support
Business$2493,000,00050API store, advanced features
EnterpriseCustomCustomCustomDedicated account manager, custom features
API credit pricing by feature:
FeatureCost (Credits)
Standard HTML Request1 credit per request
JavaScript Rendering5 credits per request
CAPTCHA Solving10 credits per CAPTCHA
Proxy RotationIncluded in plan

Response Status Codes

Scrapingbee follows standard HTTP status codes to indicate the success or failure of your requests. Here’s a table of the possible status codes and their meanings:
Status CodeDescription
200 OKThe request was successful, and the response contains the expected data.
400 Bad RequestThe request was invalid or improperly formatted.
401 UnauthorizedThe API key is missing or invalid.
403 ForbiddenAccess to the requested resource is denied, often due to IP blocking or CAPTCHA.
404 Not FoundThe requested URL was not found on the server.
429 Too Many RequestsThe rate limit has been exceeded.
500 Internal Server ErrorA server-side error occurred, retry the request.

Setting Up Scrapingbee

Setting up Scrapingbee is straightforward and involves a few key steps. Now, we will walk through creating an account, obtaining an API key, configuring basic settings, and understanding request limits. Additionally, example code snippets will demonstrate the initial setup in Python.
  1. Visit the Scrapingbee website and click on the Sign Up button.
  2. After logging into your account, navigate to the dashboard.
ScrapingBee Dashboard
  1. Your unique API key will be prominently displayed on the dashboard. Copy this key as it will be required for making API requests.
  2. In the dashboard, you can configure basic settings such as API endpoints, request parameters, and manage your API keys. Ensure you understand the usage limits of your plan. Each plan has specific limits on the number of API credits, concurrent requests, and features available.
  3. Each plan has a defined number of API credits per month. For example, the Freelance plan offers 250,000 credits, while the Startup plan provides 1,000,000 credits. Credits are deducted based on the complexity and type of requests.
ScrapingBee Pricing
  1. Also. the number of concurrent requests your plan supports is crucial for high-volume scraping tasks. Lower-tier plans support fewer concurrent requests (e.g., 5 for Freelance), while higher-tier plans allow more (e.g., 50 or more for Business).
  2. Scrapingbee’s API endpoint is the primary URL where all your requests are directed. Typically, it looks like this: https://app.scrapingbee.com/api/v1/
  3. Scrapingbee also allows you to route your requests through a proxy by specifying a proxy port in your requests. This is useful for geotargeting and avoiding IP bans.
  4. Scrapingbee supports several programming languages through SDKs, making it easier to integrate the service into your application. Supported SDKs include:
  • Python
  • Node.js
  • Java
  • Ruby
  • PHP
ScrapingBee Docs
  1. Scrapingbee’s proxy supports a variety of HTTP methods to cater to different needs:
  • GET: Used for retrieving data from a specified resource.
  • POST: Used for submitting data to a specified resource.
  • PUT: Used for updating a specified resource.
  • DELETE: Used for deleting a specified resource.
  • HEAD: Similar to GET but does not return the body, only the headers.

API Endpoint Integration

Integrating with Scrapingbee’s API endpoint involves sending HTTP requests to Scrapingbee’s servers, which then handle the complex tasks of scraping data from websites. The service manages browser sessions, proxy rotations, and CAPTCHA bypassing, making it ideal for both simple and complex scraping tasks. Why/When to Use It?
  • Why Use It? If your application requires data extraction from web pages, particularly those with complex structures or JavaScript, Scrapingbee’s API is essential. It handles the heavy lifting by executing JavaScript, rotating proxies, and avoiding blocks.
  • When to Use It? Use the API endpoint when you need reliable, automated access to web data without the need to build and maintain your own scraping infrastructure.
Here’s an example of a basic request to Scrapingbee’s API endpoint using Python’s requests library:
import requests
api_key = 'your_api_key_here'url = 'https://example.com'
response = requests.get(    'https://app.scrapingbee.com/api/v1/',    params={        'api_key': api_key,        'url': url,        'render_js': 'true'  # Enables JavaScript rendering    })
print(response.text)
For more details and advanced configurations, refer to the official Scrapingbee documentation.

Proxy Port Integration

When integrating with Scrapingbee's API via a proxy port, the process involves routing your HTTP requests through Scrapingbee’s proxies. This setup is particularly useful when transitioning from a traditional proxy solution, as it allows for seamless integration with minimal changes to your existing codebase. Scrapingbee's proxy ports are optimized for headless browser operations, enabling more reliable data extraction from web pages with complex JavaScript and anti-scraping measures. Why/When to Use It?
  • Why Use It? Using a proxy port is beneficial when you need to simulate requests from different geographical locations or IP addresses. It is also advantageous for integrating headless browsers more effectively, allowing you to bypass detection and access dynamic content that standard scraping tools might miss.
  • When to Use It? You should consider using proxy port integration when you’re dealing with websites that employ stringent anti-scraping measures, or when you need to ensure that your scraping activities are distributed across multiple IPs to avoid bans.
Here’s a basic example using Python to send a request through Scrapingbee’s proxy port:
import requests
api_key = 'your_api_key_here'url = 'https://example.com'proxy = 'http://proxy.scrapingbee.com:8080'  # Example proxy port
response = requests.get(    url,    proxies={"http": proxy, "https": proxy},    headers={"X-API-KEY": api_key})
print(response.text)
For more detailed information on setting up and using proxy ports with Scrapingbee, refer to their documentation on proxy integration.

SDK Integration

Scrapingbee provides SDKs in several programming languages, allowing developers to integrate its API more easily into their applications. The SDKs abstract much of the complexity involved in making API calls, making it simpler for developers, especially beginners, to start using Scrapingbee’s features with minimal setup. Why/When to Use It?
  • Why Use It? SDKs simplify the process of integration by providing pre-built functions and classes tailored to specific programming languages. This makes it easier for beginners or those unfamiliar with web scraping to implement Scrapingbee’s capabilities without needing to handle raw HTTP requests and responses manually.
  • When to Use It? You should use an SDK when you want a smoother, faster integration process, or if you prefer working within the environment of your chosen programming language without dealing directly with HTTP requests.
Here’s how you can use the Scrapingbee SDK for Python:
from scrapingbee import ScrapingBeeClient
client = ScrapingBeeClient(api_key='your_api_key_here')response = client.get(    url='https://example.com',    params={'render_js': 'true'})
print(response.content)
You can see List of available SDKs here: For more details and examples, refer to the Scrapingbee SDK documentation.

Async Response Integration

Async response integration in Scrapingbee allows you to handle API requests asynchronously. This is particularly useful when you need to manage multiple requests without blocking your application’s main thread. By using asynchronous requests, you can reduce server load on your end, potentially lower server costs, and avoid the complexity of managing retries or timeouts manually. Why/When to Use It?
  • Why Use It? Asynchronous processing is essential when dealing with large-scale scraping tasks that require high concurrency. It helps in reducing server workload on the client side and ensures that your application remains responsive while waiting for multiple requests to complete.
  • When to Use It? Use async response integration when you’re performing high-volume scraping that involves multiple concurrent requests, or when you need to optimize performance by handling responses without blocking other processes.
Here’s an example of using Python’s asyncio library in combination with aiohttp for asynchronous requests to Scrapingbee:
import asyncioimport aiohttp
async def fetch(session, url):    async with session.get(url, headers={"X-API-KEY": "your_api_key_here"}) as response:        return await response.text()
async def main():    urls = [        'https://example1.com',        'https://example2.com',        'https://example3.com',    ]    async with aiohttp.ClientSession() as session:        tasks = [fetch(session, url) for url in urls]        responses = await asyncio.gather(*tasks)        for response in responses:            print(response)
# Run the async loopasyncio.run(main())

Managing Concurrency

Scrapingbee can be integrated with popular scraping libraries like BeautifulSoup and Scrapy to automate and optimize web scraping tasks. Both libraries have unique features that make them suitable for different scraping scenarios.
  • BeautifulSoup is a Python library for parsing HTML and XML documents. It provides Pythonic idioms for iterating, searching, and modifying the parse tree. This library is particularly useful for smaller scraping tasks where the HTML structure is not overly complex. It is known for its simplicity and ease of use, allowing developers to quickly extract data from web pages.
  • Scrapy is a powerful and fast open-source web crawling framework written in Python. Scrapy is suitable for complex and large-scale scraping projects requiring high performance and scalability. It is designed for large-scale web scraping with features like:
    • Built-in support for handling requests and responses: Scrapy automatically manages request retries, redirects, and handles cookies and sessions.
    • Selectors: Scrapy uses XPath and CSS selectors to extract data from web pages.
    • Item Pipelines: Scrapy provides mechanisms to clean, validate, and store scraped data.
Here’s a detailed example of integrating Scrapingbee with BeautifulSoup to scrape data from Quotes to Scrape:
import requestsfrom bs4 import BeautifulSoupimport concurrent.futuresfrom urllib.parse import urlencode
API_KEY = 'YOUR_API_KEY'NUM_THREADS = 5
def get_proxy_url(url):    payload = {'api_key': API_KEY, 'url': url}    proxy_url = 'https://app.scrapingbee.com/api/v1/?' + urlencode(payload)    return proxy_url
list_of_urls = [    'http://quotes.toscrape.com/page/1/',    'http://quotes.toscrape.com/page/2/',    'http://quotes.toscrape.com/page/3/',]
output_data_list = []
def scrape_page(url):    try:        response = requests.get(get_proxy_url(url))        if response.status_code == 200:            soup = BeautifulSoup(response.text, "html.parser")            title = soup.find('h1').text            output_data_list.append({                'title': title,            })
    except Exception as e:        print('Error', e)
with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:    executor.map(scrape_page, list_of_urls)
print(output_data_list)
  • list_of_urls takes a URL as input, sends a request to Scrapingbee, and retries up to NUM_THREADS times if necessary.
  • Concurrent threads allow the script to handle multiple scraping tasks simultaneously, significantly speeding up the data extraction process. Instead of waiting for one URL to be scraped before starting the next, multiple URLs are processed in parallel.
  • The concurrent.futures.ThreadPoolExecutor creates a pool of worker threads, specified by NUM_THREADS. Each thread executes the scrape_url function on different URLs from the list_of_urls. By using concurrent threads, the script can maximize the usage of system resources and reduce the total time required for scraping multiple pages. This is particularly useful when dealing with a large number of URLs or when the target websites have slower response times.
This script will allow you to scrape multiple pages concurrently while respecting Scrapingbee's API limits. You can adjust NUM_THREADS based on your subscription plan's concurrency limit.

Advanced Functionality

Scrapingbee offers several advanced features that enable more complex and tailored web scraping tasks. These functionalities can be activated through specific query parameters when making API requests. Enabling these features may consume additional API credits, depending on the complexity of the operation. To leverage advanced functionality, you typically include additional parameters in your API request. These parameters allow you to customize your scraping environment, such as enabling JavaScript rendering, rotating proxies, geotargeting, or handling CAPTCHAs. Here’s a sample code snippet demonstrating how to enable some of these advanced features using Scrapingbee's API:
import requestsfrom urllib.parse import urlencode
API_KEY = 'YOUR_API_KEY'url = 'https://example.com'
params = {    'api_key': API_KEY,    'url': url,    'render_js': 'true',  # Enables JavaScript rendering    'country_code': 'us',  # Geotargeting to US}
response = requests.get('https://app.scrapingbee.com/api/v1/?' + urlencode(params))
print(response.text)
Here’s a table outlining some of the advanced features offered by Scrapingbee, the API credits they consume, and a brief description:
ParameterAPI CreditsDescription
render_js5 credits per requestEnables JavaScript rendering to load dynamic content.
block_ads1 credit per requestBlocks ads from loading on the target webpage.
country_code3 credits per requestGeotargeting feature to route requests through specific countries.
premium_proxy10 credits per requestUses a premium proxy to avoid detection and enhance anonymity.
captcha_solve15 credits per CAPTCHAAutomatically solves CAPTCHA challenges on the target webpage.
screenshot10 credits per requestTakes a screenshot of the webpage at the end of the request.
For a comprehensive list of all advanced functionalities and their specific usage, please refer to the official Scrapingbee documentation.

Javascript Rendering

JavaScript rendering is a feature provided by Scrapingbee that allows the API to execute JavaScript on a webpage before returning the HTML content. This is particularly useful when scraping websites that rely on JavaScript to load or modify content dynamically, such as Single Page Applications (SPAs) or sites that load data via AJAX. Why Use It?
  • Dynamic Content: Many modern websites rely on JavaScript to load content, meaning that the data you need might not be present in the initial HTML response. JavaScript rendering ensures that all dynamic content is fully loaded before the HTML is returned, giving you access to the complete data.
  • Bypass Anti-Scraping Measures: Some websites use JavaScript to detect and block bots. By rendering JavaScript, Scrapingbee can help bypass some of these measures.
  • Handle SPAs: Single Page Applications (SPAs) often load data dynamically as users interact with the page. JavaScript rendering allows you to capture this content as it would appear in a fully loaded browser.
Enabling JavaScript rendering with Scrapingbee costs 5 API credits per request. Here’s how you can use JavaScript rendering with Scrapingbee in Python:
import requestsfrom urllib.parse import urlencode
API_KEY = 'your_api_key_here'url = 'https://example.com'
params = {    'api_key': API_KEY,    'url': url,    'render_js': 'true',  # Enables JavaScript rendering}
response = requests.get('https://app.scrapingbee.com/api/v1/?' + urlencode(params))
print(response.text)
Check this document for more details.

Controlling The Browser [If Applicable]

If Scrapingbee or a similar service supports inserting JavaScript commands into the browser during a scraping session, you can control aspects of the browser, such as scrolling, clicking, and other interactions that might be necessary to trigger certain elements on a page (e.g., loading additional content). Here’s a code snippet demonstrating how to inject and execute JavaScript commands using Scrapingbee:
import requestsfrom urllib.parse import urlencode
API_KEY = 'your_api_key_here'url = 'https://example.com'
params = {    'api_key': API_KEY,    'url': url,    'render_js': 'true',    'js_snippet': 'window.scrollTo(0, document.body.scrollHeight);',  # Scrolls to the bottom of the page}
response = requests.get('https://app.scrapingbee.com/api/v1/?' + urlencode(params))
print(response.text)
Additionally, table of the functionality:
ParameterDescription
js_snippetInjects and executes a custom JavaScript command in the browser.
scroll_toScrolls to a specific part of the page (e.g., 0, document.body.scrollHeight).
click_elementSimulates a click on a specified element.
Refer to the documentation in here to have a better understanding in browser integration!

Country Geotargeting

Country Geotargeting is a feature offered by Scrapingbee that allows you to route your web scraping requests through servers located in specific countries. This ensures that the target website sees the request as originating from a particular location, which can be crucial for accessing region-specific content or bypassing geographical restrictions. Why Use It?
  • Access Region-Specific Content: Some websites display different content based on the visitor's location. Country Geotargeting allows you to scrape content as if you were located in a specific country.
  • Bypass Geo-Restrictions: If a website restricts access to certain regions, using geotargeting can help you bypass these restrictions and access the desired content.
  • SEO and Market Research: For businesses conducting SEO analysis or market research, it's essential to view how a website behaves or ranks in different countries.
  • Enabling Country Geotargeting costs 3 API credits per request.
Here’s how you can use Country Geotargeting with Scrapingbee in Python:
import requestsfrom urllib.parse import urlencode
API_KEY = 'your_api_key_here'url = 'https://example.com'
params = {    'api_key': API_KEY,    'url': url,    'country_code': 'us',  # Targeting the United States}
response = requests.get('https://app.scrapingbee.com/api/v1/?' + urlencode(params))
print(response.text)
Table of country codes:
Country CodeCountry
usUnited States
frFrance
deGermany
jpJapan
ukUnited Kingdom
For details, check out here.

Residential Proxies

Residential proxies, sometimes referred to as premium proxies, are IP addresses provided by Internet Service Providers (ISPs) to homeowners. These proxies are considered more trustworthy by websites because they mimic the browsing behavior of real users, making them less likely to be detected and blocked by anti-scraping tools. Why Use It?
  • Avoid Detection: Residential proxies are less likely to be flagged as suspicious by websites, making them ideal for scraping sensitive or heavily protected sites.
  • Higher Success Rates: Since residential proxies come from real ISPs, they offer higher success rates for scraping without getting blocked.
  • Bypass Advanced Anti-Scraping Measures: Some websites deploy sophisticated anti-scraping techniques that can detect and block data center proxies. Residential proxies are more effective in bypassing these measures.
Using residential proxies costs 10 API credits per request. Here’s how you can use residential proxies with Scrapingbee in Python:
import requestsfrom urllib.parse import urlencode
API_KEY = 'your_api_key_here'url = 'https://example.com'
params = {    'api_key': API_KEY,    'url': url,    'premium_proxy': 'true',  # Enables residential (premium) proxy}
response = requests.get('https://app.scrapingbee.com/api/v1/?' + urlencode(params))
print(response.text)

Custom Headers

Custom header functionality allows you to define and send your own HTTP headers when making requests through Scrapingbee's API. By default, proxy APIs manage request headers to optimize performance and reduce the chances of detection. However, there are scenarios where you might need to send custom headers, such as when making POST requests, handling authentication, or bypassing specific anti-bot measures. Why Use It?
  • Access Specific Data: Some websites require specific headers to return the desired content, such as authorization tokens or custom user-agent strings.
  • POST Requests: Custom headers are often necessary for POST requests where the server expects particular content types, authentication tokens, or other metadata.
  • Bypass Anti-Bot Systems: In some cases, sending specific headers can help bypass anti-bot systems that might block or throttle requests.
Word of Caution
  • Reduced Performance: Using custom headers can reduce the effectiveness of proxy APIs since static headers may flag your requests as automated. Proxy services often optimize headers to avoid detection, and overriding them might result in blocks or reduced performance.
  • Header Generation: For large-scale scraping, relying on static headers is risky. You'll need a system to generate and rotate headers dynamically to maintain the appearance of human activity.
  • Use Sparingly: Only use custom headers when absolutely necessary. Let the proxy service manage headers whenever possible to ensure optimal performance and avoid detection.
Here’s how you can use custom headers with Scrapingbee in Python:
import requestsfrom urllib.parse import urlencode
API_KEY = 'your_api_key_here'url = 'https://example.com'
headers = {    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3',    'Authorization': 'Bearer your_auth_token',}
params = {    'api_key': API_KEY,    'url': url,}
response = requests.get('https://app.scrapingbee.com/api/v1/?' + urlencode(params), headers=headers)
print(response.text)
Make sure to check it in here for more details!

Static Proxies

Static proxy functionality, also known as sticky sessions, allows you to use the same IP address for multiple requests over a period of time. This is useful for maintaining session consistency on websites that use cookies or other tracking mechanisms to associate multiple requests with the same user. Why Use It?
  • Session Persistence: When interacting with websites that maintain sessions (e.g., logging in and navigating through multiple pages), a static proxy ensures that all requests come from the same IP, maintaining the session.
  • Avoiding Detection: For certain tasks, using a static IP helps avoid detection by making the requests appear more like those of a typical user.
Here’s how you can use static proxies with Scrapingbee in Python:
import requestsfrom urllib.parse import urlencode
API_KEY = 'your_api_key_here'url = 'https://example.com'
params = {    'api_key': API_KEY,    'url': url,    'session': 'true',  # Enables static (sticky) session}
response = requests.get('https://app.scrapingbee.com/api/v1/?' + urlencode(params))
print(response.text)
Check the official documentation for more info!

Screenshot Functionality

Scrapingbee offers screenshot functionality and this allows you to capture a screenshot of a webpage during scraping. You can customize the screenshot by specifying parameters such as full-page rendering or custom viewport sizes. Why Use It?
  • Visual Verification: Screenshots help visually confirm that the content loaded correctly, especially for dynamic or JavaScript-heavy pages.
  • Documentation: Screenshots can serve as a visual record, useful for audits or presentations.
  • Handling Dynamic Content: For pages where data might be loaded dynamically, a screenshot can capture elements that are otherwise difficult to scrape.
from scrapingbee import ScrapingBeeClient
client = ScrapingBeeClient(api_key='your_api_key_here')
response = client.get(    'https://example.com',    params={        'screenshot': True,        'screenshot_full_page': True,  # Optional: Capture the full page        'window_width': 1920  # Optional: Custom viewport size    })
if response.ok:    with open("screenshot.png", "wb") as file:        file.write(response.content)
Check it out here for more examples!

Auto Parsing

Scrapingbee also supports auto parsing through its extraction rules feature. This allows you to define custom selectors to extract specific data elements directly from the HTML, without having to manually parse the response. You can extract content in different formats, such as JSON or arrays, by specifying how the data should be structured. Why Use It?
  • Efficiency: Auto parsing eliminates the need for custom parsing logic in your code, reducing development time and potential parsing errors.
  • Structured Data: Scrapingbee can return data in structured formats like JSON or arrays, making it easier to work with large or complex datasets.
  • Simplified Extraction: You can target specific HTML elements (e.g., headings, tables) and extract only the necessary information, avoiding unnecessary data.
Here’s how you might use auto parsing:
import requestsfrom urllib.parse import urlencode
API_KEY = 'your_api_key_here'url = 'https://example.com'
params = {    'api_key': API_KEY,    'url': url,    'extract_rules': '{"title": {"selector": "h1", "output": "text"}}'}
response = requests.get('https://app.scrapingbee.com/api/v1/?' + urlencode(params))
print(response.json())
Check it out here for more details!

Case Study: Using Scrapingbee on IMDb Top 250 Movies

IMDb contains tons of data on movies, TV shows, and even video games. Not only is there a lot of data, but it's also extremely varied. For example, you can explore movie descriptions, cast, ratings, trivia, related movies, awards, and more. In addition to that, you’ll find user-generated data, such as reviews. IMDB HTML Inspection In this case study, we will demonstrate how to scrape the IMDb Top 250 Movies chart using Scrapingbee. The steps include initializing Scrapingbee, sending a GET request, extracting data, and storing the extracted data in a JSON file.
  1. First, you need to initialize Scrapingbee with your API key. This key will authenticate your requests and enable you to use Scrapingbee's features. Then make sure to install:
pip install requests pandas
  1. After receiving the HTML response, we will use BeautifulSoup to parse the HTML and extract relevant data, including movie titles, rankings, release years, and ratings.
  2. Finally, we will save the extracted data in a JSON file for further analysis or use.
Here’s the complete code example for scraping the IMDb Top 250 Movies chart:
import requestsfrom bs4 import BeautifulSoupfrom urllib.parse import urlencodeimport json
SCRAPINGBEE_API_KEY = 'INSERT_YOUR_SCRAPINGBEE_API_KEY_HERE'IMDB_URL = 'https://www.imdb.com/chart/top/'
def get_scrapingbee_url(url):    payload = {'api_key': SCRAPINGBEE_API_KEY, 'url': url}    return 'https://app.scrapingbee.com/api/v1/?' + urlencode(payload)
def fetch_html(url):    proxy_url = get_scrapingbee_url(url)    response = requests.get(proxy_url)    if response.status_code == 200:        return response.text    response.raise_for_status()  # Raise an HTTPError for bad responses
def parse_imdb_top_250(html_content):    soup = BeautifulSoup(html_content, "html.parser")    movies = []    for item in soup.select('li.ipc-metadata-list-summary-item'):        title_tag = item.select_one('h3.ipc-title__text')        year_tag = item.select_one('.sc-b189961a-8.kLaxqf.cli-title-metadata-item')        rating_tag = item.select_one('.ipc-rating-star--rating')        if title_tag and year_tag and rating_tag:            movies.append({                'title': title_tag.text.strip(),                'year': year_tag.text.strip(),                'rating': rating_tag.text.strip()            })    return movies
def save_to_json(data, filename):    with open(filename, 'w') as f:        json.dump(data, f, indent=4)
def main():    try:        html_content = fetch_html(IMDB_URL)        movies_data = parse_imdb_top_250(html_content)        save_to_json(movies_data, 'imdb_top_250_movies.json')        print("Data successfully scraped and saved to imdb_top_250_movies.json")    except requests.exceptions.RequestException as e:        print(f"Failed to retrieve content: {e}")
if __name__ == "__main__":    main()
Final JSON Output
  • get_Scrapingbee_url function constructs the Scrapingbee URL with the target URL and API key. The fetch_html function sends a GET request to the IMDb page through Scrapingbee.
  • Movie details such as titles, release years, and ratings are extracted by selecting specific HTML elements.
  • The extracted data is stored in a list of dictionaries, where each dictionary represents a movie.
By following these steps, you can efficiently scrape and store data from the IMDb Top 250 Movies chart using Scrapingbee.

Challenges and Improvements

While improving your scraping application, make sure to adhere to ethical guidelines to ensure responsible and respectful use of web resources:
  1. Respect Website Resources:
    • Perform scraping at a reasonable rate to avoid overloading IMDb’s servers.
    • Implement delays between requests to reduce server strain.
    • Use Scrapingbee’s rate-limiting features to control request frequency.
  2. Use Data Ethically:
    • Extracted data should be used for legitimate and ethical purposes (e.g., research, education, personal projects).
    • Avoid using data for spamming, unauthorized redistribution, or harmful activities.
  3. Handling JavaScript-Loaded Content:
Many modern websites, including IMDb, use JavaScript to load content dynamically, which can complicate scraping.
  • Use Scrapingbee’s JavaScript rendering feature to load and extract dynamic content.
  • Enable the render parameter in Scrapingbee requests to ensure all elements are captured.
  1. Prevent IP Blocking:
Websites often track and block IPs making excessive requests in a short period.
  • Use Scrapingbee’s IP rotation feature to distribute requests across multiple IPs.
  • This reduces the chance of detection and prevents IP blacklisting.
By incorporating these ethical considerations and leveraging Scrapingbee’s advanced features, you can effectively overcome common web scraping challenges. This approach ensures that your scraping activities are both efficient and respectful of the target website’s resources, enabling you to extract valuable data while maintaining good web scraping practices.

Alternative: ScrapeOps Proxy API Aggregator

ScrapeOps Proxy API Aggregator offers a unified solution for accessing multiple proxy providers through a single API, providing flexibility, reliability, and cost-effectiveness for your web scraping needs. Why Use ScrapeOps Proxy API Aggregator?
  • Compare Pricing: ScrapeOps generally offers cheaper rates compared to individual proxy providers, making it a cost-effective solution for large-scale scraping projects.
  • More Flexible Plans: ScrapeOps provides a variety of plans, including smaller and more flexible options that can fit different needs and budgets.
  • More Reliable: With access to multiple proxy providers from a single proxy port, ScrapeOps ensures higher reliability and uptime, reducing the risk of IP bans and other disruptions.
Here is simple example of how to use ScrapeOps Proxy API Aggregator with Python Requests:
import requestsimport json
API_KEY = 'YOUR_SCRAPEOPS_API_KEY'target_url = 'http://quotes.toscrape.com/page/1/'api_url = 'https://proxy.scrapeops.io/v1/'
headers = {    'Accept': 'application/json',    'Content-Type': 'application/json',    'API-KEY': API_KEY}
payload = {    'url': target_url,    'render': 'true',    'headers': {        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'    }}
response = requests.post(api_url, headers=headers, data=json.dumps(payload))
if response.status_code == 200:    print(response.json())else:    print(f'Error: {response.status_code}')
  • In this example, render parameter enables JavaScript rendering and custom headers are included to mimic a real browser request.
Take advantage of the free trial offered by ScrapeOps, which includes 500MB of free bandwidth. This allows you to test the service and see the benefits for yourself before committing to a paid plan. For more information and to start your free trial, visit the ScrapeOps Proxy API Aggregator. DocumentationFor detailed instructions and additional features, refer to the ScrapeOps Proxy API Aggregator Quickstart Guide.

Troubleshooting

Web scraping can sometimes encounter various issues that hinder data extraction. Here are common issues and their solutions to help you troubleshoot effectively.

Issue #1: Request Timeouts

  • Description: Requests to the target website take too long and eventually time out.
  • Possible Causes:
    • Network connectivity issues.
    • The target server is overloaded or slow.
    • Firewalls or security software blocking the request.
  • Solutions:
    • Increase Timeout: Adjust the timeout settings in your request library to allow more time for the server to respond.
    response = requests.get(url, timeout=30)
    • Retry Mechanism: Implement a retry mechanism with exponential backoff.
    import timefor attempt in range(5):    try:        response = requests.get(url)        break    except requests.exceptions.Timeout:        time.sleep(2 ** attempt)
    • Check Network: Ensure your network connection is stable and that there are no firewalls blocking the requests.

Issue #2: Incorrect Data Extraction

  • Description: The data extracted does not match the expected results or is incomplete.
  • Possible Causes:
    • Changes in the website’s HTML structure.
    • Incorrect CSS selectors or XPath expressions.
    • JavaScript loading delays.
  • Solutions:
    • Update Selectors: Verify and update your CSS selectors or XPath expressions to match the current structure of the website. soup.select('div.new-class-name')
    • Use JavaScript Rendering: Enable JavaScript rendering to ensure all dynamic content is loaded before extraction.
    params = {'api_key': API_KEY, 'url': url, 'render': 'true'}response = requests.get('https://app.scrapingbee.com/api/v1/?' + urlencode(params))
    • Debugging: Print out the HTML response and visually inspect it to understand the structure and find the correct selectors.

Issue #3: Handling CAPTCHAs

  • Description: Encountering CAPTCHAs that prevent automated access to the website.
  • Possible Causes: The website detects and blocks automated requests.
  • Solutions:
    • Use Scrapingbee: Utilize Scrapingbee’s built-in CAPTCHA solving feature.
    params = {'api_key': API_KEY, 'url': url, 'render': 'true'}response = requests.get('https://app.scrapingbee.com/api/v1/?' + urlencode(params))
    • Human Intervention: For critical data, consider implementing a manual CAPTCHA solving step where a human user solves the CAPTCHA.

Dynamic Content Issues

  • Description: Issues related to content that is loaded dynamically via JavaScript.
  • Solutions:
    • JavaScript Rendering: Use tools like Scrapingbee to render JavaScript and ensure all dynamic content is captured.
    params = {'api_key': API_KEY, 'url': url, 'render': 'true'}response = requests.get('https://app.scrapingbee.com/api/v1/?' + urlencode(params))
    • Selenium: Use Selenium for browser automation to handle complex interactions and dynamically loaded content.
    from selenium import webdriverdriver = webdriver.Chrome()driver.get(url)content = driver.page_source

Handling Alerts and Pop-ups

Description: Managing browser alerts, confirmation dialogs, and pop-ups during scraping. Solutions:
  • Selenium: Use Selenium to handle browser alerts and pop-ups.
from selenium import webdriverfrom selenium.common.exceptions import NoAlertPresentException
driver = webdriver.Chrome()driver.get(url)
try:    alert = driver.switch_to.alert    alert.accept()  # Accepts the alertexcept NoAlertPresentException:    pass

Browser Compatibility Issues

  • Description: Ensuring that the scraping script works across different browsers and versions.
  • Solutions:
    • WebDriver Manager: Use WebDriver Manager to handle browser driver compatibility
from webdriver_manager.chrome import ChromeDriverManagerdriver = webdriver.Chrome(ChromeDriverManager().install())
By understanding and implementing these troubleshooting techniques, you can resolve common issues encountered during web scraping and ensure smooth and efficient data extraction.
Web scraping offers powerful capabilities for data extraction, but it comes with significant legal and ethical considerations that must be respected.
  • Web scraping should always be conducted ethically and in compliance with legal standards to avoid potential repercussions:
  • Always review and adhere to the terms of service of the websites you scrape. Violating these terms can lead to legal actions and bans from accessing the site.
Follow Privacy Policies:
  • Ensure that you respect the privacy policies of the websites. Do not scrape personal data without explicit permission, and comply with data protection regulations like GDPR.
  • Ignoring ethical guidelines and legal requirements can result in severe consequences:
    • Account Suspension: Websites may suspend or ban accounts that engage in unauthorized scraping activities.
    • Legal Penalties: Violating terms of service or data privacy laws can lead to legal actions, including fines and litigation.
    • Reputation Damage: Unethical scraping practices can harm your or your organization's reputation, leading to loss of trust and credibility.
By understanding and adhering to these legal and ethical guidelines, you can ensure that your web scraping activities are responsible, legal, and respectful of the target websites.

Conclusion

Integrating Scrapingbee into your web scraping projects enhances efficiency and reliability. We've explored setup, handling CAPTCHAs and IP blocking, and advanced techniques using libraries like Selenium and Pandas. Remember to scrape responsibly by respecting website terms of service, implementing rate limiting, and using data ethically. Stay updated with Scrapingbee features and continue to learn and explore for more effective and responsible web scraping.

More Python Web Scraping Guides

At ScrapeOps, our learning resources are seemingly endless. We wrote the playbook on web scraping in Python because we just love web scraping that much. You can view it here. To view more of our proxy integration guides, take a look at the articles below.