Bright Data Unlocker Integration Guide

Bright Data Unlocker: Web Scraping Integration Guide

Web Unlocker from Bright Data is a pretty popular way to scrape the web. Web Unlocker manages a pool of proxies so you don't have to. Web Unlocker uses a variety of features to get you access to some of the most difficult sites around. In the rest of this article, we'll go through the process of signing up for Web Unlocker from start to finish and we'll explore their features in depth and even test out some of their more advanced functionality.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.


TLDR: Web Scraping With Web Unlocker?

Web Unlocker uses proxy port integration. The quickest way to get started is to create a new zone and then configure your proxy port to work with it.
  1. You'll need your username, password, zone, and the url of your proxy port.
  2. Once you have those, you can save them to a config.json file and get started.
  3. Make sure you've setup Bright Data's CA Certificate inside your project folder so you don't experience any SSL errors.
You can use the code below to test your proxy connection.
import requestsimport json
# Bright Data Accessbrd_config = {}

with open("config.json") as file:    json_config = json.load(file)    brd_config = json_config["brightdata"]brd_superproxy = 'brd.superproxy.io:22225'brd_connectStr = f"http://brd-customer-{brd_config['username']}-zone-{brd_config['zone']}:{brd_config['password']}@{brd_superproxy}"
# Test URLbrd_test_url = 'https://geo.brdtest.com/welcome.txt'
# Path to CA certificateca_cert_path = 'ca.crt'  # Provide the correct certificate file path here
# Proxies dictionaryproxies = {    'http': brd_connectStr,    'https': brd_connectStr}
# Make the request with proxy and custom CA certresponse = requests.get(brd_test_url, proxies=proxies, verify=ca_cert_path)
# Print the responseprint(response.text)
You don't need to do much of anything to optimize the connection. Web Unlocker manages all of those things for you so you can focus on your code instead of maintaining a proxy pool. Make sure that you follow ethical scraping practices:
  • Always ensure that the data you're collecting is publicly available and not behind paywalls or restricted access areas.
  • Check the website’s robots.txt file, which tells web crawlers which pages or sections of the site they are allowed or disallowed from accessing.
  • Review the website's Terms of Service (ToS) or Terms of Use to see if it explicitly forbids web scraping or imposes limitations. If scraping is disallowed, you should avoid doing it.
  • Scrape at a reasonable pace to prevent putting excessive load on the website’s servers.
  • If possible, configure your scraper to identify itself (e.g., set the User-Agent string) so the website knows your bot is crawling it.

What Is Bright Data's Web Unlocker?

Web Unlocker is an automated proxy manager. It maintains a pool of different proxies and always connects you to the best one. Web Unlocker uses a variety of features to get you access to some of the most difficult sites around. If a site requires JavaScript execution, Web Unlocker supposedly recognizes this and automatically renders the page within a browser to solve CAPTCHAs and complete any JavaScript challenges that are sent to it. You can view some of their selling points below.
  • CAPTCHA Solving
  • IP Rotation
  • Request Retries
  • Automated Proxy Management
  • Automatic JavaScript Rendering
Bright Data Unlocker Homepage

How Does Web Unlocker Work?

Web Unlocker uses proxy port integration to act as a middleman between your scraper and the sites you want to scrape. When you configure your scraper to use Web Unlocker, you tell Web Unlocker which site you'd like to access and then it gains access to the site. After rendering the page, Web Unlocker sends the rendered page back to you. Here's how the overall process works:
  1. Your scraper tells Web Unlocker which site you want to access.
  2. Web Unlocker gains access to the site and renders the page.
  3. Web Unlocker sends the rendered HTML page back to your scraper.
As mentioned previously, Web Unlocker is built to work specifically with proxy ports. Here, we'll tweak our test connection from the TLDR to extract data from Quotes To Scrape. In the example below, we're going to find the h1 element and print its text to the terminal.
mport jsonfrom bs4 import BeautifulSoup
# Bright Data Accessbrd_config = {}

with open("config.json") as file:    json_config = json.load(file)    brd_config = json_config["brightdata"]brd_superproxy = 'brd.superproxy.io:22225'brd_connectStr = f"http://brd-customer-{brd_config['username']}-zone-{brd_config['zone']}:{brd_config['password']}@{brd_superproxy}"
# Test URLbrd_test_url = 'https://quotes.toscrape.com'
# Path to CA certificateca_cert_path = 'ca.crt'  # Provide the correct certificate file path here
# Proxies dictionaryproxies = {    'http': brd_connectStr,    'https': brd_connectStr}
# Make the request with proxy and custom CA certresponse = requests.get(brd_test_url, proxies=proxies, verify=ca_cert_path)
soup = BeautifulSoup(response.text, "html.parser")h1 = soup.find("h1")
# Print the responseprint(h1.text)

Response Format

By default, Web Unlocker returns whatever your target site returns. We can not explicitly return JSON for each request although that would surely be a nice feature. You can return JSON from sites that return JSON though. The example below makes a call to an API that returns JSON.
import requestsimport jsonfrom bs4 import BeautifulSoup
# Bright Data Accessbrd_config = {}

with open("config.json") as file:    json_config = json.load(file)    brd_config = json_config["brightdata"]brd_superproxy = 'brd.superproxy.io:22225'brd_connectStr = f"http://brd-customer-{brd_config['username']}-zone-{brd_config['zone']}:{brd_config['password']}@{brd_superproxy}"
# Path to CA certificateca_cert_path = 'ca.crt'  # Provide the correct certificate file path here
# Proxies dictionaryproxies = {    'http': brd_connectStr,    'https': brd_connectStr}
# Make the request with proxy and custom CA certresponse = requests.get("https://lumtest.com/myip.json", proxies=proxies, verify=ca_cert_path)
# Print the responseprint(json.dumps(response.json(), indent=4))
Our output looks like this:
{    "ip": "108.165.142.98",    "country": "US",    "asn": {        "asnum": 174,        "org_name": "COGENT-174"    },    "geo": {        "city": "",        "region": "",        "region_name": "",        "postal_code": "",        "latitude": 37.751,        "longitude": -97.822,        "tz": "America/Chicago"    }}

Web Unlocker Pricing

Web Unlocker gives us several different options when it comes to pricing plans. These plans range in price from $3 per thousand requests at the lowest tier all the way up to $2.10 per thousand at the highest tier. These tiers range pretty broadly in monthly cost. You can view a full breakdown of it in the table below.
PlanCost Per Thousand RequestsMonthly Cost
Pay As You Go$3Varies based on usage
Growth$2.55$499 + Tax
Business$2.25$999 + Tax
Premium$2.10$1999 + Tax
With Web Unlocker (like ScrapeOps Proxy Aggregator), we're only charged per successful request. If it's unable to access a site for you, you pay nothing. This is a pretty good model from a user's standpoint. There are many proxy services that will actually charge you even if you don't gain access to the site.

Response Status Codes

Status codes are essential in all of web development. While most of us know that 200 means everything worked, there are numerous other codes we need to be able to troubleshoot. The table below holds a breakdown of these codes.
Status CodeDescription
200Success!
401Bad request, usually a problem with headers or cookies.
403You are forbidden from accessing this url
404Site not found.
407Incorrect credentials (username, password, or zone)
411Bad request, usually a problem with headers or cookies.
429You're being rate limited, slow down your requests.
444Bad request, usually a problem with headers or cookies.
502Check the header x-luminati-error-code.
503Service unavailable, browser check failed.
You can view their full section on status codes in their docs here.

Setting Up Bright Data Web Unlocker

Now, we'll walk through the process of getting setup with Web Unlocker. To get started, go to their homepage and choose Start Free Trial or Start Free with Google. Start Free Trial Next, you'll be taken to the signup sheet. You can choose to continue with Google, Github, or your email address. Signup Under Proxies and Scraping, we can look through our available product options and find Web Unlocker. Click on the button that reads Get Started. BrightData Unlocker Dashboard You should notice that you received some free credits for signing up. However, you can get even more free credits when adding a payment method. Free Credits Before we can use Web Unlocker we need to create a zone, which is a specific instance of Web Unlocker. Once you've got all of your configurations set, click Add. Create unlocker You'll then get a popup with some shell code to test out your new zone. Copy and paste the code to check your proxy. Unlocker Test If your connection is working correctly, you should receive output similar to the image below. Unlocker Test Results You will also receive a prompt telling you to set up SSL. If you click the prompt, you'll get a popup giving you the option to download their SSL certificate. At the time of this writing, this link is to an expired certificate. However, by your time of reading it's probably fixed. Unlocker SSL Certificate You can then follow their instructions to setup SSL as you can see in the image below. Unlocker SSL If you're still getting SSL errors you can view their full instructions for setting up SSL here. This link also holds access to their updated SSL certificate. While Web Unlocker is a rapidly growing product, the recommended way of connecting is through Proxy Port integration. They are working on a REST API, but it's not finished and according to the documentation, it's in beta. There is not much mention of it in the documentation other than that.

Proxy Port Integration

We've already done Proxy Port integration with the previous examples of this article. When we use Proxy Ports, we set up our initial proxy configuration, and then we can pretty much just forget about it. This allows us to focus on our other coding such as writing our parser.
import requestsimport json
# Bright Data Accessbrd_config = {}

with open("config.json") as file:    json_config = json.load(file)    brd_config = json_config["brightdata"]brd_superproxy = 'brd.superproxy.io:22225'brd_connectStr = f"http://brd-customer-{brd_config['username']}-zone-{brd_config['zone']}:{brd_config['password']}@{brd_superproxy}"
# Test URLbrd_test_url = 'https://geo.brdtest.com/welcome.txt'
# Path to CA certificateca_cert_path = 'ca.crt'  # Provide the correct certificate file path here
# Proxies dictionaryproxies = {    'http': brd_connectStr,    'https': brd_connectStr}
# Make the request with proxy and custom CA certresponse = requests.get(brd_test_url, proxies=proxies, verify=ca_cert_path)
# Print the responseprint(response.json())

Managing Concurrency

Concurrency can be managed through ThreadPoolExecutor. ThreadPoolExecutor opens a new pool of threads with the limit of our max_threads argument. Then, we use executor.map() to call a specific function on each of these avaiable threads. This gives us the power to scrape multiple pages concurrently.
import requestsfrom bs4 import BeautifulSoupimport jsonimport concurrent.futuresfrom urllib.parse import urlencode
NUM_THREADS = 5
# Bright Data Accessbrd_config = {}

with open("config.json") as file:    json_config = json.load(file)    brd_config = json_config["brightdata"]
brd_superproxy = 'brd.superproxy.io:22225'brd_connectStr = f"http://brd-customer-{brd_config['username']}-zone-{brd_config['zone']}:{brd_config['password']}@{brd_superproxy}"
proxies = {    'http': brd_connectStr,    'https': brd_connectStr}

ca_cert_path = 'ca.crt'
## Example list of urls to scrapelist_of_urls = [            'http://quotes.toscrape.com/page/1/',            'http://quotes.toscrape.com/page/2/',            'http://quotes.toscrape.com/page/3/',            ]
output_data_list = []
def scrape_page(url):    try:        response = requests.get(url, proxies=proxies, verify=ca_cert_path)        if response.status_code == 200:            soup = BeautifulSoup(response.text, "html.parser")            title = soup.find('h1').text                        ## add scraped data to "output_data_list" list            output_data_list.append({                'title': title,            })                except Exception as e:        print('Error', e)                with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:    executor.map(scrape_page, list_of_urls)    print(output_data_list)
Pay attention here to executor.map():
  • scrape_page is the function we want to call on each available thread.
  • list_of_urls is the list of arguments to be passed into each instance of scrape_page.

Advanced Functionality

Web Unlocker comes prepacked with a ton of advanced functionality. Most of these functionalities are automated, we do get the power to manually control a decent portion of them such as geolocation, JavaScript rendering, and disabling CATPCHA solving. Below is a list of the features we can use to customize our requests. Bright Data does not charge us anything extra to use these features. Instead of charging based on the features we use, Bright Data charges us based on the difficulty of our target domain.
FeatureDescriptionAdditional Cost
GeolocationUse a specific location (country, state or city).None
User-AgentSet a mobile User-Agent for your request.None
Disable CAPTCHATurn off the automatic CAPTCHA solver.None
Render a BrowserUse a browser to render the page dynamically.None
You can view a screenshot of a zone that allows for premium domains. Regular domains cost $3 per 1,000 requests. Premium domains cost $6 per 1,000. Depending on the tier of your plan, these costs do come down. At the top tier, you would pay $2.10 per 1,000 for default domains and $4.20 per 1,000 for premium domains. alt text

Javascript Rendering

To render JavaScript, we can pass the -render flag in with our url. This tells Web Unlocker to open a browser and render the page no matter what. WhatIsmyIp.com uses JavaScript to check the IP address of your machine. We're going to use the -render flag to check our IP address. In the code snippet below, we pass the render flag to render the content on the page. Rendering the content does take extra time, but if you run the code without -render, you'll receive an error.
import requestsimport jsonfrom bs4 import BeautifulSoup
# Bright Data Accessbrd_config = {}

with open("config.json") as file:    json_config = json.load(file)    brd_config = json_config["brightdata"]brd_superproxy = 'brd.superproxy.io:22225'brd_connectStr = f"http://brd-customer-{brd_config['username']}-zone-{brd_config['zone']}-render:{brd_config['password']}@{brd_superproxy}"
url = "https://www.whatismyip.com"
# Path to CA certificateca_cert_path = 'ca.crt'  # Provide the correct certificate file path here
# Proxies dictionaryproxies = {    'http': brd_connectStr,    'https': brd_connectStr}
# Make the request with proxy and custom CA certresponse = requests.get(url, proxies=proxies, verify=ca_cert_path)
soup = BeautifulSoup(response.text, "html.parser")ip = soup.select_one("a[id='ipv4']").get("title")
# Print the responseprint(ip)
Here is the output running without -render. As you can see, our IPV4 address has not yet loaded on the page.
Traceback (most recent call last):  File "/home/nultinator/clients/ahmet/brightdata-unlocker/render.py", line 30, in <module>    ip = soup.select_one("a[id='ipv4']").get("title")AttributeError: 'NoneType' object has no attribute 'get'
Here is the output when we run using the -render flag.
Detailed Information about IP address 161.123.31.150
You can view the full documentation for this feature here.

Controlling The Browser

Web Unlocker does not allow us to control the browser directly. If you need to perform actions in the browser, you need to use a Headless Browser such as Puppeteer or Playwright. Selenium does not directly support authenticated proxy integration. With Selenium, you can use SeleniumWire but as of this time, SeleniumWire has been deprecated so using it is not recommended. You can view the articles below for proxy port integration with these browsers.

Country Geotargeting

Much like rendering JavaScript, to use a specific geolocation, we can pass a different flag depending on the geotarget we want. We can choose a location with any of the following flags. We need to choose a location based on a code.
  • country
  • state
  • city
Country codes are available here. Some city and state codes are available in their geotargeting docs. Here is our previous code example, but we also use the -country flag.
import requestsimport jsonfrom bs4 import BeautifulSoup
# Bright Data Accessbrd_config = {}

with open("config.json") as file:    json_config = json.load(file)    brd_config = json_config["brightdata"]brd_superproxy = 'brd.superproxy.io:22225'brd_connectStr = f"http://brd-customer-{brd_config['username']}-zone-{brd_config['zone']}-render-country-us:{brd_config['password']}@{brd_superproxy}"
url = "https://www.whatismyip.com"
# Path to CA certificateca_cert_path = 'ca.crt'  # Provide the correct certificate file path here
# Proxies dictionaryproxies = {    'http': brd_connectStr,    'https': brd_connectStr}
# Make the request with proxy and custom CA certresponse = requests.get(url, proxies=proxies, verify=ca_cert_path)
soup = BeautifulSoup(response.text, "html.parser")ip = soup.select_one("a[id='ipv4']").get("title")
# Print the responseprint(ip)
Here is our output.
Detailed Information about IP address 45.149.149.254
We can manually check our geolocation data using Iplookup. As you can see, our location shows up inside the state of Virginia, US. Geotargeting with Web Unlocker is a breeze. Once again, you can view their full geotargeting documentation here. IP Lookup Here is a list of country codes. This list is non-exhaustive, but should cover many of the locations you might choose to use with Web Unlocker.
CountryCountry Code
United Arab EmiratesAE
AustraliaAU
BrazilBR
CanadaCA
ChinaCN
GermanyDE
EstoniaEE
SpainES
FranceFR
United KingdomGB
Hong KongHK
IndiaIN
ItalyIT
RussiaRU
United StatesUS

Residential Proxies

We can't directly invoke residential proxies. However, we can use their functionality to automatically set a mobile user agent for our request. This doesn't guaruntee us a mobile or residential IP address, but it does make our traffic look more normal. Even without the -mobile flag, if our request fails, Bright Data's Web Unlocker will automatically switch to a better IP address (likely mobile or residential), and retry the request. Here is an example using the -mobile flag.
import requestsimport jsonfrom bs4 import BeautifulSoup
# Bright Data Accessbrd_config = {}

with open("config.json") as file:    json_config = json.load(file)    brd_config = json_config["brightdata"]brd_superproxy = 'brd.superproxy.io:22225'brd_connectStr = f"http://brd-customer-{brd_config['username']}-zone-{brd_config['zone']}-render-ua-mobile:{brd_config['password']}@{brd_superproxy}"
url = "https://www.whatismybrowser.com"
# Path to CA certificateca_cert_path = 'ca.crt'  # Provide the correct certificate file path here
# Proxies dictionaryproxies = {    'http': brd_connectStr,    'https': brd_connectStr}
# Make the request with proxy and custom CA certresponse = requests.get(url, proxies=proxies, verify=ca_cert_path)
soup = BeautifulSoup(response.text, "html.parser")browser = soup.select_one("div[aria-label='We detect that your web browser is']")
# Print the responseprint(browser.text)
Here is the output from the scrape.
Safari 16.6 on iOS 16.6
You can view the full mobile documentation here. This feature, in combination with Web Unlocker's automatic proxy management will get you virtually the same access and appearance you might want from a residential or mobile proxy. Bright Data offers purely residential proxies as a separate product. If you're interested in using their strictly residential service, we've got an article on that here.

Custom Headers

Web Unlocker typically does not allow custom headers because they can interfere with how the product works. If you choose to send custom headers when using Web Unlocker, they will be ignored. If you do need to use custom headers with Web Unlocker you can contact them by creating a ticket to set up a special accomodations for your scraper. As per their website, they do not allow custom headers or cookies for login/authentication purposes. Even if your need for custom headers is approved, you will experience the following:
  • A drop in performance.
  • A decrease in success rate.
Their full section on custom headers and cookies is available here.

Static Proxies

Web Unlocker does not support using static proxies for maintaining an authenticated session. They have a separate product for that called Scraping Browser. Scraping Browser is built specifically for configuring headless browsers with proxy ports. This product is specifically designed for Sticky Sessions. If you need a static proxy, you can use the ScrapeOps Proxy Aggregator or you can use Bright Data's Scraping Browser.

Screenshot Functionality

Web Unlocker does not support screenshots. There are some other providers that do support screenshots such as ZenRows, Scrape.do, and ScrapingBee. Especially when debugging, screenshots are an incredibly useful tool. When you take a screenshot, you can visually review the page. Screenshots give us the power to:
  • Debug our errors in the event of a crash.
  • Analyze any site visually.
  • Verify the content we've scraped from any target site.
  • View the site through the user's eyes.
  • Visually monitor changes in the site and its layout.
You can view our screenshot documentation for these other services in the links below.

Auto Parsing

Web Unlocker does not have any auto parsing features. Web Unlocker is specifically targeted at proxy management so that you can perform the data extraction yourself. If you are interested in auto parsing, please consider any of the following services instead.

Case Study: Using Web Unlocker on IMDb Top 250 Movies

Now, let's perform a little experiment. We're going to scrape the top 250 movies from IMDB. We'll conduct this study using both Bright Data's Web Unlocker and we'll also try using the ScrapeOps Proxy Aggregator. This is to show you how the two of these products stack up on a real world scraping job. Our code for both of these scrapers will be largely the same. The major difference will be how we access the site. With Web Unlocker, we're going to use proxy port integration. To access the site with ScrapeOps, we're going to write a function, get_scrapeops_url(). This function will take our API parameters and return a ScrapeOps Proxied url. Here is our proxy port access with Bright Data's Web Unlocker.
config = {}with open("config.json", "r") as config_file:    config = json.load(config_file)["brightdata"]
ca_cert_path = 'ca.crt'
brd_superproxy = 'brd.superproxy.io:22225'brd_connectStr = f"http://brd-customer-{config['username']}-zone-{config['zone']}:{config['password']}@{brd_superproxy}"
proxies = {    'http': brd_connectStr,    'https': brd_connectStr}
During our scraping function, we then use these settings when calling the API.
response = requests.get(url, proxies=proxies, verify=ca_cert_path)
Our full code using Bright Data's Web Unlocker is available below.
import osimport requestsfrom bs4 import BeautifulSoupimport jsonfrom base64 import b64decodeimport loggingfrom  urllib.parse import urlencode
## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)
config = {}with open("config.json", "r") as config_file:    config = json.load(config_file)["brightdata"]
ca_cert_path = 'ca.crt'
brd_superproxy = 'brd.superproxy.io:22225'brd_connectStr = f"http://brd-customer-{config['username']}-zone-{config['zone']}:{config['password']}@{brd_superproxy}"
proxies = {    'http': brd_connectStr,    'https': brd_connectStr}

def scrape_movies(url, location="us", retries=3):    success = False    tries = 0
    while not success and tries <= retries:        response = requests.get(url, proxies=proxies, verify=ca_cert_path)
        try:            if response.status_code != 200:                raise Exception(f"Failed response from server, status code: {response.status_code}")
            soup = BeautifulSoup(response.text, "html.parser")            json_tag = soup.select_one("script[type='application/ld+json']")            json_data = json.loads(json_tag.text)["itemListElement"]
            movie_list_length = 0
            movie_list = []
            for item in json_data:                movie_list.append(item["item"])            movie_list_length+=len(json_data)
            print(f"Movie list length: {len(json_data)}")            with open("unlocker-top-250.json", "w") as file:                json.dump(movie_list, file, indent=4)                success = True        except Exception as e:            logger.error(f"Failed to process page: {e}, retries left: {retries-tries}")            tries+=1
    if not success:        raise Exception(f"Failed to scrape page, MAX RETRIES {retries} EXCEEDED!!!")


if __name__ == "__main__":
    MAX_RETRIES = 3
    logger.info("Starting IMDB scrape")
    url = "https://www.imdb.com/chart/top/"
    scrape_movies(url, retries=MAX_RETRIES)
    logger.info("Scrape complete")
Here is the output from the scrape using web Unlocker. As you can see, the scrape took 9.427 seconds. Brightdata Unlocker Test Results When we use ScrapeOps, instead of using proxy port integration, we're going to write a function that creates a ScrapeOps Proxied url. Proxy port integration is technically possible, but a function like this makes our proxy code much more easy to read and customize. This also eliminates the need for a custom SSL certificate. The snippet below holds our proxy function. It takes our API key and target url. Then, it wraps it all up with url encoding and gives us a custom proxied url. We can pass this url into requests.get() and continue to write our code like normal.
def get_scrapeops_url(url):    payload = {        "api_key": API_KEY,        "url": url,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url
  • "api_key": holds your ScrapeOps API key.
  • "url": holds the target url that we'd like to scrape.
  • This function takes the above information and creates a custom url that we can use to access the site.
Here is our full ScrapeOps code below.
import osimport requestsfrom bs4 import BeautifulSoupimport jsonimport csvimport loggingfrom  urllib.parse import urlencode
## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["scrapeops_api_key"]
def get_scrapeops_url(url):    payload = {        "api_key": API_KEY,        "url": url,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url


def scrape_movies(url, location="us", retries=3):    success = False    tries = 0
    while not success and tries <= retries:        response = requests.get(get_scrapeops_url(url))
        try:            if response.status_code != 200:                raise Exception(f"Failed response from server, status code: {response.status_code}")
            soup = BeautifulSoup(response.text, "html.parser")            json_tag = soup.select_one("script[type='application/ld+json']")            json_data = json.loads(json_tag.text)["itemListElement"]
            movie_list_length = 0
            movie_list = []
            for item in json_data:                movie_list.append(item["item"])            movie_list_length+=len(json_data)
            print(f"Movie list length: {len(json_data)}")            with open("scrapeops-top-250.json", "w") as file:                json.dump(movie_list, file, indent=4)                success = True        except Exception as e:            logger.error(f"Failed to process page: {e}, retries left: {retries-tries}")            tries+=1
    if not success:        raise Exception(f"Failed to scrape page, MAX RETRIES {retries} EXCEEDED!!!")


if __name__ == "__main__":
    MAX_RETRIES = 3
    logger.info("Starting IMDB scrape")
    url = "https://www.imdb.com/chart/top/"
    scrape_movies(url, retries=MAX_RETRIES)
    logger.info("Scrape complete")
Here is the output from the run using ScrapeOps. The run using the ScrapeOps Proxy Aggregator took only 5.583 seconds. This is significantly faster than Bright Data. Scrapeops test results All in all, Bright Data's Web Unlocker took 9.427 seconds while the ScrapeOps Proxy Aggregator took 5.583 seconds. 9.427 - 5.583 = 3.844 seconds difference. The ScrapeOps Proxy Aggregator saved us almost 4 seconds. Depending on network conditions, that's enough time to even get a second request in!

Alternative: ScrapeOps Proxy API Aggregator

As you saw in the section above, even though ScrapeOps sometimes uses Bright Data as a provider, we were able to access and scrape the page significantly faster. Alongside that, the Proxy Aggregator comes with all sorts of custom features! Web Unlocker has a few of these features, but nowhere near all of them. With the ScrapeOps Proxy Aggregator you gain access to a bunch of cool stuff. This table covers almost everything you might use in a scraping API but this is still non-exhaustive. Of the 17 features available with ScrapeOps below, Bright Data's Web Unlocker supports 4 of them.
FeatureDescriptionWeb Unlocker Equivalent
json_responseReturn the response as a JSON objectNot Available
bypassSetting to bypass even the toughest of anti-bots.Automatic
auto_extractAutomatically parse pages from Amazon and Google.Not Available
render_jsOpen a real browser and render dynamic content.render
waitWait an arbitrary amount of time to render content.Not Available
wait_forWait for a specific CSS selector to appear.Not Available
scrollScroll the page by any number of pixels.Not Available
screenshotScreenshot with the ScrapeOps Headless Browser.Not Available
js_scenarioExecute a list of JavaScript instructions on page.Not Available
premiumUse only premium (mobile and residnetial) proxies.Not Available
residentialUse only residential IP addresses.Not Available
mobileUse only mobile IP addresses.Not Available
countryUse a specific geolocation.country
keep_headersKeep any custom headers that we send to the API.Not available
device_typeSpecify a specific user agent for our device type.ua
session_numberReuse a specific proxy with a specific session id.Not available
follow_redirectsTell the API whether or not to follow redirects.Automatic
Another reason to use ScrapeOps would be our large selection of pricing plans. Our plans are far more affordable and actually give you access to a whole lot more.
  • With the Pay As You Go plan for Web Unlocker, you're paying $3 per thousand requests ($0.003 per request).
  • With our $9 plan, you gain access to 25,000 API credits (normal requests to the API). This comes out to $0.00036 per request.
The highest tier web unlocker plan comes at $2.10 per thousand ($0.0021 per request). Even when you're buying in bulk and receiving the biggest bang for your buck, a single request using Web Unlocker costs over 5 times what it would from ScrapeOps! ScrapeOps Pricing If you're not ready to commit, sign up for our free trial and 1,000 free API credits for your next scraping job!

Troubleshooting

Issue #1: Request Timeouts

With Python Requests, every once in awhile, we run into timeout errors. The simplest we to troubleshoot a timeout is to retry your request. If that doesn't work, add a timeout setting to your request. This tells requests to wait an arbitrary amount of time before throwing a timeout error.
import requests
# 5 second timeoutresponse = requests.get("https://httpbin.org/get", timeout=5)
If you are still receiving timeouts, double check your target url to make sure that their server is running normally.

Issue #2: Handling CAPTCHAs

CAPTCHAs can create an unending source of headache when scraping the web. Bright Data's unlocker solves these automatically for you. ScrapeOps does not use automated CAPTCHA solving. If you do receive a CAPTCHA, retry your request using the bypass parameter. If you try all levels of bypass and still receive CAPTCHAs, try an external service like 2captcha. If you'd like to know more about solving CAPTCHAs in depth, take a look at our article for that here.

Issue #3: Invalid Response Data

Invalid response data is a very real issue when it comes to scraping the web and all facets of web development for that matter. When you get invalid response data, you need to check the status code of the response and look at the full response body for any error messages along with it. Once you know the status code, it's as simple as looking it up. We have a table of Web Unlocker status codes here. The ScrapeOps status codes are available here.
When you're scraping a website, you always need to be mindful of what you're doing. Scraping public data (like we did in this article) is generally considered legal everywhere. Scraping private data (data behind a login or some other form of authentication) os a very nuanced process and you're subject to all sorts of legal consequences. Here are just some of the consequences that can result from scraping private data:
  • Terms of Service Violations: You could violate a site's terms and be subject to lawsuits and hefty fines.
  • Data Privacy Laws: Different states and countries around the world have different privacy laws. When you violate somebody else's privacy, that can be a serious legal offense. This can come with fines and even prison time.
  • Copyright Infringement: If you scrape and repurpose data without the proper licensing and permissions, you could definitely be violating a copyright. To avoid getting sued or receiving a cease and desist, don't do this.
  • Computer Fraud and Abuse: Many countries have laws against hacking (unauthorized access to a computer system) and these laws are generally treated pretty seriously. Violating these laws can also result in hefty fines and prison time depending on your locality.

Ethical Considerations When Violating a Site's Terms

When you create an account to access a site, you agree to their Terms and Conditions. While there are some outliers, these agreements are usually legally binding. If you choose to violate site policies that you've explicitly agreed to, you can be subjected to account actions (suspension, banning etc.) and even legal action!

Terms and Conditions Violations

  • Civil Liability: The site you violated might very likely want to sue you to make an example out of people who violate their terms.
  • Privacy Concerns: Depending on the nature of the violation, you might be disseminating private data. This can come with very stiff penalties (see above).
  • Account Suspension/Banning: The site owner or administrator might very well decide to suspend or even permanently ban you from their site. Could you imagine being permanently banned from Amazon or Google?

robots.txt Violations

  • Reputational Damage: Site owners might be far less likely to trust your business. This makes future business dealings difficult.
  • Public Perception: We see headlines each and every day about how some company was unethically but still legally collecting some kind of data. Some might think of it as free advertisement, while others could see this kind of story as permanently damaging to their public business perception.

Conclusion

You now know how to use both Bright Data's Web Unlocker and the ScrapeOps Proxy Aggregator. You've been well informed on both products and you're more than capable of making the choice for yourself. When you to need to do your next scrape, you'll have all the tools you need: Python Requests, BeautifulSoup, JSON and Proxy Integration. Go build something and continue learning!

More Web Scraping Guides

Here at ScrapeOps, we love web scraping so much, we wrote the playbook on it. If you ever need to learn something new about scraping, we're your one stop shop! If you'd like to learn about integrating other proxy services, check out the guides below.