Bright Data Unlocker: Web Scraping Integration Guide
Web Unlocker from Bright Data is a pretty popular way to scrape the web. Web Unlocker manages a pool of proxies so you don't have to. Web Unlocker uses a variety of features to get you access to some of the most difficult sites around.
In the rest of this article, we'll go through the process of signing up for Web Unlocker from start to finish and we'll explore their features in depth and even test out some of their more advanced functionality.
TLDR: Web Scraping With Web Unlocker?
Web Unlocker uses proxy port integration. The quickest way to get started is to create a new zone and then configure your proxy port to work with it.
- You'll need your
username
, password
, zone
, and the url of your proxy port.
- Once you have those, you can save them to a
config.json
file and get started.
- Make sure you've setup Bright Data's CA Certificate inside your project folder so you don't experience any SSL errors.
You can use the code below to test your proxy connection.
import requestsimport json
brd_config = {}
with open("config.json") as file: json_config = json.load(file) brd_config = json_config["brightdata"]brd_superproxy = 'brd.superproxy.io:22225'brd_connectStr = f"http://brd-customer-{brd_config['username']}-zone-{brd_config['zone']}:{brd_config['password']}@{brd_superproxy}"
brd_test_url = 'https://geo.brdtest.com/welcome.txt'
ca_cert_path = 'ca.crt'
proxies = { 'http': brd_connectStr, 'https': brd_connectStr}
response = requests.get(brd_test_url, proxies=proxies, verify=ca_cert_path)
print(response.text)
You don't need to do much of anything to optimize the connection. Web Unlocker manages all of those things for you so you can focus on your code instead of maintaining a proxy pool.
Make sure that you follow ethical scraping practices:
- Always ensure that the data you're collecting is publicly available and not behind paywalls or restricted access areas.
- Check the website’s robots.txt file, which tells web crawlers which pages or sections of the site they are allowed or disallowed from accessing.
- Review the website's Terms of Service (ToS) or Terms of Use to see if it explicitly forbids web scraping or imposes limitations. If scraping is disallowed, you should avoid doing it.
- Scrape at a reasonable pace to prevent putting excessive load on the website’s servers.
- If possible, configure your scraper to identify itself (e.g., set the User-Agent string) so the website knows your bot is crawling it.
What Is Bright Data's Web Unlocker?
Web Unlocker is an automated proxy manager. It maintains a pool of different proxies and always connects you to the best one.
Web Unlocker uses a variety of features to get you access to some of the most difficult sites around.
If a site requires JavaScript execution, Web Unlocker supposedly recognizes this and automatically renders the page within a browser to solve CAPTCHAs and complete any JavaScript challenges that are sent to it.
You can view some of their selling points below.
- CAPTCHA Solving
- IP Rotation
- Request Retries
- Automated Proxy Management
- Automatic JavaScript Rendering
How Does Web Unlocker Work?
Web Unlocker uses proxy port integration to act as a middleman between your scraper and the sites you want to scrape. When you configure your scraper to use Web Unlocker, you tell Web Unlocker which site you'd like to access and then it gains access to the site.
After rendering the page, Web Unlocker sends the rendered page back to you.
Here's how the overall process works:
- Your scraper tells Web Unlocker which site you want to access.
- Web Unlocker gains access to the site and renders the page.
- Web Unlocker sends the rendered HTML page back to your scraper.
As mentioned previously, Web Unlocker is built to work specifically with proxy ports. Here, we'll tweak our test connection from the TLDR to extract data from Quotes To Scrape.
In the example below, we're going to find the h1
element and print its text to the terminal.
mport jsonfrom bs4 import BeautifulSoup
brd_config = {}
with open("config.json") as file: json_config = json.load(file) brd_config = json_config["brightdata"]brd_superproxy = 'brd.superproxy.io:22225'brd_connectStr = f"http://brd-customer-{brd_config['username']}-zone-{brd_config['zone']}:{brd_config['password']}@{brd_superproxy}"
brd_test_url = 'https://quotes.toscrape.com'
ca_cert_path = 'ca.crt'
proxies = { 'http': brd_connectStr, 'https': brd_connectStr}
response = requests.get(brd_test_url, proxies=proxies, verify=ca_cert_path)
soup = BeautifulSoup(response.text, "html.parser")h1 = soup.find("h1")
print(h1.text)
By default, Web Unlocker returns whatever your target site returns. We can not explicitly return JSON for each request although that would surely be a nice feature. You can return JSON from sites that return JSON though.
The example below makes a call to an API that returns JSON.
import requestsimport jsonfrom bs4 import BeautifulSoup
brd_config = {}
with open("config.json") as file: json_config = json.load(file) brd_config = json_config["brightdata"]brd_superproxy = 'brd.superproxy.io:22225'brd_connectStr = f"http://brd-customer-{brd_config['username']}-zone-{brd_config['zone']}:{brd_config['password']}@{brd_superproxy}"
ca_cert_path = 'ca.crt'
proxies = { 'http': brd_connectStr, 'https': brd_connectStr}
response = requests.get("https://lumtest.com/myip.json", proxies=proxies, verify=ca_cert_path)
print(json.dumps(response.json(), indent=4))
Our output looks like this:
{ "ip": "108.165.142.98", "country": "US", "asn": { "asnum": 174, "org_name": "COGENT-174" }, "geo": { "city": "", "region": "", "region_name": "", "postal_code": "", "latitude": 37.751, "longitude": -97.822, "tz": "America/Chicago" }}
Web Unlocker Pricing
Web Unlocker gives us several different options when it comes to pricing plans. These plans range in price from $3 per thousand requests at the lowest tier all the way up to $2.10 per thousand at the highest tier. These tiers range pretty broadly in monthly cost.
You can view a full breakdown of it in the table below.
|
Pay As You Go | $3 | Varies based on usage |
Growth | $2.55 | $499 + Tax |
Business | $2.25 | $999 + Tax |
Premium | $2.10 | $1999 + Tax |
With Web Unlocker (like ScrapeOps Proxy Aggregator), we're only charged per successful request. If it's unable to access a site for you, you pay nothing. This is a pretty good model from a user's standpoint.
There are many proxy services that will actually charge you even if you don't gain access to the site.
Response Status Codes
Status codes are essential in all of web development. While most of us know that 200 means everything worked, there are numerous other codes we need to be able to troubleshoot.
The table below holds a breakdown of these codes.
|
200 | Success! |
401 | Bad request, usually a problem with headers or cookies. |
403 | You are forbidden from accessing this url |
404 | Site not found. |
407 | Incorrect credentials (username, password, or zone) |
411 | Bad request, usually a problem with headers or cookies. |
429 | You're being rate limited, slow down your requests. |
444 | Bad request, usually a problem with headers or cookies. |
502 | Check the header x-luminati-error-code . |
503 | Service unavailable, browser check failed. |
You can view their full section on status codes in their docs here.
Setting Up Bright Data Web Unlocker
Now, we'll walk through the process of getting setup with Web Unlocker. To get started, go to their homepage and choose Start Free Trial or Start Free with Google.
Next, you'll be taken to the signup sheet. You can choose to continue with Google, Github, or your email address.
Under Proxies and Scraping, we can look through our available product options and find Web Unlocker. Click on the button that reads Get Started.
You should notice that you received some free credits for signing up. However, you can get even more free credits when adding a payment method.
Before we can use Web Unlocker we need to create a zone, which is a specific instance of Web Unlocker. Once you've got all of your configurations set, click Add.
You'll then get a popup with some shell code to test out your new zone. Copy and paste the code to check your proxy.
If your connection is working correctly, you should receive output similar to the image below.
You will also receive a prompt telling you to set up SSL. If you click the prompt, you'll get a popup giving you the option to download their SSL certificate. At the time of this writing, this link is to an expired certificate. However, by your time of reading it's probably fixed.
You can then follow their instructions to setup SSL as you can see in the image below.
If you're still getting SSL errors you can view their full instructions for setting up SSL here. This link also holds access to their updated SSL certificate.
While Web Unlocker is a rapidly growing product, the recommended way of connecting is through Proxy Port integration. They are working on a REST API, but it's not finished and according to the documentation, it's in beta. There is not much mention of it in the documentation other than that.
Proxy Port Integration
We've already done Proxy Port integration with the previous examples of this article. When we use Proxy Ports, we set up our initial proxy configuration, and then we can pretty much just forget about it.
This allows us to focus on our other coding such as writing our parser.
import requestsimport json
brd_config = {}
with open("config.json") as file: json_config = json.load(file) brd_config = json_config["brightdata"]brd_superproxy = 'brd.superproxy.io:22225'brd_connectStr = f"http://brd-customer-{brd_config['username']}-zone-{brd_config['zone']}:{brd_config['password']}@{brd_superproxy}"
brd_test_url = 'https://geo.brdtest.com/welcome.txt'
ca_cert_path = 'ca.crt'
proxies = { 'http': brd_connectStr, 'https': brd_connectStr}
response = requests.get(brd_test_url, proxies=proxies, verify=ca_cert_path)
print(response.json())
Managing Concurrency
Concurrency can be managed through ThreadPoolExecutor
. ThreadPoolExecutor
opens a new pool of threads with the limit of our max_threads
argument.
Then, we use executor.map()
to call a specific function on each of these avaiable threads. This gives us the power to scrape multiple pages concurrently.
import requestsfrom bs4 import BeautifulSoupimport jsonimport concurrent.futuresfrom urllib.parse import urlencode
NUM_THREADS = 5
brd_config = {}
with open("config.json") as file: json_config = json.load(file) brd_config = json_config["brightdata"]
brd_superproxy = 'brd.superproxy.io:22225'brd_connectStr = f"http://brd-customer-{brd_config['username']}-zone-{brd_config['zone']}:{brd_config['password']}@{brd_superproxy}"
proxies = { 'http': brd_connectStr, 'https': brd_connectStr}
ca_cert_path = 'ca.crt'
list_of_urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', 'http://quotes.toscrape.com/page/3/', ]
output_data_list = []
def scrape_page(url): try: response = requests.get(url, proxies=proxies, verify=ca_cert_path) if response.status_code == 200: soup = BeautifulSoup(response.text, "html.parser") title = soup.find('h1').text output_data_list.append({ 'title': title, }) except Exception as e: print('Error', e) with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor: executor.map(scrape_page, list_of_urls) print(output_data_list)
Pay attention here to executor.map()
:
scrape_page
is the function we want to call on each available thread.
list_of_urls
is the list of arguments to be passed into each instance of scrape_page
.
Advanced Functionality
Web Unlocker comes prepacked with a ton of advanced functionality. Most of these functionalities are automated, we do get the power to manually control a decent portion of them such as geolocation, JavaScript rendering, and disabling CATPCHA solving.
Below is a list of the features we can use to customize our requests. Bright Data does not charge us anything extra to use these features.
Instead of charging based on the features we use, Bright Data charges us based on the difficulty of our target domain.
|
Geolocation | Use a specific location (country , state or city ). | None |
User-Agent | Set a mobile User-Agent for your request. | None |
Disable CAPTCHA | Turn off the automatic CAPTCHA solver. | None |
Render a Browser | Use a browser to render the page dynamically. | None |
You can view a screenshot of a zone that allows for premium domains. Regular domains cost $3 per 1,000 requests. Premium domains cost $6 per 1,000.
Depending on the tier of your plan, these costs do come down. At the top tier, you would pay $2.10 per 1,000 for default domains and $4.20 per 1,000 for premium domains.
Javascript Rendering
To render JavaScript, we can pass the -render
flag in with our url. This tells Web Unlocker to open a browser and render the page no matter what. WhatIsmyIp.com uses JavaScript to check the IP address of your machine. We're going to use the -render
flag to check our IP address.
In the code snippet below, we pass the render flag to render the content on the page. Rendering the content does take extra time, but if you run the code without -render
, you'll receive an error.
import requestsimport jsonfrom bs4 import BeautifulSoup
brd_config = {}
with open("config.json") as file: json_config = json.load(file) brd_config = json_config["brightdata"]brd_superproxy = 'brd.superproxy.io:22225'brd_connectStr = f"http://brd-customer-{brd_config['username']}-zone-{brd_config['zone']}-render:{brd_config['password']}@{brd_superproxy}"
url = "https://www.whatismyip.com"
ca_cert_path = 'ca.crt'
proxies = { 'http': brd_connectStr, 'https': brd_connectStr}
response = requests.get(url, proxies=proxies, verify=ca_cert_path)
soup = BeautifulSoup(response.text, "html.parser")ip = soup.select_one("a[id='ipv4']").get("title")
print(ip)
Here is the output running without -render
. As you can see, our IPV4 address has not yet loaded on the page.
Traceback (most recent call last): File "/home/nultinator/clients/ahmet/brightdata-unlocker/render.py", line 30, in <module> ip = soup.select_one("a[id='ipv4']").get("title")AttributeError: 'NoneType' object has no attribute 'get'
Here is the output when we run using the -render
flag.
Detailed Information about IP address 161.123.31.150
You can view the full documentation for this feature here.
Controlling The Browser
Web Unlocker does not allow us to control the browser directly. If you need to perform actions in the browser, you need to use a Headless Browser such as Puppeteer or Playwright.
Selenium does not directly support authenticated proxy integration. With Selenium, you can use SeleniumWire but as of this time, SeleniumWire has been deprecated so using it is not recommended.
You can view the articles below for proxy port integration with these browsers.
Country Geotargeting
Much like rendering JavaScript, to use a specific geolocation, we can pass a different flag depending on the geotarget we want. We can choose a location with any of the following flags. We need to choose a location based on a code.
Country codes are available here. Some city and state codes are available in their geotargeting docs.
Here is our previous code example, but we also use the -country
flag.
import requestsimport jsonfrom bs4 import BeautifulSoup
brd_config = {}
with open("config.json") as file: json_config = json.load(file) brd_config = json_config["brightdata"]brd_superproxy = 'brd.superproxy.io:22225'brd_connectStr = f"http://brd-customer-{brd_config['username']}-zone-{brd_config['zone']}-render-country-us:{brd_config['password']}@{brd_superproxy}"
url = "https://www.whatismyip.com"
ca_cert_path = 'ca.crt'
proxies = { 'http': brd_connectStr, 'https': brd_connectStr}
response = requests.get(url, proxies=proxies, verify=ca_cert_path)
soup = BeautifulSoup(response.text, "html.parser")ip = soup.select_one("a[id='ipv4']").get("title")
print(ip)
Here is our output.
Detailed Information about IP address 45.149.149.254
We can manually check our geolocation data using Iplookup. As you can see, our location shows up inside the state of Virginia, US.
Geotargeting with Web Unlocker is a breeze. Once again, you can view their full geotargeting documentation here.
Here is a list of country codes. This list is non-exhaustive, but should cover many of the locations you might choose to use with Web Unlocker.
|
United Arab Emirates | AE |
Australia | AU |
Brazil | BR |
Canada | CA |
China | CN |
Germany | DE |
Estonia | EE |
Spain | ES |
France | FR |
United Kingdom | GB |
Hong Kong | HK |
India | IN |
Italy | IT |
Russia | RU |
United States | US |
Residential Proxies
We can't directly invoke residential proxies. However, we can use their functionality to automatically set a mobile user agent for our request. This doesn't guaruntee us a mobile or residential IP address, but it does make our traffic look more normal.
Even without the -mobile
flag, if our request fails, Bright Data's Web Unlocker will automatically switch to a better IP address (likely mobile or residential), and retry the request.
Here is an example using the -mobile
flag.
import requestsimport jsonfrom bs4 import BeautifulSoup
brd_config = {}
with open("config.json") as file: json_config = json.load(file) brd_config = json_config["brightdata"]brd_superproxy = 'brd.superproxy.io:22225'brd_connectStr = f"http://brd-customer-{brd_config['username']}-zone-{brd_config['zone']}-render-ua-mobile:{brd_config['password']}@{brd_superproxy}"
url = "https://www.whatismybrowser.com"
ca_cert_path = 'ca.crt'
proxies = { 'http': brd_connectStr, 'https': brd_connectStr}
response = requests.get(url, proxies=proxies, verify=ca_cert_path)
soup = BeautifulSoup(response.text, "html.parser")browser = soup.select_one("div[aria-label='We detect that your web browser is']")
print(browser.text)
Here is the output from the scrape.
You can view the full mobile documentation here. This feature, in combination with Web Unlocker's automatic proxy management will get you virtually the same access and appearance you might want from a residential or mobile proxy.
Bright Data offers purely residential proxies as a separate product. If you're interested in using their strictly residential service, we've got an article on that here.
Web Unlocker typically does not allow custom headers because they can interfere with how the product works. If you choose to send custom headers when using Web Unlocker, they will be ignored.
If you do need to use custom headers with Web Unlocker you can contact them by creating a ticket to set up a special accomodations for your scraper.
As per their website, they do not allow custom headers or cookies for login/authentication purposes. Even if your need for custom headers is approved, you will experience the following:
- A drop in performance.
- A decrease in success rate.
Their full section on custom headers and cookies is available here.
Static Proxies
Web Unlocker does not support using static proxies for maintaining an authenticated session. They have a separate product for that called Scraping Browser.
Scraping Browser is built specifically for configuring headless browsers with proxy ports. This product is specifically designed for Sticky Sessions.
If you need a static proxy, you can use the ScrapeOps Proxy Aggregator or you can use Bright Data's Scraping Browser.
Screenshot Functionality
Web Unlocker does not support screenshots. There are some other providers that do support screenshots such as ZenRows, Scrape.do, and ScrapingBee.
Especially when debugging, screenshots are an incredibly useful tool. When you take a screenshot, you can visually review the page.
Screenshots give us the power to:
- Debug our errors in the event of a crash.
- Analyze any site visually.
- Verify the content we've scraped from any target site.
- View the site through the user's eyes.
- Visually monitor changes in the site and its layout.
You can view our screenshot documentation for these other services in the links below.
Auto Parsing
Web Unlocker does not have any auto parsing features. Web Unlocker is specifically targeted at proxy management so that you can perform the data extraction yourself. If you are interested in auto parsing, please consider any of the following services instead.
Case Study: Using Web Unlocker on IMDb Top 250 Movies
Now, let's perform a little experiment. We're going to scrape the top 250 movies from IMDB. We'll conduct this study using both Bright Data's Web Unlocker and we'll also try using the ScrapeOps Proxy Aggregator. This is to show you how the two of these products stack up on a real world scraping job.
Our code for both of these scrapers will be largely the same. The major difference will be how we access the site. With Web Unlocker, we're going to use proxy port integration.
To access the site with ScrapeOps, we're going to write a function, get_scrapeops_url()
. This function will take our API parameters and return a ScrapeOps Proxied url.
Here is our proxy port access with Bright Data's Web Unlocker.
config = {}with open("config.json", "r") as config_file: config = json.load(config_file)["brightdata"]
ca_cert_path = 'ca.crt'
brd_superproxy = 'brd.superproxy.io:22225'brd_connectStr = f"http://brd-customer-{config['username']}-zone-{config['zone']}:{config['password']}@{brd_superproxy}"
proxies = { 'http': brd_connectStr, 'https': brd_connectStr}
During our scraping function, we then use these settings when calling the API.
response = requests.get(url, proxies=proxies, verify=ca_cert_path)
Our full code using Bright Data's Web Unlocker is available below.
import osimport requestsfrom bs4 import BeautifulSoupimport jsonfrom base64 import b64decodeimport loggingfrom urllib.parse import urlencode
logging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)
config = {}with open("config.json", "r") as config_file: config = json.load(config_file)["brightdata"]
ca_cert_path = 'ca.crt'
brd_superproxy = 'brd.superproxy.io:22225'brd_connectStr = f"http://brd-customer-{config['username']}-zone-{config['zone']}:{config['password']}@{brd_superproxy}"
proxies = { 'http': brd_connectStr, 'https': brd_connectStr}
def scrape_movies(url, location="us", retries=3): success = False tries = 0
while not success and tries <= retries: response = requests.get(url, proxies=proxies, verify=ca_cert_path)
try: if response.status_code != 200: raise Exception(f"Failed response from server, status code: {response.status_code}")
soup = BeautifulSoup(response.text, "html.parser") json_tag = soup.select_one("script[type='application/ld+json']") json_data = json.loads(json_tag.text)["itemListElement"]
movie_list_length = 0
movie_list = []
for item in json_data: movie_list.append(item["item"]) movie_list_length+=len(json_data)
print(f"Movie list length: {len(json_data)}") with open("unlocker-top-250.json", "w") as file: json.dump(movie_list, file, indent=4) success = True except Exception as e: logger.error(f"Failed to process page: {e}, retries left: {retries-tries}") tries+=1
if not success: raise Exception(f"Failed to scrape page, MAX RETRIES {retries} EXCEEDED!!!")
if __name__ == "__main__":
MAX_RETRIES = 3
logger.info("Starting IMDB scrape")
url = "https://www.imdb.com/chart/top/"
scrape_movies(url, retries=MAX_RETRIES)
logger.info("Scrape complete")
Here is the output from the scrape using web Unlocker. As you can see, the scrape took 9.427 seconds.
When we use ScrapeOps, instead of using proxy port integration, we're going to write a function that creates a ScrapeOps Proxied url. Proxy port integration is technically possible, but a function like this makes our proxy code much more easy to read and customize.
This also eliminates the need for a custom SSL certificate. The snippet below holds our proxy function. It takes our API key and target url. Then, it wraps it all up with url encoding and gives us a custom proxied url.
We can pass this url into requests.get()
and continue to write our code like normal.
def get_scrapeops_url(url): payload = { "api_key": API_KEY, "url": url, } proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload) return proxy_url
"api_key"
: holds your ScrapeOps API key.
"url"
: holds the target url that we'd like to scrape.
- This function takes the above information and creates a custom url that we can use to access the site.
Here is our full ScrapeOps code below.
import osimport requestsfrom bs4 import BeautifulSoupimport jsonimport csvimport loggingfrom urllib.parse import urlencode
logging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)
API_KEY = ""
with open("config.json", "r") as config_file: config = json.load(config_file) API_KEY = config["scrapeops_api_key"]
def get_scrapeops_url(url): payload = { "api_key": API_KEY, "url": url, } proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload) return proxy_url
def scrape_movies(url, location="us", retries=3): success = False tries = 0
while not success and tries <= retries: response = requests.get(get_scrapeops_url(url))
try: if response.status_code != 200: raise Exception(f"Failed response from server, status code: {response.status_code}")
soup = BeautifulSoup(response.text, "html.parser") json_tag = soup.select_one("script[type='application/ld+json']") json_data = json.loads(json_tag.text)["itemListElement"]
movie_list_length = 0
movie_list = []
for item in json_data: movie_list.append(item["item"]) movie_list_length+=len(json_data)
print(f"Movie list length: {len(json_data)}") with open("scrapeops-top-250.json", "w") as file: json.dump(movie_list, file, indent=4) success = True except Exception as e: logger.error(f"Failed to process page: {e}, retries left: {retries-tries}") tries+=1
if not success: raise Exception(f"Failed to scrape page, MAX RETRIES {retries} EXCEEDED!!!")
if __name__ == "__main__":
MAX_RETRIES = 3
logger.info("Starting IMDB scrape")
url = "https://www.imdb.com/chart/top/"
scrape_movies(url, retries=MAX_RETRIES)
logger.info("Scrape complete")
Here is the output from the run using ScrapeOps. The run using the ScrapeOps Proxy Aggregator took only 5.583 seconds. This is significantly faster than Bright Data.
All in all, Bright Data's Web Unlocker took 9.427 seconds while the ScrapeOps Proxy Aggregator took 5.583 seconds. 9.427 - 5.583 = 3.844 seconds difference. The ScrapeOps Proxy Aggregator saved us almost 4 seconds. Depending on network conditions, that's enough time to even get a second request in!
Alternative: ScrapeOps Proxy API Aggregator
As you saw in the section above, even though ScrapeOps sometimes uses Bright Data as a provider, we were able to access and scrape the page significantly faster.
Alongside that, the Proxy Aggregator comes with all sorts of custom features! Web Unlocker has a few of these features, but nowhere near all of them.
With the ScrapeOps Proxy Aggregator you gain access to a bunch of cool stuff. This table covers almost everything you might use in a scraping API but this is still non-exhaustive. Of the 17 features available with ScrapeOps below, Bright Data's Web Unlocker supports 4 of them.
|
json_response | Return the response as a JSON object | Not Available |
bypass | Setting to bypass even the toughest of anti-bots. | Automatic |
auto_extract | Automatically parse pages from Amazon and Google. | Not Available |
render_js | Open a real browser and render dynamic content. | render |
wait | Wait an arbitrary amount of time to render content. | Not Available |
wait_for | Wait for a specific CSS selector to appear. | Not Available |
scroll | Scroll the page by any number of pixels. | Not Available |
screenshot | Screenshot with the ScrapeOps Headless Browser. | Not Available |
js_scenario | Execute a list of JavaScript instructions on page. | Not Available |
premium | Use only premium (mobile and residnetial) proxies. | Not Available |
residential | Use only residential IP addresses. | Not Available |
mobile | Use only mobile IP addresses. | Not Available |
country | Use a specific geolocation. | country |
keep_headers | Keep any custom headers that we send to the API. | Not available |
device_type | Specify a specific user agent for our device type. | ua |
session_number | Reuse a specific proxy with a specific session id. | Not available |
follow_redirects | Tell the API whether or not to follow redirects. | Automatic |
Another reason to use ScrapeOps would be our large selection of pricing plans. Our plans are far more affordable and actually give you access to a whole lot more.
- With the Pay As You Go plan for Web Unlocker, you're paying $3 per thousand requests ($0.003 per request).
- With our $9 plan, you gain access to 25,000 API credits (normal requests to the API). This comes out to $0.00036 per request.
The highest tier web unlocker plan comes at $2.10 per thousand ($0.0021 per request).
Even when you're buying in bulk and receiving the biggest bang for your buck, a single request using Web Unlocker costs over 5 times what it would from ScrapeOps!
If you're not ready to commit, sign up for our free trial and 1,000 free API credits for your next scraping job!
Troubleshooting
Issue #1: Request Timeouts
With Python Requests, every once in awhile, we run into timeout
errors. The simplest we to troubleshoot a timeout is to retry your request.
If that doesn't work, add a timeout
setting to your request. This tells requests to wait an arbitrary amount of time before throwing a timeout error.
import requests
response = requests.get("https://httpbin.org/get", timeout=5)
If you are still receiving timeouts, double check your target url to make sure that their server is running normally.
Issue #2: Handling CAPTCHAs
CAPTCHAs can create an unending source of headache when scraping the web. Bright Data's unlocker solves these automatically for you. ScrapeOps does not use automated CAPTCHA solving.
If you do receive a CAPTCHA, retry your request using the bypass
parameter. If you try all levels of bypass
and still receive CAPTCHAs, try an external service like 2captcha.
If you'd like to know more about solving CAPTCHAs in depth, take a look at our article for that here.
Issue #3: Invalid Response Data
Invalid response data is a very real issue when it comes to scraping the web and all facets of web development for that matter. When you get invalid response data, you need to check the status code of the response and look at the full response body for any error messages along with it.
Once you know the status code, it's as simple as looking it up. We have a table of Web Unlocker status codes here. The ScrapeOps status codes are available here.
The Legal & Ethical Implications of Web Scraping
Legal Considerations
When you're scraping a website, you always need to be mindful of what you're doing. Scraping public data (like we did in this article) is generally considered legal everywhere.
Scraping private data (data behind a login or some other form of authentication) os a very nuanced process and you're subject to all sorts of legal consequences.
Here are just some of the consequences that can result from scraping private data:
- Terms of Service Violations: You could violate a site's terms and be subject to lawsuits and hefty fines.
- Data Privacy Laws: Different states and countries around the world have different privacy laws. When you violate somebody else's privacy, that can be a serious legal offense. This can come with fines and even prison time.
- Copyright Infringement: If you scrape and repurpose data without the proper licensing and permissions, you could definitely be violating a copyright. To avoid getting sued or receiving a cease and desist, don't do this.
- Computer Fraud and Abuse: Many countries have laws against hacking (unauthorized access to a computer system) and these laws are generally treated pretty seriously. Violating these laws can also result in hefty fines and prison time depending on your locality.
Ethical Considerations When Violating a Site's Terms
When you create an account to access a site, you agree to their Terms and Conditions. While there are some outliers, these agreements are usually legally binding. If you choose to violate site policies that you've explicitly agreed to, you can be subjected to account actions (suspension, banning etc.) and even legal action!
Terms and Conditions Violations
-
Civil Liability: The site you violated might very likely want to sue you to make an example out of people who violate their terms.
-
Privacy Concerns: Depending on the nature of the violation, you might be disseminating private data. This can come with very stiff penalties (see above).
-
Account Suspension/Banning: The site owner or administrator might very well decide to suspend or even permanently ban you from their site. Could you imagine being permanently banned from Amazon or Google?
robots.txt Violations
- Reputational Damage: Site owners might be far less likely to trust your business. This makes future business dealings difficult.
- Public Perception: We see headlines each and every day about how some company was unethically but still legally collecting some kind of data. Some might think of it as free advertisement, while others could see this kind of story as permanently damaging to their public business perception.
Conclusion
You now know how to use both Bright Data's Web Unlocker and the ScrapeOps Proxy Aggregator. You've been well informed on both products and you're more than capable of making the choice for yourself.
When you to need to do your next scrape, you'll have all the tools you need: Python Requests, BeautifulSoup, JSON and Proxy Integration.
Go build something and continue learning!
More Web Scraping Guides
Here at ScrapeOps, we love web scraping so much, we wrote the playbook on it. If you ever need to learn something new about scraping, we're your one stop shop!
If you'd like to learn about integrating other proxy services, check out the guides below.