IPRoyal Datacenter Proxies: Web Scraping Guide
With datacenter proxies, we sometimes get blocked from more difficult sites, but we also have some great upsides that you simply don't get when using residential and mobile proxies. Datacenter proxies are often more affordable and offer far better performance than we get with residential proxies.
IPRoyal offers a variety of different proxy products. These offerings include Residential Proxies, Datacenter Proxies, Mobile Proxies, Enterprise Proxies, and ISP Proxies.
Today, we'll show you how to purchase these datacenter proxies from IPRoyal and how to use them.
TLDR: How to Integrate IPRoyal Datacenter Proxy?
It's pretty simple to get started with datacenter proxies from IPRoyal. Once you've got an account, simply download your proxy list and try out the code below. There is no need to deal with your username
and password
because they're already baked into the proxy_list
. Make sure to place your proxy list in the same folder as your code.
In this code:
- We import
random
along with our typical proxy setup.
- We read
iproyal-proxies.txt
into an array.
- Next, we use
random.choice()
to select a random proxy from the list as our proxy_url
.
- Then, like other proxy integrations, we assign our
proxy_url
to both the http
and https
protocols of our proxies
dict
.
import requestsimport random
proxy_list = []
with open("iproyal-proxies.txt") as file: proxy_list = file.read().splitlines()
proxy_url = f"http://{random.choice(proxy_list)}"
proxies = { "http": proxy_url, "https": proxy_url}
result = requests.get("http://lumtest.com/myip.json", proxies=proxies)print(result.text)
- First, we create an empty array.
- Next, we use Python's file functionality to read
iproyal-proxies.txt
and load each line of the file into our array.
proxy_url = f"http://{random.choice(proxy_list)}"
choose a random proxy from the list to use as our proxy_url
.
- We create a
dict
object to hold our http
and https
proxies. We then assign both of them to our proxy_url
.
- We then print
result.text
to our terminal, revealing the location information of our proxy.
Understanding Datacenter Proxies
Datacenter proxies will sometimes get blocked by more stringent websites. Requests coming in from a datacenter IP tend to stick out. With the vast majority of sites on the internet, however, datacenter proxies will work just fine. Other than datacenter proxies, premium proxies are by far the next most popular choice.
Premium proxies are made up of actual mobile and residential IP addresses. They tend to cost quite a bit more and they are often significantly slower than datacenter proxies but they blend in better because they're using an actual device on an actual residential connection.
Datacenter vs Residential
Over the next few sections, we'll go over the key differences between residential/premium and datacenter proxies. Once you're done reading, you'll be able to choose the best tool for your scraping needs.
Datacenter
Pros
-
Price: No matter which provider you're choosing, datacenter proxies tend to be very affordable. When using residential proxies, it's not uncommon to pay up to $8/GB.
-
Speed: When you use this type of proxy, it gets hosted inside an actual datacenter. Datacenters are often equipped with some of the best internet connections and hardware. Datacenter proxies offer unparalleled performance.
-
Availability: Datacenters are huge. Each machine in a datacenter normally gets its own IP address. This can give us a seemingly endless proxy pool to work with.
Cons
-
Blocking: As touched on briefly, some webites block all datacenter IPs by default. This can can make some websites far more difficult to scrape.
-
Less Geotargeting Support: We usually do get geotargeting support with datacenter proxies. However, you still show up with a datacenter IP... no matter which country you choose. With datacenter proxies, our locations are limited.
-
Less Anonymity: With datacenter proxies, our location always shows up inside of a datacenter. In comparison to a residential proxy, this can really stick out.
Residential
Pros
-
Anonymity: We get a real IP address assigned to a real residential device. When dealing with more difficult sites, this allows us to blend in more easily.
-
Better Access: When using a residential IP address, it doesn't matter if your target site blocks datacenter IP addresses. Your traffic is coming from somebody's real device in a real home.
Cons
-
Price: Residential proxies are expensive. It's not uncommon to pay between $5 and $8 per GB! Datacenter proxies, on the other hand, can be 10x cheaper!
-
Speed: Datacenter proxies use top of the line hardware and internet connections. When using a residential proxy, you're far more likely to get a low-end device on a subpar internet connection. This can drastically reduce the performance of our scraper.
Residential proxies are ideal for SERP results, ad verification, social media monitoring/scraping and much more.
Why Use IPRoyal Datacenter Proxies?
When we use datacenter proxies from IPRoyal, we get a decent producct at an affordable price. At their lowest tier, we can get 5 IP addresses for $9. After everything is said and done, we're paying $1.80 per IP. This might not sound like the greatest deal, but it allows us to test out the product at a very cheap price. For comparison, there are some companies that charge between $30 and $50 for their lowest tier package!
Aside from cost benefits, IPRoyal gives us access to some basic geotargeting and our proxies are static by default. When we purchase our package, we get to choose our proxy location. However, this isn't quite as good as it seems. Each time we purchase a package, we can only select one country. If you want to use proxies in both the US and the UK, you'll need to purchase two separate plans.
- IPRoyal's Datacenter Proxies are quite affordable.
- When setting up our proxies, we get to choose their location.
IPRoyal Datacenter Proxy Pricing
With IPRoyal, we don't pay for bandwidth, we pay per IP address. As your package gets bigger, your price per IP address gets smaller. On the 30 day plan, they advertise $1.57 per IP. However, this isn't entirely true.
- On their highest price plan, you do actually buy proxies at $1.57 per IP.
- On their lowest tier, you're actually paying $1.80 per IP address.
Their pricing model isn't out of this world, but it sure isn't bad either. The table below outlines their basic pricing structure for a 30 day plan.
If you are interested in their more long term plans, you can explore them here.
|
5 | $9 | 0% | $1.80 |
10 | $18 | 2% | $1.76 |
50 | $90 | 7% | $1.67 |
100 | $180 | 13% | $1.57 |
To compare IPRoyal to other providers, we built a tool that allows you to compare virtually every provider on the market, you can use it here.
This tool is built for anybody looking to shop for a proxy provider. We're not always the best price, and when we're not, we tell you.
Setting Up IPRoyal Datacenter Proxies
Time to go through the IPRoyal signup process from start to finish. By the end of this section, you'll have all the tools you need to create an account, purchase datacenter proxies and get started. Feel free to follow along as we go.
For starters, you need to create an account. IPRoyal gives us the ability to sign up using Google, LinkedIn, or to create an account manually.
Once you've created your account, you need to purchase a plan. If you click the tab titled, Datacenter, you'll be brought to a page where you can setup your plan. There is one big thing you should notice in the screenshot below. No matter how many proxies we purchase, we are limited to choosing only one country.
Once you've got your plan setup, you need to checkout. You'll be given the option to pay using cryptocurrency or to pay with a card using Stripe.
It takes a second to confirm your payment and create your plan. Once everything's complete, you can select your plan from the dashboard. Go to My orders, and click on your datacenter plan.
This will bring you to your actual plan. Near the top of the page, you'll see a list of your proxies.
Below this list, you'll see the option to download this list. You can download as either a CSV or a TXT file. For the rest of this tutorial, we'll be using a TXT file.
After you've downloaded your list, move it into the folder where you want to keep your scraper.
Authentication
IPRoyal does not offer whitelisting with their Datacenter Proxy products. You can verify this here.
All authentication with IPRoyal is done using our username
and password
. However, if you downloaded a proxy file earlier, you'll never actually need to code these into your scraper. With their downloadable proxy list, our URLs come preformatted and ready to go.
Take a look at the code below, we don't need to worry about it at all!
import requestsimport random
proxy_list = []
with open("iproyal-proxies.txt") as file: proxy_list = file.read().splitlines()
proxy_url = f"http://{random.choice(proxy_list)}"
proxies = { "http": proxy_url, "https": proxy_url}
result = requests.get("http://lumtest.com/myip.json", proxies=proxies)print(result.text)
Key takeaways from the authentication process here:
- All of our variables (username
,
password,
hostname,
port`) are already baked into the url inside of our text file.
- Because of this, there is no need to write any code for authentication. IPRoyal already took care of that for us.
- We use a
dict
to hold both our http
and https
proxies. We set each one to our proxy_url
.
- Finally, make a request to the Lumtest API for our location information and print it to the terminal.
Basic Request Using IPRoyal Datacenter Proxies
For those of you who've been following along, you've already seen a basic request with IPRoyal's datacenter proxies. However, if you just skipped to this section because you need to know it quickly, here you go!
In the file below, we read our proxies from a file and choose a random one.
Once we've made our random choice, we set it as our proxy_url
variable. We then create a dict
to represent our http
and https
proxies. We set them both to our proxy_url
. Next, we make a GET request to the Lumtest API so we can check our location information. Then, we print our location information to the terminal.
import requestsimport random
proxy_list = []
with open("iproyal-proxies.txt") as file: proxy_list = file.read().splitlines()
proxy_url = f"http://{random.choice(proxy_list)}"
proxies = { "http": proxy_url, "https": proxy_url}
result = requests.get("http://lumtest.com/myip.json", proxies=proxies)print(result.text)
When we connect to our datacenter proxy, we're authenticated using our username
and password
. However, we don't need to save any of our config variables. IPRoyal already did that for us in our iproyal-proxies.txt
file. We just need to choose a random proxy from our list and connect to it.
Country Geotargeting
With IPRoyal's Datacenter Proxies, we get limited support for geotargeting. There is actually no way to geotarget through their service. You need to select a country when choosing a plan. You only get to select one country. If you remember earlier, we selected the UK.
With ScrapeOps however, we get some pretty good geotargeting support already baked into the Proxy Aggregator.
In the code below, we'll perform a GET using our British proxies from IPRoyal, and we'll also perform a GET using the ScrapeOps API. Using the two of these products, we'll be able to appear from two different countries.
import requestsimport randomfrom urllib.parse import urlencode
API_KEY = "your-scrapeops-api-key"SCRAPEOPS_LOCATION = "us"
def get_scrapeops_url(url, location=SCRAPEOPS_LOCATION): payload = { "api_key": API_KEY, "url": url, "country": location } proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload) return proxy_url
proxy_list = []
with open("iproyal-proxies.txt") as file: proxy_list = file.read().splitlines()
proxy_url = f"http://{random.choice(proxy_list)}"
proxies = { "http": proxy_url, "https": proxy_url}
print("-------------------------UK-----------------------")result = requests.get("http://lumtest.com/myip.json", proxies=proxies)print(result.text)
print("-------------------------US------------------------")result = requests.get(get_scrapeops_url("http://lumtest.com/myip.json"))print(result.text)
The code above is built on top of our basic request. Since we only have support for one country with our IPRoyal proxies, we also use ScrapeOps for some additional geotargeting support. In this example, we add:
- Our ScrapeOps
API_KEY
and SCRAPEOPS_LOCATION
. This API_KEY
gets used for authentication with ScrapeOps. SCRAPEOPS_LOCATION
is used to tell ScrapeOps where we'd like to appear.
- We write a function called
get_scrapeops_url()
. This function takes in our API_KEY
, location
, and target url
. It then uses urlencode
to wrap all of this information into a proxied url that we can ping using Requests (or anything for that matter).
- We check our location information with IPRoyal just like we did before.
- After our IPRoyal check, we go through and connect to the same API endpoint using our ScrapeOps proxy.
- We print the location information from each request to the console.
You can view the output of this code below.
-------------------------UK-----------------------
{"country":"GB","asn":{"asnum":211415,"org_name":"Karolio IT paslaugos, UAB"},"geo":{"city":"London","region":"ENG","region_name":"England","postal_code":"EC4R","latitude":51.5088,"longitude":-0.093,"tz":"Europe/London","lum_city":"london","lum_region":"eng"}}
-------------------------US------------------------
{"country":"US","asn":{"asnum":401104,"org_name":"CYBERPLANET"},"geo":{"city":"","region":"","region_name":"","postal_code":"","latitude":37.751,"longitude":-97.822,"tz":"America/Chicago"}}
- Using our British proxy from IPRoyal, our location shows up in GB with a timezone of Europe/London.
- With our ScrapeOps proxy in the US, our location shows up as US with a timezome of America/Chicago.
City Geotargeting
Typically, datacenter proxies don't support city level geotargeting. When you need this much control, it's often best to select a residential proxy plan.
IPRoyal's residential service does support some city level geotargeting. You may view their documentation for this here.
Their residential plan is available for purchase here.
City level geotargeting gives us access to hyper localized content. When you're dealing with local content, you can extract the following types of data at a local level. This allows you to collect and manage your precious data at a much more granular level.
- Local Ads
- Local Businesses
- Local Social Media
- Local Events
To use city geotargeting with IPRoyal, you'll need to purchase their residential proxy services.
Error Codes
In web development, error codes are very important. Most of us already know that a status code of 200 indicates a successful request. When it comes to other codes (error codes in particular), it can be a bit more tricky.
IPRoyal doesn't have any specific documentation on status codes, however, they do outline some of them in a blog post here.
For the sake of convenience, we've outlined them in a table below.
|
200 | Success | Everything works as expected. |
400 | Bad Request | The request was malformed, double check everything. |
403 | Forbidden | Your account is forbidden from accessing this information. |
404 | Not Found | Double check your url. The content wasn't found. |
407 | Proxy Authentication Required | Double check your API key, it's wrong or missing. |
408 | Request Timeout | The request timed out, try again. |
502 | Bad Gateway | Invalid response from the target server. |
503 | Service Unavailable | The proxy server down or overloaded. |
504 | Gateway Timeout | The proxy server timed out waiting for the upstream server. |
505 | HTTP Version Not Supported | The server doesn't support this version of HTTP. |
Status codes are imperative. When you encounter an error, you need to look up the status code and troubleshoot accordingly.
KYC Verification
For datacenter proxies, IPRoyal doesn't require an initial KYC process. However, they use a rolling KYC process. This means that your usage is constantly being monitored. If you begin to do shady things with their service, they reserve the right to ban you.
Some proxy services like Bright Data have a more difficult KYC process. Bright Data even requires you to get on a Zoom call with them to confirm your identity and explain your intentions with their product. Most companies use the sort of ongoing KYC employed by IPRoyal.
When starting with Datacenter Proxies, IPRoyal does not require users to undergo a full KYC process. Their KYC procedures are ongoing. If you do shady things using IPRoyal, you will be banned from IPRoyal.
Implementing IPRoyal Datacenter Proxies in Web Scraping
Now that we know how to use Smartproxy's datacenter proxies, we're going to look at implementing them with different frameworks. Pick your poison.
After this section, you'll be able to handle proxy implementation pretty much anywhere, regardless of framework. We'll go through a few popular Python frameworks and a couple of popular JavaScript ones as well.
Python Requests
We've been using Python Requests since the beginning of this article. Since you're already familiar with it, we'll start here. First, we create an empty list. Then, we read our proxy list into the empty list. From there, we use Python's random.choice()
to select random proxies from our proxy_list
.
import requestsimport random
proxy_list = []
with open("iproyal-proxies.txt") as file: proxy_list = file.read().splitlines()
proxy_url = f"http://{random.choice(proxy_list)}"
proxies = { "http": proxy_url, "https": proxy_url}
result = requests.get("http://lumtest.com/myip.json", proxies=proxies)print(result.text)
- Our credentials are already inside each url from
iproyal-proxies.txt
.
- We read load
proxy_list
from our file and use random.choice()
to select a random proxy.
- We then create a
dict
object that holds both our http
and https
proxies.
- When making our requests, we make sure to pass
proxies=proxies
. This tells Python Requests to use the dict
object we created for our proxy settings.
Python Selenium
SeleniumWire has been an important staple with many Selenium users for years. Vanilla Selenium does not support authenticated proxies. Some sad news for many of you, SeleniumWire has been deprecated! However, it is still technically possible to integrate IPRoyal Datacenter Proxies via SeleniumWire, but we strongly advise against it.
When you decide to use SeleniumWire, you are vulnerable to the following risks:
-
Security: Browsers are updated with security patches regularly. Without these patches, your browser will have holes in the security that have been fixed in other browsers such as Chromedriver or Geckodriver.
-
Dependency Issues: SeleniumWire is no longer maintained. In time, it may not be able to keep up with its dependencies as they get updated. Broken dependencies can be a source of unending headache for anyone in software development.
-
Compatibility: As the web itself gets updated, SeleniumWire doesn't. Regular browsers are updated all the time. Since SeleniumWire no longer receives updates, you may experience broken functionality and unexpected behavior.
As time goes on, the probability of all these problems increases. If you understand the risks but still wish to use SeleniumWire, you can view a guide on that here.
Depending on your time of reading, the code example below may or may not work. As mentioned above, we strongly recommend against using SeleniumWire because of its deprecation, but if you decide to do so anyway, here you go. We are not responsible for any damage that this may cause to your machine or your privacy.
from seleniumwire import webdriverimport random
proxy_list = []
with open("iproyal-proxies.txt") as file: proxy_list = file.read().splitlines()
proxy_url = f"http://{random.choice(proxy_list)}"
proxy_options = { "proxy": { "http": proxy_url, "https": proxy_url, "no_proxy": "localhost:127.0.0.1" }}
driver = webdriver.Chrome(seleniumwire_options=proxy_options)
driver.get('https://httpbin.org/ip')
- We build our url exactly how we did above with Python Requests:
http://{random.choice(proxy_list)}
. This takes a random choice from our proxy_list
.
- We assign this url to both the
http
and https
protocols of our proxy settings.
driver = webdriver.Chrome(seleniumwire_options=proxy_options)
tells webdriver
to open Chrome with our custom seleniumwire_options
.
Python Scrapy
There are several different ways to integrate your new datacenter connection with Scrapy. In this example, we're going to build our proxy directly into our spider. You'll need to move or copy your proxy list into your spiders folder.
To start, we need to make a new Scrapy project.
scrapy startproject datacenter
Then, from within your new Scrapy project, create a new Python file inside the spiders folder with the following code.
import scrapyimport osimport random
proxy_list = []
file_path = os.path.join(os.path.dirname(__file__), "iproyal-proxies.txt")
with open(file_path) as file: proxy_list = file.read().splitlines()
proxy_url = f"http://{random.choice(proxy_list)}"
class ExampleSpider(scrapy.Spider): name = "datacenter_proxy"
def start_requests(self): request = scrapy.Request(url="https://httpbin.org/ip", callback=self.parse) request.meta['proxy'] = proxy_url yield request
def parse(self, response): print(response.body)
You can run this spider with the following command.
scrapy crawl datacenter_proxy
- First, we use
os.path.join()
to ensure that our spider can find the path to the file.
- Once again, we create the same
proxy_url
: http://{random.choice(proxy_list)}
. This allows us to select a random proxy from the list.
- From inside
start_requests
, we assign our proxy_url
to request.meta['proxy']
. This tells Scrapy that all of this spider's requests are to be made through the proxy_url
we created earlier.
NodeJS Puppeteer
Getting started with Puppeteer is pretty straightforward. Once again, we'll start by creating a new project. Follow the steps below to get up and running in minutes.
Create a new folder.
mkdir puppeteer-datacenter
cd
into the new folder and create a new JavaScript project.
cd puppeteer-datacenternpm init --y
Next, we need to install Puppeteer.
Next, from within your new JavaScript project, copy/paste the code below into a new .js
file. Also, make sure to move or copy your proxy list into your new project folder.
const puppeteer = require("puppeteer");const fs = require("fs");
const proxyFileData = fs.readFileSync("iproyal-proxies.txt", "utf8");const proxyArray = proxyFileData.split('\n').map(line => line.trim()).filter(line => line.length > 0);const proxyUrl = proxyArray[Math.floor(Math.random() * proxyArray.length)];const proxyHost = proxyUrl.split("@");const creds = proxyHost[0].split(":");
(async () => { const browser = await puppeteer.launch({ args: [`--proxy-server=http://${proxyHost[proxyHost.length-1]}`] });
const page = await browser.newPage();
await page.authenticate({ username: creds[0], password: creds[1] });
await page.goto('http://lumtest.com/myip.json'); await page.screenshot({path: 'puppeteer.png'});
await browser.close();})();
- First, we read our proxy file:
fs.readFileSync("iproyal-proxies.txt", "utf8")
.
- Next, we use string splitting to separate each line read from the file.
proxyArray[Math.floor(Math.random() * proxyArray.length)]
chooses a random proxy from our list.
- We then use some more string splitting to separate our hostname and credentials from the full proxy string.
- We also add our
password
to the password
field for authentication: password: password
.
http://${proxyHost[proxyHost.length-1]}
yields our server.
proxyHost[0].split(":")
gives our proxy credentials.
- We set our username with:
username: creds[0]
.
- Our password is set with:
password: creds[1]
.
Puppeteer offers great support for proxies right out of the box. However, Puppeteer's builtin authenticate()
method gives us a special place to put both our username
and password
. With Puppeteer, we need to actually deconstruct our proxyUrl
in order to properly authenticate with IPRoyal. This isn't IPRoyal's fault, but the process could definitely be simpler when we're reading these urls from a file.
The screenshot below came from running the Puppeteer code above.
NodeJS Playwright
If you paid attention during the Puppeteer integration above, Playwright is going to seem very similar. Puppeteer and Playwright both actually share a common origin in Chrome's DevTools.
The steps below should look at least somewhat familiar, however it does get slightly different near the end.
Create a new project folder.
mkdir playwright-datacenter
cd
into the new folder and initialize a JavaScript project.
cd playwright-datacenternpm init --y
Install Playwright.
npm install playwrightnpx playwright install
Next, you can copy/paste the code below into a JavaScript file. Once again, make sure to add your proxy list to your project folder.
const playwright = require("playwright");const fs = require("fs");
const proxyFileData = fs.readFileSync("iproyal-proxies.txt", "utf8");const proxyArray = proxyFileData.split('\n').map(line => line.trim()).filter(line => line.length > 0);const proxyUrl = proxyArray[Math.floor(Math.random() * proxyArray.length)];const proxyHost = proxyUrl.split("@");const creds = proxyHost[0].split(":");
const options = { proxy: { server: `http://${proxyHost[proxyHost.length-1]}`, username: creds[0], password: creds[1] }};
(async () => { const browser = await playwright.chromium.launch(options); const page = await browser.newPage();
await page.goto('http://lumtest.com/myip.json');
await page.screenshot({ path: "playwright.png" })
await browser.close();})();
- Like our Puppeteer example, we first setup our configuration variables. This process requires a bit of extra code to split up our
proxyUrl
.
- We create a
proxy
object with the following fields:
server: `http://${proxyHost[proxyHost.length-1]}
username: creds[0]
password: creds[1]
Just like Puppeteer, Playwright gives us first class support for authenticated proxies but we need to extra work deconstructing our url. You can view the screenshot from this code below.
Case Study: Scrape The Guardian
When you run into the stricter anti-bots on the web, they'll block your datacenter proxy. For most general sites, datacenter proxies tend to do just fine.
Datacenter proxies are just so much cheaper and more efficient. When you're running an managed proxy like the ScrapeOps Proxy Aggregator, it will first try your request using a datacenter proxy.
If the datacenter proxy fails, it will retry the request using a residential proxy... with no additional charge!
Here, we'll to scrape The Guardian. This case study is more about concepts than it is about data harvesting.
In the code below:
- We use our IPRoyal proxies to setup a connection in the UK.
- We use the ScrapeOps Proxy Aggregator to setup a connection based in the US.
- With each of these proxies, we make a GET request to the Lumtest API and then we make one to The Guardian.
- We print our results to the terminal for comparison.
Take a look at the code below.
import requestsimport randomfrom urllib.parse import urlencodefrom bs4 import BeautifulSoup
API_KEY = "your-scrapeops-api-key"LOCATION = "us"
def get_scrapeops_url(url, location=LOCATION): payload = { "api_key": API_KEY, "url": url, "country": location } proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload) return proxy_url
proxy_list = []
with open("iproyal-proxies.txt") as file: proxy_list = file.read().splitlines()
proxy_url = f"http://{random.choice(proxy_list)}"
proxies = { "http": proxy_url, "https": proxy_url}
print("----------------------US---------------------")
location_info = requests.get(get_scrapeops_url("http://lumtest.com/myip.json"))print(location_info.text)
response = requests.get(get_scrapeops_url("https://www.theguardian.com/"))
soup = BeautifulSoup(response.text, "html.parser")
subnav = soup.select_one("div[data-testid='sub-nav']")print(subnav.text)
print("----------------------UK---------------------")
location_info = requests.get("http://lumtest.com/myip.json", proxies=proxies)print(location_info.text)
response = requests.get("https://www.theguardian.com/", proxies=proxies)
soup = BeautifulSoup(response.text, "html.parser")
subnav = soup.select_one("div[data-testid='sub-nav']")
print(subnav.text)
Here are some key things you should notice in this code:
- With each proxy connection, we make our request to the Lumtest API is used specifically to verify our proxy connection. It is not the end result of our code.
- On each run, we use
soup.select_one("div[data-testid='sub-nav']")
to find and return the navbar on the page.
Our proxy connection with ScrapeOps is pretty different.
- ScrapeOps gives us access to a REST API. This allows us to wrap all of our parameters into a proxied url.
- Since we're using a REST API, we don't need to setup a proxy connection manually using Requests. Instead, we just need to wrap our url properly.
If we run our code, we get output similar to what you see below.
----------------------US---------------------
{"country":"US","asn":{"asnum":20278,"org_name":"NEXEON"},"geo":{"city":"Jacksonville","region":"FL","region_name":"Florida","postal_code":"32255","latitude":30.3341,"longitude":-81.6544,"tz":"America/New_York","lum_city":"jacksonville","lum_region":"fl"}}
USUS elections 2024WorldEnvironmentUkraineSoccerBusinessTechScienceNewslettersWellness
----------------------UK---------------------
{"country":"GB","asn":{"asnum":212335,"org_name":"Simoresta UAB"},"geo":{"city":"London","region":"ENG","region_name":"England","postal_code":"EC4R","latitude":51.5088,"longitude":-0.093,"tz":"Europe/London","lum_city":"london","lum_region":"eng"}}
UKWorldClimate crisisUkraineFootballNewslettersBusinessEnvironmentUK politicsEducationSocietyScienceTechGlobal developmentObituaries
First, we'll look at our locations here. We cleaned up the important information from the JSON and made it a little easier to read. Our US proxy is located in the US and our UK proxy is located in the UK.
|
ScrapeOps | US |
United Kingdom | GB |
Now let's take a closer look at our navbar text from each run.
us
: USUS elections 2024WorldEnvironmentUkraineSoccerBusinessTechScienceNewslettersWellness
gb
: UKWorldClimate crisisUkraineFootballNewslettersBusinessEnvironmentUK politicsEducationSocietyScienceTechGlobal developmentObituaries
Let's make these a little easier to read.
us
: US | US elections 2024 | World | Environment | Ukraine | Soccer | Business | Tech | Science | Newsletters | Wellness
gb
: UK | World | Climate crisis | Ukraine | Football | Newsletters | Business | Environment | UK politics | Education | Society | Science | Tech | Global development | Obituaries
In both responses, our navbar layout is pretty different. On the US based navbar, our first three sections read: US, US elections 2024, and World. The first three of our UK based navbar read as follows: UK, World, Climate crisis.
The Guardian switches these based on your location because it's a reasonable prediction of your interests. At the time of this writing, in the US, elections are a pretty hot topic.
In the UK, US elections might get some attention, but they don't seem to be the priority of most readers.
Many websites will prioritize your attention differently based on your location.
Alternative: ScrapeOps Proxy Aggregator
IPRoyal's datacenter proxies are a pretty good deal. We've got some great deals too. We offer a different product with many more features for a really competitive price! If you followed along with our Case Study, you've actually already used it!
Take a look at our ScrapeOps Proxy Aggregator! When you use our Proxy Aggregator, we don't charge for bandwidth, instead, we charge per request. A basic request (like the ones we've been making here) costs $0.00036 at the maximum. Even better, you only pay for successful requests!
Proxy Aggregator is a managed proxy. This means that Proxy Aggregator always goes through and selects the best proxy for your needs.
Unless you tell it otherwise, Proxy Aggregator will try your request with a datacenter proxy. If that request fails, we'll then retry it using a premium (residential or mobile) proxy for you with no additional charge!
When you use our Proxy Aggregator, you get the stability and reliability you can count on.
The table below outlines our pricing.
|
$9 | 9,000 | $0.00036 |
$15 | 50,000 | $0.0003 |
$19 | 100,000 | $0.00019 |
$29 | 250,000 | $0.000116 |
$54 | 500,000 | $0.000108 |
$99 | 1,000,000 | $0.000099 |
$199 | 2,000,000 | $0.0000995 |
$254 | 3,000,000 | $0.000084667 |
All of these plans offer the following awesome features:
- JavaScript Rendering
- Screenshot Capability
- Country Geotargeting
- Residential and Mobile Proxies
- Anti-bot Bypass
- Custom Headers
- Sticky Sessions
As we mentioned earlier, IPRoyal is actually one of our providers! When you sign up for ScrapeOps, you get access to IPRoyal and many more providers.
Go a head and start your free trial here.
Once you've got your free trial, you can copy and paste the code below to check your proxy connection.
import requestsfrom urllib.parse import urlencode
API_KEY = "your-super-secret-api-key"LOCATION = "us"
def get_scrapeops_url(url, location=LOCATION): payload = { "api_key": API_KEY, "url": url, "country": location } proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload) return proxy_url
response = requests.get(get_scrapeops_url("http://lumtest.com/myip.json"))print(response.text)
In the code above, we do the following.
- Create our configuration variables:
API_KEY
and LOCATION
.
- Write a
get_scrapeops_url()
function. This function takes all of our parameters along with a target url and wraps it into a ScrapeOps Proxied url. This is an incredibly easy way to scrape and it makes our proxy code much more modular.
- Check our IP info with
response = requests.get(get_scrapeops_url("http://lumtest.com/myip.json"))
.
- Finally, we print it to the terminal. You should get an output similar to this.
{"country":"US","asn":{"asnum":31898,"org_name":"ORACLE-BMC-31898"},"geo":{"city":"San Jose","region":"CA","region_name":"California","postal_code":"95119","latitude":37.2379,"longitude":-121.7946,"tz":"America/Los_Angeles","lum_city":"sanjose","lum_region":"ca"}}
Take a look at the org_name
, ORACLE-BMC-31898
. This is a datacenter from Oracle. As we mentioned earlier, our Proxy Aggregator gives us access to datacenter proxies by default.
Ethical Considerations and Legal Guidelines
IPRoyal takes a pretty firm stance when it comes to ethical scraping. They pride themselves in using proxies only ethical sources. The screenshot below comes straight from their residential proxies page. It reads "32M+ ethically-sourced unique IPs in 195 countries". As you can see, they take ethical sourcing very seriously.
When residential proxies are sourced, they come from real people using real devices on their real internet connection. Ethical sourcing of residential proxies means that everyone providing bandwidth knows they're providing bandwidth. When we use datacenter proxies, they come from a datacenter, there is no way that our proxy could come from a user unknowingly running software on their smartphone.
Legal
Breaking the law with a proxy provider is a terrible idea. Obviously, it's illegal and something you might not have considered: it harms everyone involved. It harms the proxy provider. It eventually harms you too. If you break the law with a proxy, first, your action will be traced to the proxy provider. Then, the action will be traced to your account through either your API key, or your username and password.
This creates problems for both you and your proxy service.
-
Don't use residential proxies to access illegal content: These actions can come with intense legal penalties that can even include prison or jail time depending on severity.
-
Don't scrape and disseminate other people's private data: Depending on what your jurisdiction, this is also a highly illegal and dangerous practice. Doxxing private data can also lead to heavy fines and possibly jail/prison time.
Ethical
When we scrape, not only should we consider legalities, we need to make some considerations about what we're going ethically. Just because something is legal, this doesn't make it right. No one wants to be the next headline concerning unethical practices.
-
Social Media Monitoring: Social media stalking can be a very destructive and disrespectful behavior. How would you feel if someone used data collection methods on your account?
-
Respect Site Policies: Failure to respect a site's policies can get your account suspended/banned. It can even lead to legal troubles for those of you who sign and violate a terms of service agreement.
Conclusion
Datacenter proxies are a cheap and performant tool that allows us to scrape the web efficiently. By this point, you should understand that residential proxies aren't always needed.
You should also have a decent understanding of how to implement IPRoyal Datacenter proxies using Python Requests, Scrapy, NodeJS Puppeteer and NodeJS Playwright. You can view the full documentation for IPRoyal's Datacenter proxies here.
By this point, you should also understand how to stup a basic proxy connection with our very own Proxy Aggregator. We have tons of features and a long list of affordable plans.
Now, take your new skills and go build something with IPRoyal's Datacenter Proxies or the ScrapeOps Proxy Aggregator.
More Cool Articles
If you're in the mood to keep on reading, we've got a ton of content that can scratch that itch. Whether you're a seasoned dev or your brand new to web scraping, we've got something useful for you.
We love scraping so much that we wrote the Python Web Scraping Playbook. If you want to learn more, take a look at the guides below.