How to Bypass PerimeterX with Playwright
When trying to scrape websites, you will encounter a variety of services that try to prevent bot traffic or otherwise hinder scraping. Some of the most common ones are Cloudflare, DataDome, and PerimeterX. This article will be focusing on PerimeterX, a leading provider of security solutions on the web. They offer robust protection against threats like bots and scrapers.
In this article, we will explore how Playwright can be utilized to bypass PerimeterX security.
- TLDR: Bypassing PerimeterX with Playwright
- Understanding PerimeterX
- How PerimeterX Detects Web Scrapers and Prevents Automated Access
- Bypassing PerimeterX
- How to Bypass PerimeterX with Playwright
- Bypassing PerimeterX on Zillow
- Conclusion
- More Playwright Guides
TLDR: Bypassing PerimeterX with Playwright
There are several methods to bypass PerimeterX. The easiest is to use a service like ScrapeOps Proxy Aggregator. As an example, let's try to load Zillow with Playwright.
You'll likely see something like this, meaning you've been blocked and have to complete a bot check.
Instead though, we can use ScrapeOps Proxy Aggregator as shown in the code below.
const playwright = require("playwright");
const SCRAPE_OPS_KEY = "YOUR_SCRAPE_OPS_API_KEY";
function getScrapeOpsUrl(url) {
const payload = {
apy_key: SCRAPE_OPS_KEY,
url: url,
};
const queryStr = new URLSearchParams(payload).toString();
return `https://proxy.scrapeops.io/v1/?${queryStr}`;
}
(async () => {
// Launch browser, context and page
const browser = await playwright.chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();
// Navigate to the page and take a screenshot
const url = getScrapeOpsUrl("https://www.perimeterx.com/");
await page.goto(url);
await page.screenshot({ path: "zillow-proxied.png" });
// Close the browser
await browser.close();
})();
- We require playwright using
require()
- We create a
getScrapeOpsUrl
function that will put together URLs for ScrapeOps - We load the URL using Playwright and take a screenshot
Now you should see an actual page on Zillow
Understanding PerimeterX
For our purposes of web scraping, Perimeter is designed to recognize and block automated visitors (like bots and scrapers). We don't know exactly what goes on behind the scenes, but we can assume they use a number of common practices to identify and block bot traffic:
- Behavioral Analysis
- Fingerprinting
- CAPTCHA challenges
- IP Monitoring
- Header Analysis
- Session Tracking
- DNS Monitoring
Next, we'll look at what these methods are and how they work so we can get a better understanding of how to bypass them.
How PerimeterX Detects Web Scrapers and Prevents Automated Access
Services like PerimeterX utilize a variety of server and client side technology to detect and block bots. Let's discuss them in more detail.
Server Side
- Behavioral Analysis: On the server side, sites can track and analyze the behvaior of their visitors. For example, if a visitor navigates to a large number of pages very quickly, they might be a bot. Similarly, if a visitor is clicking on buttons extremely quickly, they're also probably a bot.
- IP Monitoring: PerimeterX likely also monitor the IP Addresses that traffic comes from. If one address is producing a lot of traffic, they may block or hinder that traffic to help defend against DDOS and attacks and large scale scraping.
- Header Analysis: When visiting a website, your browser sends a variety of headers. These headers include small information about the visitor, like language, referring website, etc and also information about the browser itself (like Chrome, Firefox, etc).
- DNS Monitoring: Sites also often monitor the domains that requests come from. If a domain is recognized as malicious or heavily automated it will usually be blocked.
Client Side
- CAPTCHA: CAPTCHAs are likely the most common client side bot prevention. These are the "are you a robot?" popups. They are embedded in the website code and must be solved by the client. If your scraper is being blocked by CAPTCHAs it is an immediate indication that you are being identified as a bot. There are ways to solve CAPTCHAs as a bot, but usually it is more worthwhile to avoid them all together.
- Fingerprinting: To track user behavior, websites give users a specific fingerprint. This provides metrics for things like behavioral analysis to identify bot users.
- Session Tracking: Similar to fingerprinting, websites might place a cookie in your browser to identify you. The cookie can be used to track your actions around the site.
How to Bypass PerimeterX
Because PerimeterX tracks and analyzes IP addresses, it is important to use IPs that are identified as either residential or mobile. This is easy if your scraper is running on a local machine on a home network, but for production, most scrapers will be running on a server in the cloud.
When scraping from the cloud, if the scraper isn't configured correctly, the website will easily identify the traffic coming from a datacenter and block it instantly.
An ideal scraper will also rotate through real browser headers by using a GUI or by setting up fake user agents.
It's worth noting: you're actually more likely to be detected when using a headless browser. The browser will often send that information in some way that can be detected by the website.
Use Residential and Mobile IPs
Like previously mentioned, scrapers should use residential and mobile proxies to avoid showing their real (datacenter) IP. The ScrapeOps Proxy Aggregator rotates through the best proxies in the industry to get you a good IP for every request.
Rotate Real Browser Headers
If you can get away with running your browser in headless mode, you should still be using normal browser headers. Specifically, the User Agent should match the browser you are automating.
If your scraper is using a specific version of Chrome, your headers should match that same version of Chrome's headers.
Use a Headed Browser
If you don't want to deal with all the user agents, you can use a headed browser instead. This will replicate a normal browsing experience as closely as possible and make your traffic look much more legit.
How to Bypass PerimeterX with Playwright
PerimeterX is very sophisticated but we can try a couple different ways to get past it. We'll spend the rest of this article trying to get past the screen we originally showed on Zillow.
We will try the following tactics:
- Fortified Scraper
puppeteer-extra-plugin-stealth
(with Playwright)- ScrapeOps Proxy Aggregator
Option 1: Fortify the Browser
One of the primary ways websites will detect if you're a bot is by checking browser headers.
const playwright = require("playwright");
const userAgents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36",
"Mozilla/5.0 (iPhone; CPU iPhone OS 14_4_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Mobile/15E148 Safari/604.1",
"Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/87.0.664.75",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18363",
];
(async () => {
// Launch browser, context and page
const browser = await playwright.chromium.launch();
const context = await browser.newContext({
userAgent: userAgents[Math.floor(Math.random() * userAgents.length)],
});
const page = await context.newPage();
// Navigate to the page and take a screenshot
await page.goto("https://www.whatismybrowser.com");
await page.screenshot({ path: "what-is-fortified.png" });
// Close the browser
await browser.close();
})();
The above code does the following:
- Launch Playwright (headless by default)
- Set the user agent of the context in the call to
browser.newContext
- Navigate to the website using
page.goto
- Screenshot the page with
page.screenshot
Which produces the following image:
Option 2: Use puppeteer-extra-plugin-stealth
You may be familiar with puppeteer-extra-plugin-stealth
and you'd be right to assume it was made for puppeteer. But fortunately, there is a drop-in package to make it compatible with playwright. The plugin automates a lot of detection prevention.
To install the plugins, use this command:
npm i playwright-extra puppeteer-extra-plugin-stealth
Now see this example using the plugin:
const { chromium } = require("playwright-extra");
const stealth = require("puppeteer-extra-plugin-stealth")();
(async () => {
// Enable the stealth plugin
chromium.use(stealth);
// Launch browser, context and page
const browser = await chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();
// Navigate to the page and take a screenshot
await page.goto("https://www.whatismybrowser.com");
await page.screenshot({ path: "what-is-stealth.png" });
// Close the browser
await browser.close();
})();
Notice the following changes:
- We now use the
playwright-extra
instead ofplaywright
to import the browser. - We import the
puppeteer-extra-plugin-stealth
plugin - We use
chromium.use()
to use thestealth
plugin. - We no longer handle our own user agents
- We screenshot the same whatismybrowser page.
Here is our result with the plugin
As you probably noticed in the screenshot, whatismybrowser was unable to detect any abnormalities with the browser. Stealth tends to do a far better job at covering up our browser leaks.
Option 3: Use the ScrapeOps Proxy Aggregator
This option is by far the easiest and most reliable. The ScrapeOps Proxy Aggregator rotates through the best proxies to guarantee you are using a good address and talks to the website for you so that you don't have to worry about headers etc.
Here's an example using the Proxy Aggregator:
const playwright = require("playwright");
const SCRAPE_OPS_KEY = "YOUR_SCRAPE_OPS_API_KEY";
function getScrapeOpsUrl(url) {
const payload = {
apy_key: SCRAPE_OPS_KEY,
url: url,
};
const queryStr = new URLSearchParams(payload).toString();
return `https://proxy.scrapeops.io/v1/?${queryStr}`;
}
(async () => {
// Launch browser, context and page
const browser = await playwright.chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();
// Navigate to the page and take a screenshot
const url = getScrapeOpsUrl("https://www.whatismybrowser.com");
await page.goto(url);
await page.screenshot({ path: "whatis-proxied.png" });
// Close the browser
await browser.close();
})();
We have to write a bit more code but this method is actually a lot easier because we don't have to manage anything anymore.
- We use the default puppeteer package
- We make a
getScrapeOpsUrl
function to create URLs - We launch the browser as usual with
chromium.launch()
- We create a new page
- We navigate to the scrape ops URL using
page.goto
- Finally we take a screenshot and close the browser.
Here's the result of whatismybrowser
Bypassing PerimeterX on Zillow
Now that we've learned all that, let's apply it. Zillow uses PerimeterX to stop bot traffic. We're going to try scraping Zillow with these methods to see what works.
Here's what Zillow looks like from my browser, we'll be trying to get a similar result from our code:
This is what it looks like when you get blocked, we're avoiding this:
Zillow with a Fortified Browser
We'll be mostly reusing the code we wrote for Option 1 earlier. Here's what it looks like:
const playwright = require("playwright");
const userAgents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36",
"Mozilla/5.0 (iPhone; CPU iPhone OS 14_4_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Mobile/15E148 Safari/604.1",
"Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/87.0.664.75",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18363",
];
(async () => {
// Launch browser, context and page
const browser = await playwright.chromium.launch();
const context = await browser.newContext({
userAgent: userAgents[Math.floor(Math.random() * userAgents.length)],
});
const page = await context.newPage();
// Navigate to the page and take a screenshot
await page.goto("https://zillow.com");
await page.screenshot({ path: "zillow-fortified.png" });
// Close the browser
await browser.close();
})();
Here's the result:
Surprisingly, the simplest method is successful. Even then, it is only a matter of time before you are detected or this bypass becomes obsolete so don't be surprised when this option fails.
Zillow with puppeteer-extra-plugin-stealth
Once again, we're just copying the code we already wrote before but using Zillow. Here it is for reference:
const { chromium } = require("playwright-extra");
const stealth = require("puppeteer-extra-plugin-stealth")();
(async () => {
// Enable the stealth plugin
chromium.use(stealth);
// Launch browser, context and page
const browser = await chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();
// Navigate to the page and take a screenshot
await page.goto("https://www.whatismybrowser.com");
await page.screenshot({ path: "what-is-stealth.png" });
// Close the browser
await browser.close();
})();
Here is the result:
As you can see, it worked. But that doesn't mean it always will. The plugin is open sourced so it is a constant battle between the maintainers and anti-bot services to ensure success on both sides!
Zillow with ScrapeOps Proxy Aggregator
Finally, we'll try to load Zillow with the ScapeOps Proxy Aggregator from Option 3.
Here's our code:
const playwright = require("playwright");
const SCRAPE_OPS_KEY = "YOUR_SCRAPE_OPS_API_KEY";
function getScrapeOpsUrl(url) {
const payload = {
apy_key: SCRAPE_OPS_KEY,
url: url,
};
const queryStr = new URLSearchParams(payload).toString();
return `https://proxy.scrapeops.io/v1/?${queryStr}`;
}
(async () => {
// Launch browser, context and page
const browser = await playwright.chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();
// Navigate to the page and take a screenshot
const url = getScrapeOpsUrl("https://www.whatismybrowser.com");
await page.goto(url);
await page.screenshot({ path: "whatis-proxied.png" });
// Close the browser
await browser.close();
})();
And here's the result:
And of course, this one works without surprise. And is most likely to continue to work as the ScrapeOps team maintains the API and ensures it is using the best detection avoidance tech!
Conclusion
In summary, the most effective method to bypass PerimeterX is through the use of a reliable proxy. All our methods demonstrated success in this regard. Despite the outcomes of this case study, it remains a best practice to utilize a proxy.
If you'd like to learn more about Playwright, checkout the official Playwright Documentation.
More Playwright Guides
If you would like to master the web scraping with Playwright, make sure to check out our Playwright Web Scraping Playbook
Take a look at some of our other case studies here on ScrapeOps: