Skip to main content

Bypass PerimeterX with Playwright

How to Bypass PerimeterX with Playwright

When trying to scrape websites, you will encounter a variety of services that try to prevent bot traffic or otherwise hinder scraping. Some of the most common ones are Cloudflare, DataDome, and PerimeterX. This article will be focusing on PerimeterX, a leading provider of security solutions on the web. They offer robust protection against threats like bots and scrapers.

In this article, we will explore how Playwright can be utilized to bypass PerimeterX security.


TLDR: Bypassing PerimeterX with Playwright

There are several methods to bypass PerimeterX. The easiest is to use a service like ScrapeOps Proxy Aggregator. As an example, let's try to load Zillow with Playwright.

You'll likely see something like this, meaning you've been blocked and have to complete a bot check. PerimeterX Block

Instead though, we can use ScrapeOps Proxy Aggregator as shown in the code below.

const playwright = require("playwright");

const SCRAPE_OPS_KEY = "YOUR_SCRAPE_OPS_API_KEY";
function getScrapeOpsUrl(url) {
const payload = {
apy_key: SCRAPE_OPS_KEY,
url: url,
};

const queryStr = new URLSearchParams(payload).toString();
return `https://proxy.scrapeops.io/v1/?${queryStr}`;
}

(async () => {
// Launch browser, context and page
const browser = await playwright.chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();

// Navigate to the page and take a screenshot
const url = getScrapeOpsUrl("https://www.perimeterx.com/");
await page.goto(url);
await page.screenshot({ path: "zillow-proxied.png" });

// Close the browser
await browser.close();
})();
  • We require playwright using require()
  • We create a getScrapeOpsUrl function that will put together URLs for ScrapeOps
  • We load the URL using Playwright and take a screenshot

Now you should see an actual page on Zillow Zillow Homepage


Understanding PerimeterX

For our purposes of web scraping, Perimeter is designed to recognize and block automated visitors (like bots and scrapers). We don't know exactly what goes on behind the scenes, but we can assume they use a number of common practices to identify and block bot traffic:

  • Behavioral Analysis
  • Fingerprinting
  • CAPTCHA challenges
  • IP Monitoring
  • Header Analysis
  • Session Tracking
  • DNS Monitoring

Next, we'll look at what these methods are and how they work so we can get a better understanding of how to bypass them.


How PerimeterX Detects Web Scrapers and Prevents Automated Access

Services like PerimeterX utilize a variety of server and client side technology to detect and block bots. Let's discuss them in more detail.

Server Side

  • Behavioral Analysis: On the server side, sites can track and analyze the behvaior of their visitors. For example, if a visitor navigates to a large number of pages very quickly, they might be a bot. Similarly, if a visitor is clicking on buttons extremely quickly, they're also probably a bot.
  • IP Monitoring: PerimeterX likely also monitor the IP Addresses that traffic comes from. If one address is producing a lot of traffic, they may block or hinder that traffic to help defend against DDOS and attacks and large scale scraping.
  • Header Analysis: When visiting a website, your browser sends a variety of headers. These headers include small information about the visitor, like language, referring website, etc and also information about the browser itself (like Chrome, Firefox, etc).
  • DNS Monitoring: Sites also often monitor the domains that requests come from. If a domain is recognized as malicious or heavily automated it will usually be blocked.

Client Side

  • CAPTCHA: CAPTCHAs are likely the most common client side bot prevention. These are the "are you a robot?" popups. They are embedded in the website code and must be solved by the client. If your scraper is being blocked by CAPTCHAs it is an immediate indication that you are being identified as a bot. There are ways to solve CAPTCHAs as a bot, but usually it is more worthwhile to avoid them all together.
  • Fingerprinting: To track user behavior, websites give users a specific fingerprint. This provides metrics for things like behavioral analysis to identify bot users.
  • Session Tracking: Similar to fingerprinting, websites might place a cookie in your browser to identify you. The cookie can be used to track your actions around the site.

How to Bypass PerimeterX

Because PerimeterX tracks and analyzes IP addresses, it is important to use IPs that are identified as either residential or mobile. This is easy if your scraper is running on a local machine on a home network, but for production, most scrapers will be running on a server in the cloud.

When scraping from the cloud, if the scraper isn't configured correctly, the website will easily identify the traffic coming from a datacenter and block it instantly.

An ideal scraper will also rotate through real browser headers by using a GUI or by setting up fake user agents.

It's worth noting: you're actually more likely to be detected when using a headless browser. The browser will often send that information in some way that can be detected by the website.

Use Residential and Mobile IPs

Like previously mentioned, scrapers should use residential and mobile proxies to avoid showing their real (datacenter) IP. The ScrapeOps Proxy Aggregator rotates through the best proxies in the industry to get you a good IP for every request.

Rotate Real Browser Headers

If you can get away with running your browser in headless mode, you should still be using normal browser headers. Specifically, the User Agent should match the browser you are automating.

If your scraper is using a specific version of Chrome, your headers should match that same version of Chrome's headers.

Use a Headed Browser

If you don't want to deal with all the user agents, you can use a headed browser instead. This will replicate a normal browsing experience as closely as possible and make your traffic look much more legit.


How to Bypass PerimeterX with Playwright

PerimeterX is very sophisticated but we can try a couple different ways to get past it. We'll spend the rest of this article trying to get past the screen we originally showed on Zillow.

We will try the following tactics:

  • Fortified Scraper
  • puppeteer-extra-plugin-stealth (with Playwright)
  • ScrapeOps Proxy Aggregator

Option 1: Fortify the Browser

One of the primary ways websites will detect if you're a bot is by checking browser headers.

const playwright = require("playwright");

const userAgents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36",
"Mozilla/5.0 (iPhone; CPU iPhone OS 14_4_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Mobile/15E148 Safari/604.1",
"Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/87.0.664.75",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18363",
];

(async () => {
// Launch browser, context and page
const browser = await playwright.chromium.launch();
const context = await browser.newContext({
userAgent: userAgents[Math.floor(Math.random() * userAgents.length)],
});
const page = await context.newPage();

// Navigate to the page and take a screenshot
await page.goto("https://www.whatismybrowser.com");
await page.screenshot({ path: "what-is-fortified.png" });

// Close the browser
await browser.close();
})();

The above code does the following:

  • Launch Playwright (headless by default)
  • Set the user agent of the context in the call to browser.newContext
  • Navigate to the website using page.goto
  • Screenshot the page with page.screenshot

Which produces the following image:

What Is My Browser with Fake User Agent

Option 2: Use puppeteer-extra-plugin-stealth

You may be familiar with puppeteer-extra-plugin-stealth and you'd be right to assume it was made for puppeteer. But fortunately, there is a drop-in package to make it compatible with playwright. The plugin automates a lot of detection prevention.

To install the plugins, use this command:

npm i playwright-extra puppeteer-extra-plugin-stealth

Now see this example using the plugin:

const { chromium } = require("playwright-extra");
const stealth = require("puppeteer-extra-plugin-stealth")();

(async () => {
// Enable the stealth plugin
chromium.use(stealth);

// Launch browser, context and page
const browser = await chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();

// Navigate to the page and take a screenshot
await page.goto("https://www.whatismybrowser.com");
await page.screenshot({ path: "what-is-stealth.png" });

// Close the browser
await browser.close();
})();

Notice the following changes:

  • We now use the playwright-extra instead of playwright to import the browser.
  • We import the puppeteer-extra-plugin-stealth plugin
  • We use chromium.use() to use the stealth plugin.
  • We no longer handle our own user agents
  • We screenshot the same whatismybrowser page.

Here is our result with the plugin whatismybrowser stealth plugin

As you probably noticed in the screenshot, whatismybrowser was unable to detect any abnormalities with the browser. Stealth tends to do a far better job at covering up our browser leaks.

Option 3: Use the ScrapeOps Proxy Aggregator

This option is by far the easiest and most reliable. The ScrapeOps Proxy Aggregator rotates through the best proxies to guarantee you are using a good address and talks to the website for you so that you don't have to worry about headers etc.

Here's an example using the Proxy Aggregator:

const playwright = require("playwright");

const SCRAPE_OPS_KEY = "YOUR_SCRAPE_OPS_API_KEY";
function getScrapeOpsUrl(url) {
const payload = {
apy_key: SCRAPE_OPS_KEY,
url: url,
};

const queryStr = new URLSearchParams(payload).toString();
return `https://proxy.scrapeops.io/v1/?${queryStr}`;
}

(async () => {
// Launch browser, context and page
const browser = await playwright.chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();

// Navigate to the page and take a screenshot
const url = getScrapeOpsUrl("https://www.whatismybrowser.com");
await page.goto(url);
await page.screenshot({ path: "whatis-proxied.png" });

// Close the browser
await browser.close();
})();

We have to write a bit more code but this method is actually a lot easier because we don't have to manage anything anymore.

  • We use the default puppeteer package
  • We make a getScrapeOpsUrl function to create URLs
  • We launch the browser as usual with chromium.launch()
  • We create a new page
  • We navigate to the scrape ops URL using page.goto
  • Finally we take a screenshot and close the browser.

Here's the result of whatismybrowser whatismybrowser using ScrapeOps Proxy Aggregator


Bypassing PerimeterX on Zillow

Now that we've learned all that, let's apply it. Zillow uses PerimeterX to stop bot traffic. We're going to try scraping Zillow with these methods to see what works.

Here's what Zillow looks like from my browser, we'll be trying to get a similar result from our code:

Zillow Homepage

This is what it looks like when you get blocked, we're avoiding this: Zillow Blocked

Zillow with a Fortified Browser

We'll be mostly reusing the code we wrote for Option 1 earlier. Here's what it looks like:

const playwright = require("playwright");

const userAgents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36",
"Mozilla/5.0 (iPhone; CPU iPhone OS 14_4_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Mobile/15E148 Safari/604.1",
"Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/87.0.664.75",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18363",
];

(async () => {
// Launch browser, context and page
const browser = await playwright.chromium.launch();
const context = await browser.newContext({
userAgent: userAgents[Math.floor(Math.random() * userAgents.length)],
});
const page = await context.newPage();

// Navigate to the page and take a screenshot
await page.goto("https://zillow.com");
await page.screenshot({ path: "zillow-fortified.png" });

// Close the browser
await browser.close();
})();

Here's the result: Zillow Fortified Browser Result

Surprisingly, the simplest method is successful. Even then, it is only a matter of time before you are detected or this bypass becomes obsolete so don't be surprised when this option fails.

Zillow with puppeteer-extra-plugin-stealth

Once again, we're just copying the code we already wrote before but using Zillow. Here it is for reference:

const { chromium } = require("playwright-extra");
const stealth = require("puppeteer-extra-plugin-stealth")();

(async () => {
// Enable the stealth plugin
chromium.use(stealth);

// Launch browser, context and page
const browser = await chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();

// Navigate to the page and take a screenshot
await page.goto("https://www.whatismybrowser.com");
await page.screenshot({ path: "what-is-stealth.png" });

// Close the browser
await browser.close();
})();

Here is the result: Zillow with puppeteer stealth

As you can see, it worked. But that doesn't mean it always will. The plugin is open sourced so it is a constant battle between the maintainers and anti-bot services to ensure success on both sides!

Zillow with ScrapeOps Proxy Aggregator

Finally, we'll try to load Zillow with the ScapeOps Proxy Aggregator from Option 3.

Here's our code:

const playwright = require("playwright");

const SCRAPE_OPS_KEY = "YOUR_SCRAPE_OPS_API_KEY";
function getScrapeOpsUrl(url) {
const payload = {
apy_key: SCRAPE_OPS_KEY,
url: url,
};

const queryStr = new URLSearchParams(payload).toString();
return `https://proxy.scrapeops.io/v1/?${queryStr}`;
}

(async () => {
// Launch browser, context and page
const browser = await playwright.chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();

// Navigate to the page and take a screenshot
const url = getScrapeOpsUrl("https://www.whatismybrowser.com");
await page.goto(url);
await page.screenshot({ path: "whatis-proxied.png" });

// Close the browser
await browser.close();
})();

And here's the result: Zillow with ScrapeOps Proxy Aggregator

And of course, this one works without surprise. And is most likely to continue to work as the ScrapeOps team maintains the API and ensures it is using the best detection avoidance tech!


Conclusion

In summary, the most effective method to bypass PerimeterX is through the use of a reliable proxy. All our methods demonstrated success in this regard. Despite the outcomes of this case study, it remains a best practice to utilize a proxy.

If you'd like to learn more about Playwright, checkout the official Playwright Documentation.


More Playwright Guides

If you would like to master the web scraping with Playwright, make sure to check out our Playwright Web Scraping Playbook

Take a look at some of our other case studies here on ScrapeOps: