How To Bypass Cloudflare with Puppeteer
With nearly 20% of internet traffic flowing through Cloudflare, it stands out as one of the most reliable and effective methods for identifying and mitigating bot activity on websites. When it comes to web scraping and automation, bypassing Cloudflare presents a significant challenge.
In this article, we will see how to bypass Cloudflare Bot Management with Puppeteer.
- TLDR: How to Bypass Cloudflare with Puppeteer
- How Does Cloudflare Detect Bots?
- How to Bypass Cloudflare with Puppeteer
- Alternative Approaches
- Case Study: Bypassing Cloudflare on petsathome.com
- Conclusion
- More Web Scraping Guides
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
TLDR: How to Bypass Cloudflare with Puppeteer
Cloudflare employs various strategies to distinguish between HTTP requests from bots and those from genuine users.
Let's delve into what happens when we attempt to navigate to g2.com, a Cloudflare-protected site, and assess if Puppeteer gets detected.
To begin, set up Node.js and Puppeteer using npm
, adding "type": "module"
to the package.json
file to enable the use of ES6 import statements.
Create a new file, index.js
, and write the following Puppeteer code to navigate to g2.com and capture a screenshot:
import { launch } from 'puppeteer';
(async () => {
const browser = await puppeteer.launch({
headless: "new",
defaultViewport: { width: 1080, height: 720 }
});
const page = await browser.newPage();
await page.goto('https://www.g2.com/products/asana/reviews', {
waitUntil: 'networkidle2'
});
await page.screenshot({ path: 'cloudflare-protected-site.png' });
await browser.close();
})();
In the script above:
- We enforce headless mode using headless: "new", navigate to g2.com using the goto() method, and capture a screenshot of the entire page using the screenshot() method.
- Upon execution, you'll find an image named "cloudflare-protected-site.png" in the current directory, displaying a "Sorry, you have been blocked" message.
Let's explore a quick fix to our problem and attempt to bypass Cloudflare.
We'll use puppeteer-real-browser, a tool that prevents Puppeteer from being identified as a bot in services like Cloudflare and enables passing captchas seamlessly to bypass Cloudflare. It mimics the behavior of a real browser.
Install it via npm
:
npm i puppeteer-real-browser
If you're using Linux, you'll also need to install xvfb for puppeteer-real-browser
to function:
sudo apt-get install xvfb
Now, let's make some adjustments to our previous example and observe if using puppeteer-real-browser
makes any difference:
import { connect } from 'puppeteer-real-browser';
(async () => {
const { page } = await connect({
headless: 'auto',
fingerprint: true,
turnstile: true,
tf: true,
});
await page.goto('https://www.g2.com/products/asana/reviews', {
waitUntil: "networkidle2"
});
await page.waitForTimeout(15000);
await page.screenshot({ path: 'puppeteer-real-browser.png' });
await page.close();
})();
- This script successfully bypasses the Cloudflare waiting room and automatically handles the captcha because we passed
turnstile: true
to the connect() method ofpuppeteer-real-browser
. - A screenshot named "puppeteer-real-browser.png" will appear in the current working directory, resembling the following:
We have successfully bypassed Cloudflare protection using puppeteer-real-browser
.
How does Cloudflare Detect Bots?
Cloudflare's Bot Management is a suite of tools and techniques designed to protect websites and web applications from automated traffic, including both legitimate bots and malicious bots.
Bypassing Cloudflare is extremely difficult because:
-
Cloudflare employs a range of security measures, including bot management, DDoS protection, and web application firewall (WAF), making it challenging to bypass.
-
It operates a vast global network of data centers, which helps distribute traffic geographically and mitigate DDoS attacks.
-
It continuously monitors traffic patterns and behaviors to detect and mitigate threats in real-time.
-
Cloudflare's bot management employs advanced techniques, such as machine learning algorithms and heuristics, to accurately identify and block malicious bots.
-
It regularly updates its security measures and bot management algorithms to adapt to evolving threats and new attack techniques.
Here are several methods through which Cloudflare's bot management distinguishes between bots and genuine users:
-
User-Agent Strings: These strings contain information about the client’s operating system, browser type, version, and other relevant details. Cloudflare analyzes this to identify the source of a request.
-
JavaScript Challenge: Cloudflare sends a Javascript code to the client that can only be solved by a real browser.
-
Rate Limiting: Cloudflare looks for abnormal patterns in the behavior of sending requests (sending too many requests can only be done by a bot), then blocks the suspicious ones.
-
Browser Fingerprinting: Cloudflare analyzes different attributes of the device, such as screen size, browser type, and installed plugins to see if the request originates from a real browser.
-
CAPTCHAs: Cloudflare uses CAPTCHAs that can only be solved by humans.
-
Event Tracking: Cloudflare adds event listeners to webpages so that it can monitor user actions like mouse movements, clicks, and key presses.
-
Proxy Quality: Cloudflare computes an IP address reputation score for the IP addresses you use to send requests.
-
HTTP Browser Headers: Cloudflare analyses the HTTP headers you send with your requests and compares them to a database of known browser headers patterns, to find inconsistencies.
-
TLS & HTTP/2 Fingerprints: Every HTTP request client generates a static TLS and HTTP/2 fingerprint that Cloudflare can use to determine if the request is coming from a real user or a bot.
-
Browser-Specific APIs: If the
window.chrome
API doesn't exist when Cloudflare checks the browser then it is a clear sign that you are using a headless browser. -
Environment APIs: If the
navigator.userAgent
andnavigator.platform
values are mismatched, Cloudflare will know that this request is automated.
How to Bypass Cloudflare with Puppeteer
While Cloudflare can indeed be bypassed, the challenge lies in the fact that any discovered workaround may become obsolete over time.
This is due to Cloudflare's proactive approach in continually seeking out and addressing new vulnerabilities within its security framework.
Nonetheless, such methods do exist, one of which was demonstrated earlier in the section where we utilized puppeteer-real-browser
.
Now, let's explore various other techniques for bypassing Cloudflare as we design our puppeteer bots.
Method 1: Use Stealth Plugin
As we’ve discussed previously, services like Cloudflare scrutinize browser fingerprint differences to distinguish between headless browsers (often used by bots) and real browsers.
Puppeteer, a popular tool for browser automation, also exhibits differences from genuine Chrome. For instance, its user agent contains the string "Headless"
, and it lacks essential extensions and plugins (such as a PDF viewer).
Detecting these differences is straightforward, and a Puppeteer bot can easily be blocked.
Fortunately, there’s a solution: the puppeteer-extra-plugin-stealth.
This plugin patches variables and their values associated with fingerprint differences, making Puppeteer appear more like a regular browser. For example, in headless mode, properties like navigator.mimeTypes
and navigator.plugins
are empty.
The stealth plugin emulates these properties with functional mocks to mimic a typical Chrome browser.
Let’s explore how to use puppeteer-extra-plugin-stealth
to mitigate browser fingerprint differences.
First, install puppeteer-extra
, which extends Puppeteer with plugin functionality. Then, add the puppeteer-extra-plugin-stealth
:
npm i puppeteer-extra puppeteer-extra-plugin-stealth
Using the plugin in your Puppeteer code is straightforward. Require it and configure it using the use()
method. Here’s an example:
import puppeteer from 'puppeteer-extra';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';
puppeteer.use(StealthPlugin());
(async () => {
const browser = await puppeteer.launch({
headless: "new",
defaultViewport: { width: 1080, height: 720 }
});
const page = await browser.newPage();
await page.goto('https://www.g2.com/products/asana/reviews', {
waitUntil: 'networkidle2'
});
await page.waitForTimeout(15000);
await page.screenshot({ path: 'stealth-plugin.png' });
await browser.close();
})();
- In the first section, we wrote the same code and encountered a block from Cloudflare because we hadn’t used the stealth plugin at that time.
- However, now that we’ve integrated it into our script, you’ll notice that we’re no longer getting blocked, as evidenced by the screenshot below:
While we successfully tricked Cloudflare into thinking we were a real browser, we still couldn’t bypass it entirely.
g2.com is using Cloudflare’s turnstile product, which adds CAPTCHAs to websites. These CAPTCHAs require users to click on a checkbox to complete them.
Unfortunately, the stealth plugin alone can’t handle this. For such scenarios, other approaches like using puppeteer-extra-plugin-recaptcha and puppeteer-real-browser come into play.
We’ll explore how to use the latter in the next section.
If you would like to learn more about this plugin, check our extensive Puppeteer-Extra-Stealth Guide: Bypass Anti-Bots With Ease.
Method 2: Use Puppeteer-Real-Browser
Puppeteer-Real-Browser is a plugin designed to enhance the capabilities of Puppeteer. The plugin aims to make Puppeteer undetectable to services like Cloudflare, also enabling it to bypass CAPTCHAs and act like a real browser.
Here’s how it works:
-
Fingerprinting: It injects a unique fingerprint ID into the page every time the browser is launched, making detection more difficult.
-
Turnstile: It can automatically click on CAPTCHAs if set to true, thanks to Cloudflare Turnstile integration.
-
Target Filter: Uses a target filter to avoid detection and allows you to specify which targets to allow.
-
Custom Configuration: Allows for additional flags and configurations, such as specifying the browser path with executablePath.
To use Puppeteer-Real-Browser
, you need to install it via npm, and if you’re on Linux, you also need to install xvfb
:
npm i puppeteer-real-browser
sudo apt-get install xvfb
Here’s a code example that demonstrates its usage:
import { connect } from 'puppeteer-real-browser';
(async () => {
const { page } = await connect({
headless: 'auto',
fingerprint: true,
turnstile: true,
tf: true,
});
await page.goto('https://www.g2.com/products/asana/reviews', {
waitUntil: "networkidle2"
});
await page.waitForTimeout(15000);
await page.waitForSelector('.ws-pw > p:nth-child(1)');
const textContent = await page.evaluate(() => {
const paragraph = document.querySelector('.ws-pw > p:nth-child(1)');
return paragraph.textContent.trim();
});
console.log(textContent);
await page.close();
})();
// Why is Asana highly ranked across multiple G2 Grids for
// categories including Project Management, Work Management,
// ... More
- This script will successfully bypass the Cloudflare waiting room and automatically click on the CAPTCHA because we passed
turnstile: true
to the connect() method ofpuppeteer-real-browser
. - The
headless: 'auto'
option ensures that the script runs in the most stable mode for the operating system in use. - The
fingerprint: true
setting helps to prevent the script from being identified as a bot.
Method 3: Use Residential or Mobile Proxies
Residential and mobile proxies provide IP addresses associated with real users, making them less likely to be blocked by websites and help in bypassing Cloudflare.
Here’s how to integrate them into Puppeteer:
- Choose a Proxy Provider:
- Select a reliable proxy provider that offers residential or mobile proxies. Some popular providers include Bright Data, Luminati, and Oxylabs.
- Configure Puppeteer with Proxies:
- In your Puppeteer script, create a new browser instance using puppeteer.launch().
- Set the proxy server address, port, and authentication details (if required) in the
args
option.
Take a look a this example code:
import puppeteer from 'puppeteer';
const proxyAddress = 'your_proxy_address';
const proxyPort = 12345;
const browser = await puppeteer.launch({
args: [`--proxy-server=${proxyAddress}:${proxyPort}`],
});
- Rotate Proxies:
- To avoid rate limits and IP-based restrictions, consider rotating proxies for each request.
- Switch between different proxy servers dynamically using a pool of proxies.
Check our Using Proxies With NodeJS Puppeteer to learn how to use proxies with NodeJS Puppeteer.
Method 4: Use Hosted Version of Puppeteer
When conventional Puppeteer setups encounter roadblocks in circumventing anti-bot defenses, turning to hosted solutions like BrightData’s Scraping Browser offers a robust alternative.
Scraping Browser is one of BrightData's proxy-unlocking solutions and is designed to help you easily focus on your multi-step data collection from browsers while they take care of the full proxy and unblocking infrastructure for you, including CAPTCHA solving.
You can now easily access and navigate target websites via browsing libraries such as Puppeteer, Playwright, and Selenium
Here's a detailed guide on integrating it into your workflow:
-
Sign Up for BrightData’s Scraping Browser: First, begin by creating an account on BrightData’s official website. This process typically involves providing basic information and agreeing to their terms of service.
-
Install BrightData’s Scraping Browser: Once registered, proceed to install the puppeteer-core package using npm.
npm install puppeteer-core
- Initialize Puppeteer with Scraping Browser: After installation, it’s time to integrate Puppeteer with BrightData’s Scraping Browser. Follow these steps:
import puppeteer from 'puppeteer-core';
const SBR_WS_ENDPOINT = 'wss://brd-customer-hl_4f9f6b32-zone-scraping_browser1:h4nom7n8n1i2@brd.superproxy.io:9222';
(async () => {
const browser = await puppeteer.connect({
browserWSEndpoint: SBR_WS_ENDPOINT,
});
const page = await browser.newPage();
await page.goto('https://example.com');
await page.screenshot({ path: './page.png', fullPage: true });
await page.close();
})()
In the provided code snippet, replace 'SBR_WS_ENDPOINT' with the API key you received upon registration with BrightData. This key serves as your authentication token for accessing the Scraping Browser services.
Method 5: Fortify Puppeteer Yourself
If you prefer to build your own fortified Puppeteer browser, follow these steps:
-
Understand Common Browser Fingerprint Leaks:
- Anti-bot systems detect automated browsers based on various factors, including user agent strings, canvas fingerprinting, and JavaScript behavior.
- Research common fingerprint leaks and understand how they can be used to identify bots.
-
Implement Anti-Fingerprinting Techniques:
- Modify the user agent string to mimic real browsers.
import puppeteer from 'puppeteer';
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36');
// Rest of your code
})();- Disable or modify JavaScript features that leak information (e.g., WebGL, WebRTC).
import puppeteer from 'puppeteer';
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.evaluateOnNewDocument(() => {
// Disable WebGL
Object.defineProperty(navigator, 'webdriver', {
get: () => false,
});
// Modify WebRTC
const patchedRTCConfig = {
iceServers: [{ urls: 'stun:stun.example.org' }],
};
Object.defineProperty(window, 'RTCConfiguration', {
writable: false,
value: patchedRTCConfig,
});
});
// Rest of your code
})();- Randomize canvas fingerprints by manipulating canvas properties.
import puppeteer from 'puppeteer';
import { createCanvas, loadImage } from 'canvas';
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Override the `toDataURL` function to return a random value
await page.evaluateOnNewDocument(() => {
HTMLCanvasElement.prototype.toDataURL = function() {
return '';
};
});
// Rest of your code
})(); -
Handle Captchas and Rate Limits:
- Implement logic to solve captchas automatically (e.g., using third-party services or machine learning models).
- For handling captchas, you might need to integrate with a third-party service like AntiCaptcha or 2Captcha. Here's a basic example using Puppeteer with AntiCaptcha:
import puppeteer from 'puppeteer-extra';
import RecaptchaPlugin from 'puppeteer-extra-plugin-recaptcha';
puppeteer.use(
RecaptchaPlugin({
provider: { id: 'anti-captcha', token: 'YOUR_ANTICAPTCHA_API_KEY' },
visualFeedback: true,
})
);
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('YOUR_URL_WITH_RECAPTCHA');
// Solve captcha
await page.solveRecaptchas();
// Rest of your code
})(); -
Monitor rate limits and adjust request frequency accordingly.
Method 6: Leverage ScrapeOps Proxy to Bypass Cloudflare
ScrapeOps Proxy Aggregator is a powerful tool that allows you to bypass Cloudflare without the hassle of constantly updating your scripts with the latest bypass techniques. With ScrapeOps Proxy, you don’t have to worry about the maintenance of your scraping scripts against Cloudflare’s defenses.
Here’s a code example using ScrapeOps Proxy with Puppeteer:
import puppeteer from 'puppeteer';
const API_KEY = "YOUR-SUPER-SECRET-API-KEY";
function getScrapeOpsUrl(url) {
let payload = {
"api_key": API_KEY,
"url": url
};
const queryString = new URLSearchParams(payload).toString();
const proxy_url = `https://proxy.scrapeops.io/v1/?${queryString}`;
return proxy_url;
}
(async () => {
const browser = await puppeteer.launch({headless: true});
const page = await browser.newPage();
await page.goto(getScrapeOpsUrl("https://www.example.com"));
await page.screenshot({path: "bypass-cloudflare.png"});
await browser.close();
})();
Remember to replace 'YOUR_API_KEY' with your actual ScrapeOps API key. This code will help you access content from Cloudflare-protected websites seamlessly.
For more information, visit the [ScrapeOps Proxy Aggregator][/proxy-aggregator/] and check out our services.
Alternative Approaches
In addition to the commonly used methods like Puppeteer-Real-Browser, using proxies, or employing Puppeteer's Stealth Plugin, there are alternative approaches worth knowing.
While these methods typically suffice for many scenarios, there are occasions where a more intricate and resilient strategy is necessary.
Therefore, here are some alternative approaches you might consider:
Alternative Approach 1: Send Requests to the Origin Server
In certain scenarios, accessing the origin server directly can circumvent Cloudflare's security measures. The origin server refers to the server where the website's content is originally hosted before being distributed through Cloudflare's CDN (Content Delivery Network).
- SSL Certificates:
- Other Tools:
- Several online tools and services such as CrimeFlare, SecurityTrails, and similar platforms can be utilized to uncover information about a website's origin server.
- These tools aggregate data from various sources, including DNS records, WHOIS databases, and historical records, to provide insights into the infrastructure behind a website.
// Directly access the origin server using its IP address
await page.goto('http://88.211.26.45/');
By directly navigating to the IP address associated with the origin server, Puppeteer can bypass Cloudflare's protective measures and retrieve the desired web content.
Alternative Approach 2: Use the Google Cache version of web pages
Many websites protected by Cloudflare allow search engine crawlers like Googlebot to index their content. Consequently, Google maintains cached versions of these web pages, which can be accessed even when direct access to the website is restricted by Cloudflare.
Instead of accessing the website directly, one can retrieve the cached version of the web page from Google's servers.
await page.goto('https://webcache.googleusercontent.com/search?q=cache:https://www.petsathome.com/');
By fetching the cached version from Google's servers, Puppeteer can circumvent Cloudflare's protections and retrieve the content as it appeared during the last indexing by Googlebot.
However, there are some limitations to scraping the Google Cache version. The cached versions are often not the most up-to-date versions of the website and may lack dynamic content generated by JavaScript or server-side processes.
Case Study: Bypassing Cloudflare on PetsAtHome.com
PetsAtHome.com serves as another example of a website protected by Cloudflare. In this case study, we will utilize all the Cloudflare bypassing methods that we have looked up so far and see which one is more effective than the other (in the case of PetsAtHome.com) and which one doesn't work and what might be the possible reasons.
It's important to note that Cloudflare offers multiple plans and products, which websites select based on their needs and budget.
Consequently, a method effective on one Cloudflare-protected site may not necessarily work on another. Let's start:
Vanilla Puppeteer
Initially, let's observe the outcome of attempting to scrape PetsAtHome.com solely with Puppeteer, without any additional measures.
Consider the following code:
import puppeteer from 'puppeteer';
(async () => {
const browser = await puppeteer.launch({
headless: "new",
defaultViewport: { width: 1080, height: 720 }
});
const page = await browser.newPage();
await page.goto('https://www.petsathome.com/', {
waitUntil: 'networkidle2'
});
await page.waitForTimeout(15000);
await page.screenshot({ path: 'petsathome-v0.png' });
await browser.close();
})();
In this code snippet:
- We simply visit PetsAtHome.com and wait until most HTTPS requests are completed, along with an additional 15-second delay.
- However, executing this script results in being stuck in the Cloudflare waiting room, failing to redirect to the website, regardless of the waiting duration.
- Therefore, attempting to access a Cloudflare-protected site using Puppeteer either results in being blocked or being stuck in the virtual waiting room.
Now, let's utilize the Google Cache version of PetsAtHome.com. Having understood the concept of Google Cache, we proceed to write code to scrape the names of all categories from PetsAtHome.com:
import puppeteer from 'puppeteer';
(async () => {
const browser = await puppeteer.launch({ headless: "new" });
const page = await browser.newPage();
await page.goto('https://webcache.googleusercontent.com/search?q=cache:https://www.petsathome.com', {
waitUntil: 'networkidle2'
});
const categories = await page.evaluate(() => {
const titles = Array.from(document.querySelectorAll('p.title'));
return titles.map(t => t.textContent.trim());
});
console.log(categories);
await browser.close();
})();
This code snippet resembles the previous one; however, the URL is changed from PetsAtHome.com to the Google Cache version.
Upon execution, the script successfully prints the names of all categories to the console:
[
'Dog',
'Cat',
'Puppy',
'Kitten',
... More
]
Although it may appear that we have successfully scraped PetsAtHome.com, in reality, we have not retrieved the latest or live version of the website.
Instead, we have scraped a version cached or downloaded by Google, possibly several hours earlier. However, scraping the cached version eliminates concerns about Cloudflare protection and saves time.
Thus, if you are certain that the target website's content will remain unchanged and you do not require dynamic content, opting for the cached version is viable.
Stealth Plugin
In the previous methods, we relied solely on Puppeteer to scrape PetsAtHome.com with its catched url and non-catched url; the latter of them didn't work. Now let's see if the Puppeteer-Extra's Stealth Plugin provides any solution to that. I bet you have already installed puppeteer-extra
and puppeteer-extra-stealth-plugin
. If not, then execute this command in the terminal of your working directory:
npm i puppeteer-extra puppeteer-extra-stealth-plugin
The script that we are going to write resembles the last one. The things we are going to change are: importing "puppeteer-extra"
instead of "puppeteer"
and incorporating the Stealth Plugin using the use()
method. Here is the final script:
import puppeteer from 'puppeteer-extra';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';
puppeteer.use(StealthPlugin());
(async () => {
const browser = await puppeteer.launch({
headless: "new",
defaultViewport: { width: 1080, height: 720 }
});
const page = await browser.newPage();
await page.goto('https://www.petsathome.com/', {
waitUntil: 'networkidle2'
});
await page.waitForTimeout(15000);
await page.screenshot({ path: 'petsathome-v1.png' });
await browser.close();
})();
Upon running this script, you will see that the Stealth Plugin didn't provide any special help in bypassing the cloudflare security. The navigation remained stuck in the Cloudflare's waiting room. It doesn't mean that Stealth Plugin is useless. It works but not with Cloudflare, because Cloudflare is ahead of all the open source bypassing libraries in finding and patching loopholes that bots might exploit.
Puppeteer-Real-Browser
Now, let's utilize puppeteer-real-browser and attempt to capture a screenshot of PetsAtHome.com to ascertain if it successfully bypasses Cloudflare.
Here is the script:
import { connect } from 'puppeteer-real-browser';
(async () => {
const { page } = await connect({
headless: 'auto',
fingerprint: true,
turnstile: true,
tf: true,
});
await page.goto('https://petsathome.com', {
waitUntil: "networkidle2"
});
await page.waitForTimeout(15000);
await page.screenshot({path: 'petsathome-v2.png'});
await page.close();
})();
Executing this script demonstrates that Puppeteer-real-browser successfully bypasses Cloudflare, as evidenced by the screenshot:
It looks like Puppeteer-Real-Browser has become the new Stealth Plugin now, thanks to the open-source community.
BrightData's Scraping Browser
Now let's try BrightData's Scraping Browser with puppeteer-core. We have already seen how to register for it and get a free trial. Now let's write the code:
import puppeteer from 'puppeteer-core';
const AUTH = 'brd-customer-hl_4f9f6b32-zone-scraping_browser1:h4nom7n8n1i2';
const SBR_WS_ENDPOINT = `wss://${AUTH}@brd.superproxy.io:9222`;
(async () => {
try {
const browser = await puppeteer.connect({
browserWSEndpoint: SBR_WS_ENDPOINT,
});
const page = await browser.newPage();
await page.goto('https://petsathome.com');
const html = await page.content();
console.log(html);
await page.close();
await browser.disconnect(); // Close the browser connection
} catch (err) {
console.error('An error occurred:', err);
}
})();
// <html lang="en-GB" class="pah"><head><meta http-equiv="Content-Type"
// content="text/html; charset=UTF-8"><base href="https://www.petsathome.com/">
// ... More
In the above script, we are sending a request to BrightData's Scraping Browser via the AUTH
and SBR_WS_ENDPOINT
keys which you have also seen earlier. And it turns out that it successfully fetched the HTML of PetsAtHome.com. Now is it better than puppeteer-real-browser
? The answer is "No" because the fortified puppeteer-real-browser
is installed and available to our system locally while BrightData's Browser needs to be accessed via the internet which reduces its speed.
The BrightData's Scraping Browser solution is a Proxy Solution. It operates a vast network of proxies worldwide. When you use their Scraping Browser solution, your requests are routed through their proxy network. In the above script, we defined the authentication credentials (AUTH) and the WebSocket (SBR_WS_ENDPOINT) endpoint for the proxy server. The WebSocket endpoint also includes the authentication details.
Fortifying Puppeteer
Fortifying Puppeteer means finding all the loopholes in Puppeteer Headless Browser that can give away its presence to Cloudflare. These loopholes can include the lack of Plugins, inconsistencies in User-Agent, Operating System, and Browser, the limitations of client-side JavaScript evaluation or it can be mismatched Canvas properties.
Fortifying Puppeteer Headless Browser requires a nerd-like interest in Computer Science and a lot of time. But thanks to the open-source community, we have access to several ideas, some of which are hinted at below:
import { use, launch } from 'puppeteer-extra';
import RecaptchaPlugin from 'puppeteer-extra-plugin-recaptcha';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';
import { Proxy } from 'puppeteer-extra-plugin-proxy';
import { getRandom } from 'random-useragent';
use(StealthPlugin());
use(
RecaptchaPlugin({
provider: { id: '2captcha', token: 'YOUR_2CAPTCHA_API_KEY' }, // Use 2Captcha provider
visualFeedback: true,
})
);
(async () => {
const browser = await launch({ headless: true });
const page = await browser.newPage();
await fortifyBrowser(page);
await handleCaptchas(page, 'YOUR_URL_WITH_RECAPTCHA');
await handleIPRotation(page, 'YOUR_PROXY_URL', 'YOUR_PROXY_USERNAME', 'YOUR_PROXY_PASSWORD');
await emulateHumanBehavior(page);
// Monitor page changes
monitorPageChanges(page);
// Rest of your code
})();
async function fortifyBrowser(page) {
const userAgent = getRandom();
await page.setUserAgent(userAgent);
await page.evaluateOnNewDocument(() => {
Object.defineProperty(navigator, 'webdriver', { get: () => false });
const patchedRTCConfig = { iceServers: [{ urls: 'stun:stun.example.org' }] };
Object.defineProperty(window, 'RTCConfiguration', { writable: false, value: patchedRTCConfig });
HTMLCanvasElement.prototype.toDataURL = () => '';
// Additional anti-fingerprinting measures
// Modify screen resolution, timezone, language, etc.
// Example:
Object.defineProperty(window.screen, 'width', { value: 1920 });
Object.defineProperty(window.screen, 'height', { value: 1080 });
Object.defineProperty(Intl.DateTimeFormat.prototype, 'resolvedOptions', {
value: function () {
return { timeZone: 'America/New_York' };
}
});
// Add more properties as needed
});
}
async function handleCaptchas(page, url) {
await page.goto(url);
await page.solveRecaptchas();
}
async function handleIPRotation(page, proxyUrl, username, password) {
const proxy = await Proxy.create({ proxyUrl });
await page.authenticate({ username, password });
}
async function emulateHumanBehavior(page) {
const delay = (ms) => new Promise((resolve) => setTimeout(resolve, ms));
const randomMouseMove = async () => {
const randomX = Math.floor(Math.random() * window.innerWidth);
const randomY = Math.floor(Math.random() * window.innerHeight);
await page.mouse.move(randomX, randomY, { steps: 10 }); // Smooth mouse movement
await delay(Math.random() * 500 + 500); // Random delay between 500ms and 1000ms
};
const randomTyping = async () => {
const randomText = Math.random().toString(36).substring(2); // Generate random text
await page.keyboard.type(randomText, { delay: Math.random() * 50 + 50 }); // Simulate typing speed variation
await delay(Math.random() * 500 + 500); // Random delay between 500ms and 1000ms
};
// Simulate random mouse movements and typing
setInterval(randomMouseMove, 2000); // Every 2 seconds
setInterval(randomTyping, 3000); // Every 3 seconds
// Additional user behavior variation, e.g., scrolling, hovering, etc.
setInterval(async () => {
await page.evaluate(() => {
window.scrollBy(0, Math.random() * 100); // Random scrolling
});
}, 5000); // Every 5 seconds
}
function monitorPageChanges(page) {
page.on('framenavigated', async () => {
// Add logic to handle page changes
});
}
You can get more ideas by looking into the source code of Stealth Plugin
and puppeteer-real-browser
on GitHub. After that, you can start fortifying Puppeteer yourself and keep improving it. It's like finding a flaw, patching it, testing it, then finding a new flaw and repeating the whole process. It's a tiresome thing to do, and that's why there are services like ScrapeOps to help you.
ScrapeOps Proxy Solution
Now let's see if ScrapeOps's Proxy Aggregator works and helps in bypassing the Cloudflare protection when navigating to PetsaAtHome.com with Puppeteer.
For detailed instructions on how to incorporate ScrapeOps Proxy into your Puppeteer script, use this Guide. The following is the script with ScrapeOps incorporated into it:
import puppeteer from 'puppeteer';
// ScrapeOps proxy configuration
PROXY_USERNAME = 'scrapeops.headless_browser_mode=true';
PROXY_PASSWORD = 'YOUR_API_KEY'; // <-- enter your API_Key here
PROXY_SERVER = 'proxy.scrapeops.io';
PROXY_SERVER_PORT = '5353';
(async () => {
const browser = await puppeteer.launch({
ignoreHTTPSErrors: true,
args: [
`--proxy-server=http://${PROXY_SERVER}:${PROXY_SERVER_PORT}`
]
});
const page = await browser.newPage();
await page.authenticate({
username: PROXY_USERNAME,
password: PROXY_PASSWORD,
});
try {
await page.goto('https://petsathome.com', {timeout: 180000});
const categories = await page.evaluate(() => {
const titles = Array.from(document.querySelectorAll('p.title'));
return titles.map(t => t.textContent.trim());
});
console.log(categories);
} catch(err) {
console.log(err);
}
await browser.close();
})();
// Dog
// Cat
// Puppy
// Kitten
// ... More
Upon running this script, you will see that it successfully returned the scraped data which means that it bypassed the Cloudflare Protection.
Now, let's see how all these bypassing strategies differ from each other in terms of efficiency, speed, and ease of use:
Method | Speed | Efficiency | Ease of Use |
---|---|---|---|
Vanilla Puppeteer | Slow (Gets stuck) | Ineffective (Gets stuck in Cloudflare waiting room) | Easy (Simple implementation) |
Google Cache Version | Moderate (Depends on Google Cache) | Limited (Provides outdated content) | Easy (Modification of URL required) |
Puppeteer-real-browser | Fast (Real-time) | High (Bypasses Cloudflare effectively) | Moderate (Setup required, may require extra resources) |
BrightData's Scraping Browser | Competitive (Real-time) | High (Bypasses Cloudflare effectively) | Moderate (Setup required, cost involved) |
ScrapeOps Proxy Solution | Competitive (Real-time) | High (Bypasses Cloudflare effectively) | Moderate (Setup required, additional cost) |
Fortifying Puppeteer | Varies | High (Depends on implemented measures) | Moderate to High (Comprehensive setup required) |
Residential Proxies | Competitive (Real-time) | High (Provides real user IP addresses) | Moderate to High (Configuration and rotation required) |
Stealth Plugin | Varies | Limited (Doesn't make Puppeteer fully undetectable) | Easy to Moderate (Integration with Puppeteer) |
Final Recommendation:
-
Puppeteer-real-browser remains the top choice for its speed, efficiency, and reliability in bypassing Cloudflare protection, making it suitable for real-time scraping tasks.
-
Residential proxies offer an effective alternative for maintaining anonymity and avoiding detection, especially in scenarios where
puppeteer-real-browser
may not be feasible. -
ScrapeOps Proxy Solution provides a convenient option for integrating proxy services directly into Puppeteer, offering reliable access to Cloudflare-protected sites with minimal setup.
-
Fortifying Puppeteer with additional measures like User-Agent modification, Captcha Handling, and human-like behavior emulation can significantly enhance its effectiveness in bypassing Cloudflare protection.
-
Stealth Plugin can further improve Puppeteer's stealth capabilities, making it more adept at evading anti-bot measures and mimicking human behavior. Integrating the Stealth Plugin may require additional customization and testing to optimize its effectiveness.
Conclusion
While we've explored various techniques to bypass Cloudflare using Puppeteer, it's essential to recognize the formidable challenge posed by Cloudflare's extensive resources and dedicated teams. Continual research and experimentation are necessary due to Cloudflare's evolving defenses.
Fortunately, services like ScrapeOps simplify this process, albeit at a small cost, saving significant time and effort in discovering and maintaining bypass methods.
For more information, visit Cloudflare and Puppeteer official documentation.
More Web Scraping Guides
For further insights into Puppeteer, check out our extensive NodeJS Puppeteer Web Scraping Playbook.
You can also check these related articles:
- Using Proxies With NodeJS Puppeteer
- Bypass CAPTCHAs With Puppeteer
- How to Bypass DataDome with Puppeteer
- How to Bypass PerimeterX with Puppeteer