Puppeteer-Extra-Stealth Guide: Bypass Anti-Bots With Ease
Headless browsers are excellent for web scraping, especially on dynamic websites. However, they can be misused for activities such as artificially boosting view counts and ad impressions, by using bots or automated scripts. Browsers employ anti-bot measures to deter such misuse, which also hinders legitimate web scraping.
There are various methods to bypass these anti-bot detections. One such method is puppeteer-extra-plugin-stealth. This plugin's primary objective is to hide Puppeteer's headless state, making it appear as a regular browser by eliminating fingerprint differences between Chromium and standard Chrome.
In this guide, we'll walk you through:
- The Rise of Anti-Bot Technologies
- Common Methods to Bypass Bot Detection
- Anti-Bot Issues with Puppeteer Core
- Countermeasures for Enhanced Stealth
- What is Puppeteer-Extra-Stealth
- Puppeteer-Extra-Stealth Bypass Performance Vs Puppeteer
- Limitations of Stealth Plugin
- Alternative Bypassing Options
- Conclusion
TLDR: Using Puppeteer-Extra-Stealth
Integrating Puppeteer-Extra-Stealth into your scraper is very straightforward.
Simply install puppeteer-extra
and puppeteer-extra-plugin-stealth
using NPM:
npm install puppeteer-extra puppeteer-extra-plugin-stealth
Then enable puppeteer-extra-plugin-stealth
in your Puppeteer script:
const puppeteer = require('puppeteer-extra')
const StealthPlugin = require('puppeteer-extra-plugin-stealth')()
puppeteer.use(StealthPlugin)
// Your regular Puppeteer code goes here
The Rise of Anti-Bot Technologies
According to the Imperva Bad Bot Report 2023, 30% of all internet traffic is attributed to automated bots. These bots are programmed to imitate human behavior for various ulterior motives and commonly employed to artificially boost social media followers, views, and likes.
Browsers are well-aware of this issue, leading them to incorporate diverse bot detection methods. These defenses are designed to identify and counteract bot-related activities.
Let's explore some of the most effective anti-bot techniques:
-
User Agent Detection: The User Agent (UA) contains information such as the browser's name, version, operating system and device info, aiding the server in client identification and tailored responses. It's part of the HTTP header in your request, specifying whether it originates from a headless or headful browser. Browsers use this data to distinguish between standard user requests and automated bot requests.
-
IP Blocking: Bots often display unique traffic patterns, characterized by rapid and repetitive requests, which can be identified by monitoring request rates. In response, browsers may block suspicious IPs from accessing websites.
-
CAPTCHAs: CAPTCHAs present challenges that are designed to be easy for humans but difficult for automated scripts or bots to solve. This helps identify and differentiate between regular users and bots.
-
Browser Fingerprinting: Browser Fingerprinting involves collecting extensive data, including the user's device model, operating system, browser version, user time zone, preferred language settings, ad blocker usage, screen resolution, and detailed technical specifications of their CPU, graphics card, and more. Using this information, browsers can distinguish between bots and regular users.
Common Methods to Bypass Bot Detection
There are various ways to bypass these types of bot detection methods. Here are some examples:
-
Mimicking Human Behavior: By simulating human-like browsing patterns, including mouse movements, keystrokes, and scrolling behavior, it's possible to evade detection by anti-bot mechanisms that rely on detecting unnatural activity.
-
Browser Fingerprinting Spoofing: Manipulating browser fingerprinting data, such as user-agent strings, plugins, and other identifiable information, aids in presenting the bot's activity as that of a legitimate user, thereby eluding detection by sophisticated fingerprinting-based anti-bot measures.
-
Custom User Agent: You can opt for a custom User Agent or have a list of User Agents and pick one randomly each time. This way, browsers will think that you're a legitimate user using a real browser to access websites. A legitimate Chrome UA looks like:
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36."
-
Proxies: Proxies serve as a middleman between the client requesting a resource and the server delivering it. They offer IP addresses from different geographic locations, allowing you to simulate user behavior and make it more challenging for websites to detect unusual access patterns originating from a single location. Another technique is proxy chaining, which entails forwarding traffic from one proxy server to another.
-
CAPTCHA Solvers: You can find services like 2captcha that automatically solve CAPTCHAs. These services employ both human operators for manual CAPTCHA solving and Optical Character Recognition (OCR) Technology for automated CAPTCHA solving.
-
Headless Browser Customization: Tailoring the configurations of headless browsers, such as Puppeteer-Extra-Stealth which is the main subject of the article, to mimic human-like browser attributes and interactions enhances the bot's ability to navigate through websites undetected.
Anti-Bot Issues With Puppeteer Core
Despite garnering significant attention for its robust automation capabilities, the use of Puppeteer for automated browsing activities has encountered various challenges related to anti-bot mechanisms.
The headless Chromium browser differs significantly in its implementation compared to standard Chrome. Puppeteer, bundled with Chromium, inherits all its limitations by default, such as:
-
Puppeteer doesn't support audio/video formats like "AAC" and "H.264", leading to unexpected behavior when scraping pages that include audio and video elements.
-
It lacks many native features found in standard browsers, such as: extensions, add-ons, bookmarks, history and password managers. The absence of these features can be a sign for detecting Puppeteer based bots.
-
In the case of a headless browser, the User Agent string typically resembles
"HeadlessChrome/61.0.3153.0."
. Thewindow.navigator.userAgent
property can be used to detect if the request originates from a headless browser, thereby revealing the presence of Puppeteer. -
The Canvas and WebGL APIs render images with subtle differences, which are a result of variations in image format, graphics processing engines, compression levels, and pixel-level settings across different operating systems and browsers. These minute differences, can lead to the detection of bots.
-
Bots may attempt to randomize mouse movements and clicking rhythms, but they cannot imitate human-like behavior. Browsers can detect bot usage by analyzing mouse dynamics, movements, (whether non-linear, randomized, or patterned), and identifying distinctive clicking rhythms.
-
Browsers come with various in-built plugins such as Flash, PDF viewers, and DevTools, which are absent in headless browsers used for creating bots. The
window.navigator.plugins
property can reveal the absence of these plugins, potentially indicating the presence of bots. -
Bots often use fixed screen and viewport dimensions, whereas actual users may have different screen sizes and resolutions. This difference can also expose the presence of automation.
Countermeasures for Enhanced Stealth
There are certain countermeasures that can be employed to hide the Puppeteer's presence in bots to some extent. Some of these countermeasures include:
-
Randomizing Intervals: Introduces variations in the time intervals between different interactions with the web application. For example, actions like clicking links, submitting forms, or navigating through pages can have a randomized delay between them to reduce automation detection.
-
Randomizing Viewports: Bot detection can be reduced by using a wide range of screen and viewport sizes and resolutions, simulating the diversity of devices used by genuine users.
-
Headful Browser Mode: Bots usually work with a headless browser, meaning they operate without a graphical user interface (GUI).
- To make interactions with your web application more human-like, you can utilize a full, non-headless browser like Chrome.
- Puppeteer can be configured to operate in Headful mode by passing a flag
headless: false
and specifying theexecutionPath
to the installed Chrome on your operating system in thepuppeteer.launch()
method. - Puppeteer also provides a
setUserAgent()
method to use a custom User Agent that resemble a real browser.
What is Puppeteer-Extra-Stealth
Puppeteer-Extra is a light-weight wrapper around Puppeteer that augments its functionality with plugins. It has a variety of plugins, that can be installed individually for different purposes. For example puppeteer-extra-plugin-recaptcha solves reCAPTCHAs and hCAPTCHAs, while puppeteer-extra-plugin-adblocker blocks ads and trackers on websites.
The plugin that we are going to discuss here is the Puppeteer-Extra-Stealth Plugin. As we discussed earlier, Puppeteer can be easily detected due to its subtle fingerprint differences with real Chrome browsers.
The puppeteer-extra-plugin-stealth removes these differences, using various anti-detection evasion modules, to hide Puppeteer's presence in bots. It is actively maintained and enhanced with new evasion modules by its dedicated open-source community as they encounter and address new bot detection challenges.
Installing Puppeteer-Extra-Stealth
Let's quickly set up everything you need to begin. Ensure you have the latest NodeJS version installed. To install Puppeteer along with Puppeteer-Extra and Puppeteer-Extra-Plugin-Stealth, run this command:
npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealth
Once you've installed all the necessary dependencies, you're ready to proceed. Puppeteer-Extra provides the use()
method to incorporate plugins into your Puppeteer script. Here's a sample code to kickstart your project:
const puppeteer = require('puppeteer-extra')
const StealthPlugin = require('puppeteer-extra-plugin-stealth')()
puppeteer.use(StealthPlugin)
// Your regular Puppeteer code goes here
Note that we used puppeteer-extra instead of puppeteer because it integrates with Puppeteer in the background, eliminating the need for an explicit require.
Evasion Techniques Employed by Puppeteer-Extra-Stealth
The Extra Stealth plugin includes a variety of built-in evasion modules, each designed to address specific types of bot detection.
Let's delve into the challenges, that Puppeteer users face, in terms of bot detection and see how the Stealth plugin overcomes them.
-
User-Agent Override: The "user-agent-override" module fixes UserAgent info, including UA string, Accept-Language, Platform, and UA hints. This is necessary because, by default, Puppeteer doesn't set Accept-Language header in headless mode. This adjustment aids in making your User Agent appear as if it's from a real browser.
NOTE: Avoid usingsetUserAgent()
, as it will reset the language and platform values that you've configured using this module. -
Languages Module: The "navigator.languages" module ensures that the
navigator.languages
property delivers custom languages when specified, or defaults to['en-US', 'en']
to ensure consistent behavior and avoid revealing automation. -
Permissions Module: The "navigator.permissons" module customizes the behavior of
navigator.permissions.query
for thenotifications
permission to ensure that permission requests consistently mimic real browser behavior, regardless of the actual permissions state. -
WebGL Module: By default, Puppeteer exposes
"Google Inc"
as the WebGL vendor and"Google SwiftShader"
as the renderer, which can potentially signal automation. The "webgl.vendor" module alters the WebGL vendor to"Intel Inc"
and the renderer to"Intel(R) Iris(TM) Graphics 6100"
, eluding bot detection. -
Media Codecs Module: Puppeteer operates in headless mode with Chromium, which lacks support for media codecs. The "media.codecs" module carefully adjusts specific variables to prevent leaving behind traces that might reveal the use of Puppeteer.
-
Plugins Module: Web browsers typically include a range of media types (mimeType) and built-in plugins like Flash and PDF viewers. However, Puppeteer, as a headless browser, lacks these features. The "navigator.plugins" module generates mimeTypes and plugins from scratch and seamlessly incorporates them into Puppeteer, giving it the appearance of a genuine browser.
-
SourceURL Module: SourceURL is a special comment added at the end of JavaScript code to indicate the URL or source file of the code's origin for debugging purposes. Puppeteer suffix its sourceURL with
__puppeteer_evaluation_script__
which can be detected by examining the call stack. The "sourceurl" module hides Puppeteer's presence by removing this suffix. -
Chrome Runtime Module: The "chrome.runtime" module emulates the
chrome.runtime
object which is not available in Puppeter. It provides a set of mock methods forchrome.runtime.connect
andchrome.runtime.sendMessage
, handling potential edge cases and ensuring that these methods behave in a way that resembles a real Chrome environment. -
Iframe Module: The "iframe.contentWindow" module tackles issues related the detection iframes in Puppeteer, mainly the
iframe.contentWindow
property. The code intercepts iframe creation events and augments thesrcdoc
property to ensure that calls toiframe.contentWindow
behave correctly. -
OuterDimensions Module: The "window.outerDimensions" module resolves the absence of
window.outerWidth
andwindow.outerHeight
in Puppeteer. It accomplishes this by configuring the viewport to match the window size, unless the user specifies otherwise.
The Stealth plugin incorporates several other subtle tweaks, such as deleting the navigator.webdriver, mocking chrome.csi, chrome.app, and chrome.loadTimes objects when they are unavailable.
For a comprehensive list of all the modules, you can visit Evasion Modules.
Configuring Puppeteer-Extra-Stealth Evasion Modules
You can retrieve a list of available evasion modules using the availableEvasion
property.
const puppeteer = require('puppeteer-extra')
const StealthPlugin = require('puppeteer-extra-plugin-stealth')()
puppeteer.use(StealthPlugin)
console.log(StealthPlugin.availableEvasions)
// Set(16) {
// 'chrome.app',
// 'chrome.csi',
// 'chrome.loadTimes',
// ... More
By default, all the evasion modules become active when you use StealthPlugin. However, you can selectively enable a subset of these modules by using the enabledEvasion.delete()
method to remove the ones you don't require. Let's see how to do this:
const puppeteer = require('puppeteer-extra')
const StealthPlugin = require('puppeteer-extra-plugin-stealth')()
puppeteer.use(StealthPlugin)
StealthPlugin.enabledEvasions.delete('sourceurl');
StealthPlugin.enabledEvasions.delete('webgl.vendor');
console.log(StealthPlugin.enabledEvasions)
// Set(14) {
// 'chrome.app',
// 'chrome.csi',
// 'chrome.loadTimes',
// ... More
You can craft your own custom evasion modules using this template provided by the Puppeteer-Extra-Stealth developers. User-created modules should be placed in the "node_modules/puppeteer-extra-plugin-stealth/evasion" directory within your project to become available.
Puppeteer-Extra-Stealth Bypass Performance Vs Puppeteer
SannySoft serves as a testing ground for assessing the efficiency of the Stealth Plugin in bypassing bot detection. When you take a screenshot of this website using Puppeteer without the Stealth Plugin, it highlights various detected fingerprints in red, clearly showing how Puppeteer-reliant bots can be easily identified. However, with the integration of the Stealth Plugin, nearly all the fingerprint tests pass successfuly.
On the left side of the comparison, you'll observe a lot of red lines, marking areas where detection is prominent. However, on the right side, the Stealth Plugin, with the anti-detection techniques we previously discussed, prevails over these challenges.
Limitations of Stealth Plugin
The Stealth Plugin excels at staying hidden in various situations, but it's not entirely immune to detection. Advanced browser analysis, fingerprinting, and strict IP reputation standards can still uncover it.
Furthermore, websites increase their security through third-party services such as Cloudflare. Most open-source [Cloudflare][https://www.cloudflare.com/] bypass techniques, tend to work for only a few months before they become ineffective.
Cloudflare developers regularly update their systems to detect and counter new bypass methods as they arise.
Alternative Bypassing Options
There are paid approaches to bypass anti-bots like Cloudflare, DataDome, and PerimeterX, etc.
One highly effective option is to utilize the ScrapeOps Proxy Aggregator, which seamlessly combines more than 20 proxy providers into a single proxy API.
It helps you discover the best and most cost-effective proxy provider for your specific target domains.
Activating ScrapeOps Cloudflare Bypass is straightforward – simply append bypass=cloudflare_level_1
to your API request, like this:
const axios = require('axios');
const url = `https://proxy.scrapeops.io/v1/`;
axios.get(url, {
params: {
api_key: 'YOUR_API_KEY',
url: 'http://example.com', // Cloudflare protected website,
bypass: 'cloudflare_level_1',
},
})
.then((response) => {
console.log(response.data);
})
.catch((error) => {
console.error(error);
});
Cloudflare is the most common anti-bot system being used by websites today, and bypassing it depends on which security settings the website has enabled.
To combat this, we offer 3 different Cloudflare bypasses designed to solve the Cloudflare challenges at each security level.
Security Level | Bypass | API Credits | Description |
---|---|---|---|
Low | cloudflare_level_1 | 10 | Use to bypass Cloudflare protected sites with low security settings enabled. |
Medium | cloudflare_level_2 | 35 | Use to bypass Cloudflare protected sites with medium security settings enabled. On large plans the credit multiple will be increased to maintain a flat rate of $3.50 per thousand requests. |
High | cloudflare_level_3 | 50 | Use to bypass Cloudflare protected sites with high security settings enabled. On large plans the credit multiple will be increased to maintain a flat rate of $4 per thousand requests. |
Conclusion
Puppeteer serves as a valuable tool for controlling headless Chromium browsers in the realm of web scraping. However, it exhibits various implementation differences compared to the standard Chrome browser.
The Stealth Plugin offers several solutions to eliminate these differences, employing numerous evasion modules that are frequently updated to counter new bot detections. It may fail to counter advanced bot-detection systems like Cloudflare.
In such cases, paid services like ScrapeOps Aggregator come to the rescue by offering solutions that employ proxies to evade detection, effectively safeguarding against Cloudflare's bot detection mechanisms.
More Web Scraping Tutorials
If you would like to learn more about Web Scraping with Puppeteer, then be sure to check out The Puppeteer Web Scraping Playbook.
Or check out one of our more in-depth guides: