Skip to main content

Puppeteer Extra Guide

What is Puppeteer Extra - A Web Scrapers Guide

If you're serious about web scraping, privacy, or automation, then Puppeteer Extra is a must-learn and great option to consider. It can help you avoid bot detection, enhance privacy and security by hiding your IP address and location, and improve performance.

In this comprehensive guide, we delve into the:

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.


What Is Puppeteer Extra?

Puppeteer Extra is an open-source framework that extends the functionality of Puppeteer by offering a rich ecosystem of plugins and custom features that empower developers to overcome various challenges encountered during web scraping and automation tasks.

These plugins provide solutions for common issues like bypassing anti-scraping measures, handling captchas, and interacting with websites in a more human-like manner. For example, the puppeteer-extra-plugin-stealth plugin helps you avoid bot detections when crawling your target website.

Puppeteer Extra, as an advanced web scraping framework, offers several advantages and disadvantages, which are important to consider when evaluating its suitability for specific projects.

Advantages of Puppeteer Extra

  1. Enhanced Functionality: Puppeteer Extra extends the capabilities of Puppeteer, providing a rich collection of plugins that facilitate tasks such as bypassing anti-scraping measures, handling captchas, and simulating human interactions, thereby enabling more comprehensive and sophisticated scraping processes.

  2. Flexibility: The framework's modular design allows users to customize their scraping workflows by integrating specific plugins according to their requirements, making it adaptable to a wide range of scraping scenarios and target websites.

  3. Community Support: Puppeteer Extra benefits from an active community of developers who contribute to the development of plugins, share insights, and provide assistance, creating a robust support network for users encountering challenges or seeking guidance.

  4. Integration with Puppeteer: Puppeteer Extra builds upon the foundation of Puppeteer, leveraging its core functionalities such as page manipulation, DOM interaction, and network monitoring, ensuring a seamless transition for users already familiar with Puppeteer.

Disadvantages of Puppeteer Extra

  1. Learning Curve: While Puppeteer Extra offers advanced features, its utilization may require a certain level of expertise in web scraping and JavaScript, making it less accessible for beginners or those unfamiliar with the intricacies of web automation.

  2. Maintenance and Updates: Given its evolving nature and reliance on external plugins, Puppeteer Extra may necessitate frequent updates and maintenance to ensure compatibility with changes in target websites and to address potential issues arising from plugin dependencies.

  3. Performance Overhead: The use of additional plugins and features in Puppeteer Extra can potentially introduce performance overhead, impacting the speed and efficiency of web scraping tasks, particularly when dealing with large-scale or time-sensitive data extraction operations.


Integrating puppeteer-extra

In order to use Puppeteer Extra, you first need to install it. You can use npm (Node Package Manager) to install Puppeteer Extra. Run the following command on your preferred terminal or command prompt:

npm install puppeteer-extra

This command will install both Puppeteer and Puppeteer Extra, allowing you to leverage the extended functionalities provided by Puppeteer Extra.

Before we jump into exploring different Puppeteer Extra plugins, let's do a simple demonstration of how to use Puppeteer to automate browser tasks, such as navigation and taking screenshots.

// Import the Puppeteer library
const puppeteer = require("puppeteer");

// Asynchronously launch a Puppeteer browser in non-headless mode
(async () => {
const browser = await puppeteer.launch({
headless: false,
});

// Create a new Puppeteer page.
const page = await browser.newPage();

// Navigate the Puppeteer page to the website
await page.goto('https://quotes.toscrape.com/');

// Take a screenshot of the current page and save it
await page.screenshot({
path: 'screenshot.png',
});

// Close the Puppeteer browser
await browser.close();
})();
  • The script above navigates to the URL Quotes to Scrape and captures a screenshot of the current page using the page.screenshot method.

  • The sample code that uses Puppeteer but not Puppeteer Extra.

Now, let's integrate Puppeteer Extra with the Stealth plugin to avoid bot detection on websites.

If you need specific plugins for Puppeteer Extra, you can install them separately using npm. Since we’re also using one of the plugins of puppeteer-extra (i.e. puppeteer-extra-plugin-stealth), install it as well.

npm install puppeteer-extra-plugin-stealth

Here's a script with Puppeteer Extra functionality similar to the above.

// Import the Puppeteer and Puppeteer Extra Stealth Plugin libraries
const puppeteer = require('puppeteer-extra');
const stealthPlugin = require('puppeteer-extra-plugin-stealth');

// Use the Puppeteer Extra Stealth Plugin
puppeteer.use(stealthPlugin());

(async () => {
const browser = await puppeteer.launch({
headless: false,
args: ['--no-sandbox'],
});

// Create a new Puppeteer page
const page = await browser.newPage();

// Navigate to the website.
await page.goto('https://quotes.toscrape.com/');

// Take a screenshot of the current page and save it
await page.screenshot({
path: 'screenshot.png'
});

// Close the Puppeteer browser
await browser.close();
})();

We did the same thing but used the Puppeteer Extra Stealth Plugin as well.

The integration of the Puppeteer Extra Stealth Plugin in this code demonstrates how to enhance web scraping capabilities by implementing stealth measures to avoid detection and blocking by websites.

To refresh your fundamentals of Puppeteer, check out our The NodeJS Puppeteer Guide.


Best Puppeteer Extra Plugins for Web Scraping

Below are a few Puppeteer Extra plugins commonly used for web scraping. Choosing the best plugin will depend on your specific requirements and the website you wish to target.

puppeteer-extra-plugin-stealth

puppeteer-extra-plugin-stealth is a plugin for Puppeteer Extra to prevent detection by anti-bots and other systems designed to detect web scrapers.

This plugin applies various techniques to make the detection of Puppeteer harder. The use of Puppeteer can easily be detected by a target website, and the goal of this plugin is to avoid detection; otherwise, your requests will be flagged as a bot.

The puppeteer-extra-plugin-stealth is particularly beneficial when dealing with websites that actively employ anti-scraping measures or those that are sensitive to automated data extraction, enabling smoother and more efficient scraping operations while minimizing the risk of detection and interference.

If this is your first plugin, install puppeteer-extra and the puppeteer-extra-plugin-stealth plugin using the following command:

npm install puppeteer-extra puppeteer-extra-plugin-stealth

In the following example, learn how to use Puppeteer Extra with the Stealth plugin.

// Import Puppeteer and the Puppeteer Extra Stealth plugin
const puppeteer = require('puppeteer-extra');
const stealthPlugin = require('puppeteer-extra-plugin-stealth');

// Enable the Stealth plugin with all evasions
puppeteer.use(stealthPlugin());

(async () => {
// Launch the browser in headless mode
const browser = await puppeteer.launch({
args: ['--no-sandbox'],
headless: true
});
const page = await browser.newPage();

// Navigate to the page
const testUrl = 'https://quotes.toscrape.com/';
await page.goto(testUrl);

// Save a screenshot
const screenshotPath = 'screenshot.png';
await page.screenshot({
path: screenshotPath
});

console.log('Screenshot saved.');

// Close the browser.
await browser.close();
})();
  • Here we start by importing Puppeteer Extra to configure Puppeteer Stealth.
  • Then, we add the Stealth plugin and use it in the default mode, which ensures that our script uses all evasion modules.
  • Next, we launch Puppeteer Stealth.
  • Finally, as in our basic Puppeteer script, we create a new page, navigate to the target website, and take a screenshot.

puppeteer-extra-plugin-proxy

Some websites impose rate limits on the number of requests a single IP address can make within a certain timeframe.

puppeteer-extra-plugin-proxy is a plugin for puppeteer-extra that adds proxy support, which helps avoid rate limiting in web scraping and providing a layer of anonymity and flexibility during web scraping and automation tasks.

This plugin allows you to specify proxy settings, including IP addresses and ports, and integrate proxy functionality into your Puppeteer script.

Install the plugin using the following command:

npm install puppeteer-extra-plugin-proxy

In the following code, we use the puppeteer-extra-plugin-proxy to use a proxy server when launching Puppeteer. We pass a sample IP and port, which you can find here. Then, we launch the browser and go to the httpbin website.

After that, we extract the body and text content from the website to verify whether our proxy server is using our IP.

// Import Puppeteer and the Puppeteer Extra Proxy plugin
const puppeteer = require('puppeteer-extra');
const pluginProxy = require('puppeteer-extra-plugin-proxy');

// Use the Proxy plugin with the specified proxy address and port
puppeteer.use(pluginProxy({
address: '35.236.207.242',
port: 33333
}));

// Launch Puppeteer in non-headless mode
puppeteer.launch({
headless: false
})
.then(async browser => {
const page = await browser.newPage();

// Navigate to the httpbin website
await page.goto('https://httpbin.org/ip');

// Wait for the body element to load.
const body = await page.waitForSelector('body');

// Get the IP address from the body element
const ip = await body.getProperty('textContent');

// Log the IP address to the console
console.log(await ip.jsonValue());

// Close the browser
await browser.close();
});

Overall, the puppeteer-extra-plugin-proxy is a valuable tool for enhancing the versatility and robustness of Puppeteer-based web scraping and automation projects, enabling users to manage and manipulate network requests through the utilization of proxies for a variety of purposes, including privacy, accessibility, and performance optimization.

puppeteer-extra-plugin-anonymize-ua

User agents (UAs) are strings that are sent by the browser of a user to the server. The UA contains information such as the browser type and version, as well as the operating system. Anonymizing the User Agent string can help you in avoiding detection by websites.

A UA string looks like this:


"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"

The puppeteer-extra-plugin-anonymize-ua plugin anonymizes the user-agent header sent by Puppeteer.

You can visit useragentstring.com to see the UA for your web browsing environment.

What is Puppeteer Extra - puppeteer-extra-plugin-anonymize-ua - User-Agent

To use the puppeteer-extra-plugin-anonymize-ua plugin, first, install it using the following command:

npm install puppeteer-extra-plugin-anonymize-ua

Once you have used the anonymizeUaPlugin() method, all requests made by Puppeteer will have their User-Agent (UA) strings anonymized.

// Import Puppeteer and the Puppeteer Extra Anonymize UA plugin
const puppeteer = require('puppeteer-extra');
const anonymizeUaPlugin = require('puppeteer-extra-plugin-anonymize-ua');

// Use the Anonymize UA plugin
puppeteer.use(anonymizeUaPlugin());

(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();

// Navigate to the target website
await page.goto('https://quotes.toscrape.com/');

// Take a screenshot of the website
await page.screenshot({
path: 'screenshot.png'
});

// Close the browser
await browser.close();
})();

This code will launch Puppeteer with the anonymize-ua plugin enabled. This will anonymize the User-Agent (UA) string of all requests made by Puppeteer.

The puppeteer-extra-plugin-anonymize-ua plugin serves as a valuable tool for maintaining anonymity, preventing detection, and ensuring the integrity of data collection and testing processes, enabling users to conduct operations discreetly and without interference from target websites.

puppeteer-extra-plugin-block-resources

When scraping websites, you may not be interested in loading or downloading all types of resources, such as images, scripts, or certain stylesheets.

The puppeteer-extra-plugin-block-resources plugin blocks specific resources (such as media, scripts, CSS, etc.) to make Puppeteer faster and reduce unnecessary data usage.

This plugin serves as a valuable tool for optimizing web scraping operations, enhancing performance, and streamlining the extraction of essential data from target websites, enabling users to tailor the scraping process according to their specific requirements and preferences.

To use the plugin, use the following command to install:

npm install puppeteer-extra-plugin-block-resources

Below, we're integrating the block resources plugin with the Puppeteer script.

// Import Puppeteer and the Puppeteer Extra Block Resources plugin
const puppeteer = require('puppeteer-extra');
const blockResourcesPlugin = require('puppeteer-extra-plugin-block-resources')();

// Use the Block Resources plugin
puppeteer.use(blockResourcesPlugin);

async function withPlugIn() {
// Launch Puppeteer in non-headless mode
const browser = await puppeteer.launch({
headless: false
});

const page = await browser.newPage();

// Add the 'media' and 'script' resource types to the list of blocked resources
blockResourcesPlugin.blockedTypes.add('media');
blockResourcesPlugin.blockedTypes.add('script');

// Navigate to the target website and wait for the DOM to load
await page.goto('http://www.youtube.com', {
waitUntil: 'domcontentloaded'
});

// Close the browser
await browser.close();
}

withPlugIn();

We're calling the blockedTypes.add() method with the appropriate parameter. In the above code, we're blocking external JavaScript and CSS.

You can dynamically remove or add the resources you want to block.

blockResourcesPlugin.blockedTypes.add('media')
blockResourcesPlugin.blockedTypes.remove('stylesheet')

In the above code, we're blocking media on the page. Additionally, we're specifying that we do not want to block stylesheets (CSS).

puppeteer-extra-plugin-recaptcha

CAPTCHAs are an obstacle designed to keeps scrapers and bots out. However, Puppeteer can help you overcome this issue.

To solve the CAPTCHA with Puppeteer, you can use the puppeteer-extra-plugin-recaptcha plugin, which can solve reCAPTCHAs and hCaptCHAs automatically. We'll be using the 2Captcha API-based CAPTCHA-solving service.

First, install the plugin using the following command

npm install puppeteer-extra-plugin-recaptcha

In the following, we will demonstrate how to use Puppeteer along with the Puppeteer Extra Recaptcha Plugin to automate the login process on a webpage that includes a reCAPTCHA verification.

// Import necessary modules
const puppeteer = require('puppeteer-extra');
const RecaptchaPlugin = require('puppeteer-extra-plugin-recaptcha');

// Configure the reCAPTCHA solving provider
puppeteer.use(
RecaptchaPlugin({
provider: {
id: '2captcha',
token: 'API Key', // Replace with your own 2CAPTCHA API key
},
visualFeedback: true, // Colorize reCAPTCHAs (violet = detected, green = solved)
})
);

// Define the login function
const logging = async () => {
// Launch a headful browser instance
const browser = await puppeteer.launch({
headless: false,
});

// Create a new page
const page = await browser.newPage();

// Navigate to the login page
await page.goto('https://app.scrapingbee.com/account/login');

// Fill in the email and password fields
await page.waitForSelector('#email');
await page.type('#email', 'Your Email', {
delay: 100
});
await page.waitForSelector('#password', {
delay: 100
});
await page.type('#password', 'Your Password');

// Solve reCAPTCHAs on the page
await page.solveRecaptchas();

// Wait for navigation and click the login button
await Promise.all([
page.waitForNavigation(),
page.click('[type="submit"]'),
]);

};

// Call the login function
logging();
  • The script first filled in the login credentials.
  • Then, used the page.solveRecaptchas() method to automatically solve any reCAPTCHA challenges present on the page.

Best Puppeteer Extra Plugins for Debugging

Puppeteer Extra provides plugins to debug your scripts, especially those that are running on remote machines or in headless environments. Here, we'll discuss two plugins for debugging:

  1. puppeteer-extra-plugin-portal
  2. puppeteer-extra-plugin-devtools

puppeteer-extra-plugin-portal

puppeteer-extra-plugin-portal is a Puppeteer Extra plugin to remotely view and interact with Puppeteer sessions.

It opens a portal to the page, allowing you to remotely view and interact with Puppeteer sessions.

The puppeteer-extra-plugin-portal serves as a powerful tool for optimizing resource utilization, enhancing productivity, and enabling efficient management of parallel browsing tasks, making it a valuable asset for developers

To use puppeteer-extra-plugin-portal, install it using the following command:

npm install puppeteer-extra-plugin-portal

Once you have installed the plugin, you can use it to create a new Puppeteer session with a portal, which allows you to remotely view and interact with the session.

const puppeteer = require('puppeteer-extra');
const PortalPlugin = require('puppeteer-extra-plugin-portal');
puppeteer.use(
PortalPlugin({
// This is a typical configuration when hosting behind a secured reverse proxy
webPortalConfig: {
listenOpts: {
port: 5500,
},
baseUrl: 'http://localhost:5500/',
},
})
)

// puppeteer usage as normal
puppeteer.launch({
headless: true
}).then(async browser => {
const page = await browser.newPage();
await page.goto('https://app.scrapingbee.com/account/login');

// Open a portal to get a link to it.
const portalUrl = await page.openPortal();
console.log('Portal URL:', portalUrl);

// Wait a long time for the success condition to be met
const successDiv = await page.waitForSelector('.recaptcha-success', {
timeout: 86400 * 1000, // 24 hours
});

// await page.closePortal(); // You can manually close a portal with
// OR
// await page.close(); // Closing the page will automatically close its portal.
// OR
// await browser.close(); // Closing the browser will automatically close the portals opened on it.
// When all portals are closed, the web server will automatically shut down
})

When you run the above code, you'll get a sample link that looks like this: http://localhost:5500/?targetId=7F595050561EA7D7164206C8763ACE86.

To view the remote session, open this link in a web browser. Note that, first comment out the code that closes the portal, otherwise, the session will end.

Here's the result when you open https://app.scrapingbee.com/account/login.

What is Puppeteer Extra - puppeteer-extra-plugin-portal - Without-Portal

Here's the result when you open the remote view of the session (http://localhost:5500/?targetId=7F595050561EA7D7164206C8763ACE86).

What is Puppeteer Extra - puppeteer-extra-plugin-portal - Portal


puppeteer-extra-plugin-devtools

The puppeteer-extra-plugin-devtools plugin allows the debugging of browsers.

It grants access to the Chrome DevTools protocol, allowing you to interact with it programmatically. It creates a secure tunnel through which the DevTools frontend (including screencasts) can be accessed from the public internet.

To use puppeteer-extra-plugin-devtools, you'll need to install it:

npm install puppeteer-extra-plugin-devtools

Once you have added the plugin, you can launch a Puppeteer browser with remote debugging enabled.

const puppeteer = require('puppeteer-extra')
const devtools = require('puppeteer-extra-plugin-devtools')()
puppeteer.use(devtools)
puppeteer
.launch({
headless: true,
defaultViewport: null
})
.then(async browser => {
console.log('Start')
const tunnel = await devtools.createTunnel(browser)
console.log(tunnel.url)

const page = await browser.newPage()
await page.goto('https://www.nytimes.com/international/')
console.log('All setup.')
})

Other Valuable Puppeteer Extra Plugins

Apart from the plugins previously mentioned, Puppeteer Extra offers a variety of additional plugins that can further enhance the capabilities of Puppeteer for web scraping and automation.

Some of these valuable plugins include:

  1. puppeteer-extra-plugin-adblocker: This plugin blocks ads and trackers, which reduces data consumption and speeds up loading times. It's an extremely efficient ad blocker that uses little memory and blocks all types of ads and trackers. It's also small and lightweight (only 64KB minified and gzipped).
  2. puppeteer-extra-plugin-repl: This plugin adds the Read Eval Print Loop (REPL) feature to Puppeteer, allowing you to execute Puppeteer scripts directly from the command line. You can interrupt your code at any time to start an interactive REPL in your console, where you can inspect arbitrary objects and instances.
  3. puppeteer-extra-plugin-flash: This plugin enables Flash on all sites without user interaction. However, the Flash plugin does not work in headless mode. Flash is a deprecated technology, but it is still used by some websites.
  4. puppeteer-extra-plugin-user-preferences: This plugin launches Puppeteer with arbitrary user preferences. This lets you control the browser environment by setting custom preferences, such as enabling geolocation. The user-defined preferences will be merged with preferences set by other plugins. You can use this to enable or disable certain features or customize the browser's appearance and behavior.

Advanced Puppeteer Extra Integrations

You can use various other advanced integrations with Puppeteer Extra. We'll be discussing three advanced integrations of Puppeteer Extra.


Using TypeScript with Puppeteer Extra Plugin

Using TypeScript with Puppeteer Extra improves the code readability and productivity. TypeScript is a superset of JavaScript that adds type safety and other features, making your code more robust and easier to maintain. To enable TypeScript, follow these steps:

Step 1: Install TypeScript to add it to your project.

npm install typescript

Step 2: To initialize TypeScript, run the following command. This will create a tsconfig.json file in your project root, which contains TypeScript configurations.

npx tsc --init

Step 3: To enable TypeScript in Puppeteer Extra, rename your script from .js to .ts and update the imports accordingly. Replace all require() statements with import statements. For example, replace const puppeteer = require('puppeteer-extra') with import puppeteer from 'puppeteer-extra'.


Using Multiple Puppeteers with Different Plugins

Using multiple Puppeteers with different plugins is a powerful way to scale up large-scale scraping operations. For example, you can use one Puppeteer instance to scrape pages that require high stealth, and another Puppeteer instance to scrape pages that contain ads.

To use multiple Puppeteers with different plugins, you can use the addExtra() function from Puppeteer Extra to create different Puppeteer instances, each representing a distinct browser environment.

Then, add the required plugins for each instance using the puppeteer.use() method.

const {
addExtra
} = require("puppeteer-extra");
const vanillaPuppeteer = require("puppeteer");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
const AdblockerPlugin = require("puppeteer-extra-plugin-adblocker");

(async () => {
const puppeteer = addExtra(vanillaPuppeteer);
puppeteer.use(StealthPlugin());
puppeteer.use(AdblockerPlugin());

// Do stuff
})();

Using Puppeteer Extra with Puppeteer Cluster

puppeteer-cluster allows you to create a cluster of Puppeteer workers, which means that you can perform multiple tasks simultaneously. Puppeteer-cluster and Puppeteer Extra are two powerful JavaScript libraries that can be used together to achieve concurrency support.

To achieve concurrency, use the addExtra() function to create a custom Puppeteer instance that incorporates the necessary plugins. This means that it takes a plugin as an argument and returns a new Puppeteer instance with the plugin enabled.

Then, initialize the cluster with the custom Puppeteer instance, and define the task handler using the cluster.task() function, and queue the tasks using cluster.queue().

The following code will launch two concurrent Puppeteer workers, each of which will take screenshots from the URLs in the queue. Once both workers have finished taking screenshots, the script will exit.

const {
addExtra
} = require("puppeteer-extra");
const vanillaPuppeteer = require("puppeteer");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
const RecaptchaPlugin = require("puppeteer-extra-plugin-recaptcha");
const {
Cluster
} = require("puppeteer-cluster");
(async () => {
const puppeteer = addExtra(vanillaPuppeteer);
puppeteer.use(StealthPlugin());
puppeteer.use(RecaptchaPlugin());
const cluster = await Cluster.launch({
puppeteer,
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 2,
});
let i = 0;
await cluster.task(async ({
page,
data: url
}) => {
await page.goto(url);
const screen = await page.screenshot({
path: `${"screenshot" + i++ + ".png"}`,
fullPage: true,
});
});
cluster.queue("http://www.google.com/");
cluster.queue("http://www.wikipedia.org/");
await cluster.idle();
await cluster.close();
console.log("Program is finished!");
})();

More Web Scraping Tutorials

In this guide, you learned about Puppeteer Extra and its plugins, including the best plugins for web scraping, debugging, and other valuable purposes, as well as advanced integrations.

If you would like to learn more about Web Scraping with Puppeteer, then be sure to check out The Puppeteer Web Scraping Playbook.

Or check out one of our more in-depth guides: