Puppeteer Real Browser Guide

Puppeteer Real Browser provides a seamless way to mimic real browser behavior, ensuring your scraping activities remain undetected.

In this guide, we will walk you through using this powerful tool designed to help you overcome bot detection and CAPTCHA challenges in web scraping.

[TL:DR - Puppeteer Real Browser](#tldr

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

---puppeteer-real-browser)

Understanding Puppeteer Real Browser
Why Use Puppeteer Real Browser Vs Puppeteer Stealth
Installation and Setup
Launching a Real Browser
Creating a New Page
Navigating to a URL
Interacting with Elements
Waiting for Page Load
Capturing Screenshots
Generating PDFs
Emulating Devices
Using Proxies
Advanced Functionality
Best Practices and Recommendations
Conclusion
More Web Scraping Guides

TL:DR - Puppeteer Real Browser

If you're short on time and need a quick solution for bypassing bots, here's a brief summary. The script below demonstrates how to set up and use Puppeteer Real Browser to bypass bot detection:

import { connect } from 'puppeteer-real-browser';

connect({
    headless: 'auto',
    turnstile: true
})
.then(async response => {
    const { browser, page } = response;
    await page.goto('https://books.toscrape.com/');
    await page.screenshot({ path: 'books.png' });
    await browser.close();
})
.catch(error => {
    console.log(error.message);
});

The script leverages the puppeteer-real-browser package, which extends Puppeteer with additional features.

This script launches a real browser, navigates to a URL, and takes a screenshot, all while avoiding detection. By using a real browser instance, it closely mimics human behavior, reducing the risk of being flagged as a bot.

Understanding Puppeteer Real Browser

Puppeteer Real Browser is designed to prevent detection by mimicking real browser behavior. It uses a real browser instance, which allows it to:

Accurately render web pages: By using a real browser instance, Puppeteer Real Browser ensures that web pages are rendered accurately, just as they would be in a standard user’s browser. This is crucial for tasks that depend on the correct rendering of HTML, CSS, and JavaScript. Accurate rendering is particularly important when scraping data from dynamic web pages that rely heavily on client-side JavaScript for content generation.
Handle dynamic content: Many modern web applications load content dynamically using JavaScript. Puppeteer Real Browser can handle such content because it runs a full browser instance, capable of executing all JavaScript on the page and loading content as it appears. This makes it suitable for scraping single-page applications (SPAs) and websites that use AJAX to update content without refreshing the entire page.
Test web applications in a realistic environment: Running in a real browser environment means that you can test web applications under conditions that closely mimic those experienced by end-users. This includes rendering, performance, and interaction testing. This is beneficial for web developers who need to ensure their applications work correctly across different devices and environments.

Features of Puppeteer Real Browser

Puppeteer Real Browser offers several powerful features:

Device Emulation: Simulate different devices to test responsive designs. This allows you to ensure your web application looks and works correctly on a variety of devices, from desktops to smartphones.
Asynchronous Operation Handling: Efficiently manage asynchronous operations. This is crucial for handling tasks such as loading data, waiting for user interactions, and managing timeouts.
Debugging Tools: Built-in tools to help debug your scripts. These tools can be invaluable when developing complex scraping scripts, as they allow you to inspect the browser's state, log messages, and step through your code.

Why Use Puppeteer Real Browser Vs Puppeteer Stealth

When it comes to web automation and web scraping, the choice between using Puppeteer Real Browser and Puppeteer Stealth largely depends on the specific requirements and challenges of the task at hand.

Both tools aim to minimize detection and maximize the effectiveness of automated interactions with web pages, but they do so in different ways.

Puppeteer Real Browser offers several advantages over Puppeteer Stealth:

Realism: Uses a real browser instance, making it harder for websites to detect. Websites often employ sophisticated techniques to detect non-human behavior, such as analyzing mouse movements, scroll patterns, and network requests. By using a real browser, Puppeteer Real Browser can more effectively mimic human behavior.
Accuracy: Provides accurate rendering and interaction with web pages. This is particularly important for tasks that require pixel-perfect accuracy, such as visual testing and capturing screenshots.
Device Emulation: It can emulate various devices, including mobile phones and tablets, to test how web pages behave on different devices. This helps ensure that your web scraping or automation scripts are compatible with a wide range of devices.
Debugging Tools: Offers extensive debugging capabilities with support for running in headless and non-headless modes, allowing visual inspection of browser interactions. This makes it easier to diagnose and fix issues in your scripts.
Turnstile/CAPTCHA Handling: Designed to handle CAPTCHAs and other bot-detection mechanisms more effectively by mimicking human-like interactions. This is particularly useful for scraping websites that use CAPTCHAs to block automated access.

Tradeoffs

When choosing between Puppeteer Real Browser and Puppeteer Stealth, it's essential to consider the tradeoffs involved. Here are the key tradeoffs between the two:

Performance: Real browser instances can be more resource-intensive. Running a full browser consumes more CPU and memory compared to running in headless mode with stealth plugins.
Setup: May require additional setup, especially on Linux systems. For example, installing additional dependencies like xvfb for running headless browsers in environments without a graphical interface.

Use Puppeteer Real Browser if:

You need high detection avoidance for complex, dynamic websites.
You require accurate device emulation and realistic browser behavior.
Handling CAPTCHAs is a significant part of your automation task.

Use Puppeteer Stealth if:

Performance and resource efficiency are critical.
You need a simple integration with existing Puppeteer scripts.
The websites you are automating are not overly complex or heavily protected.

Installation and Setup

To install Puppeteer Real Browser using npm, run the following command:

npm install puppeteer-real-browser

CommonJS and Module Import Methods

Depending on your project setup, you can import Puppeteer Real Browser using either CommonJS or ES Module syntax.

CommonJS:

const puppeteer = require('puppeteer-real-browser');

ES Module:

import puppeteer from 'puppeteer-real-browser';

Additional Setup for Linux

For Linux users, you might need to install xvfb:

sudo apt-get install xvfb

xvfb (X Virtual Framebuffer) allows you to run graphical applications without a display, which is useful for running Puppeteer in headless mode.

Launching a Real Browser

Launching a real browser using Puppeteer Real Browser is straightforward. By default, Puppeteer Real Browser launches the browser in headless mode, which means it runs in the background without a graphical interface.

However, for development and debugging purposes, you might want to see the browser in action. You can do this by setting the headless option to false.

import { connect } from 'puppeteer-real-browser';

connect({
    headless: false
})
.then(async response => {
    const { browser } = response;
    console.log('Browser launched');
    await browser.close();
})
.catch(error => {
    console.log(error.message);
});

Open Browser

Running the browser in non-headless mode allows you to see exactly what the script is doing, making it easier to debug issues with navigation, element interaction, and more.

Creating a New Page

Creating a new page within the launched browser is a fundamental task in Puppeteer Real Browser. This step is essential for navigating to different URLs, interacting with web elements, and performing various scraping operations.

import { connect } from 'puppeteer-real-browser';

connect({})
.then(async response => {
    const { browser, page } = response;
    console.log('New page created');
    await browser.close();
})
.catch(error => {
    console.log(error.message);
});

Navigating to a URL

Once you have a new page, the next step is to navigate to a specific URL. This is done using the goto method, which directs the browser to load the desired webpage. Navigating to a URL is a common action required for scraping, testing, or any other browser automation task.

import { connect } from 'puppeteer-real-browser';

connect({})
.then(async response => {
    const { page } = response;
    await page.goto('https://books.toscrape.com/');
    console.log('Navigated to URL');
    await response.browser.close();
})
.catch(error => {
    console.log(error.message);
});

Books to Scrape

Interacting with Elements (Eg. Click, Type in a field etc.)

Interacting with elements on a webpage is crucial for tasks like filling out forms, clicking buttons, or extracting data. Puppeteer Real Browser provides methods to perform these actions seamlessly.

import { connect } from 'puppeteer-real-browser';

connect({})
.then(async response => {
    const { page } = response;
    await page.goto('https://books.toscrape.com/');
    await page.click('div[class="image_container"]');  // Replace 'selector' with the actual selector for the element
    console.log('Interacted with elements');
    await response.browser.close();
})
.catch(error => {
    console.log(error.message);
});

click_element

In this example, the script navigates to a URL, and clicks on an element. Replace the selector with those relevant to the elements you want to interact with on your target webpage.

Waiting for Page Load

Ensuring that a page has fully loaded before interacting with it is vital for the reliability of your scripts. Puppeteer Real Browser allows you to wait for specific conditions, such as network idleness, before proceeding with further actions.

import { connect } from 'puppeteer-real-browser';

connect({})
.then(async response => {
    const { page } = response;
    await page.goto('https://books.toscrape.com/', { waitUntil: 'networkidle2' });
    console.log('Page loaded');
    await response.browser.close();
})
.catch(error => {
    console.log(error.message);
});

In this example, the goto method includes an option to wait until there are no more than two network connections for at least 500 ms (networkidle2). This ensures that the page has finished loading all necessary resources, making subsequent interactions more reliable.

Capturing Screenshots

Capturing screenshots is a useful feature for debugging, monitoring, and documentation purposes. Puppeteer Real Browser makes it easy to take screenshots of web pages at any point during your script’s execution.

import { connect } from 'puppeteer-real-browser';

connect({})
.then(async response => {
    const { page } = response;
    await page.goto('https://books.toscrape.com/');
    await page.screenshot({ path: 'books.png' });
    console.log('Screenshot captured');
    await response.browser.close();
})
.catch(error => {
    console.log(error.message);
});

The screenshot method captures a screenshot of the current state of the page and saves it to the specified path. This can be particularly helpful for verifying that your script is interacting with the page correctly.

Generating PDFs

Generating PDFs from web pages is another powerful feature of Puppeteer Real Browser. This can be particularly useful for saving invoices, reports, or any content that needs to be preserved in a portable format.

import { connect } from 'puppeteer-real-browser';

connect({})
.then(async response => {
    const { page } = response;
    await page.goto('https://books.toscrape.com/');
    await page.pdf({ path: 'books.pdf', format: 'A4' });
    console.log('PDF generated');
    await response.browser.close();
})
.catch(error => {
    console.log(error.message);
});

Emulating Devices

Emulating different devices is essential for testing how a webpage behaves on various screen sizes and resolutions.

import { connect } from 'puppeteer-real-browser';
import { KnownDevices } from 'puppeteer';

connect({})
.then(async response => {
    const { page } = response;
    await page.emulate(KnownDevices['iPhone 6']);
    await page.goto('https://books.toscrape.com/');
    await response.browser.close();
})
.catch(error => {
    console.log(error.message);
});

Emulate Device

Using Proxies

Using proxies is crucial for tasks that require anonymity or access to geo-restricted content. Puppeteer Real Browser supports proxy usage to help you manage these requirements effectively.

For seamless integration, we recommend using the ScrapeOps Proxy Aggregator. ScrapeOps Proxy Aggregator provides access to the best performing proxies via a single endpoint.

Here’s how you can integrate ScrapeOps proxies with Puppeteer Real Browser:

import { connect } from 'puppeteer-real-browser';

const { connect } = require('puppeteer-real-browser');

// Residential proxy aggregator credentials
const PROXY_HOST = 'http://residential-proxy.scrapeops.io';  // Replace with your proxy provider's host
const PROXY_PORT = 8181;  // Replace with the correct port
const PROXY_USERNAME = 'scrapeops';  // Replace with your proxy username
const PROXY_PASSWORD = 'YOUR_API_KEY';  // Replace with your proxy password

connect({
    proxy: {
        host: PROXY_HOST,
        port: PROXY_PORT,
        username: PROXY_USERNAME,
        password: PROXY_PASSWORD
    }
})
.then(async response => {
    const { page } = response;
    await page.goto('https://books.toscrape.com/');
    console.log('Residential proxy used');
    await response.browser.close();
})
.catch(error => {
    console.log(error.message);
});

In this example, the connect method includes a proxy configuration with the proxy server's host, port, username, and password. Using proxies helps you bypass geo-restrictions and distribute your requests to avoid detection.

By using ScrapeOps Proxy Aggregator, you benefit from optimized proxy performance tailored for headless browsers like Puppeteer, ensuring your scraping tasks run smoothly and efficiently.

ScrapeOps will take care of the proxy selection and rotation for you so you just need to send us the URL you want to scrape.

Advanced Functionality

Handling SkipTarget Feature

The skipTarget feature is useful for navigating past detection hurdles that target specific browser behaviors. By skipping certain targets, you can avoid common traps set by websites to identify bots.

import { connect } from 'puppeteer-real-browser';

connect({
    skipTarget: ['https://books.toscrape.com//skip']
})
.then(async response => {
    const { page } = response;
    await page.goto('https://books.toscrape.com/');
    console.log('SkipTarget used');
    await response.browser.close();
})
.catch(error => {
    console.log(error.message);
});

Utilizing ConnectOption

The connectOption parameter allows for additional configurations when connecting to a browser instance. This is particularly useful for custom setups or when connecting to remote browsers.

import { connect } from 'puppeteer-real-browser';

connect({
    connectOption: { browserWSEndpoint: 'ws://localhost:3000' }
})
.then(async response => {
    const { page } = response;
    await page.goto('https://books.toscrape.com/');
    console.log('ConnectOption used');
    await response.browser.close();
})
.catch(error => {
    console.log(error.message);
});

In this example, the connect method connects to a browser instance using a WebSocket endpoint specified by browserWSEndpoint. This allows for advanced configurations and remote browser management.

Opening Multiple Pages Simultaneously

Opening multiple pages simultaneously can significantly enhance the efficiency of your scraping tasks. Puppeteer Real Browser supports handling multiple pages concurrently, allowing for parallel data extraction.

import { connect } from 'puppeteer-real-browser';

connect({
    turnstile: true
})
.then(async response => {
    const { page, browser, setTarget } = response;

    await page.goto('https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html', {
        waitUntil: 'domcontentloaded'
    });

    setTarget({ status: false });

    let page2 = await browser.newPage();

    setTarget({ status: true });

    await page2.goto('https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html');

    console.log('Multiple pages opened');
})
.catch(error => {
    console.log(error.message);
});

Multiple Pages

In this script, two pages are created and navigated to different URLs concurrently. Managing multiple pages allows you to perform complex scraping operations more efficiently by leveraging parallel processing.

Best Practices and Recommendations

When working with Puppeteer Real Browser, there are several best practices and recommendations to keep in mind to ensure smooth operation, maintainability, and security of your automation scripts. Here are some key points to consider:

Balancing Between Headless and Non-Headless Modes

Unless you specifically require interaction with the browser UI, run Puppeteer Real Browser in headless mode to improve performance and reduce resource consumption. Headless mode eliminates the need to render the browser’s graphical interface, making scripts run faster and use fewer resources.
Use non-headless mode for more accurate scraping and debugging. Seeing the browser in action can help identify issues with navigation, element interaction, and timing.

Optimizing Code for Performance and Reliability

Avoid unnecessary page reloads: Reloading pages can be time-consuming and resource-intensive. Optimize your scripts to interact with existing page elements without refreshing the entire page.
Use efficient selectors: Use precise and efficient CSS or XPath selectors to interact with elements. This reduces the likelihood of errors and speeds up element selection.
Handle exceptions gracefully: Use try-catch blocks to handle exceptions and ensure that your scripts can recover from errors without crashing. Log errors for debugging purposes and implement retries for critical actions.

Handling CAPTCHAs and Bot Detection Responsibly

Use real browser interactions: Mimic human-like interactions, such as mouse movements and keyboard inputs, to reduce the likelihood of detection.
Avoid overloading servers with requests: Implement rate limiting and random delays between requests to avoid overloading target servers and triggering anti-bot mechanisms.
Respect website terms of service: Ensure that your scraping activities comply with the terms of service of the websites you are targeting. Unauthorized scraping can lead to legal issues and IP bans.

Conclusion

In this guide, we explored the Puppeteer Real Browser, a powerful tool for bypassing bot detection and CAPTCHA challenges in web scraping. We covered installation, setup, and various functionalities such as navigating to URLs, interacting with elements, and more.

Puppeteer Real Browser’s ability to mimic real browser behavior makes it a valuable tool for web scraping and automation tasks that require high detection avoidance and accurate rendering.

For further details, refer to the official repository.

More Web Scraping Guides

Looking to advance your scraping skills? Take a look at our Puppeteer Web Scraping Playbook.

You can also check some of our in-depth guides as well:

Puppeteer Real Browser Guide

Need help scraping the web?

TL:DR - Puppeteer Real Browser​

Understanding Puppeteer Real Browser​

Features of Puppeteer Real Browser​

Why Use Puppeteer Real Browser Vs Puppeteer Stealth​

Tradeoffs​

Installation and Setup​

CommonJS and Module Import Methods​

Additional Setup for Linux​

Launching a Real Browser​

Creating a New Page​

Navigating to a URL​

Interacting with Elements (Eg. Click, Type in a field etc.)​

Waiting for Page Load​

Capturing Screenshots​

Generating PDFs​

Emulating Devices​

Using Proxies​

Advanced Functionality​

Handling SkipTarget Feature​

Utilizing ConnectOption​

Opening Multiple Pages Simultaneously​

Best Practices and Recommendations​

Balancing Between Headless and Non-Headless Modes​

Optimizing Code for Performance and Reliability​

Handling CAPTCHAs and Bot Detection Responsibly​

Conclusion​

More Web Scraping Guides​