Skip to main content

Run Puppeteer Using Jupyter Notebook

Puppeteer Guide: Run Using Jupyter Notebook

Jupyter Notebook allows you to run code in an interactive environment that combines code execution, rich text, equations, visualizations, and more. This makes it a useful tool for rapidly testing and iterating on Puppeteer scripts.

This guide will walk through how to set up and utilize Puppeteer within a Jupyter Notebook.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.


TLDR - How to Run Puppeteer Using Jupyter Notebooks

To run Puppeteer code in a Jupyter Notebook you can utilize the ijavascript-await kernel for Jupyter Notebook. Once you have Jupyter Notebook and ijavascript-await installed, you can begin writing NodeJS code in a notebook.

After creating a notebook, you can add a code cell with the following code to launch the puppeteer browser and create a page.

const puppeteer = require("puppeteer");

const browser = await puppeteer.launch();
const page = await browser.newPage();

Then you can create another code cell with the following code to utilize the previously created page to navigate to a website and log the title

await page.goto("https://example.com");

// Get page title
console.log(await page.title());

That is all! You are now running NodeJS code with Puppeteer in a Jupyter Notebook. The usable methods and techniques with Puppeteer do not change any, they just now run in this convenient format!


Why Run Puppeteer Using Jupyter Notebooks

Some of the major benefits of using Jupyter Notebook for Puppeteer include:

  • Rapid prototyping: Test snippets of Puppeteer code quickly
    • Allow for rapid prototyping of Puppeteer code by providing an interactive environment where code can be executed in cells.
    • Developers can quickly write and test snippets of Puppeteer code without needing to create and run separate scripts.
  • Visualizations: Inspect and plot scraped data.
    • Facilitate the inspection and visualization of scraped data obtained through Puppeteer.
    • These visualizations can provide valuable insights into the structure and patterns within the data, helping developers to better understand and analyze it.
  • Mixing content: Combine code, markdown notes, outputs, and more.
    • Allow for the seamless integration of code, markdown notes, outputs, and more within a single document.
    • This mixed-content format promotes clear communication and documentation of the code, enhancing collaboration and knowledge sharing among team members.
  • Sharing: Easily share executable notebooks with others.
    • Make it easy to share executable notebooks with others.
    • Developers can share their Puppeteer scripts, along with any accompanying documentation and visualizations, in a single self-contained document.

Using Puppeteer in a Jupyter Notebook

To use Puppeteer in Jupyter Notebook, you first need to install the required packages:

Setting Up the Environment

 `npm install -g ijavascript-await`
  • Run the environment using the ijsnotebook command.

Creating Your Jupyter Notebook

With the environment set up complete you can now work on the Jupyter Notebook using either the web interface or your IDE if it is supported. This guide will utilize the web interface for simplicity.

  1. Open the interface: the command should have opened the page for you but if not it will be accessible on localhost:8888
  2. Create a new Notebook: use the File > New > Notebook menu to create a new notebook.
  3. Use the NodeJS Kernel: you should be presented with a drop down regarding the kernel selection, choose "JavaScript (Node.js)" from the "Start Other Kernel" section.

Begin Writing Code

Now that you have created a notebook you can begin adding cells to write code or markdown in. You can use the "+" icon in the controls menu and then select code to have your first code cell.

Add your first code cell and put the following code in it:

const puppeteer = require("puppeteer");

const browser = await puppeteer.launch();
const page = await browser.newPage();

You are now ready to use puppeteer in any following code cells. the page and browser objects will be available throughout the notebook. To test this, add the following code in a new cell

await page.goto("https://example.com");

// Get page title
console.log(await page.title());

You can then run both of the individual cells using the "Run" button in the tool bar. Make sure to run them in order because the second depends on the first. You should see them both successfully execute with output like below: Notebook Running Puppeteer Ceels


Visual Feedback and Debugging

While working on your Puppeteer Jupyter Notebook you can make use of conventional code practices to ensure your code is running properly and troubleshooting.

Disabling Headless Client

As usual, you can disable headless browsing when testing your code so that you can see and access the browser client performing the work. This is even more beneficial in Jupyter Notebook because the browser will remain open and allow you to rerun the cells as you change and test. To do this, simply change your first cell to configure the browser object as shown below

const browser = await puppeteer.launch({ headless: false });

Screenshots

Another popular way to investigate your puppeteer code is by using screenshots. Again, we can take screenshots just as normal but an added benefit of Jupyter Notebooks is that we can embed the screenshots directly into the notebook.

First we can create a code cell to generate the screenshot

await page.screenshot({ path: "test.png" });

Then we can create a markdown cell to show the screenshot

![Test Image](test.png)

After running both cells we can see the screenshot beneath the code Screenshots in Notebooks

Debugging

The Notebooks web interface does not directly allow for conventional debugging like break points. Instead you should use Jupyter Lab or an IDE of your choice (Like VSCode) that supports debugging in Notebook cells.


Troubleshooting and Best Practices

There are some common challenges and issues that may arise while running Puppeteer in a Jupyter Notebook. Most of them are in some way related to the long running form that is typical of Notebook development.

Remember, Puppeteer is usually opened and closed quickly when running normally. With that in mind, some common issues and best practices are discussed below:

Common Issues and Mitigation

  • Session timeouts: Lower timeout thresholds, retry critical steps.
    • Session timeouts occur when a Puppeteer script takes too long to execute, resulting in the session being terminated by the server or encountering an idle timeout.
    • Adjust the timeout settings in Puppeteer to reduce the maximum time allowed for various operations such as page loading, navigation, or waiting for elements.
    • Implement retry logic for critical steps in the script to handle transient network issues or server timeouts. This allows the script to retry failed operations automatically.

  • Memory usage: Close browser frequently, limit concurrent tabs.
    • High memory usage can occur when Puppeteer instances consume excessive memory, leading to performance degradation or out-of-memory errors.
    • Close the browser instance periodically, especially after completing resource-intensive tasks or when memory usage reaches a certain threshold. This releases memory and resources associated with the browser.
    • Limit the number of open tabs or pages to reduce memory consumption. Consider closing inactive tabs or recycling existing pages instead of opening new ones.

  • Frozen execution: Restart kernel if code hangs or errors.
    • Frozen execution occurs when the Puppeteer script hangs or becomes unresponsive due to an error or blocking operation.
    • Restart the kernel to terminate the current execution and start fresh. This can help resolve issues caused by code errors, infinite loops, or unhandled exceptions.
    • Review the script for potential errors or blocking operations that may cause the execution to freeze. Ensure that asynchronous operations are properly awaited and error handling is implemented to prevent hanging.

  • Unhandled promises: Ensure all promises are handled properly.
    • Unhandled promises occur when asynchronous operations in the Puppeteer script are not properly handled, leading to uncaught exceptions or unexpected behavior.
    • Implement error handling for all asynchronous operations to catch and handle any exceptions that may occur. Use try-catch blocks to handle errors gracefully.
    • When executing multiple asynchronous operations concurrently, use Promise.allSettled() to await all promises and handle their results or errors collectively.

Handling Credentials Safely

When using credentials in notebooks:

  • Avoid hardcoding credentials: Use variables or prompt for input.
    • Hardcoding credentials directly into the Puppeteer script poses a security risk, as it exposes sensitive information such as usernames, passwords, or API keys in plain text.
    • Store sensitive information in variables or configuration files external to the script. This allows credentials to be easily updated or changed without modifying the code.
    • Prompt users to input credentials interactively when running the script. This ensures that sensitive information is not hard-coded and is only provided at runtime.

  • Restrict notebook access: Use access controls if hosting on services like Colab.
    • Notebooks hosted on platforms like Google Colab may be accessible to others, potentially exposing sensitive information or credentials stored within the notebook.
    • Configure access controls and permissions settings to restrict access to the notebook. For example, limit access to specific users or collaborators who require access to the notebook.
    • Encrypt sensitive notebooks or sections of code to prevent unauthorized access. Use encryption tools or features provided by the hosting platform to secure sensitive information.

  • Clear credentials after use: Delete variables/kernel after use
    • Storing credentials or sensitive information in variables or memory within the notebook session may pose a security risk if not properly cleared after use.
    • Explicitly delete variables or objects containing sensitive information from memory after they are no longer needed.
    • Restart the notebook kernel or runtime environment after executing code that contains sensitive information.

Ensuring Secure WebDriver Configurations

Some tips for securely configuring the WebDriver:

  • Default ports and paths for endpoints (e.g., Puppeteer's WebSocket endpoint) can be predictable and targeted by attackers, increasing the risk of unauthorized access or exploitation. Use non-default ports and paths for endpoints.
  • Automatic downloads and extensions in Puppeteer can pose security risks, such as inadvertently downloading malicious files or installing unauthorized browser extensions. Disable automatic downloads and extensions.
  • Exposing WebDriver endpoints to the public internet without proper access controls can leave them vulnerable to unauthorized access or exploitation. Limit WebDriver access via firewall rules and network segmentation.

Example Use Cases

Jupyter Notebook is useful for tasks like:

  • Web scraping experiments - Try different page actions and selectors
  • Data analysis - Clean and process scraped data
  • Automation testing - Validate UI elements and interactions
  • Ad-hoc scripts - Build one-off tools for data tasks

Limitations and Challenges

Limitations to note when using Jupyter Notebook:

  • Statefulness: Jupyter Notebooks can be challenging to maintain persistent sessions, especially when working with long-running tasks or iterative development processes.

  • Memory: Jupyter Notebooks run entirely in memory, which means they can consume significant resources, especially when dealing with large datasets or memory-intensive operations.

  • Debugging: Jupyter Notebooks lack comprehensive debugging tools compared to integrated development environments (IDEs) or dedicated debugging tools.

  • Errors: upyter Notebooks are susceptible to errors, timeouts, or interruptions during execution, which may freeze or disrupt notebook execution.


Conclusion

Jupyter Notebook provides an interactive environment to quickly build and test Puppeteer scripts. It combines code execution, documentation, and visualization in a shareable format. While limitations exist, it can boost productivity for certain use cases.

Check out the Jupyter Documentation and the official Puppeteer Documentation to get more information.

More Web Scraping Guides

If you would like to learn more about Web Scraping with Puppeteer, then be sure to check out The Puppeteer Web Scraping Playbook.

Or check out one of our more in-depth guides: