The Python Pyppeteer Guide: Using Puppeteer With Python

The Python Pyppeteer Guide - Using Puppeteer With Python

Pyppeteer is an unofficial Python wrapper for Puppeteer, the hugely popular Javascript Chrome/Chromium browser automation library.

Using a headless browser like Pyppeteer gives Python developers a real alternative to older browser automation libraries like Selenium.

Python Pyppeteer is one of the best headless browser options you can use with for browser automation and web scraping so in this guide we will go through how:

How To Install Python Pyppeteer
How To Use Pyppeteer
How To Scrape Pages With Pyppeteer
How To Wait For The Page To Load
How To Click On Buttons With Pyppeteer
How To Scroll The Page With Pyppeteer
How To Take Screenshots With Pyppeteer
How To Use Proxies With Pyppeteer
More Pyppeteer Functionality

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

How To Install Python Pyppeteer

Installing and setting up pyppeteer is very straightforward.

First, you need to install pyppeteer itself:

pip install pyppeteer 

Or install the latest version from this Github repository:

pip install -U git+https://github.com/pyppeteer/pyppeteer@dev

How To Use Pyppeteer

Now, let's create our first script using pyppeteer to open a page in a browser.

## demo.py

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('https://quotes.toscrape.com/')
    await page.screenshot({'path': 'screenshot.png'})
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

Pyppeteer has async support by default, which means that our scripts handles browser automation steps asynchronously, which can significantly increase performance.

Now when we run our script demo.py:

python demo.py

Our Pyppeteer script will open a browser and take a picture of the ScrapeOps homepage and save it as a screenshot.png.

Downloading Chromium Browser

When you run pyppeteer for the first time, it will download the latest version of Chromium (~150MB) if it is not found on your system. This may delay the running of your script.

If you prefer to download the the latest version of Chromium before running your script, you can do so using the following command:

pyppeteer-install

How To Scrape Pages With Pyppeteer

A common use case for Pyppeteer and other browser automation libaries is scraping websites.

You can easily scrape websites by using Pyppeteer's implementation of Javascipts Document API.

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('https://quotes.toscrape.com/')

    ## Get Title
    title_html = await page.querySelector('h1')
    title = await title_html.getProperty("textContent")
    print('title', title)

    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

Alternatively, we can just retrieve the HTML content from the response and use a library like BeautifulSoup to parse the data we need.

import asyncio
from pyppeteer import launch
from bs4 import BeautifulSoup

async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('https://quotes.toscrape.com/')

    ## Get HTML
    html = await page.content()
    await browser.close()
    return html

html_response = asyncio.get_event_loop().run_until_complete(main())

## Load HTML Response Into BeautifulSoup
soup = BeautifulSoup(html_response, "html.parser")
title = quote_block.find('h1').text
print('title', title)

How To Wait For The Page To Load

A common requirement when using headless browsers is making sure all the content has loaded prior to moving onto the next step.

With Pyppeteer we can do this in two ways:

Wait a specific amount of time
Wait for a page element to appear

Wait Specific Amount of Time

To wait for a specific amount of time before carrying out the next steps in our script we can simple use the page.waitFor(5000) command to our script and define a time in milliseconds:

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('https://quotes.toscrape.com/')
    await page.waitFor(5000) 

    ## Next Steps

    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

Wait For Page Element To Appear

The other approach is to wait for a page element to appear on the page before moving on.

We can do this using page.waitForSelector():

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('https://quotes.toscrape.com/')

    await page.waitForSelector('h1', {'visible': True}) 

    ## Next Steps

    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

How To Click On Buttons With Pyppeteer

Clicking a button or other page element with Pyppeteer is pretty simple.

We just need to find it with a selector and then tell Pyppeteer to click on it:

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('https://quotes.toscrape.com/')

    ## Click Button
    link = await page.querySelector("h1")
    await link.click()

    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

How To Scroll The Page With Pyppeteer

A lot of modern websites now use infinite scrolls to load more results onto the page.

Requiring you to scroll the page to scrape all the data you need.

We can scroll to the bottom of the page using the evalute

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('https://quotes.toscrape.com/scroll')

    ## Scroll To Bottom
    await page.evaluate("""{window.scrollBy(0, document.body.scrollHeight);}""")

    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

How To Take Screenshots With Pyppeteer

Another common use case for using a automated browser is taking screenshots. Which Pyppeteer makes very easy.

To take a screenshot with Pyppeteer we just need to use page.screenshot() and define the path to save the file.

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('https://quotes.toscrape.com/')
    await page.screenshot({'path': 'screenshot.png'})
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

You can also change the page size prior to taking the screenshot using page.setViewport():

await page.setViewport({"width": 1600, "height": 900})
await page.screenshot({'path': 'screenshot.png'})

How to Use A Proxy With Pyppeteer

If you are scraping then you will likely want to use a proxy.

With Pyppeteer you can set a proxy when you launch the browser:

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch({'args': ['--proxy-server=ip:port'], 'headless': False })
    page = await browser.newPage()
    await page.goto('https://quotes.toscrape.com/')
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

If you need to authenticate the proxy then you can do so like this:

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch({'args': ['--proxy-server=ip:port'], 'headless': False })
    page = await browser.newPage()
    await page.authenticate({'username': 'user', 'password': 'passw'})
    await page.goto('https://quotes.toscrape.com/')
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

More Pyppeteer Functionality

Python Pyppeteer has a huge amount of functionality and is highly customisable, so much so that it is hard to cover everything properly in a single guide.

So if you would like to learn more about Python Pyppeteer then check out the offical documentation here.

It covers everything from setting user-agents:

await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36')

To running the browser in headless mode:

browser = await launch({"headless": True})

The Python Pyppeteer Guide - Using Puppeteer With Python

Need help scraping the web?

How To Install Python Pyppeteer​

How To Use Pyppeteer​

How To Scrape Pages With Pyppeteer​

How To Wait For The Page To Load​

Wait Specific Amount of Time​

Wait For Page Element To Appear​

How To Click On Buttons With Pyppeteer​

How To Scroll The Page With Pyppeteer​

How To Take Screenshots With Pyppeteer​

How to Use A Proxy With Pyppeteer​

More Pyppeteer Functionality​

More Web Scraping Tutorials​