Skip to main content

Python Selenium vs. Python Pyppeteer Compared

Selenium vs Python Pyppeteer for Web Scraping

Python Selenium and Python Pyppeteer are both powerful tools for web scraping, but they have different characteristics and use cases.

Exploring the web scraping landscape, we compare Selenium's comprehensive automation features with those ones of Python Pyppeteer.

In this tutorial, we'll walk you through:


TLDR: Selenium vs Python Pyppeteer for Web Scraping

In the arena of web scraping, Selenium is a robust toolset, well-suited for comprehensive browser automation and testing, particularly valuable for projects requiring extensive multi-browser compatibility and detailed logging capabilities

Conversely, Python Pyppeteer offers a more streamlined approach, excelling in headless browsing and swift script execution, albeit with more limited browser support.

Selenium is the preferred solution for complex testing scenarios across various environments, whereas Pyppeteer is tailored for developers who prioritize speed and efficiency in handling JavaScript-intensive tasks.

FeatureSeleniumPython Pyppeteer
Browser supportsExtensive, including legacyLimited to Chrome/Chromium
ConcurrencyParallel testing with GridAsync/Await for tasks
Chrome DevToolsDirect integrationAccess via DevTools Protocol
Limited to Chrome/ChromiumIntegrates with testing frameworksStandalone, no native integration

What is Python Pyppeteer?

Python Pyppeteer is a Python port of the Puppeteer library, which is a Node library that provides a high-level API to control headless Chrome or Chromium browsers over the DevTools Protocol.

In simpler terms, Pyppeteer allows you to automate browser tasks in Python, making it an invaluable tool for web scraping, especially when dealing with JavaScript-heavy websites.

Advantages of Using Python Pyppeteer For Web Scraping

Delving into the strengths of Pyppeteer, we uncover the unique advantages that make it a compelling choice for modern web scraping endeavors.

  • Headless Browsing: Pyppeteer is designed primarily for headless browsing using Puppeteer and Chromium, making it more efficient for scraping tasks without the need for a visible browser interface.
  • Direct Access to Chrome DevTools Protocol: Pyppeteer provides direct access to the Chrome DevTools Protocol, allowing for more granular control over the browser's actions, such as intercepting network requests or injecting scripts.
  • Lightweight: Without the need for additional drivers (like Selenium's WebDriver), Pyppeteer can be a more lightweight solution, requiring only the Puppeteer library and a Chromium instance.
  • Stealth Mode: Helps evade detection mechanisms on websites, making scraping activities less distinguishable from genuine user interactions.
  • Support for Asynchronous Operations: Pyppeteer stands out for its native support of asynchronous operations, streamlining web scraping tasks with efficient, non-blocking execution.
  • Javascript Rendering - Leveraging its Chromium base, Pyppeteer excels in rendering JavaScript-heavy websites, ensuring dynamic content is fully processed and accessible for scraping.
  • Rich API and Flexibility - Pyppeteer's rich API offers a high degree of flexibility, allowing developers to craft custom scraping solutions with ease and precision.

Disadvantages of Using Python Pyppeteer for Web Scraping

Despite its strengths, Python Pyppeteer also presents certain limitations that users must consider when selecting it for web scraping projects.

  • Limited Browser Support: Pyppeteer is primarily designed for Chromium, which means it lacks the multi-browser support that Selenium offers.
  • Less Mature: Selenium has been around for a longer time and has a more mature ecosystem. This maturity brings a wealth of resources, community support, plugins, and integrations that might not be as extensive for Pyppeteer.
  • No Official Maintenance: Pyppeteer is no longer officially maintained, which can pose risks in terms of security updates, bug fixes, and compatibility with newer web technologies.
  • Community and Resources: Given Selenium's longer presence in the market, it has a larger community, which means more tutorials, forums, and third-party tools are available for Selenium compared to Pyppeteer.
  • Resource Intensive: Pyppeteer's resource-intensive nature may demand more from your system, potentially leading to higher computational overhead.
  • Slower Performance: The architecture of Pyppeteer can contribute to slower performance compared to some lightweight scraping tools.
  • Detection and Anti-Scraping Measures: Advanced detection mechanisms and anti-scraping measures on websites can more easily identify and block Pyppeteer-driven scrapers.
  • Complexity and Learning Curve: Despite its powerful capabilities, Pyppeteer presents a steep learning curve and complexity that can be daunting for beginners.

When Should You Use Pyppeteer over Selenium?

When considering Pyppeteer over Selenium for web scraping, you should lean towards Pyppeteer in the following scenarios:

  • Rapid Script Development: If your project requires quick turnaround times for script development, Pyppeteer's concise API is beneficial for rapid prototyping and development.
  • Complex JavaScript Execution with Chrome: For web pages that are heavily reliant on JavaScript, Pyppeteer's ability to execute JavaScript within the context of the browser can be more direct and less cumbersome than Selenium's WebDriver.
  • Headless Execution Needs: Pyppeteer is designed to operate headless by default, which is ideal for server environments or situations where you do not need a graphical browser interface and wish to conserve system resources.
  • Direct Browser Control: When you need fine-grained control over the browser, including direct access to the Chrome DevTools Protocol, Pyppeteer provides this level of control, allowing for more advanced browser interactions and monitoring.

Setting Up Python Pyppeteer

Before installing Pyppeteer, ensure you have: Python 3.6 or higher installed.

You can verify your Python version by running python --version in your terminal.

pip, the Python package installer. If you don't have it, here's a pip installation guide to get you started.

Run the following command in your terminal:

pip install pyppeteer

One of the perks of Pyppeteer is that it automatically downloads a compatible version of Chromium (a lightweight version of Chrome) the first time you launch it. However, if you wish to manually download it, you can use:

import pyppeteer
pyppeteer.chromium_downloader.download_chromium()

Basic Python Pyppeteer Example

Once installed, you can start using Pyppeteer right away. Here's a simple example to launch a browser:

import asyncio
from pyppeteer import launch

async def main():
browser = await launch()
page = await browser.newPage()
await page.goto('https://quotes.toscrape.com/')
await browser.close()

asyncio.get_event_loop().run_until_complete(main())

This script launches a browser, navigates to 'https://quotes.toscrape.com/', and then closes the browser.


What is Selenium?

Selenium is an open-source automation framework primarily used for automating web applications for testing purposes.

It supports multiple programming languages, browsers, and operating systems, making it a versatile tool for web developers and testers.

Beyond testing, Selenium is also employed for automating repetitive web-based administration tasks and web scraping.

Advantages of Using Selenium For Web Scraping

Selenium, a powerhouse in the automation world, offers a suite of advantages for web scraping, making it a go-to choice for developers looking to extract data with precision and efficiency.

  • Wide Browser Support: Selenium's wide browser support encompasses industry leaders like Chrome, Firefox, Safari, and Edge, providing unparalleled versatility for web scraping across different platforms.
  • Mature Ecosystem: Selenium's mature ecosystem is a testament to its longevity and widespread adoption, offering a wealth of resources including comprehensive documentation, a robust community for support, and a plethora of plugins and integrations.
  • Parallel Execution: Selenium Grid facilitates parallel execution, allowing simultaneous tests or web scraping tasks across a variety of browsers and operating systems, significantly boosting productivity and time efficiency.
  • Explicit Waits: Selenium's explicit waits offer a robust mechanism for scripts to pause until certain conditions are met, such as the presence of an element on the page, making it ideal for scraping sites with content that loads dynamically.

Disadvantages of Using Selenium for Web Scraping:

While Selenium is a powerful tool for web scraping, it comes with certain limitations that can affect its performance and suitability for some scraping tasks.

  • Less Native Async Support: Selenium's architecture, primarily synchronous, can lead to slower handling of tasks when compared to tools designed with native asynchronous capabilities.
  • Complex Installation Process: The installation process for Selenium can be quite involved, requiring the setup of multiple components such as language-specific bindings and browser drivers.
  • Performance Management Capabilities: Selenium's capabilities in managing and monitoring the performance of web scraping tasks are somewhat limited, often necessitating additional setup or external utilities to measure page load times.
  • Steep Learning Curve: Selenium's comprehensive support for various browsers, platforms, and programming languages contributes to a steeper learning curve, potentially posing a challenge for newcomers to web automation and scraping.
  • Slower Performance:

When Should You Use Selenium over Pyppeteer ?

When selecting a tool for web scraping and browser automation, certain scenarios call for the robust features and extensive compatibility of Selenium, distinguishing it as the preferred choice over Pyppeteer.

  • Integration with Testing Frameworks: Selenium excels in scenarios where the seamless integration of web scraping scripts with automated testing frameworks is paramount, offering a unified environment that enhances both data extraction and application testing.
  • Large-Scale, Distributed Web Scraping: Selenium stands out when scalability and integration with cloud platforms are essential, catering to extensive web scraping needs across thousands of pages with ease.
  • Multi-Browser and Legacy Browser Support: Selenium excels in scenarios requiring data scraping across diverse browsers, including older and legacy versions, ensuring comprehensive compatibility and insights.

Setting Up Selenium

Setting up Selenium is a straightforward process that involves a few key steps to get started with automating web browsers.

First, you need to install the Selenium WebDriver, which acts as a bridge between your code and the web browser.

For instance, in Python, you would run:

pip install selenium

Once Selenium is installed, you'll need to download the appropriate driver for the browser you intend to automate, such as ChromeDriver for Google Chrome.

These drivers need to be accessible in your system's PATH or specified within your code.

from selenium import webdriver

# Initialize the driver for Chrome browser
driver = webdriver.Chrome()

# Open a webpage
driver.get('https://www.example.com')

# Close the browser
driver.quit()

This snippet sets up the driver, opens a web page, and then closes the browser. It's the foundation upon which more complex automation tasks are built, such as interacting with web page elements and extracting data.

Basic Selenium Example

Here's a basic example using Selenium WebDriver with Python to demonstrate how you can open a web page, perform a search, and print the title of the resulting page.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

# Initialize the Chrome WebDriver
driver = webdriver.Chrome()

# Navigate to the Google homepage
driver.get('http://www.google.com')

# Wait for the page to load
driver.implicitly_wait(5) # seconds

# Locate the search box using its name attribute value
search_box = driver.find_element(By.NAME, 'q')

# Clear the search box in case there's any pre-filled text
search_box.clear()

# Type the search query into the search box
search_box.send_keys('Selenium WebDriver')

# Submit the search request
search_box.send_keys(Keys.RETURN)

# Wait for the search results page to load
driver.implicitly_wait(5) # seconds

# Print the title of the search results page
print(driver.title)

# Close the browser
driver.quit()

This example serves as a stepping stone to more advanced browser automation tasks.


Detailed Comparison of Python Pyppeteer vs Selenium Features

Let's dive into a detailed comparison of Python Pyppeteer and Selenium, focusing on various features and aspects that are relevant for web scraping and automation.

FeaturePython PyppeteerSelenium
Browser SupportPrimarily supports ChromiumSupports Chrome, Firefox, Safari, Edge
Execution SpeedGenerally faster due to being lighterCan be slower due to being more robust
Asynchronous ProgrammingNative support for async/awaitDoes not natively support async/await
JavaScript ExecutionDirect execution within browserExecutes via WebDriver interface
Installation ComplexitySimple, often a single package installMore complex with drivers for browsers
Community and SupportSmaller community, less supportLarge community, extensive support
DocumentationLimited, as project is not maintainedExtensive and well-maintained
Integration with TestingNot designed for testingDeep integration with testing frameworks
Performance ManagementLimited performance toolsRequires third-party tools for advanced performance management
Learning CurveSteeper due to less documentationSteeper due to complexity and features
Cloud Services IntegrationLimited out-of-the-box supportSupports integration with cloud services
Legacy Browser SupportNot a focusSupports legacy browsers
DevTools ProtocolDirect accessAccessible, but not as direct

Ideal Python Pyppeteer & Selenium Use Cases

In the landscape of web scraping and browser automation, Python Pyppeteer and Selenium serve as powerful tools, each with scenarios where they excel.

  • Python Pyppeteer is adept at handling dynamic, JavaScript-heavy single-page applications, offering a streamlined approach for developers to scrape and interact with content that is rendered client-side.

  • Selenium, with its robust framework, is the tool of choice for complex, multi-browser interactions and is particularly valuable in enterprise environments. It supports a wide range of browsers, including legacy versions, making it indispensable for testing web applications across different user environments.

Together, these tools provide a comprehensive suite for tackling various web scraping challenges, from quick data extraction to thorough testing and automation across diverse web platforms.


Case Study: A Side-by-Side Python Pyppeteer vs Selenium Comparison

To illustrate the differences between the two technologies let’s scrape amazon products with Selenium and Pyppeteer and discuss the key points.

To circumvent potential blocks from Amazon, it's crucial for our scrapers to utilize proxies effectively. Scrapeops offers a robust solution tailored for web scraping needs, which we'll integrate to ensure uninterrupted data extraction.

Web Page Structure in Examples

Please note that the specific HTML structure, element locators, and class or ID attributes used in the code samples are based on the current state of the web page as of November 2023.

Due to possible updates or modifications to the webpage's design and structure, the element locators and CSS selectors mentioned in the examples may no longer be valid or effective.

Please leave a comment on the article if you encounter any discrepancies, changes, or issues with the provided code samples.

Selenium Example

from seleniumwire import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from urllib.parse import urlencode
import json
import time

# Set up proxy
SCRAPEOPS_API_KEY = '' # Fill out with your API key

proxy_options = {
'proxy': {
'http': f'http://scrapeops.headless_browser_mode=true:{SCRAPEOPS_API_KEY}@proxy.scrapeops.io:5353',
'https': f'http://scrapeops.headless_browser_mode=true:{SCRAPEOPS_API_KEY}@proxy.scrapeops.io:5353',
'no_proxy': 'localhost:127.0.0.1'
}
}

def extract_product_details(driver, product_url, original_window):
'''
In this function we are going to open a product page and scrape product details
'''

# Navigate to the product URL
# Open a new window using JavaScript
driver.execute_script("window.open('');")

# Switch to the new window and open a URL
new_window = [window for window in driver.window_handles if window != original_window][0]
driver.switch_to.window(new_window)
driver.get(product_url)

# In order to make sure that we search for the product details only after the pages is loaded,
# let's implement a wait for the product title
try:
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.ID, "productTitle")))
except:
pass # If timeout occurs, proceed

# Extract product name and price. If those are not available, return "not available" string.
try:
product_name = driver.find_element(By.ID, "productTitle").text
except:
product_name = "not available"

try:
product_price = driver.find_element(By.CSS_SELECTOR, "span.a-price.reinventPricePriceToPayMargin span.a-offscreen").text
except:
product_price = "not available"

# Close the product window
driver.close()

# Switch back to the original window
driver.switch_to.window(original_window)

return {'name': product_name, 'price': product_price}

def scrape_amazon_products():
'''
This is the main function. Here we are going to:
- set up and initialize a browser
- create while loop to iterate over the amazon search results
- on each of the search results we are going to call extract_product_details()
- save the product details in a .json file
'''

# Set up the Selenium WebDriver. We should initialize it as headless and
# pass it the seleniumwire_options argument in order for it to work with proxy

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--blink-settings=imagesEnabled=false')
driver = webdriver.Chrome(options=chrome_options, seleniumwire_options=proxy_options)
products = []
page_num = 1

# Set up a while loop to iterate over amazon search pages
while True:
url = f"https://www.amazon.com/ebike/s?k=ebike&page={page_num}"


# In order to make the crawling a bit faster, instead of clicking every
# product link and then returning to the orignal search results page,
# we are going to open each new product in a new tab.
# We'll start with saving the search results page first

original_window = driver.current_window_handle
driver.get(url)

# Get product links
product_links = driver.find_elements(By.CSS_SELECTOR, 'h2 a.a-link-normal')
product_urls = [link.get_attribute('href') for link in product_links]

# Scrape product details
for product_url in product_urls:
product_details = extract_product_details(driver, product_url, original_window)
products.append(product_details)

# Check for the next pagination page
next_page = driver.find_elements(By.CSS_SELECTOR, 'li.a-last a')
if not next_page:
break

page_num += 1

driver.quit()

# Save the data as JSON
with open('amazon_ebike_products.json', 'w') as f:
json.dump(products, f)

scrape_amazon_products()

Comment

  • This script does not implement concurrency, as Selenium does not natively support asynchronous operations like Pyppeteer. If you need to handle multiple pages concurrently, you would typically use Selenium Grid or run multiple instances of the script in parallel processes.
  • To integrate our proxy with your Selenium scraper we recommend that you use the Selenium Wire extension which makes it very easy to use proxies with Selenium. First, you need to install Selenium Wire using pip:
pip install selenium-wire

Then update your scraper to use seleniumwire's webdriver instead of the default selenium webdriver.

Pyppeteer Example

import asyncio
import os
import shutil
import stat
import tempfile
from pyppeteer import launch
import pyppeteer.errors
import json

# Proxy setup
PROXY_SERVER = 'proxy.scrapeops.io'
PROXY_SERVER_PORT = '5353'
SCRAPEOPS_API_KEY = '' # Replace with your API key

# Search setup
SEARCH_WORD = "jeans"


def remove_readonly(func, path, _):
""" Clear the readonly bit and reattempt the removal
Once the script finishes working, this function will help us to remove temporary user data created by the script
"""
os.chmod(path, stat.S_IWRITE)
func(path)

async def handle_request(req):
custom_headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"Cache-Control": "max-age=0",
"Connection": "keep-alive",
"DNT": "1", # Do Not Track Request Header
"Referer": "https://www.google.com/",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36"
}

# Add custom headers, overriding the existing ones
headers = {**req.headers, **custom_headers}

# Add error handling to the handle_request function
# to catch any exceptions that might occur during request processing
try:
if req.resourceType == 'image':
await req.abort()
else:
await req.continue_({'headers': headers})
except Exception as e:
print(f"Error in request handling: {e}")


async def extract_product_details(browser, product_url, semaphore):
# use semaphore to handle concurrency limits
async with semaphore:
page = None
# we have to wrap up our code in try...except blocks
# to make sure that we close browser pages gracefully in case any exceptions arise
try:
page = await browser.newPage()
# Set up request interception to abort requests with images and to set up custom headers
await page.setRequestInterception(True)
page.on('request', lambda req: asyncio.create_task(handle_request(req)))
try:
await page.goto(product_url, waitUntil='domcontentloaded', timeout=120000)
except pyppeteer.errors.TimeoutError:
print(f'TimeoutError while loading {product_url}')
return {'name': "not available", 'price': "not available"}

# in case product details are not found we set up their values to "not available"
product_name = "not available"
product_price = "not available"

# we should make sure that we don't perform any actions
# on pages that might have been closed due to any exceptions
if not page.isClosed():
try:
# set up a custom wait for a product name and parse it
await page.waitForSelector("span#productTitle", timeout=60000)
product_name = await page.evaluate('() => document.querySelector("span#productTitle").innerText')
except Exception as e:
print(f'Error fetching product name for {product_url}: {e}')

try:
product_price = await page.evaluate('() => document.querySelector(".a-price-whole").innerText').replace('\n', '')
except Exception as e:
print(f'Error fetching product price for {product_url}: {e}')

except Exception as e:
print(f'Error in extract_product_details for {product_url}: {e}')
finally:
# make sure that the page is closed if it was not
if page and not page.isClosed():
await page.close()

return {'name': product_name, 'price': product_price}


async def scrape_amazon_products():
browser = None
# we have to wrap up our code in try...except blocks
# to make sure that we close browser pages gracefully in case any exceptions arise
temp_dir = tempfile.mkdtemp()
try:
browser = await launch(
ignoreHTTPSErrors=True,
headless=True,
defaultViewport={'width': 1900, 'height': 1080},
args=['--start-fullscreen',
'--no-sandbox',
f'--proxy-server=http://scrapeops.headless_browser_mode=true:{SCRAPEOPS_API_KEY}@proxy.scrapeops.io:5353',
],
userDataDir=temp_dir
)

page = await browser.newPage()
# Set up request interception to abort requests with images and to set up custom headers.
await page.setRequestInterception(True)
page.on('request', lambda req: asyncio.create_task(handle_request(req)))
await page.setUserAgent(
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3')

all_product_urls = []

# Sequentially scrape the first two search results pages for illustration purposes
for page_num in range(1, 3):
url = f"https://www.amazon.com/{SEARCH_WORD}/s?k={SEARCH_WORD}&page={page_num}"
await page.goto(url)
product_links = await page.querySelectorAll('h2 a.a-link-normal')
product_urls = [await page.evaluate('(link) => link.href', link) for link in product_links]
all_product_urls.extend(product_urls)

await page.close()

# Set up semaphore to limit concurrency, otherwise we might overwhelm the server and get blocked out
semaphore = asyncio.Semaphore(3)

# Concurrently scrape product details for each URL
tasks = [extract_product_details(browser, product_url, semaphore) for product_url in all_product_urls]
products = await asyncio.gather(*tasks)

# Save the data as JSON
with open('output_data_pyppeteer.json', 'w') as f:
json.dump(products, f)

except Exception as e:
print(f'An error occurred: {e}')
finally:
if browser:
await browser.close()
# Introduce a small delay to ensure that all asynchronous operations have been completed and that
# all resources, especially those related to the browser and its pages, have been properly released.
await asyncio.sleep(2)
try:
shutil.rmtree(temp_dir, onerror=remove_readonly)
except Exception as e:
print(f"Error removing temporary directory: {e}")

asyncio.run(scrape_amazon_products())

Now, with the real full code example we can compare the two crawlers by the keypoints:

Storing the Data Both the Selenium and Pyppeteer scripts store data in the same format, which is JSON. The method of storing data is similar in both scripts, using Python's json.dump to serialize the product details into a file.

Performance Stats

  • Pyppeteer can potentially offer better performance stats due to its asynchronous nature, which allows for concurrent page processing.
  • Selenium, on the other hand, operates synchronously.

Navigating/Crawling Pages

  • Pyppeteer provides more straightforward methods to wait for elements and pages, which can be beneficial when dealing with dynamic content that loads asynchronously.
  • Selenium requires explicit waits and conditions to ensure that elements are present or that pages have loaded, which can add complexity to the script.

Speed

  • The speed of Pyppeteer can be faster due to its asynchronous execution.
  • Selenium is generally slower in comparison because it waits for each task to complete before moving on to the next one. Integrating Proxies Both scripts integrate proxies in a similar manner by constructing a URL that routes through the proxy service. However, the setup and management might differ slightly.
  • In Pyppeteer, the proxy integration is handled within the asynchronous function, and it's straightforward to apply the proxy to each new page instance.
  • In Selenium, the proxy must be configured for the WebDriver instance. If multiple instances are used for parallelism, each must be configured individually, which can be more cumbersome.

In summary, while both tools can achieve the same end result, Pyppeteer's asynchronous capabilities give it an edge in performance, especially when scraping a large number of pages.

Selenium, being synchronous, is more straightforward and predictable but may require additional considerations for performance optimization, such as parallel execution strategies.


Additional Python Pyppeteer & Selenium Resources

For those delving into web scraping with Python's Pyppeteer and Selenium, a wealth of resources, from official documentation to community forums, is available to guide and enhance your development journey.

https://scrapeops.io/proxy-aggregator/


More Web Scraping Guides

Now, you have a good understanding of Python Selenium and Python Pyppeteer.

Each tool offers distinct advantages tailored to specific needs, but ultimately, the choice depends on the specific requirements of your scraping project.

If you would like to get more information about web scraping, check out our other tutorials: