Selenium Web Scraping Guide

The Python Selenium Guide - Web Scraping With Selenium

Originally designed for automated testing of web applications, over the years Selenium became the go to headless browser option for Python developers looking to scrape JS heavy websites.

Selenium gave you the ability to scrape websites that needed to be rendered or interacted with to show all the data.

For years, Selenium was the most popular headless browser for web scraping, however, since the launch of Puppeteer and Playwright Selenium has begun to fall out of favour.

Python Selenium is one of the best headless browser options for Python developers who have browser automation and web scraping use cases. So in this guide we will go through how:

How To Install Python Selenium
How To Use Selenium
How To Scrape Pages With Selenium
How To Wait For The Page To Load
How To Click On Buttons With Selenium
How To Scroll The Page With Selenium
How To Take Screenshots With Selenium
How To Use Proxies With Selenium
More Selenium Functionality

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

How To Install Python Selenium

Installing and setting up Python Selenium for browser automation is a straightforward process.

First, you need to install the Selenium library using pip:

pip install selenium 

Next, you'll need to download the appropriate web driver for the browser you want to automate. Selenium requires a separate web driver to interact with each browser. For example, if you want to use Chrome, you can download the ChromeDriver from here. Make sure you download the correct version that matches your installed Chrome browser.

After downloading the web driver, extract the executable file and make sure it is added to your system's PATH environment variable.

Managing WebDrivers

Sometimes managing web drivers can be a bit of a pain. As you might have issues getting them setup on your PATH environment variable and/or you need to remember to download a new one when a new Chromium Browser is available for example.

An easier option is to use Webdriver Manager which we show how to use here.

How To Use Selenium

Now, let's create our first script using Selenium to open a page in a browser.

## demo.py

from selenium import webdriver

# Set the path to your Chrome webdriver executable.
# Download the appropriate version for your Chrome browser from https://sites.google.com/a/chromium.org/chromedriver/downloads
driver_path = '/path/to/chromedriver'

# Initialize the browser driver (for example, Chrome)
driver = webdriver.Chrome(executable_path=driver_path)  # Make sure the ChromeDriver executable is in the PATH

# Open a webpage
driver.get('https://quotes.toscrape.com/')

# Take a screenshot
driver.save_screenshot('screenshot.png')

# Close the browser
driver.quit()

Now when we run our script demo.py:

python demo.py

Python Selenium will open a browser, navigate to the given URL (https://quotes.toscrape.com/), take a screenshot, and save it as screenshot.png.

Python Selenium also supports asynchronous execution with the help of libraries like asyncio and aiohttp. This allows you to handle multiple browser automation tasks concurrently, improving performance when dealing with numerous tasks.

Managing Selenium Web Drivers

In the above example, we showed you how to run Selenium when you have already downloaded and installed a Web Driver on your machine and added it to a environment variables PATH.

This method works, but can cause issues for some users if they have trouble getting it setup on their PATH and as they have to regularly update their Web Drivers to stay up to date with the latest Chromium, Firefox, Edge, etc. drivers.

An easier option is to use Webdriver Manager for Python which automatically downloads and setups your Selenium browser to use the latest web driver version when you run your Selenium scraper script.

The first step is to install webdriver-manager on your machine or virtual environment.

pip install webdriver-manager

Next step is simply configure your Selenium scraper to use the Web Driver driver downloaded by webdriver_manager. In this case, we use a Chromium Driver.

## demo.py

from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager

# Initialize the browser driver (for example, Chrome) using ChromeDriverManager
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))

# Open a webpage
driver.get('https://quotes.toscrape.com/')

# Take a screenshot
driver.save_screenshot('screenshot.png')

# Close the browser
driver.quit()

For more configuration options for the Web Driver Manager package then check out the official docs here.

To make

How To Scrape Pages With Selenium

A common use case for Selenium and other browser automation libaries is scraping websites.

You can easily scrape websites by using Selenium's implementation of Javascipts Document API.

from selenium import webdriver

def main():
    options = webdriver.ChromeOptions()
    options.add_argument("--headless")  # Optional: Run the browser in headless mode (no GUI).

    # Set the path to your Chrome webdriver executable.
    # Download the appropriate version for your Chrome browser from https://sites.google.com/a/chromium.org/chromedriver/downloads
    driver_path = '/path/to/chromedriver'

    # Launch the Chrome browser.
    driver = webdriver.Chrome(executable_path=driver_path, options=options)

    try:
        # Navigate to the target website.
        driver.get('https://quotes.toscrape.com/')

        # Get Title
        title_element = driver.find_element_by_tag_name('h1')
        title = title_element.text
        print('title:', title)

    finally:
        # Close the browser after finishing the scraping.
        driver.quit()

if __name__ == "__main__":
    main()

Alternatively, we can just retrieve the HTML content from the response and use a library like BeautifulSoup to parse the data we need.

import asyncio
from pyppeteer import launch
from bs4 import BeautifulSoup

async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('https://quotes.toscrape.com/')

    ## Get HTML
    html = await page.content()
    await browser.close()
    return html

html_response = asyncio.get_event_loop().run_until_complete(main())

## Load HTML Response Into BeautifulSoup
soup = BeautifulSoup(html_response, "html.parser")
title = quote_block.find('h1').text
print('title', title)

How To Wait For The Page To Load

A common requirement when using headless browsers is making sure all the content has loaded prior to moving onto the next step.

With Pyppeteer we can do this in two ways:

Wait a specific amount of time
Wait for a page element to appear

Wait Specific Amount of Time

To wait for a specific amount of time before carrying out the next steps in our script we can simple use the page.waitFor(5000) command to our script and define a time in milliseconds:

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('https://quotes.toscrape.com/')
    await page.waitFor(5000) 

    ## Next Steps

    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

Wait For Page Element To Appear

The other approach is to wait for a page element to appear on the page before moving on.

We can do this using page.waitForSelector():

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('https://quotes.toscrape.com/')

    await page.waitForSelector('h1', {'visible': True}) 

    ## Next Steps

    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

How To Click On Buttons With Pyppeteer

Clicking a button or other page element with Pyppeteer is pretty simple.

We just need to find it with a selector and then tell Pyppeteer to click on it:

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('https://quotes.toscrape.com/')

    ## Click Button
    link = await page.querySelector("h1")
    await link.click()

    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

How To Scroll The Page With Pyppeteer

A lot of modern websites now use infinite scrolls to load more results onto the page.

Requiring you to scroll the page to scrape all the data you need.

We can scroll to the bottom of the page using the evalute

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('https://quotes.toscrape.com/scroll')

    ## Scroll To Bottom
    await page.evaluate("""{window.scrollBy(0, document.body.scrollHeight);}""")

    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

How To Take Screenshots With Pyppeteer

Another common use case for using a automated browser is taking screenshots. Which Pyppeteer makes very easy.

To take a screenshot with Pyppeteer we just need to use page.screenshot() and define the path to save the file.

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('https://quotes.toscrape.com/')
    await page.screenshot({'path': 'screenshot.png'})
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

You can also change the page size prior to taking the screenshot using page.setViewport():

await page.setViewport({"width": 1600, "height": 900})
await page.screenshot({'path': 'screenshot.png'})

How to Use A Proxy With Pyppeteer

If you are scraping then you will likely want to use a proxy.

With Pyppeteer you can set a proxy when you launch the browser:

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch({'args': ['--proxy-server=ip:port'], 'headless': False })
    page = await browser.newPage()
    await page.goto('https://quotes.toscrape.com/')
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

If you need to authenticate the proxy then you can do so like this:

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch({'args': ['--proxy-server=ip:port'], 'headless': False })
    page = await browser.newPage()
    await page.authenticate({'username': 'user', 'password': 'passw'})
    await page.goto('https://quotes.toscrape.com/')
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

More Pyppeteer Functionality

Python Pyppeteer has a huge amount of functionality and is highly customisable, so much so that it is hard to cover everything properly in a single guide.

So if you would like to learn more about Python Pyppeteer then check out the offical documentation here.

It covers everything from setting user-agents:

await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36')

To running the browser in headless mode:

browser = await launch({"headless": True})

The Python Selenium Guide - Web Scraping With Selenium

Need help scraping the web?

How To Install Python Selenium​

How To Use Selenium​

Managing Selenium Web Drivers​

How To Scrape Pages With Selenium​

How To Wait For The Page To Load​

Wait Specific Amount of Time​

Wait For Page Element To Appear​

How To Click On Buttons With Pyppeteer​

How To Scroll The Page With Pyppeteer​

How To Take Screenshots With Pyppeteer​

How to Use A Proxy With Pyppeteer​

More Pyppeteer Functionality​

More Web Scraping Tutorials​