The Python Selenium Guide - Web Scraping With Selenium
There are all sorts of reasons that we would want to automate processes on the web. Search engines use web crawlers to constantly go through and scrape all the sites they can, then they use the content they find to rank your results. Sometimes people need content scraped as well, such as product listings and content information.
Unlike many other scraping tools, Selenium can be used to simulate the human use of a webpage. Selenium makes it a breeze to accomplish some things that would be near impossible to do using another scraping package. So in this guide we will go through:
- What is Selenium
- Installing Selenium
- Testing To Make Sure it Works
- Basic Actions in Selenium
- Filling Form Info
- Taking Screenshots
- Dealing With Dynamic Content
- Infinite Scrollng
- Dealing with Pop Ups
- Custom Settings and Configuration
- Get Around Anti-Scraping Protections with Selenium
- How to Use A Proxy With Selenium
- Conclusion
If you prefer to follow along with a video then check out the video tutorial version here:
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
What is Selenium?
Originally designed for automated testing of web applications, over the years Selenium became the go to headless browser option for Python developers looking to scrape JS heavy websites.
Selenium gave you the ability to scrape websites that needed to be rendered or interacted with to show all the data.
Why is Selenium used for?
Python Selenium is one of the best headless browser options for Python developers who have browser automation and web scraping use cases.
Let's take a look at the good and not-so-good aspects of using Selenium. Understanding both its strengths and weaknesses can give us a better idea of what it can and cannot do.
Pros
Selenium has many good things that make it a top choice for automated testing and making web applications.
- Cross-Browser Compatibility: Selenium supports multiple browsers, including Chrome, Firefox, Internet Explorer, and Safari, enabling comprehensive cross-browser testing.
- Language Support: It offers support for various programming languages, allowing users to write automation scripts in their preferred language.
- Open-Source: Being an open-source tool, Selenium has a large and active community that provides support, updates, and additional functionalities.
- Flexibility: Selenium is highly flexible and can be integrated with other tools and frameworks to enhance its capabilities.
- Parallel Test Execution: It allows for the execution of tests in parallel, significantly reducing the overall testing time for large-scale applications.
Cons
There are also some aspects of Selenium that can pose challenges and limitations in certain scenarios.
- Complexity in Setup: Setting up Selenium for the first time can be challenging, especially for those new to test automation and unfamiliar with its configuration.
- Lack of Built-In Reporting: Selenium does not have built-in reporting capabilities, requiring users to integrate additional tools or frameworks for comprehensive test result analysis and reporting.
- Limited Support for Non-Web-Based Applications: Selenium is primarily focused on web applications and lacks robust support for testing non-web-based applications and systems.
- Flakiness in Tests: Automated tests built with Selenium can sometimes be flaky due to factors such as dynamic web elements, asynchronous behavior, or timing issues.
- Maintenance Overhead: Test scripts may require regular maintenance to ensure they remain functional and compatible with any changes made to the application under test.
Installing Selenium
To get started with Selenium in Python, first you need Python and pip installed on your system. This article assumes that you've already installed Python and that you are ready to get started with Selenium.
To install Selenium with pip, use the following command.
pip install selenium
Next, we need to make sure that we have our webdriver installed. Selenium supports numerous browsers such as Chrome, Firefox, Edge and Safari. You may view the fill list of supported browsers here.
If you wish to use Selenium with Chrome, you would use chromedriver. Make sure you download the correct version that matches your installed Chrome browser.
If you would prefer another browser, like Firefox, you would go ahead and use geckodriver.
Remember that drivers are os specific, this means that a Windows driver will not work on Mac or Linux. A Linux driver will not work on Windows or Mac, and a Mac driver will not work on Windows or Linux.
Testing to Make Sure it Works
We can start by placing our driver in the same directory as our scraping script. First, we need to create a folder and then we place our neccessary components inside it.
mkdir selenium_tutorial
Either copy or move your webdriver to this folder.
cp path/to/chromedriver selenium_tutorial/chromedriver
Now we need to create a Python script. Open this new directory with your code editor of choice and create a new file, selenium_tutorial.py
.
Please note that the specific HTML structure, element locators, and class or ID attributes used in the code samples are based on the current state of the web page as of 17/10/2023.
Due to possible updates or modifications to the webpage's design and structure, the element locators and CSS selectors mentioned in the examples may no longer be valid or effective.
Please leave a comment on the article if you encounter any discrepancies, changes, or issues with the provided code samples.
Now, let's create our first script using Selenium to open a page in a browser:
#import our webdriver
from selenium import webdriver
#open the Chrome browser
driver = webdriver.Chrome()
#navigate to the scrapeops website
driver.get("https://scrapeops.io/")
#close the browser
driver.quit()
If everything is working correctly, this script above will open Chrome, find our page (https://scrapeops.io/), and then close Chrome.
Basic Actions in Selenium
Now let's make Selenium actually do stuff. We'll start with some simple scrolling and clicking. Let's add a while loop and some imports so Selenium knows what to do and when to shut off.
Selenium can find elements in several different ways:
- NAME: Locates elements using the "name" attribute.
- ID: Locates elements using the "id" attribute, which is expected to be unique on the page.
- TAG_NAME: Locates elements based on their HTML tag name.
- CLASS_NAME: Locates elements using the value of the "class" attribute.
- CSS_SELECTOR: Locates elements using CSS selectors, which offer a powerful way to select elements based on their attributes.
- X_PATH: Locates elements using XPath expressions, which provide a way to navigate through the XML structure of a document.
We'll use the methods listed above throughout this tutorial.
First of all, we'll use import statements for setting up the necessary dependencies for the Selenium script to interact with a web browser, perform various actions, and handle delays during the automation process.
#import our webdriver
from selenium import webdriver
#import ActionChains
from selenium.webdriver import ActionChains
#import By
from selenium.webdriver.common.by import By
#import ScrollOrigin
from selenium.webdriver.common.actions.wheel_input import ScrollOrigin
#import the ability to sleep
from time import sleep
In the code above:
- We import several things from Selenium and we also import
sleep
fromtime
to introduce delays in the script. - We import
ActionChains
in order to create a chain of actions, such as scrolling and clicking. - We import
By
so we can find objects by a certain property, such asCLASS_NAME
or any of the others from the list above. - We also import
ScrollOrigin
fromwheel_input
so we can have the ability to scroll up and down the page.
Below the imports, we can add the following code:
#create a bool to know when we're running
running = True
#open the Chrome browser
driver = webdriver.Chrome()
while running:
#navigate to the scrapeops website
driver.get("https://scrapeops.io/")
#wait a couple seconds
sleep(2)
#save the page's footer as a variable
footer = driver.find_element(By.TAG_NAME, "footer")
#scroll to the footer
ActionChains(driver)\
.scroll_to_element(footer)\
.perform()
#wait a couple more seconds
sleep(2)
#exit the loop
running = False
#close the browser
driver.quit()
The code above will automatically scroll to the bottom of our page. As you can see, we find the footer
of the page by using its TAG_NAME
(in html this would be <footer>
) and we use ActionChain
to create a new chain of actions.
After finding our footer, we use our ActionChain
to scroll down to it with the scroll_to_element
method.
Now let's get setup to add a button click.
###previous code goes above this line. Remove driver.quit() before adding the code together.
#after scrolling, we find all the links on the page
links = driver.find_elements(By.TAG_NAME, "a")
#iterate through the links
for link in links:
print(link.text)
#Find "Get Free Account"
if link.text == "Get Free Account":
#save this link as our target
target = link
#break the loop
break
#scroll to the link we want
ActionChains(driver)\
.scroll_to_element(target)\
.perform()
sleep(2)
target.click()
#exit the loop
running = False
#close the browser
driver.quit()
-
First, we tell Selenium to find all of the links available on the page.
-
After that we have our list of links, let's iterate through them and find the "Get Free Account" button. Once we find this button, let's click it.
-
Once we've found our link, we scroll up to it.
-
We can click the link with the
.click()
method. In the code block below, we do the same as we did above, but we click the link after finding it.
To navigate to the previous page, we can use the .back()
method. Let's move back and forth between pages.
###our previous code is up here. Remove driver.quit() before adding the code together.###
target.click()
driver.back()
sleep(2)
driver.forward()
#close the browser
driver.quit()
Next, let's find our form on the new page and fill it out!
This time we'll find the input boxes with different methods including XPATH, Name, ID and CSS_SELECTOR.
###previous code is up here. Remove driver.quit() before adding the code together. ###
driver.forward()
#find by XPATH
name_box = driver.find_element(By.XPATH,
"""//*[@id="mat-input-0"]""")
#Find by ID
email_box = driver.find_element(By.ID, "mat-input-2")
#Find by NAME
password_box = driver.find_element(By.NAME, "password")
#Find by NAME
confirm_password_box = driver.find_element(By.NAME, "confirmedPassword")
#Find by CLASS_NAME
captcha = driver.find_element(By.CLASS_NAME, "h-captcha")
running = False
#close the browser
driver.quit()
In the code above:
- We find the
name_box
with theXPATH
method. - We find the
email_box
by ID. - We use
NAME
to findpassword_box
andconfirm_password_box
. - Then finally, we find our
captcha
checkbox by usingCLASS_NAME
.
Filling Form Info
Filling forms is a fundamental aspect of many web applications, often used for user registration, data submission, and other interactive tasks.
Now let's use Selenium to fill out the form. We'll use the following methods:
send_keys()
select_by_value()
click()
Now to add these into our code.
###previous code is above this line. Remove driver.quit() before adding the code together.###
#Fill in the name box
name_box.send_keys("Scraper Guy")
#Fill in the email box
email_box.send_keys("my_special_email@gmail.com")
#fill in the password box
password_box.send_keys("mysupersecretpassword")
#confirm the password
confirm_password_box.send_keys("mysupersecretpassword")
#Check the captcha box
captcha.click()
sleep(90)
running = False
#close the browser
driver.quit()
If we run the code above, it fills out the form for us but we have a slight problem. This captcha requires us to acutally identify objects in the pictures. Since this part requires actual human interaction, let's give 90 seconds to accomplish it. Whoever runs the scraper will get 90 seconds to complete the captcha and then the script will continue to run.
We don't need to since a timezone is already selected by default, but let's tell this script to also choose a timezone using the dropdown.
We would normally do this by creating a Select
object an then find the dropdown element, and use select_by_value()
or select_by_index
to choose something.
Normally this portion of our code would like like this:
dropdown = Select(driver.find_element(By.TAG_NAME, "select"))
dropdown.select_by_index(2)
Or:
dropdown = Select(driver.find_element(By.TAG_NAME, "select"))
dropdown.select_by_value("America/Dominica (-4:00)")
The site we're scraping actually doesn't use the select
tag. We're dealing with a custom-built selector. To get around this, we can simply use the .click()
method that we used earlier.
To find the XPATH of an object on a webpage, you can simply right-click and inspect it. Then in the "copy" options, you can copy the XPATH of the item.
###previous code goes above this line###
dropdown = driver.find_element(By.ID, "timezone-picker")
dropdown.click()
option = driver.find_element(By.XPATH, "/html/body/app-root/div[2]/app-register/div/form/mat-card/mat-card-content/div/ng-moment-timezone-picker/div/ng-select/ng-dropdown-panel/div/div[2]/div[9]/span")
option.click()
captcha.click()
sleep(90)
running = False
Taking Screenshots
Quite often, you will want to save a copy of the page you are scraping for further analysis or to verify the information that you've scraped.
We can accomplish this with the save_screenshot()
method. Let's adjust our code so it takes a screenshot of the empty form and a screenshot of the filled form.
Add the following line of code before you submit the form:
driver.save_screenshot("before.png")
Add the following line of code after you submit the form:
driver.save_screenshot("after.png")
Your code block should be placed like this to appropriately take screenshots before and after you fill in the form:
# Take a screenshot of the empty form
driver.save_screenshot("before.png")
#Fill in the name box
name_box.send_keys("Scraper Guy")
#Fill in the email box
email_box.send_keys("my_special_email@gmail.com")
#fill in the password box
password_box.send_keys("mysupersecretpassword")
#confirm the password
confirm_password_box.send_keys("mysupersecretpassword")
# Take a screenshot of the filled form
driver.save_screenshot("after.png")
Dealing With Dynamic Elements
JavaScript often loads dynamic content, handles asynchronous operations, and renders complex web page layouts. By managing JavaScript effectively, scraping tools can retrieve data accurately, while automated tests can simulate user interactions more reliably.
In many cases, we have to either wait for JavaScript content to load or just flat out disable it.
Let's use Selenium to wait for a new page to load.
First we'll add WebDriverWait
and expected_conditions
to our imports, then we tell selenium to wait until the expected elements have been loaded.
#import our webdriver
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
#open the Chrome browser
driver = webdriver.Chrome()
#navigate to the scrapeops website
driver.get("https://quotes.toscrape.com/")
try:
element = WebDriverWait(driver, 5).until(
EC.presence_of_element_located((
By.CLASS_NAME, "quote"))
)
finally:
driver.save_screenshot("loaded.png")
Infinite Scrolling
Some websites allow you to scroll infinitely. Infinite scrolling allows for a seamless browsing experience and is commonly used on social media platforms, news websites, and various other types of web applications.
However, from a web scraping perspective, infinite scrolling can pose challenges for data extraction. Since the additional content loads dynamically as the user scrolls down, web scrapers that rely on simple HTTP requests may not capture the complete data set.
With a simple while
loop, we can scroll infinite sites.
from time import sleep
from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.common.actions.wheel_input import ScrollOrigin
#open the Chrome browser
driver = webdriver.Chrome()
driver.get("https://shopping.google.com/m/bestthings/")
scroll_origin = ScrollOrigin.from_viewport(0, 10)
while True:
ActionChains(driver)\
.scroll_from_origin(scroll_origin, 0, 2000)\
.perform()
sleep(2)
driver.quit()
- First, we get the site with
driver.get()
. - We then tell Selenium to scroll down
10
from the viewport by passing it in as an argument to the functionScrollOrigin.from_viewport()
which takes two arguments, anx
(horizontal) value and oury
(vertical) value. - In Python,
while: True
automatically creates an infinite loop. - So during our infinite loop, we scroll down by
2000
, wait two seconds and scroll again. - Since we are inside an infinite loop, this process will repeat forever! (...well until we kill the process by shutting of Python)
Dealing with Pop Ups
Pop-ups are all over the web and in many cases they are simply disabled by the browser. While Selenium has builtin switch_to_alert
, dismiss()
and accept()
functions for dealing with JavaScript alerts, more often than not, we're going to run into custom built pop-ups called modals.
In the code below, we head to the website and sleep
for a few seconds while we wait for the pop-ups to appear. After the cookie notice appears, we find the x
button using its XPATH, and click()
on it in order to close the pop-up. The code then sleeps for a little while longer so that you can watch the pop-up close.
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
#open the Chrome browser
driver = webdriver.Chrome()
driver.get("https://www.mcdonalds.com/us/en-us.html")
sleep(10)
close_cookie_xpath = "/html/body/div[2]/div[2]/div/div[2]/button"
close_button = driver.find_element(By.XPATH, close_cookie_xpath)
close_button.click()
sleep(10)
driver.quit()
Different Configurations
Depending on what you're doing, you may want to disable JavaScript or launch in headless mode.
The code below will run with JavaScript disabled.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_experimental_option(
"prefs",
{
'profile.managed_default_content_settings.javascript':2
}
)
#open the Chrome browser
driver = webdriver.Chrome(options=options)
driver.get("https://www.whatismybrowser.com/detect/is-javascript-enabled/")
sleep(10)
driver.quit()
Now let's run in headless mode. The code below will run without displaying a browser window.
from time import sleep
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--headless")
#open the Chrome browser
driver = webdriver.Chrome(options=options)
driver.get("https://www.whatismybrowser.com/detect/is-javascript-enabled/")
sleep(10)
driver.quit()
Get Around Anti-Scraping Protections with Selenium
When using Selenium in Python for web scraping, there are several strategies to avoid anti-scraping protections and prevent detection. These include:
- Adding Real Headers:
- Mimic the behavior of a regular browser by setting legitimate user-agent headers, mimicking various browsers, operating systems, and devices. This can help prevent detection by the website's anti-scraping mechanisms
- Emulate typical header information, such as the 'User-Agent' header, to make the scraper appear more like a genuine user.
- Handling Hidden Elements:
- Identify and interact with hidden elements, if necessary, to bypass any attempts by the website to prevent automated scraping.
- Use Selenium's capabilities to simulate user behavior, such as scrolling, clicking, or filling in forms, to access and retrieve data from hidden or dynamically loaded content.
- Using Proxies:
- Rotate IP addresses using proxies to avoid being blocked by websites that have IP-based anti-scraping measures
- Distribute scraping requests across multiple IP addresses to reduce the risk of triggering rate limits or IP bans.
For more information, check our guide to learn how to setup and use Selenium's Undetected ChromeDriver to bypass some of the most sophisticated anti-bot mechanisms on the market today like DataDome, Perimeterx and Cloudflare.
How to Use A Proxy With Selenium
There are many reason that you may wish to use a proxy on the web and with Selenium specifically.
When determining your location, servers use your IP address. Proxies allow us to not only conceal, but to rotate our IP addresses. This allows us to get aroud things such as rate limiting and geoblockers.
The ScrapeOps Proxy API Aggregator allows us to easily set and use proxies. Let's make an example that uses The ScrapeOps Proxy and undetected-chromedriver
.
First we need to install seleniumwire
and undetected-chromedriver
.
pip install selenium-wire
pip install undetected-chromedriver
We are ready to setup a web scraper in Selenium with proxy support using the seleniumwire
library.
email = "some_address@emailprovider.com"
API_KEY = "my_super_special_api_key"
from time import sleep
import seleniumwire.undetected_chromedriver as uc
proxy_url = f"http://{email}.headless_browser_mode=true:{API_KEY}@proxy.scrapeops.io:5353"
chrome_options = uc.ChromeOptions()
chrome_options.headless=False
proxy_options = {
"proxy": {
"http" : proxy_url,
"https" : proxy_url,
"no_proxy" : "127.0.0.1"
}
}
driver = uc.Chrome(options=chrome_options, seleniumwire_options=proxy_options)
driver.get("http://quotes.toscrape.com/")
sleep(10)
driver.quit()
While the code sample above is small, there's quite a bit going on.
proxy_url
is a formatted string that inserts ouremail
andAPI_KEY
into the proxy server url.- We then construct an
undetected_chromedriver
object. Think of this as an alternative driver to the chromedriver that we used throughout most of this tutorial. uc.ChromeOptions()
is an alternative to theOptions
object used earlier.- We set
headless
to false so we can see what is going on inside the browser. - After we have everything setup, we construct our own driver object from all that information with
uc.Chrome(options=chrome_options, seleniumwire_options=proxy_options)
. - When you run the code, your browser will navigate to quotes.toscrape and then close the browser after a few seconds.
More Web Scraping Tutorials
Now you know the basics of using Selenium and how to use it in your own projects.
If you would like to learn more about Selenium or other Python libraries, then be sure to check out our other guides: