Python Selenium Guide: Downloading a File
When working with Selenium, occasionally we need to download files. When we download a file, need to identify the steps required to do so. More often than not, this task involves either clicking a button, or combination of pop-ups and buttons.
Selenium provides a robust framework for interacting with web elements, but when it comes to file downloads, additional considerations and techniques are required.
This guide will walk you through the process of downloading a file using Selenium.
- TLDR: How to Download a File
- Downloading a File With Selenium
- Downloading Best Practices
- Advanced Techniques
- Conclusion
- More Cool Articles
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
TLDR: How to Download a File
Here's a basic example in Python to download a file using Selenium:
This script automates the process of navigating to the Python downloads page, clicking on the download button, observing pending downloads, and then closing the browser after a brief pause.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from time import sleep
#declare our url
url = "https://www.python.org/downloads/"
#start chrome
driver = webdriver.Chrome()
#navigate to the website
driver.get(url)
#xpath to the download button
xpath = "/html/body/div/header/div/div[2]/div/div[2]/p/a"
#use the xpath to find the download button
button = driver.find_element(By.XPATH, xpath)
#create an ActionChains instance
action = ActionChains(driver)
#move to the button, and then click it
action.move_to_element(button)\
.click()\
.perform()
#look at pending downloads
driver.get("chrome://downloads")
#sleep for a few seconds so we can see the download
sleep(10)
driver.quit()
In the script above:
- Imported the required modules from Selenium (webdriver, By, ActionChains) and Python's
time
module for handling sleep. - Set the URL for the Python downloads page.
- Initialized a Chrome WebDriver instance.
- Opened the specified URL in the Chrome browser.
- Found the download link on the web page using its XPath.
- Created an instance of ActionChains to perform a series of actions.
- Use
move_to_element
to move the mouse cursor to the download button. - Chain the
click
action and execute the actions withperform
.
- Use
- Opened the Chrome downloads page (chrome://downloads) to observe any pending downloads.
- Paused the script execution for 10 seconds using sleep to allow time for the download to start.
- Quit the WebDriver to close the Chrome browser.
Downloading a File With Selenium
As you probably already know, when you download a file in your browser, download times vary in length and we can download files of pretty much any type.
While Selenium does have limited support for downloading files via click()
, the official recommendation from their site is acutally not to download files using Selenium.
The experimental support for file downloads is quite limited. It is extremely important that we can verify our downloads, in order to view them, we can head over to chrome://downloads
or check our local filesystem just like we would in a normal browsing session.
Locating the Download Link or Button
Selenium gives us first class support for finding page elements with the find_element()
and find_elements()
methods.
Both of these elements work the same basic way. find_element()
finds and returns one element of a certain criteria and find_elements()
finds all elements of a certain criteria and returns them as a list.
In the example earlier, we found our element using its XPATH.
Here is a code example to find other elements using different criteria:
from selenium import webdriver
from selenium.webdriver.common.by import By
#save our url
url = "https://quotes.toscrape.com"
driver = webdriver.Chrome()
#go to the url
driver.get(url)
#find the header by tag name
header = driver.find_element(By.TAG_NAME, "h1")
print(f"header: {header.text}")
#find all quote tags by their class name
tags = driver.find_elements(By.CLASS_NAME, "tags")
for tag in tags:
print(f"Tag: {tag.text}")
The example above searches for elements by TAG_NAME
and by CLASS_NAME
. If we want to find elements with other location methods, change the arguments that we pass into find_element()
.
Here are examples for other locators:
- By Xpath:
driver.find_element(By.XPATH, xpath_to_element)
- By ID:
driver.find_element(By.ID, id_of_the_element_to_find)
Clicking the Download Link or Button
Once we've identified our element with find_element()
, it's time to click()
on it to begin the download.
If you recall from the TLDR
example, sometimes our link is not always immediately clickable. ActionChains
is often the best workaround for this.
Here are the actions performed in the TLDR
example:
#create an ActionChains instance
action = ActionChains(driver)
#move to the button, and then click it
action.move_to_element(button)\
.click()\
.perform()
In the TLDR
example, we:
- create an
ActionChains
object withaction = ActionChains(driver)
- We then specify the actions we'd like to perform with
action.move_to_element(button).click()
- Then, we
perform()
the chain of actions we specified
We use move_to_element()
to put the cursor over the button. Afterward, we use click()
to physically click on the element. Then, perform()
tells Selenium to perform the chain of actions that we just set up.
When we need to click our download link, we use the click method. The trickiest part if verifying the completion of the download.
Waiting For The Download To Complete
Waiting for a download to complete can be a crucial step in Selenium automation. The challenge lies in determining when the download is finished. As previously mentioned, Selenium has limited experimental support for downloading files. However, you can use a combination of approaches to handle this situation.
In the TLDR
example, we can check Chrome's downloads manually by using driver.get()
and the page will tell us when the download is complete.
Handling File Download Dialog
Sometimes we have to handle pop-ups, dialog or some other special scenario in order to start our download. These scenarios are also often best handled with click()
.
When clicking to download a package from pypi
, we need click the right button in order to the download link to come up.
from selenium import webdriver
from selenium.webdriver.common.by import By
from time import sleep
#save the url
url = "https://pypi.org/project/selenium/"
#open Chrome
driver = webdriver.Chrome()
#go to the website
driver.get(url)
#find the button
button = driver.find_element(By.ID, "files-tab")
#click the button
button.click()
#xpath of the download link
xpath = "/html/body/main/div[3]/div/div/div[2]/div[4]/div[1]/div[2]/a[1]"
#find the link
link = driver.find_element(By.XPATH, xpath)
#click the link
link.click()
#sleep so we can watch the download
sleep(10)
driver.quit()
As you can see in the code above, we:
- Open Chrome and navigate to the site with
webdriver.Chrome()
anddriver.get()
- We then locate the "Download Files" button using its XPath
- Once we've located and clicked this button, we're then able to find and
click()
the download link
Verifying File Downloaded
Verifying the download is the final and most important step to downloading files. If the download has been corrupted or is somehow otherwise incomplete, all of our effort up to this point has been wasted!
Earlier, we used Chrome's built-in downloads page to check our download. Now, let's check the download using our os
module.
from selenium import webdriver
from selenium.webdriver.common.by import By
import os
from time import sleep
#path to the downloads folder
path = "home/nultinator/Downloads"
#save the url
url = "https://pypi.org/project/selenium/"
#open Chrome
driver = webdriver.Chrome()
#go to the website
driver.get(url)
#find the button
button = driver.find_element(By.ID, "files-tab")
#click the button
button.click()
#xpath of the download link
xpath = "/html/body/main/div[3]/div/div/div[2]/div[4]/div[1]/div[2]/a[1]"
#find the link
link = driver.find_element(By.XPATH, xpath)
#save the filename
filename = link.text
#click the link
link.click()
#sleep so we can watch the download
sleep(10)
#move to root directory
os.chdir("/")
#move to downloads directory
os.chdir(path)
#list the files
files = os.listdir()
#iterate through them
for file in files:
#use the builtin string method, "count" to find the filename
if file.count(filename) == 0:
#if the count is 0, keep going
continue
else:
#if the count is greater than 0, we found the file
print("Found the file!")
break
#close Chrome
driver.quit()
The code above is largely the same as the previous example with a few key differences:
- We import the
os
module so Python can see our filesystem - We save our filename as a variable
- We save the path to our
Downloads
folder as a variable - Once we've downloaded the file, we use
os.chdir("/")
to move into ourroot
directory - Once inside our root folder, we use
os.chdir(path)
to move into ourDownloads
folder os.listdir()
returns all a list of all the files in the folder,files
- Iterate through this list looking for strings that have our
filename
variable with thecount()
method - If
count()
returns zero occurences of the name, we keep going - If count returns anything other than zero, we've found the file and exit the loop
Downloading Best Practices
When downloading files using Selenium, it's essential to follow best practices to ensure robust and reliable automation. Here are some best practices for handling file downloads with Selenium:
Dealing With Dynamic Content
In the examples above, we used a manual sleep()
to wait for the download to complete. This time, let's try using Selenium's built-in wait support.
We can use an implicit wait by simply setting it when we start Chrome:
from selenium import webdriver
from selenium.webdriver.common.by import By
from time import sleep
#save the url
url = "https://pypi.org/project/selenium/"
#open Chrome
driver = webdriver.Chrome()
#set implicit wait
driver.implicitly_wait(10)
#go to the website
driver.get(url)
#find the button
button = driver.find_element(By.ID, "files-tab")
#click the button
button.click()
#xpath of the download link
xpath = "/html/body/main/div[3]/div/div/div[2]/div[4]/div[1]/div[2]/a[1]"
#find the link
link = driver.find_element(By.XPATH, xpath)
#click the link
link.click()
#go to downloads page
driver.get("chrome://downloads")
#sleep so we can watch the download
sleep(5)
driver.quit()
Here is an example of the same code using an explicit wait:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from time import sleep
#save the url
url = "https://pypi.org/project/selenium/"
#open Chrome
driver = webdriver.Chrome()
#set the maximum time to wait
wait = WebDriverWait(driver, 10)
#go to the website
driver.get(url)
#find the button
button = driver.find_element(By.ID, "files-tab")
#click the button
button.click()
#xpath of the download link
xpath = "/html/body/main/div[3]/div/div/div[2]/div[4]/div[1]/div[2]/a[1]"
#find the link
link = wait.until(EC.presence_of_element_located(
(By.XPATH, xpath)))
#click the link
link.click()
#go to the downloads page
driver.get("chrome://downloads")
#sleep so we can watch the download
sleep(5)
driver.quit()
Basic Error Handling
When running scripts for long hours in production, it's just about impossible to identify failures and exceptions at the time they occur. We can use try
, except
, and finally
statements in order to handle our errors and we can use logging to write them to a file as the program keeps moving onward.
The code below tries to download Selenium without clicking the "Download" button first. Normally it would throw an exception and our scraper would crash.
from selenium import webdriver
from selenium.webdriver.common.by import By
import logging
logging.basicConfig(filename="errors.log")
#save the url
url = "https://pypi.org/project/selenium/"
#open Chrome
driver = webdriver.Chrome()
#go to the website
driver.get(url)
#xpath of the download link
xpath = "/html/body/main/div[3]/div/div/div[2]/div[4]/div[1]/div[2]/a[1]"
#find the link, this will throw an exception
try:
link = driver.find_element(By.XPATH, xpath)
#click the link
link.click()
except Exception as e:
logging.exception(e)
finally:
driver.quit()
In the code above, we:
- Navigate to the page
try
to look for (and fail to find) the "Download" button withfind_element()
- Log the exception with
except
finally
, after all errors have been handled and logged, we exit the script and close the browser
Dealing with Asynchronous Downloads
When dealing with many asynchronous downloads at once, we can either set a manual wait time, this is what we'd do perhaps for a one time task.
If we absolutely need to wait for a large amount of long downloads, we can use Selenium to constantly check the Downloads page until it is empty.
Handling Different File Types
When handling PDF files and images, we need to use the experimental options available in Selenium. The code below downloads a PDF file:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import os
from time import sleep
#get the current directory
current_folder = os.getcwd()
# Set up Chrome options to download files automatically
chrome_options = Options()
#add our experimental options
chrome_options.add_experimental_option("prefs", {
"download.default_directory": current_folder,
"download.prompt_for_download": False,
"download.directory_upgrade": True,
"plugins.always_open_pdf_externally": True,
"plugins.plugins_list": [{"enabled": False, "name": "Chrome PDF Viewer"}]
})
#open Chrome with our options
driver = webdriver.Chrome(options=chrome_options)
#navigate to the site (this is a download link)
driver.get("https://nakamotoinstitute.org/static/docs/untraceable-electronic-mail.pdf")
#sleep for a moment so the download can complete
sleep(5)
#close the browser
driver.quit()
In the code above, pay attention to the following things:
Options()
allows us to set custom options for Chromeadd_experimental_option()
takes all of our experimental options as parameters"plugins.plugins_list": [{"enabled": False, "name": "Chrome PDF Viewer"}]
turns off Chrome's PDF viewer, this needs to be turned off in order to download PDF files
Advanced Techniques
Parallel File Downloads
In this section, we'll learn how to download multiple files at the same time. It is pretty similar to our previous example, but we take it to a whole new extreme. Let's attempt to download every PDF from the site.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import os
from time import sleep
#get the current directory for downloads
current_folder = os.getcwd()
chrome_options = Options()
chrome_options.add_experimental_option("prefs", {
"download.default_directory": current_folder,
"download.prompt_for_download": False,
"download.directory_upgrade": True,
"plugins.always_open_pdf_externally": True,
"plugins.plugins_list": [{"enabled": False, "name": "Chrome PDF Viewer"}]
})
driver = webdriver.Chrome(options=chrome_options)
#base url
url = "https://nakamotoinstitute.org/literature/"
#navigate to the site
driver.get(url)
#create an empty list
list_of_links = []
links = driver.find_elements(By.XPATH, "//a[contains(@href, '.pdf')]")
#push all our links into list_of_links for longer lifetime
for link in links:
href = link.get_attribute("href")
list_of_links.append(href)
#attempt to hit each url
for href in list_of_links:
try:
driver.get(href)
except:
print("dead link")
#swtich to the chrome downloads page
driver.get("chrome://downloads")
#sleep so we can view the page
sleep(10)
#close the browser
driver.quit()
In the example above, we:
- Set our options and navigate to the site
- Create a variable,
list_of_links
to create an empty list - We then use
find_elements()
to find all PDF links on the site using theirhref
attribute - Because page elements can go stale, we push all of these
href
items into the longer lived variable:list_of_links
- We then iterate through
list_of_links
andtry
to hit each url in order to download the PDF - If the link is dead, we handle the exception using
except
and we print "dead link" to the terminal - We then navigate to the downloads page to watch the downloads complete
Security Considerations
When downloading any file, it is important to make sure it doesn't contain any sensitive information and if it does contain sensitive information, it is imperative that you handle and store the file properly.
When dealing with sensitive information, the best thing to do with any such file is to encrypt it and handle your encryption keys properly so they don't fall into the wrong hands. This way, even if the wrong person is able to gain access to the file, they won't be able to use it for anything or even view it for that matter.
Conclusion
You've reached the end of this article. You should now have a solid grasp of not only downloading items in Selenium, but also, how to download a bunch of things extremely fast.
If you'd like to learn more about Selenium in general, take a look at the Selenium Documentation.
More Cool Articles
Get your feet wet with ScrapeOps Guides and Tutorials. We're your one stop shop for all things related to web scraping!
Take a look at the following guides: