Downloading a File Using Selenium

Python Selenium Guide: Downloading a File

When working with Selenium, occasionally we need to download files. When we download a file, need to identify the steps required to do so. More often than not, this task involves either clicking a button, or combination of pop-ups and buttons.

Selenium provides a robust framework for interacting with web elements, but when it comes to file downloads, additional considerations and techniques are required.

This guide will walk you through the process of downloading a file using Selenium.

TLDR: How to Download a File
Downloading a File With Selenium
Downloading Best Practices
Advanced Techniques
Conclusion
More Cool Articles

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

TLDR: How to Download a File

Here's a basic example in Python to download a file using Selenium:

This script automates the process of navigating to the Python downloads page, clicking on the download button, observing pending downloads, and then closing the browser after a brief pause.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from time import sleep
#declare our url
url = "https://www.python.org/downloads/"
#start chrome
driver = webdriver.Chrome()
#navigate to the website
driver.get(url)
#xpath to the download button
xpath = "/html/body/div/header/div/div[2]/div/div[2]/p/a"
#use the xpath to find the download button
button = driver.find_element(By.XPATH, xpath)
#create an ActionChains instance
action = ActionChains(driver)
#move to the button, and then click it
action.move_to_element(button)\
    .click()\
    .perform()
#look at pending downloads
driver.get("chrome://downloads")
#sleep for a few seconds so we can see the download
sleep(10)
driver.quit()

In the script above:

Imported the required modules from Selenium (webdriver, By, ActionChains) and Python's time module for handling sleep.
Set the URL for the Python downloads page.
Initialized a Chrome WebDriver instance.
Opened the specified URL in the Chrome browser.
Found the download link on the web page using its XPath.
Created an instance of ActionChains to perform a series of actions.
- Use move_to_element to move the mouse cursor to the download button.
- Chain the click action and execute the actions with perform.
Opened the Chrome downloads page (chrome://downloads) to observe any pending downloads.
Paused the script execution for 10 seconds using sleep to allow time for the download to start.
Quit the WebDriver to close the Chrome browser.

Downloading a File With Selenium

As you probably already know, when you download a file in your browser, download times vary in length and we can download files of pretty much any type.

While Selenium does have limited support for downloading files via click(), the official recommendation from their site is acutally not to download files using Selenium.

The experimental support for file downloads is quite limited. It is extremely important that we can verify our downloads, in order to view them, we can head over to chrome://downloads or check our local filesystem just like we would in a normal browsing session.

Locating the Download Link or Button

Selenium gives us first class support for finding page elements with the find_element() and find_elements() methods.

Both of these elements work the same basic way. find_element() finds and returns one element of a certain criteria and find_elements() finds all elements of a certain criteria and returns them as a list.

In the example earlier, we found our element using its XPATH.

Here is a code example to find other elements using different criteria:

from selenium import webdriver
from selenium.webdriver.common.by import By
#save our url
url = "https://quotes.toscrape.com"
driver = webdriver.Chrome()
#go to the url
driver.get(url)
#find the header by tag name
header = driver.find_element(By.TAG_NAME, "h1")
print(f"header: {header.text}")
#find all quote tags by their class name
tags = driver.find_elements(By.CLASS_NAME, "tags")
for tag in tags:
    print(f"Tag: {tag.text}")

The example above searches for elements by TAG_NAME and by CLASS_NAME. If we want to find elements with other location methods, change the arguments that we pass into find_element(). Here are examples for other locators:

By Xpath: driver.find_element(By.XPATH, xpath_to_element)
By ID: driver.find_element(By.ID, id_of_the_element_to_find)

Clicking the Download Link or Button

Once we've identified our element with find_element(), it's time to click() on it to begin the download.

If you recall from the TLDR example, sometimes our link is not always immediately clickable. ActionChains is often the best workaround for this.

Here are the actions performed in the TLDR example:

#create an ActionChains instance
action = ActionChains(driver)
#move to the button, and then click it
action.move_to_element(button)\
    .click()\
    .perform()

In the TLDR example, we:

create an ActionChains object with action = ActionChains(driver)
We then specify the actions we'd like to perform with action.move_to_element(button).click()
Then, we perform() the chain of actions we specified

We use move_to_element() to put the cursor over the button. Afterward, we use click() to physically click on the element. Then, perform() tells Selenium to perform the chain of actions that we just set up.

When we need to click our download link, we use the click method. The trickiest part if verifying the completion of the download.

Waiting For The Download To Complete

Waiting for a download to complete can be a crucial step in Selenium automation. The challenge lies in determining when the download is finished. As previously mentioned, Selenium has limited experimental support for downloading files. However, you can use a combination of approaches to handle this situation.

In the TLDR example, we can check Chrome's downloads manually by using driver.get() and the page will tell us when the download is complete.

Handling File Download Dialog

Sometimes we have to handle pop-ups, dialog or some other special scenario in order to start our download. These scenarios are also often best handled with click().

When clicking to download a package from pypi, we need click the right button in order to the download link to come up.

from selenium import webdriver
from selenium.webdriver.common.by import By
from time import sleep
#save the url
url = "https://pypi.org/project/selenium/"
#open Chrome
driver = webdriver.Chrome()
#go to the website
driver.get(url)
#find the button
button = driver.find_element(By.ID, "files-tab")
#click the button
button.click()
#xpath of the download link
xpath = "/html/body/main/div[3]/div/div/div[2]/div[4]/div[1]/div[2]/a[1]"
#find the link
link = driver.find_element(By.XPATH, xpath)
#click the link
link.click()
#sleep so we can watch the download
sleep(10)
driver.quit()

As you can see in the code above, we:

Open Chrome and navigate to the site with webdriver.Chrome() and driver.get()
We then locate the "Download Files" button using its XPath
Once we've located and clicked this button, we're then able to find and click() the download link

Verifying File Downloaded

Verifying the download is the final and most important step to downloading files. If the download has been corrupted or is somehow otherwise incomplete, all of our effort up to this point has been wasted!

Earlier, we used Chrome's built-in downloads page to check our download. Now, let's check the download using our os module.

from selenium import webdriver
from selenium.webdriver.common.by import By
import os
from time import sleep
#path to the downloads folder
path = "home/nultinator/Downloads"
#save the url
url = "https://pypi.org/project/selenium/"
#open Chrome
driver = webdriver.Chrome()
#go to the website
driver.get(url)
#find the button
button = driver.find_element(By.ID, "files-tab")
#click the button
button.click()
#xpath of the download link
xpath = "/html/body/main/div[3]/div/div/div[2]/div[4]/div[1]/div[2]/a[1]"
#find the link
link = driver.find_element(By.XPATH, xpath)
#save the filename
filename = link.text
#click the link
link.click()
#sleep so we can watch the download
sleep(10)
#move to root directory
os.chdir("/")
#move to downloads directory
os.chdir(path)
#list the files
files = os.listdir()
#iterate through them
for file in files:
    #use the builtin string method, "count" to find the filename
    if file.count(filename) == 0:
        #if the count is 0, keep going
        continue
    else:
        #if the count is greater than 0, we found the file
        print("Found the file!")
        break
#close Chrome
driver.quit()

The code above is largely the same as the previous example with a few key differences:

We import the os module so Python can see our filesystem
We save our filename as a variable
We save the path to our Downloads folder as a variable
Once we've downloaded the file, we use os.chdir("/") to move into our root directory
Once inside our root folder, we use os.chdir(path) to move into our Downloads folder
os.listdir() returns all a list of all the files in the folder, files
Iterate through this list looking for strings that have our filename variable with the count() method
If count() returns zero occurences of the name, we keep going
If count returns anything other than zero, we've found the file and exit the loop

Downloading Best Practices

When downloading files using Selenium, it's essential to follow best practices to ensure robust and reliable automation. Here are some best practices for handling file downloads with Selenium:

Dealing With Dynamic Content

In the examples above, we used a manual sleep() to wait for the download to complete. This time, let's try using Selenium's built-in wait support.

We can use an implicit wait by simply setting it when we start Chrome:

from selenium import webdriver
from selenium.webdriver.common.by import By
from time import sleep
#save the url
url = "https://pypi.org/project/selenium/"
#open Chrome
driver = webdriver.Chrome()
#set implicit wait
driver.implicitly_wait(10)
#go to the website
driver.get(url)
#find the button
button = driver.find_element(By.ID, "files-tab")
#click the button
button.click()
#xpath of the download link
xpath = "/html/body/main/div[3]/div/div/div[2]/div[4]/div[1]/div[2]/a[1]"
#find the link
link = driver.find_element(By.XPATH, xpath)
#click the link
link.click()
#go to downloads page
driver.get("chrome://downloads")
#sleep so we can watch the download
sleep(5)
driver.quit()

Here is an example of the same code using an explicit wait:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from time import sleep
#save the url
url = "https://pypi.org/project/selenium/"
#open Chrome
driver = webdriver.Chrome()
#set the maximum time to wait
wait = WebDriverWait(driver, 10)
#go to the website
driver.get(url)
#find the button
button = driver.find_element(By.ID, "files-tab")
#click the button
button.click()
#xpath of the download link
xpath = "/html/body/main/div[3]/div/div/div[2]/div[4]/div[1]/div[2]/a[1]"
#find the link
link = wait.until(EC.presence_of_element_located(
    (By.XPATH, xpath)))
#click the link
link.click()
#go to the downloads page
driver.get("chrome://downloads")
#sleep so we can watch the download
sleep(5)
driver.quit()

Basic Error Handling

When running scripts for long hours in production, it's just about impossible to identify failures and exceptions at the time they occur. We can use try, except, and finally statements in order to handle our errors and we can use logging to write them to a file as the program keeps moving onward.

The code below tries to download Selenium without clicking the "Download" button first. Normally it would throw an exception and our scraper would crash.

from selenium import webdriver
from selenium.webdriver.common.by import By
import logging
logging.basicConfig(filename="errors.log")
#save the url
url = "https://pypi.org/project/selenium/"
#open Chrome
driver = webdriver.Chrome()
#go to the website
driver.get(url)
#xpath of the download link
xpath = "/html/body/main/div[3]/div/div/div[2]/div[4]/div[1]/div[2]/a[1]"
#find the link, this will throw an exception
try:
    link = driver.find_element(By.XPATH, xpath)
    #click the link
    link.click()
except Exception as e:
    logging.exception(e)

finally:
    driver.quit()

In the code above, we:

Navigate to the page
try to look for (and fail to find) the "Download" button with find_element()
Log the exception with except
finally, after all errors have been handled and logged, we exit the script and close the browser

Dealing with Asynchronous Downloads

When dealing with many asynchronous downloads at once, we can either set a manual wait time, this is what we'd do perhaps for a one time task.

If we absolutely need to wait for a large amount of long downloads, we can use Selenium to constantly check the Downloads page until it is empty.

Handling Different File Types

When handling PDF files and images, we need to use the experimental options available in Selenium. The code below downloads a PDF file:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import os
from time import sleep
#get the current directory
current_folder = os.getcwd()
# Set up Chrome options to download files automatically
chrome_options = Options()
#add our experimental options
chrome_options.add_experimental_option("prefs", {
    "download.default_directory": current_folder,
    "download.prompt_for_download": False,
    "download.directory_upgrade": True,
    "plugins.always_open_pdf_externally": True,
    "plugins.plugins_list": [{"enabled": False, "name": "Chrome PDF Viewer"}]
})
#open Chrome with our options
driver = webdriver.Chrome(options=chrome_options)
#navigate to the site (this is a download link)
driver.get("https://nakamotoinstitute.org/static/docs/untraceable-electronic-mail.pdf")
#sleep for a moment so the download can complete
sleep(5)
#close the browser
driver.quit()

In the code above, pay attention to the following things:

Options() allows us to set custom options for Chrome
add_experimental_option() takes all of our experimental options as parameters
"plugins.plugins_list": [{"enabled": False, "name": "Chrome PDF Viewer"}] turns off Chrome's PDF viewer, this needs to be turned off in order to download PDF files

Advanced Techniques

Parallel File Downloads

In this section, we'll learn how to download multiple files at the same time. It is pretty similar to our previous example, but we take it to a whole new extreme. Let's attempt to download every PDF from the site.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import os
from time import sleep
#get the current directory for downloads
current_folder = os.getcwd()
chrome_options = Options()
chrome_options.add_experimental_option("prefs", {
    "download.default_directory": current_folder,
    "download.prompt_for_download": False,
    "download.directory_upgrade": True,
    "plugins.always_open_pdf_externally": True,
    "plugins.plugins_list": [{"enabled": False, "name": "Chrome PDF Viewer"}]
})
driver = webdriver.Chrome(options=chrome_options)
#base url
url = "https://nakamotoinstitute.org/literature/"
#navigate to the site
driver.get(url)
#create an empty list
list_of_links = []
links = driver.find_elements(By.XPATH, "//a[contains(@href, '.pdf')]")
#push all our links into list_of_links for longer lifetime
for link in links:
    href = link.get_attribute("href")
    list_of_links.append(href)
#attempt to hit each url
for href in list_of_links:
    try:
        driver.get(href)
    except:
        print("dead link")
#swtich to the chrome downloads page
driver.get("chrome://downloads")
#sleep so we can view the page
sleep(10)
#close the browser
driver.quit()

In the example above, we:

Set our options and navigate to the site
Create a variable, list_of_links to create an empty list
We then use find_elements() to find all PDF links on the site using their href attribute
Because page elements can go stale, we push all of these href items into the longer lived variable: list_of_links
We then iterate through list_of_links and try to hit each url in order to download the PDF
If the link is dead, we handle the exception using except and we print "dead link" to the terminal
We then navigate to the downloads page to watch the downloads complete

Security Considerations

When downloading any file, it is important to make sure it doesn't contain any sensitive information and if it does contain sensitive information, it is imperative that you handle and store the file properly.

When dealing with sensitive information, the best thing to do with any such file is to encrypt it and handle your encryption keys properly so they don't fall into the wrong hands. This way, even if the wrong person is able to gain access to the file, they won't be able to use it for anything or even view it for that matter.

Conclusion

You've reached the end of this article. You should now have a solid grasp of not only downloading items in Selenium, but also, how to download a bunch of things extremely fast.

If you'd like to learn more about Selenium in general, take a look at the Selenium Documentation.

Python Selenium Guide: Downloading a File

Need help scraping the web?

TLDR: How to Download a File​

Downloading a File With Selenium​

Locating the Download Link or Button​

Clicking the Download Link or Button​

Waiting For The Download To Complete​

Handling File Download Dialog​

Verifying File Downloaded​

Downloading Best Practices​

Dealing With Dynamic Content​

Basic Error Handling​

Dealing with Asynchronous Downloads​

Handling Different File Types​

Advanced Techniques​

Parallel File Downloads​

Security Considerations​

Conclusion​

More Cool Articles​