Selenium Guide: How To Use Selenium Stealth For Web Scraping
Selenium Stealth is a powerful tool designed to augment Selenium's capabilities for web scraping by adding an extra layer of anonymity.
In this comprehensive guide, we will explore the intricacies of using Selenium Stealth to enhance your web scraping endeavors. From the basics of installation to advanced techniques for maintaining stealth, this guide covers it all.
- TLDR: How to Use Selenium Stealth for Web Scraping
- Understanding Selenium Stealth
- Benefits of Using Selenium Stealth for Web Scraping
- Getting Started with Selenium Stealth
- Basic Usage
- Configuring Selenium WebDriver Options
- Customizing Selenium-Stealth Args
- Rotating User-Agents With Selenium-Stealth
- Using Proxies With Selenium-Stealth
- Selenium-Stealth Performance
- Alternatives to Selenium-Stealth
- More Selenium Web Scraping Guides
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
TLDR: How to Use Selenium Stealth for Web Scraping
Here's a brief overview and some sample code to jump into Selenium Stealth:
from selenium import webdriver
from selenium_stealth import stealth
# Set up Chrome Options
chrome_options = webdriver.ChromeOptions()
# Set up additional Chrome options for headless mode, window maximization, etc.
# Set up the Selenium WebDriver for a specific browser
driver = webdriver.Chrome(options=chrome_options)
# Use Selenium-Stealth to make this browser instance stealthy
stealth(
driver,
languages=["en-US", "en"], # Specify the languages supported by the browser
vendor="Google Inc.", # Set the vendor of the browser
platform="Win32", # Specify the platform on which the browser is running
webgl_vendor="Intel Inc.", # Spoof the WebGL rendering engine vendor
renderer="Intel Iris OpenGL Engine", # Spoof the WebGL rendering engine renderer
fix_hairline=True # Enable fixing a specific issue related to headless browsing
)
# Now use driver to navigate and interact with web pages
driver.get("https://www.example.com")
# ... your web automation tasks ...
driver.quit()
ChromeOptions
allow you to customize and configure various settings when using the Chrome WebDriver. They provide a way to set preferences, enable or disable features, and control the behavior of the Chrome browser during automation.
For example, chrome_options.add_argument("--headless")
runs the script in headless mode, that's, without a visible browser window.
Let's dive into ins and outs of Selenium Stealth.
Understanding Selenium Stealth
Selenium Stealth transforms the way Selenium interacts with websites by providing tools and techniques to navigate through challenges that are commonplace in web scraping. Some of the challenges include IP blocking, CAPTCHAs, and anti-bot measures.
Selenium Stealth helps users do web scraping more secretly. It makes scraping tasks less likely to be noticed, lowering the chance of getting caught. This ensures that automated tasks run smoothly without any issues.
How Selenium Stealth Addresses Common Challenges
Selenium Stealth addresses several key challenges faced by traditional Selenium automation, enhancing its capabilities for web scraping.
Here's a breakdown of what Selenium Stealth changes about vanilla Selenium:
- Enhanced Anonymity:
- Challenge: Automated bots are often easily detected due to their predictable behavior, risking IP bans.
- Selenium Stealth Solution: Mimics human-like browsing behavior, reducing the risk of detection and enhancing anonymity.
- Avoiding IP Blocks:
- Challenge: Websites employ IP blocking as a defense mechanism against bots, hindering scraping efforts.
- Selenium Stealth Solution: Provides techniques to seamlessly rotate and manage IP addresses, allowing the automation process to bypass IP blocking.
- CAPTCHA Handling:
- Challenge: CAPTCHAs serve as barriers, interrupting automated processes and requiring manual intervention.
- Selenium Stealth Solution: Offers mechanisms to prevent and handle CAPTCHAs effectively, ensuring uninterrupted web scraping activities.
- Stealthy Browser Characteristics:
- Challenge: Automated browsers often exhibit detectable patterns that mark them as non-human.
- Selenium Stealth Solution: Modifies various browser properties, such as vendor, platform, WebGL rendering engine details, and more, to resemble a regular user's browser, making detection more challenging.
- Fixing Headless Browsing Issues:
- Challenge: Headless browsers may exhibit subtle signs that give away their automated nature.
- Selenium Stealth Solution: Introduces the
fix_hairline
option to address specific issues related to headless browsing, enhancing the overall stealthiness of the automation process.
Benefits of Using Selenium Stealth for Web Scraping
When it comes to web scraping, Selenium Stealth offers key advantages in stealth mode, aiding in the avoidance of detection and circumventing anti-scraping mechanisms.
Web scraping encounters hurdles like IP blocking, CAPTCHAs, and anti-bot measures. Selenium Stealth addresses these challenges, providing the following benefits:
- Enhanced Anonymity:
- Selenium Stealth makes your automated browsing behave more like a human, reducing the chances of being detected during web scraping.
- Avoiding IP Blocks:
- With Selenium Stealth, you can smoothly rotate and manage IP addresses, cleverly bypassing obstacles like IP blocking that websites may impose.
- CAPTCHA Handling:
- Selenium Stealth comes to the rescue when dealing with CAPTCHAs, ensuring a seamless and uninterrupted scraping experience. It's designed to handle and prevent CAPTCHAs effectively during automated tasks.
Getting Started with Selenium Stealth
Here is a step-by-step guide to start enjoying the above benefits of Selenium Stealth.
Installation and Setup
To begin using Selenium Stealth, follow these straightforward steps:
- Install Selenium Stealth:
Use your preferred package manager, like pip, to install Selenium Stealth:
pip install selenium-stealth
- Configuration Options:
Explore and customize various options within Selenium Stealth for optimal stealth. Common configurations include languages, vendors, platforms, and more.
Basic Usage
Integrating Selenium Stealth into your scraping workflow is easy. Below are simple code examples illustrating its usage:
from selenium import webdriver
from selenium_stealth import stealth
# Set up the Selenium WebDriver for a specific browser
driver = webdriver.Chrome()
# Use Selenium Stealth to make this browser instance stealthy
stealth(driver,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True)
# Now use the driver to navigate and interact with web pages
driver.get("https://www.example.com")
# ... your web automation tasks ...
# Don't forget to quit the driver when you're done
driver.quit()
This basic example illustrates how to set up and apply Selenium Stealth to your Selenium WebDriver instance. Familiarize yourself with these essential functions and methods for a seamless start to your web scraping endeavors.
Configuring Selenium WebDriver Options
When using Selenium-Stealth, it's still a good practice to configure the WebDriver options as you would typically do when using Selenium. Selenium-Stealth enhances the stealthiness of your browser automation, but it doesn't replace the need for setting up WebDriver options.
The options are critical for defining the behavior and characteristics of the browser session.
Here are some common WebDriver options you might set, and why they're still important even when using Selenium-Stealth:
- Headless Mode:
- Running the browser in headless mode (without a GUI) is common in automation and scraping. It reduces resource usage and can speed up tasks.
- Selenium-Stealth can work with both headless and non-headless modes, but the headless mode is a common giveaway for automated browsers.
- If you use headless mode, other stealth measures become even more important.
- User-Agent:
- While Selenium-Stealth can spoof the user-agent, explicitly setting a user-agent via WebDriver options can provide an additional layer of customization, ensuring that your browser session sends requests with the desired user-agent string.
- Window Size:
- Setting the size of the browser window can help in mimicking the behavior of a regular user. Some websites may check the window size as a metric to detect automation, especially if it's unusually small or large.
- Disabling WebDriver Attributes:
- Traditionally, scripts often modify certain JavaScript properties to hide that the browser is being controlled by WebDriver. However, Selenium-Stealth automatically handles many such detections.
- Yet, it's important to understand what Selenium-Stealth covers and whether additional measures are needed.
- Custom Preferences and Capabilities:
- Depending on your task, you might need to set custom preferences (like downloading behavior, handling of pop-ups, etc.) or capabilities specific to the browser being automated.
- Proxy Settings:
- If you're using proxies for your automation tasks, these need to be configured in the WebDriver options.
- Incognito/Private Mode:
- Running the browser in incognito or private mode can sometimes help reduce the chances of being detected.
You can use some of the options as follows:
from selenium import webdriver
from selenium_stealth import stealth
# Set up Chrome Options
chrome_options = webdriver.ChromeOptions()
# Run in headless mode for automated tasks without a visible browser window
chrome_options.add_argument("--headless")
# Maximize the Chrome window upon startup for an optimized viewport
chrome_options.add_argument("start-maximized")
# Disable Chrome extensions to ensure a clean automation environment
chrome_options.add_argument("--disable-extensions")
# Disable sandbox mode, which can be necessary in certain environments
chrome_options.add_argument('--no-sandbox')
# Disable the use of the /dev/shm shared memory space, addressing potential memory-related issues
chrome_options.add_argument('--disable-dev-shm-usage')
# Set a custom user agent to simulate different browsers or devices for enhanced stealth during automation
chrome_options.add_argument('user-agent=YOUR_CUSTOM_USER_AGENT')
# Set up the Selenium WebDriver for a specific browser
driver = webdriver.Chrome(options=chrome_options)
# Use Selenium-Stealth to make this browser instance stealthy
stealth(driver,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True)
# Now use driver to navigate and interact with web pages
driver.get("https://www.example.com")
# ... your web automation tasks ...
driver.quit()
Customizing Selenium-Stealth Args
Selenium-Stealth equips you with various arguments to disguise your Selenium-driven browser sessions, mirroring those of a typical user. Understanding these arguments is vital for using Selenium-Stealth effectively.
Below is a summary table of common arguments to pass to the stealth function:
Argument | Description |
---|---|
languages | Specifies the languages supported by the browser, mimicking a regular user's language preferences (e.g., ["en-US", "en"]). |
vendor | Sets the browser's vendor (e.g., "Google Inc."), simulating the navigator.vendor property. |
platform | Specifies the platform on which the browser is running (e.g., "Win32", "Linux", or "MacIntel"), helping spoof the navigator.platform property. |
webgl_vendor | Used to spoof WebGL rendering engine properties (e.g., "Intel Inc."), aiding in bypassing checks detecting automated browsers. |
renderer | Similar to webgl_vendor , sets the renderer property of the WebGL rendering engine (e.g., "Intel Iris OpenGL Engine"). |
fix_hairline | When True, attempts to fix a specific issue related to thin lines appearing in headless browsing, a potential indicator of automation. |
user_agent | Specifies the user agent to be used by the browser, including details like browser version, operating system, and device type. |
accept_languages | Sets the Accept-Language HTTP header used by the browser to specify preferred languages for web content, akin to user language preferences. |
plugins | Determines the installed plugins in the browser, configurable to mimic common plugins found in a regular user's browser. |
custom_resolution | Allows setting a custom screen resolution, enhancing the appearance of the automated browser session to resemble that of a standard device. |
do_not_track | Sets the "Do Not Track" setting of the browser, enabling or disabling it based on preferences. |
hardware_concurrency | Mimics the reported number of CPU cores by the browser, potentially used to identify automated browsers. |
navigator_permissions | Configures permissions-related properties of the navigator object, aligning them with a typical user's settings. |
navigator_plugins | Adjusts the plugins array in the navigator object, creating an appearance similar to a regular browser's plugins. |
media_codecs | Specifies the supported media codecs by the browser, contributing to a more authentic browser profile. |
Each of these arguments plays a crucial role in modifying various properties of the browser session to avoid detection by websites.
It is essential to configure these arguments thoughtfully, closely mimicking the characteristics of a standard user's browser to enhance the chances of avoiding detection during web scraping or automation.
Rotating User-Agents With Selenium-Stealth
A "User-Agent" is a string of information sent by a web browser or client software to identify itself to a web server. It includes details about the browser type, version, operating system, and sometimes device characteristics.
Why User-Agent Management Matters
Effectively managing user-agents remains critical even with the implementation of Selenium Stealth to prevent detection during web scraping. As websites frequently inspect user-agent strings, the practice of regularly changing them becomes crucial.
This rotation helps simulate diverse browsing behaviors, significantly reducing the risk of being flagged as a bot.
Setting User-Agents with Selenium Stealth
To set user-agents with Selenium Stealth, follow these steps:
-
Define Chrome Options:
from selenium import webdriver
from selenium_stealth import stealth
chrome_options = webdriver.ChromeOptions() -
Set User-Agent:
chrome_options.add_argument('user-agent=YOUR_CUSTOM_USER_AGENT')
-
Apply Stealth:
driver = webdriver.Chrome(options=chrome_options)
stealth(driver,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
fix_hairline=True
)
This example demonstrates setting a specific user-agent for the Chrome browser to enhance stealth during automation.
Selecting Random User-Agent
To use a random user-agent with Selenium Stealth, consider the following code:
from selenium import webdriver
from selenium_stealth import stealth
from selenium.webdriver.support.ui import WebDriverWait
import random
# List of user-agents
user_agents = [
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
]
# Select a random user-agent
random_user_agent = random.choice(user_agents)
# Set up Chrome Options
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(f'user-agent={random_user_agent}')
# Set up the Selenium WebDriver with Selenium-Stealth
driver = webdriver.Chrome(options=chrome_options)
stealth(
driver,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
fix_hairline=True
)
# Navigate to the specified URL
driver.get('https://www.deezer.com/en/channels/explore/')
# Wait for page to load
wait = WebDriverWait(driver, 10)
# Take screenshot of the page
driver.save_screenshot("screenshot.png")
# Don't forget to quit the driver when you're done
driver.quit()
This code snippet selects a random user-agent from the list and sets it in the Chrome Options, providing variability in user-agent strings for increased stealth during web scraping.
Using Proxies With Selenium-Stealth
Proxies are intermediary servers that sit between a user and the internet. They act as middlemen, receiving requests from users and forwarding them to the destination servers.
Why Proxies Are Essential with Selenium Stealth
Even with Selenium Stealth in play, incorporating proxies is crucial for several reasons. Proxies provide an additional layer of anonymity by routing your web requests through different IP addresses, making it challenging for websites to trace and identify your scraping activities.
This helps mitigate the risk of IP blocking and enhances the overall stealthiness of your web scraping endeavors.
Using Proxies with Selenium Stealth - Code Example
To integrate proxies with Selenium Stealth, follow these steps using Python and the Selenium library:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
from selenium_stealth import stealth
import time
# Set up Chrome Options
chrome_options = webdriver.ChromeOptions()
# Configure Proxy Settings
proxy = "187.95.229.112:8080"
chrome_options.add_argument(f"--proxy-server={proxy}")
# Set up the Selenium WebDriver with Selenium-Stealth
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=chrome_options)
stealth(
driver,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
fix_hairline=True
)
# Navigate to the specified URL
driver.get('https://www.deezer.com/en/channels/theholidays')
# Maximize the browser window for better visibility
driver.maximize_window()
# Wait for the privacy 'Accept' pop-up button to be visible
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.ID, "gdpr-btn-accept-all")))
# Find and click the 'Accept' button on the privacy pop-up
agree_button = driver.find_element(By.XPATH, "//button[text()='Accept']")
agree_button.click()
time.sleep(10)
xpath = '//*[@id="page_content"]/div/section[4]/div/div[2]/div/div/ul'
songs_ul = driver.find_element(By.XPATH, xpath)
songs = songs_ul.find_elements(By.TAG_NAME, 'li')
print(len(songs))
# Don't forget to quit the driver when done
driver.quit()
We set up Chrome options and configure proxy settings. We then initialize Selenium WebDriver using ChromeDriverManager
from the webriver-manager package.
In the script, we also incorporate the Selenium-Stealth library to make the web scraping less detectable. Then, we navigate to the specified URL, locate a li
elements on the page using XPath, retrieve a list of songs, print the number of songs found, and finally quit the WebDriver.
Selenium-Stealth Performance
Testing the performance of Selenium-Stealth involves measuring many of factors, including execution time, memory usage and number of undetected requests.
For instance, this GitHub performance test examines the number of public bot tests passed with and without Selenium.
In both headed and headless browser environment without Selenium Stealth, the requests are detected to originate from a bot (Selenium). However, the detection rate is minimized when Selenium Stealth is used.
Alternatives to Selenium-Stealth
While Selenium-Stealth is a powerful tool for web scraping, it's worth exploring alternatives like Selenium Undetected Chrome Driver and ScrapeOps Proxy Aggregator.
Selenium Undetected Chrome Driver
Selenium Undetected Chrome Driver is a modified version of the Chrome WebDriver designed to operate stealthily, minimizing the chances of being detected during automated tasks.
Unlike Selenium-Stealth, Selenium Undetected Chrome Driver is a standalone solution that aims to make the underlying Chrome WebDriver undetectable. It provides an alternative approach to achieving stealthiness in web scraping.
Here is a code example on how to use it:
# install first `pip install undetected-chromedriver` before importing
from undetected_chromedriver import ChromeOptions, Chrome
options = ChromeOptions()
# Customize options as needed
options.add_argument("--headless")
driver = Chrome(options=options)
# Now use the driver for web automation tasks
driver.get("https://www.example.com")
# ... your web automation tasks ...
# Don't forget to quit the driver when done
driver.quit()
ScrapeOps Proxy Aggregator
ScrapeOps Proxy Aggregator is a service that provides a reliable pool of rotating proxies for web scraping. It aggregates proxies from various sources, ensuring a diverse and efficient proxy pool.
While Selenium-Stealth focuses on browser characteristics and behaviors, ScrapeOps Proxy Aggregator emphasizes the use of rotating proxies to enhance anonymity. It complements Selenium-Stealth by providing an extensive proxy solution.
Here is a code example on how to use ScrapeOps Proxy Aggregator:
from seleniumwire import webdriver
from webdriver_manager.chrome import ChromeDriverManager
SCRAPEOPS_API_KEY = 'APIKEY'
## Define ScrapeOps Proxy Port Endpoint
proxy_options = {
'proxy': {
'http': f'http://scrapeops:{SCRAPEOPS_API_KEY}@proxy.scrapeops.io:5353',
'https': f'http://scrapeops:{SCRAPEOPS_API_KEY}@proxy.scrapeops.io:5353',
'no_proxy': 'localhost:127.0.0.1'
}
}
## Set Up Selenium Chrome driver
driver = webdriver.Chrome(ChromeDriverManager().install(), seleniumwire_options=proxy_options)
# Now use the driver for web automation tasks
driver.get("https://www.example.com")
# ... your web automation tasks ...
# Don't forget to quit the driver when done
driver.quit()
Conclusion
This guide explored Selenium Stealth for better web scraping. From installation to advanced techniques, it covered everything. Selenium Stealth tackles common issues like IP blocks and CAPTCHAs. It enhances anonymity and mimics human-like browsing. You can easily get started with its basic usage and customize settings.
The guide also discussed configuring Selenium WebDriver options, rotating user-agents, and using proxies. It mentioned performance testing and alternatives like Selenium Undetected Chrome Driver and ScrapeOps Proxy Aggregator. Overall, the guide equips you to scrape the web efficiently and anonymously.
Explore additional resources and guides related to web scraping with Selenium, including:
- Selenium Documentation: The official documentation for Selenium WebDriver.
- Selenium GitHub Repository: Access the Selenium project's GitHub repository for updates and contributions.
More Selenium Web Scraping Guides
Dive into the world of web scraping with Selenium, discover new strategies, and stay updated on the latest trends using our guides below: