Python Scrapy VS Python Selenium Compared
Python, with its extensive range of libraries and frameworks, offers two powerful tools—Scrapy and Selenium—that have garnered significant attention for their distinct approaches to web scraping.
In this tutorial, we will be using two different scraping frameworks, Selenium and Scrapy. They each have their own set of pros and cons and when scraping, it is important to always choose the correct tool for the job. Throughout the article, We'll deep dive into:
- TLDR Python Selenium vs Python Scrapy
- What is Selenium
- What is Scrapy
- Detailed Comparison
- Case Study: Scraping Amazon with Selenium and Scrapy
- Additional Selenium and Scrapy Resources
TLDR Python Selenium vs Python Scrapy
Python Selenium and Python Scrapy are two great libraries for web scraping. However, they each have their own pros and cons, and are each ideally suited to particular types of web scraping.
- Python Selenium: Is a browser automation library that allows you to scrape websites with a full browser. It renders the page, and allows you to interact with the website and extract the data you need.
- Python Scrapy: Is a powerful web crawling framework for efficient data extraction. It streamlines the process of systematically crawling multiple pages and following links to extract structured data, making it an ideal choice for large-scale web scraping tasks.
Here are the situations when you should consider using each library:
Selenium | Scrapy |
---|---|
When you need to render dynamic pages with a browser. | When you are scraping at a large scale and need asynchronous operations. |
When you need to interact a lot with the website (click, scroll, etc.) to access the data you need. | When you need to systematically crawl multiple pages and follow links to extract data. |
When you need to make automated bots that work behind logins. | When you need to respect the rules set by robots.txt for ethical web scraping. |
When you need to take screenshots of pages. | When you need to customize and extend the scraping process with custom pipelines. |
When you need to scrape heavily protected websites. | - |
What is Selenium?
Python Selenium is a popular open-source automated testing framework primarily used for web applications. Additionally, it's commonly used for web scraping tasks that involve interacting with dynamic content rendered by JavaScript.
Selenium is more suited to emulate an end user. You are able to scroll, click on items, and interact with dynamic content.
Advantages of Using Selenium
Selenium is an excellent tool for doing things real users would do. With time.sleep()
and WebDriverWait()
, we can wait for certain elements to appear on the screen and react accordingly.
We can click, scroll, and navigate forward and backward from the current page. Selenium can even take screenshots of page!
These features are great to have in any scenario where you need any of the following features:
- JavaScript handling
- Browser simulation
- High level interaction
- Authentication (passwords and form filling)
- Screenshots/Image based extraction
Selenium can handle JavaScript seamlessly using WebDriverWait()
for elements to appear on screen and form filling with send_keys()
is an excellent feature to automate processes such as login. We can also use click()
to submit forms.
Disadvantages of Using Selenium
All of those incredible features of Selenium are also direct causes of its downsides. Because Selenium has such a rich toolset, it eats up alot of resources... You have to run a browser just to use Selenium!
When scraping a list of pages, Selenium is magnitudes slower than other frameworks because that sort of a task is simply not what it was designed for.
While Selenium has a very rich set of features and documentation, it does have something of a unique learning curve to be effective with things such as ActionChain
.
When using ActionChain
in Selenium, the developer needs to think about every little thing a user would do and code that into the chain. Difficulties result here because users are often unpredictable and act in ways we can't always define.
In short, when using Selenium for scraping, keep the following in mind:
- Can be resource intensive.
- Slower than other frameworks.
- Programmatically simulating a person can be difficult.
- You will be dependent on browser/driver updates.
- Not designed for headless scraping.
- Not the best choice for large scale scraping.
When Should You Use Selenium Over Scrapy?
Selenium is perfect for sites where you need to interact with the page. Whether you need to wait for the page to load content, or you'd like to click buttons and submit information to make new information appear on the page.
The following are all great reasons to use Selenium:
- Dynamic sites
- Authentication
- Real user simulation
- Screenshots/Visual data
- Pop-up handling
- Form filling (especially with JS generated forms)
Let's check the ideal use cases of Selenium more in detail below:
1. Dynamic sites
Selenium is an excellent choice for handling dynamic sites as it allows for the automation of interactions with JavaScript-based elements.
It can effectively navigate and interact with web pages that rely heavily on dynamic content generated through client-side scripts.
Its ability to wait for elements to appear and dynamically interact with them enables seamless automation of tasks on dynamic web pages.
2. Authentication
When scraping websites that require authentication, the scraping tool needs to be able to simulate the login process to access the data that is only available to authenticated users.
Selenium can handle authentication processes that require user login and session management. It can automate the login process by filling in credentials and submitting forms, making it a suitable tool for testing applications that involve authentication mechanisms.
3. Real user simulation
Emulating user behaviors can be necessary for scenarios where the data of interest is only accessible through specific user actions.
Selenium's ability to simulate real user interactions, including mouse clicks, keyboard inputs, and scrolling, makes it an ideal tool for mimicking human browsing behavior. This capability is particularly useful for testing user interfaces, workflows, and complex interactions on web applications.
4. Screenshots/Visual data
Capturing visual data might be necessary for tasks that involve extracting images or extracting data embedded within visual elements on the webpage.
Selenium can capture screenshots and visual data during the automated testing process, allowing for visual validation of web elements, layouts, and designs.
5. Pop-up handling
An effective web scraping tool should be able to handle such pop-ups seamlessly without interrupting the scraping process.
Selenium can handle various types of pop-ups, including alerts, prompts, and confirmations, that appear during the browsing experience. Its capability to switch between different windows and handle pop-up dialogs enables seamless automation of tasks that involve interacting with pop-up windows.
6. Form filling (especially with JS generated forms)
In web scraping, this capability is crucial when dealing with websites that use dynamic forms to submit data. The scraping tool needs to be able to fill out and submit these forms to access and extract the data behind them.
Selenium excels at filling out forms, including those generated dynamically using JavaScript. Its ability to interact with form elements, select options, and input data into fields, even when the form elements are dynamically generated, makes it a powerful tool for automating tasks that involve complex form submissions and interactions.
Installing Selenium
To install Selenium, you need the proper webdriver installed for your browser. The most commonly used one is chromedriver. Also, make sure that your version of chromedriver matches your version of Chrome. To see which version of Chrome is installed, you can run the following command:
google-chrome --version
After installing your webdriver, use pip
to install Selenium. You can use the command below:
pip install selenium
To create a new Selenium project, we can simply create a new Python file, we can call it selenium_quotes.py
.
Here is a simple example of a Selenium based scraper:
from selenium import webdriver
from selenium.webdriver.common.by import By
from time import sleep
#create an instance of chrome
driver = webdriver.Chrome()
#navigate to the page
driver.get("https://quotes.toscrape.com")
#create a list of all items with the "a" tag
links = driver.find_elements(By.TAG_NAME, "a")
#iterate through the list until we find the one we want
for link in links:
if link.text == "Login":
target = link
break
#click the target link and go to the next page
target.click()
#sleep for a few seconds so we can look at the page
sleep(10)
driver.quit()
In the code above, we do the following:
- open an instance of Chrome
- navigate to a webpage with
driver.get()
- find all links on the page with
driver.find_elements(By.TAG_NAME, "a")
and return them as a list - Iterate through the list elements until we find the one titled
Login
, then we click on it withclick()
What is Scrapy?
Python Scrapy is an open-source web crawling framework built for Python. It provides a set of tools for extracting data from websites in a structured and efficient way.
It is more scalable and built to crawl multiple pages all with a single command, making it an ideal tool for tasks such as data mining, monitoring, and automated testing.
Coming from Selenium, you will probably notice several differences when starting a project:
- Scrapy has a build system of its own
- You can scrape more information with less code.
Advantages of Scrapy
Scrapy is great for high performance scraping. It is not only scalable but it can scrape multiple pages at once.
Because of the way it is setup, Scrapy allows us to configure which information to scrape, and how we would like to parse/store it. Scrapy also has a very rich set of documentation that you can view here.
In short, Scrapy has many advantages such as:
- high performance
- scalable/parallel execution
- structured approach
- features (request handling/data parsing/data storage)
- scrape custom pipelines/middleware
- rich community and resources
All in all, you can scrape many websites quickly and then quickly format and save the results as well.
Disadvantages of Scrapy
There are definitely some cases in which Scrapy is not the best tool of choice. Particularly, when dealing with JavaScript, Selenium is much better equipped.
Scrapy is not built to wait for items to appear on screen, nor does it simulate an actual browser. Scrapy is also not built to handle form filling and things such as that... it does not simulate a user!
Scrapy is built for static pages. Another issue with Scrapy is the learning curve. While Selenium has its own challenges with things such as ActionChain
, Scrapy requires a mild understanding of OOP (object oriented programming) due to that fact that we're defining custom classes and and methods.
In short, Scrapy's disadvantages include:
- limited JavaScript handling... possible with headless browser integrations
- authentication challenges...not made to simulate real users!!!
- object oriented learning curve
- resource intensive (more efficient than Selenium, but still requires alot of resources and Python runtime)
When Should You Use Scrapy Over Selenium?
Scrapy is a great choice when you need to scrape multiple pages, then extract and transform the data. This is ideal for things such as news sites and other websites with large amounts of static content.
Scrapy is a great choice for any of the following tasks:
- Large scale crawling
- Data mining and aggregation
- Scheduled data extraction
- Custom pipelines to scrape, clean, format, and store data
Let's check the ideal use cases of Scrapy more in detail below:
1. Large scale crawling
This type of web scraping typically involves systematically accessing and extracting data from multiple web pages or websites on a significant scale.
Scrapy is well-suited for large scale crawling due to its asynchronous operations, efficient memory usage, and robust scheduling system.
2. Data mining and aggregation
Scrapy is a great fit for data mining and aggregation due to its robust web crawling capabilities and its ability to extract structured data from websites.
Its built-in mechanisms for data extraction and its support for XPath and CSS selectors allow developers to easily extract relevant information from web pages and can handle this task more effectively than Selenium.
3. Scheduled data extraction
Scrapy's powerful scheduling system and its ability to handle large-scale crawling tasks make it an excellent choice for scheduled data extraction. It allows developers to set up recurring scraping tasks at specific intervals, ensuring that data is regularly updated and remains relevant.
4. Custom pipelines to scrape, clean, format, and store data
Scrapy's customizable pipelines offer a versatile framework for scraping, cleaning, formatting, and storing data. Its extensible architecture allows developers to implement custom data processing logic, including data cleaning and transformation operations.
Installing Scrapy
Installing Scrapy is quite a bit simpler. It is strongly recommended to use Scrapy inside of a virtual environment. Assuming you already have Python
and pip
available on your system, just run the following command:
pip install Scrapy
First, let's intialize a new scrapy
project. The command below automatically builds a new project named quote_scraper
and sets us up with all our necessary files and folders. This feature is very convenient and all too familiar to those of you who have ever used cargo
with Rust or create-react-app
with Node.js.
scrapy startproject quote_scraper
Once we've created the new project, we can look inside to see how it is structured. As you can see, our quote_scraper project has a smaller folder, also named quote_scraper
, and inside of that folder, we have one called spiders
.
quote_scraper
├── quote_scraper
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ └── __init__.py
└── scrapy.cfg
Once you've located the the spiders
folder, open it up and add a new file. We can call this one quote_spider.py
.
Add the following code to quote_spider.py
:
from pathlib import Path
import scrapy
#create a class for the information we wish to scrape
class QuotesSpider(scrapy.Spider):
#give it a name, quotes
#our class needs a name so we can easily run the project later on
name = "quotes"
#add a start_requests method
def start_requests(self):
#create a list of URLs to scrape
urls = [
"https://quotes.toscrape.com/page/1/",
"https://quotes.toscrape.com/page/2/",
"https://quotes.toscrape.com/page/3/",
"https://quotes.toscrape.com/page/4/",
"https://quotes.toscrape.com/page/5/"
]
#iterate through the URLs and parse through their information
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
#how we parse a page
def parse(self, response):
#split the url at the "/" character to get the page number
page = response.url.split("/")[-2]
#create a filename ending in .html
filename = f"quotes-{page}.html"
#write the page info to the html file
Path(filename).write_bytes(response.body)
#log that we've saved the file
self.log(f"saved file: {filename}")
Now let's take a look at this code and see what it does.
- First, we create a
class
to scrape, this one is calledQuotesSpider
. - Next, we add our
start_requests
method. This method contains a list of URLs torequest
andparse
. - Our
parse
method takes in a response and writes the response to an html page locally
Now we can run the code with the crawl
command:
scrapy crawl quotes
The command above does the following:
- the word
scrapy
invokes Scrapy crawl
tells it to run the crawlerquotes
tells scrapy that we would like to crawl spiders with the name ofquotes
which we declared when making ourQuotesSpider
class.
Detailed Comparison of Selenium and Scrapy
Below is a comprehensive comparison between Selenium and Scrapy, outlining their key features and suitability for various web scraping and automation tasks.
Aspect | Scrapy | Selenium |
---|---|---|
Ease of Use | Simple setup | Requires chromedriver and a browser |
Speed | Fast, especially for large-scale web scraping | Slower due to browser automation |
Headless Browsing | Headless by default | Runs browser driver even in headless mode |
JavaScript | Limited support | Full Support |
Web Page Rendering | No rendering | Renders pages and fully supports dynamic content |
User Interaction | Not designed for user interaction | Ideal for simulating user interactions |
Automation Tasks | Perfect for crawling multiple pages and data mining | Not efficient for large scale tasks |
Browser-Based Tasks | No browser | Perfect for accomplishing tasks in the browser |
Dependencies | Scrapy | browser, browser driver, and Selenium |
Maintenance | Low maintenance | Requires frequent updates to browser and drivers |
Community Support | Active and well-documented | Well documented but much of the content in circulation is outdated |
Case Study: Scraping Amazon with Selenium and Scrapy
In the next section, we will create a couple of scrapers for Amazon products.
Amazon uses anti-scraping mechanisms. When first attempting this project, we tried to scrape without the proxy, however, we were immediately recognized and blocked.
We will be using the ScrapeOps proxy for both examples in order to bypass these mechanisms and efficiently retrieve the data.
Scraping Laptops with Selenium
Let's create a Selenium project to scrape Amazon products. Create a new file called selenium_product_scraper.py
and add the following code:
from selenium import webdriver
from urllib.parse import urlencode
#create an instance of chrome
driver = webdriver.Chrome()
API_KEY= "YOUR-SCRAPEOPS-API-KEY"
#create a function to handle our urls
def get_scrapeops_url(url):
payload = {
"api_key": API_KEY,
"url": url
}
proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)
return proxy_url
#Base url that we wish to scrape
BASE_URL = "https://www.amazon.com/s?k=laptop&page="
#navigate to the page
for i in range(1,6):
driver.get(get_scrapeops_url(f"{BASE_URL}{i}"))
with open(f"laptops-{i}.html", "w") as file:
file.write(driver.page_source)
driver.quit()
In the code above, we do the following:
- create a function to allow us to connect to websites through the ScrapeOps proxy
- open chrome with
webdriver.Chrome()
- navigate to each page through the proxy connection with
driver.get()
- Save each page as an html file
After running it, you should notice that it takes awhile and that you spend most of your time waiting for Chrome to load the page. All in all, this process took 86 seconds on a Lenovo Ideapad 1i.
If you'd like to learn more about using proxies with Selenium, take a look at the ScrapeOps guide to Selenium Proxy.
Scraping Laptops with Scrapy
Now, we can create a new project to scrape Amazon products with Scrapy.
First, enter the following command to create the project:
scrapy startproject amazon_scraper
Next, add a new file inside the spiders
folder, laptops.py
.
from pathlib import Path
from urllib.parse import urlencode
import scrapy
API_KEY= "YOUR-SCRAPEOPS-API-KEY"
#create a proxy because Amazon blocks scrapers
def get_scrapeops_url(url):
payload = {
"api_key": API_KEY,
"url": url
}
proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)
return proxy_url
#create a class for the information we wish to scrape
class LaptopSpider(scrapy.Spider):
name = "laptops"
#add a start_requests method
def start_requests(self):
#create a list of URLs to scrape
BASE_URL = "https://www.amazon.com/s?k=laptop&page="
urls = [
f"{BASE_URL}1",
f"{BASE_URL}2",
f"{BASE_URL}3",
f"{BASE_URL}4",
f"{BASE_URL}5"
]
#iterate through the URLs and parse through their information
for url in urls:
print(url)
yield scrapy.Request(url=get_scrapeops_url(url), callback=self.parse)
#how we parse a page
def parse(self, response):
#the last character of the page is the page number
page = response.url[-1]
#create a filename ending in .html
filename = f"laptops-{page}.html"
#write the page info to the html file
Path(filename).write_bytes(response.body)
#log that we've saved the file
self.log(f"saved file: {filename}")
In the code above we do the following:
- create a ScrapeOps proxy
- create a new url for each page using the
get_scrapeops_url()
function - Scrape scrape each url and save the page as an html file
You can run the scraper using the following command from inside your amazon_scraper
folder:
scrapy crawl laptops
Even when using a proxy with these listings, the entire process takes about 10 seconds... Almost nine times as fast as Selenium on the test machine (Lenovo Ideapad 1i).
If you'd like to learn more about adding proxies to your Scrapy project, take a look at Scrapy Proxy.
Which one is the winner?
All in all, Selenium, is far less efficient and far more time consuming than Scrapy for this task. Scrapy is our clear winner when it comes to scraping larger loads of data. S
- Scrapy handles all of these pages concurrently and in an asynchronous fashion while Selenium continually gets stuck processing the information page by page.
- Both of the frameworks do a relatively quick job of writing the html code to a new file.
- Both frameworks also integrate with the ScrapeOps proxy seamlessly. All you need is our
get_scrapeops_url()
function which you may view again below.
def get_scrapeops_url(url):
payload = {
"api_key": API_KEY,
"url": url
}
proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)
return proxy_url
Additional Scrapy and Selenium Resources
If you'd like to learn more about Scrapy or Selenium, please take a look at the ScrapeOps guides below.
More Web Scraping Tutorials
Now you have a basic understanding of Selenium and Scrapy. You should also understand appropriate usecases for each tool.
You can use Selenium to imitate a person and you can also use Scrapy to crawl
large volumes of information. You also have a basic understanding of how to setup a proxy connection with each tool.
Looking to advance your scraping skillset? Take a look at the following guides: