Python Scrapy vs Requests with Beautiful Soup Compared
Python, with its plethora of libraries, provides developers with powerful tools for web scraping, each catering to distinct needs and preferences. Among these, two stalwarts stand out—Scrapy and the dynamic duo of Requests with Beautiful Soup.
In this article, we're going to explore two popular Python libraries—Scrapy and Requests with Beautiful Soup—and review their strengths, weaknesses, and use cases.
- TLDR: Scrapy vs Requests with Beautiful Soup
- What is Scrapy
- What is Requests with Beautiful Soup?
- Detailed Comparison
- Case Study
- Additional Resources
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
TLDR: Scrapy vs Requests with Beautiful Soup
- Scrapy: is a full featured toolchain build system for Scraping projects. It boasts lightning fast speed and full async support.
- Requests with BeautifulSoup: is a combination of two Python libraries that fit together seamlessly. Requests gives full support for standard web requests (GET, POST, PUT, DELETE). BeautifulSoup is a library for parsing HTML. In short, you
get
a page with Requests, then you parse it with BeautifulSoup.
Feature | Scrapy | Requests with BeautifulSoup |
---|---|---|
Use Case | Full-featured toolchain for large scale scraping | Combination of two libraries for web requests and HTML parsing |
Speed | Blazing fast with async support | Medium for small tasks, gets bogged down with larger workloads |
Ease of Use | Requires a learning curve | Very straightforward and easy to implement |
Scrapy, a comprehensive web crawling framework, boasts speed, concurrent page scraping, and efficient result parsing through its asynchronous architecture. However, its setup involves creating a project folder, and its reliance on object-oriented data may pose a learning curve for some.
On the other hand, Requests with Beautiful Soup offers simplicity and versatility, making it easy to get started with scraping. It provides a more straightforward approach to element selection compared to Scrapy but lacks some advanced features.
The choice between them depends on the project's scale, complexity, and the developer's familiarity with object-oriented programming.
What is Scrapy?
Scrapy is a scraping framework built to crawl long lists of pages asynchronously. In short, you feed Scrapy a list of pages, and fetches all the pages. It then parses the results for us in a quick and efficient manner.
Advantages of Scrapy
Scrapy is blazing fast to use. It can easily handle scraping at scale, and it processes the results very efficiently. It also feels very similar to cargo
or create-react-app
. With Scrapy, you get an entire build system.
Scrapy's advantages include:
- Speed:
- Scrapy utilizes asynchronous programming to perform non-blocking operations, allowing it to send multiple requests simultaneously.
- The framework's built-in support for concurrency further enhances its ability to handle numerous requests concurrently, allowing developers to make the most efficient use of available resources and significantly speeding up the data retrieval process.
- Scraping multiple pages at once:
- Scrapy's architecture is designed to support the concurrent scraping of multiple pages through the use of spiders.
- This feature, combined with Scrapy's scalability, makes it an ideal choice for projects dealing with diverse sources or those requiring the extraction of extensive datasets distributed across various pages.
- Quickly parsing the results:
- Scrapy comes equipped with its own built-in parsing mechanism, which facilitates the extraction of data from HTML and XML documents.
- Scrapy's pipeline system further enhances result parsing by offering a modular approach to processing and storing scraped data.
Getting started scraping any site with Scrapy is simple as running scrapy startproject yourprojectname
and then creating a class built around what you'd like to scrape.
Disadvantages of Scrapy
As amazing as it can be, Scrapy definitely has its own set of disadvantages as well. The challenges of Scrapy are sort of side-effects of the way it is built.
Scrapy's disadvantages include:
- Unique setup:
- When creating a new project, as opposed to simply making a new python script, you create an entire project folder with
scrapy startproject
- When creating a new project, as opposed to simply making a new python script, you create an entire project folder with
- Complexity:
- When using Scrapy, we define our data as objects and those unfamiliar with OOP (Oject Oriented Programming) may have a difficult time adjusting to it.
- Low-level element selection:
- When selecting page elements with Scrapy, the process is rather low level and primitive compared to other frameworks
While it's a great framework to use, Scrapy can be somewhat difficult to manage if you're not familiar with toolsets for larger projects.
Scrapy's Object Oriented learning curve and difficult element selection are also reasons that you might not choose Scrapy for your project.
When to Choose Scrapy over Requests with BeautifulSoup
When you choose Scrapy for your project, you get an extremely fast and efficient toolset to work with. Scrapy is best suited for large webcrawling projects and scraping at scale.
If you have a heavy workload with lots of information to extract, Scrapy is the perfect choice. Scrapy will accomplish your task fast and efficiently.
Installing Scrapy
To install Scrapy, simply run the following command:
pip install scrapy
After installing, we can create a new project with the startproject
command.
scrapy startproject quotes_tutorial
Inside the spiders
folder of the new project, create a new file. We can call it quotes_spider.py
. Inside the file, place the following code:
from pathlib import Path
import scrapy
class QuotesSpider(scrapy.Spider):
# Give the class a name
name = "quotes"
# Define which requests to make
def start_requests(self):
# URLs to scrape
urls = [
"https://quotes.toscrape.com/",
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
# Define how to parse our information
def parse(self, response):
# Find all elements of the quote class and return their text as a list
quotes = response.css(".quote > span::text").getall()
self.log("-------quotes---------")
for quote in quotes:
# If the quote is not "by", log it to the terminal
if quote.replace(" ", "") != "by":
self.log(quote)
In the code above, we do the following:
- Create a
class
calledQuotesSpider
- Define which urls we'd like to scrape in
start_requests()
- Define what to do with our information in
parse()
- When parsing a page we:
- Find all elements of the class
quote
withresponse.css(".quote > span::text)
and return them as a list - For each quote in our list, if it is not the word "by", we log it to the console when we crawl and display the log
- Find all elements of the class
Our project should now have a structure that looks like this
├── quotes_tutorial
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ ├── __init__.py
│ └── quotes_spider.py
└── scrapy.cfg
2 directories, 8 files
What is Requests with BeautifulSoup?
Requests is a Python library for making basic requests on the web, such as GET, POST, PUT, and DELETE. It is used all over the place. If you are using any sort of client-side software in Python, it is most likely using Requests somewhere under the hood.
BeautifulSoup is a library for parsing html
pages. Pretty much everything displayed on a webpage is displayed as HTML. When you go to a page in your browser, the server sends an HTML file to your browser. Your browser then reads the HTML and displays the page to you.
Advantages of Requests with BeautifulSoup
Requests and BeautifulSoup is a very powerful all-around stack. With Requests, you get the minimalism and great flexibility needed for any web related project. With BeautifulSoup, you get first class support for HTML.
Advantages of the Requests and BeautifulSoup combination include:
- Ease of use:
- Strong support for standard web requests (GET, POST, PUT, DELETE) makes it very easy to fetch pages.
- First class selector support:
- First class Selector support with high level syntax when parsing HTML with BeautifulSoup.
- Minimalistic project design:
- With this combination, you can accomplish alot without having to learn a bulkier framework such as Scrapy or Selenium.
When using Requests with BeautifulSoup, as long as you're familiar with the basics of development, you can be productive immediately and scrape pages relatively quickly.
Disadvantages of Requests with BeautifulSoup
Despite the minimalism and simplicity of this toolset, it does have its own set of disadvantages.
The disadvantages of this combo include:
- No built-in async support:
- The Requests module operates in a synchronous fashion, so pages are fetched one at a time.
- Speed:
- Slower than other frameworks.
- Efficiency:
- Not the most efficient when handling larger workloads.
When using Requests with BeautifulSoup, the minimalism makes it very easy to code, but you can definitely run into bottlenecks related to this minimalism.
When to Choose Requests with BeautifulSoup over Scrapy
You should choose to use Requests with BeautifulSoup when you have a lighter workload or if you are just coming to scraping from another side of web development.
The minimal setup of this stack allows for quick prototypes that can scrape pages at a decent enough speed. Choose this stack when you need to build a prototype quickly or if you have a relatively small list of pages to scrape.
Installing Requests with Beautiful Soup
Installing Beautiful Soup and Requests is very easy. They can both be setup and installed via pip
.
To install Requests run:
pip install requests
We do the same with BeautifulSoup:
pip install beatifulsoup4
After installing, let's create a new folder, you can name yours anything you'd like, this one will be called req_soup
.
Inside this new folder, create a file (once again, you can name it whatever you'd like), this one will be called soup_tutorial.py
.
Add the following code to our file:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://quotes.toscrape.com/")
soup = BeautifulSoup(response.content, 'html.parser')
quotes = soup.find_all("div", class_="quote")
for quote in quotes:
print(quote.text)
In the code above, we:
- Fetch our page with
requests.get()
- Create an instance of BeautifulSoup's html parser with
BeautifulSoup()
- Find all
div
elements with theclass
name ofquote
and return them as a list,quotes
- iterate through our quotes list with a
for
loop and print the text of each quote withprint(quote.text)
Detailed Comparison
Take a look at the table below to see how these techstacks match up.
Feature | Scrapy | Requests with BeautifulSoup |
---|---|---|
Use Case | Large workloads and scraping at scale | Quick prototyping and simple scraping jobs |
Speed | Blazing fast with async support | Medium for small tasks, gets bogged down with larger workloads |
Ease of Use | Requires a learning curve | Very straightforward and easy to implement |
Scraping Multiple Pages | Can efficiently handle multiple pages concurrently | Fetches pages sequentially |
Parsing Results | Quick and efficient parsing, but with difficult syntax | First-class easy to use HTML parsing support |
Setup | More difficult | Simple script setup |
Learning Curve | Object-Oriented Programming (OOP) knowledge needed | Suitable for beginners or those from other web development backgrounds |
Element Selection | Low-level and primitive selection process | Simpler, high-level selection with BeautifulSoup |
Async Support | Full async support | Fetches pages sequentially |
Project Complexity | Project requires a complete toolchain and customizability | Simple and minimalistic project requirements |
Case Study: Scraping Ebay with Scrapy VS Requests with BeautifulSoup
Let's pit these two setups head to head and see how they stack up. In this case study, we'll be scraping GPU listings from eBay and writing the results to a file.
Scraping GPUs with Scrapy
Let's build a new Scrapy project and call it ebay_scraper
.
scrapy startproject ebay_scraper
Inside the spiders
folder, create a new file, gpu_spider.py
. Add the following code to this file:
from pathlib import Path
import scrapy
from urllib.parse import urlencode
API_KEY= "YOUR-SUPER-SECRET-API-KEY"
#function to convert urls to proxied urls
def get_scrapeops_url(url):
payload = {
"api_key": API_KEY,
"url": url
}
proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)
return proxy_url
class GpuSpider(scrapy.Spider):
#give the class a name
name = "gpus"
#which requests to make
def start_requests(self):
gpu_url = "https://www.ebay.com/b/Computer-Graphics-Cards/27386/bn_661796?_pgn="
#urls to scrape
urls = [
f"{gpu_url}1",
f"{gpu_url}2",
f"{gpu_url}3",
f"{gpu_url}4",
f"{gpu_url}5",
f"{gpu_url}6",
f"{gpu_url}7",
f"{gpu_url}8",
f"{gpu_url}9",
f"{gpu_url}10"
]
for url in urls:
yield scrapy.Request(url=get_scrapeops_url(url), callback=self.parse)
#how to parse our information
def parse(self, response):
#check to see if the result file exists
path = Path("gpu_results.txt")
# if it doesn't exist, create it
if path.is_file() == False:
outfile = open("gpu_results.txt", "w")
outfile.close()
#open the results file in "append mode"
outfile = open("gpu_results.txt", "a")
#find all h3 elements, listings are displayed as h3
listings = response.css("h3::text").getall()
#append each listing to the results file
for listing in listings:
outfile.write(f"{listing}\n")
In the code above, we do the following:
- Create a function to convert regular urls into proxied urls,
get_scrapeops_url()
- Create our
GpuSpider
class - Add our urls to the
urls
list - For each page in our
urls
list, find all text fromh3
header elements withresponse.css("h3::text")
- For each
h3
element, write it to our results file
You can run the scraper with:
scrapy crawl gpus
In our test, Scrapy was lightning fast. In total, Scrapy fetched, parsed, and wrote everything to the results file in approximately 11 seconds. This test was run using a Lenovo Ideapad 1i.
Scraping GPUs with Requests and BeautifulSoup
Inside of our req_soup
(or whatever you named yours) folder, we can create a new file, soup_ebay.py
. Add the following code to this file:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlencode
from time import sleep
API_KEY= "YOUR-SUPER-SECRET-API-KEY"
#function to convert urls to proxied urls
def get_scrapeops_url(url):
payload = {
"api_key": API_KEY,
"url": url
}
proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)
return proxy_url
#base url for a gpu page
gpu_url = "https://www.ebay.com/b/Computer-Graphics-Cards/27386/bn_661796?_pgn="
#create a file to write our results
outfile = open("gpu_results.txt", "w")
#we created the file, now close it, we will append it later
outfile.close()
#for pages 1 through 10, do the following
for page in range(1, 11):
#open the file in append mode
outfile = open("gpu_results.txt", "a")
#get the webpage
response = requests.get(get_scrapeops_url(f"{gpu_url}{page}"))
#create an instance of the html parser
soup = BeautifulSoup(response.content, 'html.parser')
#write the page number to our results file
outfile.write(f"Page {page}\n")
#find all h3 elements(listings use an h3 header)
listings = soup.find_all("h3")
#iterate through each listing
for listing in listings:
#append the file with the listing text
outfile.write(f"{listing.text}\n")
#close the file
outfile.close()
In the code above, we:
- Create a
get_scrapeops_url()
function to convert our urls to proxied urls - Save the base url of GPU results
- Create a file to save our results
- For pages one through ten, we:
- Get the url with our
Page
number added to the base url - Find all
h3
header elements withsoup.find_all()
- Append our
outfile
with the text of each listing
- Get the url with our
Requests and BeautifulSoup did a pretty decent job at this. In our test run, these results were all scraped and written to gpu_results.txt
in approximately 107 seconds. This test was ran on a Lenovo Ideadpad 1i.
Which One is the Winner?
Scrapy blew Requests with BeautifulSoup out of the water on this test. Scrapy was roughly 90% faster.
While it is much simpler to use Requests with BeautifulSoup, in terms of speed and workload, it's simply no match for Scrapy's speed, asynchronous execution and efficiency.
Conclusion
Now that you've finished this tutorial, you should have a decent understanding of how to get started with both Scrapy and Requests with BeautifulSoup.
-
You now know that Scrapy can handle enormous workloads with blazing fast speed.
-
You also know how simple it is to get started when using Requests with BeautifulSoup.
More Scraping Tutorials
If you'd like to learn more about scraping but don't know where to start, try one of these: