Skip to main content

Python Scrapy vs Requests with Beautiful Soup Compared

Python Scrapy vs Requests with Beautiful Soup Compared

Python, with its plethora of libraries, provides developers with powerful tools for web scraping, each catering to distinct needs and preferences. Among these, two stalwarts stand out—Scrapy and the dynamic duo of Requests with Beautiful Soup.

In this article, we're going to explore two popular Python libraries—Scrapy and Requests with Beautiful Soup—and review their strengths, weaknesses, and use cases.

TLDR: Scrapy vs Requests with Beautiful Soup

  • Scrapy: is a full featured toolchain build system for Scraping projects. It boasts lightning fast speed and full async support.
  • Requests with BeautifulSoup: is a combination of two Python libraries that fit together seamlessly. Requests gives full support for standard web requests (GET, POST, PUT, DELETE). BeautifulSoup is a library for parsing HTML. In short, you get a page with Requests, then you parse it with BeautifulSoup.
FeatureScrapyRequests with BeautifulSoup
Use CaseFull-featured toolchain for large scale scrapingCombination of two libraries for web requests and HTML parsing
SpeedBlazing fast with async supportMedium for small tasks, gets bogged down with larger workloads
Ease of UseRequires a learning curveVery straightforward and easy to implement

Scrapy, a comprehensive web crawling framework, boasts speed, concurrent page scraping, and efficient result parsing through its asynchronous architecture. However, its setup involves creating a project folder, and its reliance on object-oriented data may pose a learning curve for some.

On the other hand, Requests with Beautiful Soup offers simplicity and versatility, making it easy to get started with scraping. It provides a more straightforward approach to element selection compared to Scrapy but lacks some advanced features.

The choice between them depends on the project's scale, complexity, and the developer's familiarity with object-oriented programming.


What is Scrapy?

Scrapy is a scraping framework built to crawl long lists of pages asynchronously. In short, you feed Scrapy a list of pages, and fetches all the pages. It then parses the results for us in a quick and efficient manner.

Advantages of Scrapy

Scrapy is blazing fast to use. It can easily handle scraping at scale, and it processes the results very efficiently. It also feels very similar to cargo or create-react-app. With Scrapy, you get an entire build system.

Scrapy's advantages include:

  • Speed:
    • Scrapy utilizes asynchronous programming to perform non-blocking operations, allowing it to send multiple requests simultaneously.
    • The framework's built-in support for concurrency further enhances its ability to handle numerous requests concurrently, allowing developers to make the most efficient use of available resources and significantly speeding up the data retrieval process.
  • Scraping multiple pages at once:
    • Scrapy's architecture is designed to support the concurrent scraping of multiple pages through the use of spiders.
    • This feature, combined with Scrapy's scalability, makes it an ideal choice for projects dealing with diverse sources or those requiring the extraction of extensive datasets distributed across various pages.
  • Quickly parsing the results:
    • Scrapy comes equipped with its own built-in parsing mechanism, which facilitates the extraction of data from HTML and XML documents.
    • Scrapy's pipeline system further enhances result parsing by offering a modular approach to processing and storing scraped data.

Getting started scraping any site with Scrapy is simple as running scrapy startproject yourprojectname and then creating a class built around what you'd like to scrape.

Disadvantages of Scrapy

As amazing as it can be, Scrapy definitely has its own set of disadvantages as well. The challenges of Scrapy are sort of side-effects of the way it is built.

Scrapy's disadvantages include:

  • Unique setup:
    • When creating a new project, as opposed to simply making a new python script, you create an entire project folder with scrapy startproject
  • Complexity:
    • When using Scrapy, we define our data as objects and those unfamiliar with OOP (Oject Oriented Programming) may have a difficult time adjusting to it.
  • Low-level element selection:
    • When selecting page elements with Scrapy, the process is rather low level and primitive compared to other frameworks

While it's a great framework to use, Scrapy can be somewhat difficult to manage if you're not familiar with toolsets for larger projects.

Scrapy's Object Oriented learning curve and difficult element selection are also reasons that you might not choose Scrapy for your project.

When to Choose Scrapy over Requests with BeautifulSoup

When you choose Scrapy for your project, you get an extremely fast and efficient toolset to work with. Scrapy is best suited for large webcrawling projects and scraping at scale.

If you have a heavy workload with lots of information to extract, Scrapy is the perfect choice. Scrapy will accomplish your task fast and efficiently.

Installing Scrapy

To install Scrapy, simply run the following command:

pip install scrapy

After installing, we can create a new project with the startproject command.

scrapy startproject quotes_tutorial

Inside the spiders folder of the new project, create a new file. We can call it quotes_spider.py. Inside the file, place the following code:

from pathlib import Path
import scrapy

class QuotesSpider(scrapy.Spider):
# Give the class a name
name = "quotes"

# Define which requests to make
def start_requests(self):
# URLs to scrape
urls = [
"https://quotes.toscrape.com/",
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)

# Define how to parse our information
def parse(self, response):
# Find all elements of the quote class and return their text as a list
quotes = response.css(".quote > span::text").getall()
self.log("-------quotes---------")
for quote in quotes:
# If the quote is not "by", log it to the terminal
if quote.replace(" ", "") != "by":
self.log(quote)

In the code above, we do the following:

  • Create a class called QuotesSpider
  • Define which urls we'd like to scrape in start_requests()
  • Define what to do with our information in parse()
  • When parsing a page we:
    • Find all elements of the class quote with response.css(".quote > span::text) and return them as a list
    • For each quote in our list, if it is not the word "by", we log it to the console when we crawl and display the log

Our project should now have a structure that looks like this

├── quotes_tutorial
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│   ├── __init__.py
│   └── quotes_spider.py
└── scrapy.cfg

2 directories, 8 files

What is Requests with BeautifulSoup?

Requests is a Python library for making basic requests on the web, such as GET, POST, PUT, and DELETE. It is used all over the place. If you are using any sort of client-side software in Python, it is most likely using Requests somewhere under the hood.

BeautifulSoup is a library for parsing html pages. Pretty much everything displayed on a webpage is displayed as HTML. When you go to a page in your browser, the server sends an HTML file to your browser. Your browser then reads the HTML and displays the page to you.

Advantages of Requests with BeautifulSoup

Requests and BeautifulSoup is a very powerful all-around stack. With Requests, you get the minimalism and great flexibility needed for any web related project. With BeautifulSoup, you get first class support for HTML.

Advantages of the Requests and BeautifulSoup combination include:

  • Ease of use:
    • Strong support for standard web requests (GET, POST, PUT, DELETE) makes it very easy to fetch pages.
  • First class selector support:
    • First class Selector support with high level syntax when parsing HTML with BeautifulSoup.
  • Minimalistic project design:
    • With this combination, you can accomplish alot without having to learn a bulkier framework such as Scrapy or Selenium.

When using Requests with BeautifulSoup, as long as you're familiar with the basics of development, you can be productive immediately and scrape pages relatively quickly.

Disadvantages of Requests with BeautifulSoup

Despite the minimalism and simplicity of this toolset, it does have its own set of disadvantages.

The disadvantages of this combo include:

  • No built-in async support:
    • The Requests module operates in a synchronous fashion, so pages are fetched one at a time.
  • Speed:
    • Slower than other frameworks.
  • Efficiency:
    • Not the most efficient when handling larger workloads.

When using Requests with BeautifulSoup, the minimalism makes it very easy to code, but you can definitely run into bottlenecks related to this minimalism.

When to Choose Requests with BeautifulSoup over Scrapy

You should choose to use Requests with BeautifulSoup when you have a lighter workload or if you are just coming to scraping from another side of web development.

The minimal setup of this stack allows for quick prototypes that can scrape pages at a decent enough speed. Choose this stack when you need to build a prototype quickly or if you have a relatively small list of pages to scrape.

Installing Requests with Beautiful Soup

Installing Beautiful Soup and Requests is very easy. They can both be setup and installed via pip.

To install Requests run:

pip install requests

We do the same with BeautifulSoup:

pip install beatifulsoup4

After installing, let's create a new folder, you can name yours anything you'd like, this one will be called req_soup.

Inside this new folder, create a file (once again, you can name it whatever you'd like), this one will be called soup_tutorial.py.

Add the following code to our file:

import requests
from bs4 import BeautifulSoup

response = requests.get("https://quotes.toscrape.com/")
soup = BeautifulSoup(response.content, 'html.parser')

quotes = soup.find_all("div", class_="quote")

for quote in quotes:
print(quote.text)

In the code above, we:

  • Fetch our page with requests.get()
  • Create an instance of BeautifulSoup's html parser with BeautifulSoup()
  • Find all div elements with the class name of quote and return them as a list, quotes
  • iterate through our quotes list with a for loop and print the text of each quote with print(quote.text)

Detailed Comparison

Take a look at the table below to see how these techstacks match up.

FeatureScrapyRequests with BeautifulSoup
Use CaseLarge workloads and scraping at scaleQuick prototyping and simple scraping jobs
SpeedBlazing fast with async supportMedium for small tasks, gets bogged down with larger workloads
Ease of UseRequires a learning curveVery straightforward and easy to implement
Scraping Multiple PagesCan efficiently handle multiple pages concurrentlyFetches pages sequentially
Parsing ResultsQuick and efficient parsing, but with difficult syntaxFirst-class easy to use HTML parsing support
SetupMore difficultSimple script setup
Learning CurveObject-Oriented Programming (OOP) knowledge neededSuitable for beginners or those from other web development backgrounds
Element SelectionLow-level and primitive selection processSimpler, high-level selection with BeautifulSoup
Async SupportFull async supportFetches pages sequentially
Project ComplexityProject requires a complete toolchain and customizabilitySimple and minimalistic project requirements

Case Study: Scraping Ebay with Scrapy VS Requests with BeautifulSoup

Let's pit these two setups head to head and see how they stack up. In this case study, we'll be scraping GPU listings from eBay and writing the results to a file.

Scraping GPUs with Scrapy

Let's build a new Scrapy project and call it ebay_scraper.

scrapy startproject ebay_scraper

Inside the spiders folder, create a new file, gpu_spider.py. Add the following code to this file:

from pathlib import Path
import scrapy
from urllib.parse import urlencode

API_KEY= "YOUR-SUPER-SECRET-API-KEY"
#function to convert urls to proxied urls
def get_scrapeops_url(url):
payload = {
"api_key": API_KEY,
"url": url
}
proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)
return proxy_url
class GpuSpider(scrapy.Spider):
#give the class a name
name = "gpus"
#which requests to make
def start_requests(self):
gpu_url = "https://www.ebay.com/b/Computer-Graphics-Cards/27386/bn_661796?_pgn="
#urls to scrape
urls = [
f"{gpu_url}1",
f"{gpu_url}2",
f"{gpu_url}3",
f"{gpu_url}4",
f"{gpu_url}5",
f"{gpu_url}6",
f"{gpu_url}7",
f"{gpu_url}8",
f"{gpu_url}9",
f"{gpu_url}10"
]
for url in urls:
yield scrapy.Request(url=get_scrapeops_url(url), callback=self.parse)
#how to parse our information
def parse(self, response):
#check to see if the result file exists
path = Path("gpu_results.txt")
# if it doesn't exist, create it
if path.is_file() == False:
outfile = open("gpu_results.txt", "w")
outfile.close()
#open the results file in "append mode"
outfile = open("gpu_results.txt", "a")
#find all h3 elements, listings are displayed as h3
listings = response.css("h3::text").getall()
#append each listing to the results file
for listing in listings:
outfile.write(f"{listing}\n")

In the code above, we do the following:

  • Create a function to convert regular urls into proxied urls, get_scrapeops_url()
  • Create our GpuSpider class
  • Add our urls to the urls list
  • For each page in our urls list, find all text from h3 header elements with response.css("h3::text")
  • For each h3 element, write it to our results file

You can run the scraper with:

scrapy crawl gpus

In our test, Scrapy was lightning fast. In total, Scrapy fetched, parsed, and wrote everything to the results file in approximately 11 seconds. This test was run using a Lenovo Ideapad 1i.

Scraping GPUs with Requests and BeautifulSoup

Inside of our req_soup (or whatever you named yours) folder, we can create a new file, soup_ebay.py. Add the following code to this file:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urlencode
from time import sleep

API_KEY= "YOUR-SUPER-SECRET-API-KEY"
#function to convert urls to proxied urls
def get_scrapeops_url(url):
payload = {
"api_key": API_KEY,
"url": url
}
proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)
return proxy_url
#base url for a gpu page
gpu_url = "https://www.ebay.com/b/Computer-Graphics-Cards/27386/bn_661796?_pgn="
#create a file to write our results
outfile = open("gpu_results.txt", "w")
#we created the file, now close it, we will append it later
outfile.close()
#for pages 1 through 10, do the following
for page in range(1, 11):
#open the file in append mode
outfile = open("gpu_results.txt", "a")
#get the webpage
response = requests.get(get_scrapeops_url(f"{gpu_url}{page}"))
#create an instance of the html parser
soup = BeautifulSoup(response.content, 'html.parser')
#write the page number to our results file
outfile.write(f"Page {page}\n")
#find all h3 elements(listings use an h3 header)
listings = soup.find_all("h3")
#iterate through each listing
for listing in listings:
#append the file with the listing text
outfile.write(f"{listing.text}\n")
#close the file
outfile.close()

In the code above, we:

  • Create a get_scrapeops_url() function to convert our urls to proxied urls
  • Save the base url of GPU results
  • Create a file to save our results
  • For pages one through ten, we:
    • Get the url with our Page number added to the base url
    • Find all h3 header elements with soup.find_all()
    • Append our outfile with the text of each listing

Requests and BeautifulSoup did a pretty decent job at this. In our test run, these results were all scraped and written to gpu_results.txt in approximately 107 seconds. This test was ran on a Lenovo Ideadpad 1i.

Which One is the Winner?

Scrapy blew Requests with BeautifulSoup out of the water on this test. Scrapy was roughly 90% faster.

While it is much simpler to use Requests with BeautifulSoup, in terms of speed and workload, it's simply no match for Scrapy's speed, asynchronous execution and efficiency.


Conclusion

Now that you've finished this tutorial, you should have a decent understanding of how to get started with both Scrapy and Requests with BeautifulSoup.

  • You now know that Scrapy can handle enormous workloads with blazing fast speed.

  • You also know how simple it is to get started when using Requests with BeautifulSoup.


More Scraping Tutorials

If you'd like to learn more about scraping but don't know where to start, try one of these: