BeautifulSoup Guide: Scraping HTML Pages With Python
In this guide for The Python Web Scraping Playbook, we will look at how to use Python's popular BeautifulSoup library to build our first web scraper.
We will walk your through all the most powerful features and functionality of BeautifulSoup so you can extract data from any web page.
- What Is BeautifulSoup?
- Installing BeautifulSoup
- Getting HTML Data From Website
- Getting HTML Data From File
- Querying The DOM Tree
- Querying With Python Object Attributes
- Querying With BeautifulSoup Methods
- Querying With CSS Selectors
First, let's get a quick overview of what is BeautifulSoup.
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
What Is BeautifulSoup?
BeautifulSoup is a Python library for extracting data from HTML and XML files.
You simply load a HTML response (or file) into a BeautifulSoup instance, and you can extract any value you want from the HTML page:
from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The first paragraph</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
## Get Title Tag
print(soup.find('title'))
# --> <title>The Dormouse's story</title>
## Get Title Tag Inner Text
print(soup.find('title').get_text())
# --> "The Dormouse's story"
## Get First Paragraph
print(soup.find('p'))
# --> <p class="title"><b>The Dormouse's story</b></p>
## Get First <a> Tag
print(soup.find('a'))
# --> <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
## Get All <a> Tags
print(soup.find_all('a'))
# --> [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
BeautifulSoup transforms a complex HTML document into a complex tree of Python objects. From here you can search the HTML response for data based on its:
- Tag Type: i.e.
<p>
,<a>
,<span>
, etc. - Id: i.e.
<span id="data">Target Data<span>
- Classes: i.e.
<span class="data">Target Data<span>
- CSS Selectors: i.e.
td:nth-child(2) > span:nth-child(1)
We go into detail how to query the DOM tree with BeautifulSoup later in this guide, however, BeautifulSoup provides a number of ways in which we can query this DOM tree:
- Via Python object attributes
- BeautifulSoup methods
.find()
and.find_all()
- CSS Selectors
.select()
Each of which have their own pros and cons, which we will walk through.
Installing BeautifulSoup
Setting up and installing BeautifulSoup in your Python project is very simple.
You just need to install the latest version of BeautifulSoup:
pip install beautifulsoup4
Then import it into your Python Script and initializing a BeautifulSoup instance by passing it some HTML data (YOUR_HTML_DATA
):
from bs4 import BeautifulSoup
soup = BeautifulSoup(YOUR_HTML_DATA, 'html.parser')
From here BeautifulSoup will parse the HTML response and allow you to query the HTML for the data you need.
Getting HTML Data From Website
BeautifulSoup is a HTML & XML parsing library, that allows you to extract data from a HTML file.
However, it doesn't provide any functionality to actually get HTML data from a website.
To get the HTML data you need, you have to use a Python HTTP client library like Python Requests or Python HTTPX which allow you to send HTTP requests to websites to get the HTML response.
import requests
from bs4 import BeautifulSoup
response = requests.get('https://quotes.toscrape.com/')
soup = BeautifulSoup(response.content, 'html.parser')
Above we use Python Requests to get the HTML response from QuotesToScrape.com, then we pass the HTML response into BeautifulSoup.
We use response.content
instead of response.text
as using response.text
can sometimes lead to character encoding issues.
The .content
attribute holds raw bytes, which can be decoded better than the text representation we recieve with the .text
attribute.
From here, we can use BeautifulSoup to extract the data we need from the HTML response.
Python has numerous HTTP client libraries to choose from. With Python Requests and Python HTTPX being the most popular.
If you would like to learn more about Python's HTTP clients and how they differ from one another, then check out our Python HTTP Client Comparison here.
Getting HTML Data From File
If you already have the HTML page stored as a file on your local machine or storage bucket then you can also load that into BeautifulSoup.
from bs4 import BeautifulSoup
with open("index.html") as fp:
soup = BeautifulSoup(fp, 'html.parser')
Just open the file containing the HTML and load it into your BeautifulSoup instance.
Querying The DOM Tree
As we mentioned previously, when a HTML page is initialized within a BeautifulSoup instance, BeautifulSoup transforms the HTML document into a complex tree of Python objects.
BeautifulSoup provides a number of ways in which we can query this DOM tree:
- Via Python object attributes
- BeautifulSoup methods
.find()
and.find_all()
- CSS Selectors
.select()
Each of which have their own pros and cons, which we will walk through.
Querying With Python Object Attributes
As BeautifulSoup converts the HTML file into a complex tree of Python objects, we can select values from within that DOM tree like we would with any other Python dictionary.
For example, here are some examples of querying the DOM tree of QuotesToScrape.com with object attributes:
import requests
from bs4 import BeautifulSoup
response = requests.get('https://quotes.toscrape.com/')
soup = BeautifulSoup(response.content, 'html.parser')
## H1 Element
print(soup.h1)
## --> <h1><a href="/" style="text-decoration: none">Quotes to Scrape</a></h1>
## H1 Text
print(soup.h1.a.string)
## --> 'Quotes to Scrape'
## H1 href
print(soup.h1.a['href'])
## --> '/'
This method works but it isn't the best as:
- It will only return the first value it finds that matches your criteria.
- You can't create complex queries like searching for all
div
tags whereclass='quotes'
As a result, it is recommended to use BeautifulSoups .find()
and .find_all()
methods, or use CSS Selectors via .select()
.
Querying With BeautifulSoup Methods
The recommended way to search the DOM tree for the data you want to extract is using BeautifulSoups .find()
and .find_all()
methods.
.find()
returns the first element it finds that matches your query..find_all()
returns all elements it finds that matches your query. Returns an array.
BeautifulSoup .find()
Method
You should use the .find()
method when you know there is only one element on the page that matches your query as it only returns the first element that matches your query.
Let's rewrite the prior example using the .find()
method:
import requests
from bs4 import BeautifulSoup
response = requests.get('https://quotes.toscrape.com/')
soup = BeautifulSoup(response.content, 'html.parser')
## H1 Element
print(soup.find('h1'))
## --> <h1><a href="/" style="text-decoration: none">Quotes to Scrape</a></h1>
## H1 Text
print(soup.find('h1').get_text())
## --> 'Quotes to Scrape'
## H1 href
print(soup.find('h1').find('a').get('href'))
## --> '/'
That all looks pretty similar to querying with object attributes, however, the .find()
gives us the ability to use more complex queries like searching by class, id, and other element attributes.
Using .find()
you can create queries where two conditions or more conditions must be satisfied:
## <p> Tag + Class Name
soup.find('p', class_='class_name')
## <p> Tag + Id
soup.find('p', id='id_name')
## <p> Tag + Any Attribute
soup.find('span', attrs={"aria-hidden": "true"})
## <p> Tag + Class Name & Id
soup.find('p', attrs={"class": "class_name", "id": "id_name"})
You can also pass functions into the .find()
method when you want to make even more complex queries:
def custom_selector(tag):
# Return "span" tags with a class name of "target_span"
return tag.name == "span" and tag.has_attr("class") and "target_span" in tag.get("class")
soup.find(custom_selector)
For a more detailed explaination of the .find()
method the check out our How To Use BeautifulSoup's find() Method guide.
BeautifulSoup .find_all()
Method
You should use the .find_all()
method when there are multiple instances of the element on the page that matches your query. .find_all()
returns an array of elements that you can then parse individually.
Let's find all the quotes on the QuotesToScrape.com page:
import requests
from bs4 import BeautifulSoup
response = requests.get('https://quotes.toscrape.com/')
soup = BeautifulSoup(response.content, 'html.parser')
## Find All Quotes
print(soup.find_all('span', class_='text'))
"""
Output:
[
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>,
<span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>,
<span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>,
...
]
"""
Like the .find()
method, with the .find_all()
method you can create queries where two conditions or more conditions must be satisfied:
## <p> Tag + Class Name
soup.find_all('p', class_='class_name')
## <p> Tag + Id
soup.find_all('p', id='id_name')
## <p> Tag + Any Attribute
soup.find_all('span', attrs={"aria-hidden": "true"})
## <p> Tag + Class Name & Id
soup.find_all('p', attrs={"class": "class_name", "id": "id_name"})
Like the .find()
method, with the .find_all()
method you can also pass functions into the .find_all()
method when you want to make even more complex queries:
def custom_selector(tag):
# Return "span" tags with a class name of "target_span"
return tag.name == "span" and tag.has_attr("class") and "target_span" in tag.get("class")
soup.find_all(custom_selector)
For a more detailed explaination of the .find_all()
method the check out our How To Use BeautifulSoup's find_all() Method guide.
Querying With CSS Selectors
BeautifulSoup provides a .select()
method which uses the SoupSieve package to run a CSS selector against a parsed document and return all the matching elements.
The SoupSieve lists all the currently supported CSS selectors, however, here are some of the most commonly used:
.classes
#ids
[attributes=value]
parent child
parent > child
sibling ~ sibling
sibling + sibling
:not(element.class, element2.class)
:is(element.class, element2.class)
parent:has(> child)
Looking at our QuotesToScrape.com example again here is how you would use CSS Selectors:
import requests
from bs4 import BeautifulSoup
response = requests.get('https://quotes.toscrape.com/')
soup = BeautifulSoup(response.content, 'html.parser')
## Find H1 Text
print(soup.select('h1 a')[0].get_text())
## --> 'Quotes to Scrape'
## Find All Quotes
print(soup.select('span.text'))
"""
Output:
[
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>,
<span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>,
<span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>,
...
]
"""
.select()
Returns ListThe .select()
method returns a list of elements, so when only looking for 1 element you need to take the first element ([0]
) from the list.
Using CSS selectors over BeautifulSoups inbuilt methods have a number of benefits:
- Easy Testing: With CSS selectors you can quickly develop and test your CSS selectors in your developer tools prior to implementing them in your code.
- Transferable: If you decide to use a different parsing library at a future point then if you are using CSS selectors it is should be quick process to transfer them into your new code base.
- Easy to Maintain: Scrapers that use CSS selectors tend to be easier to maintain as you can set the CSS selectors as variables and reuse that variable in multiple places in your code.
More Web Scraping Tutorials
So that's an introduction to Python BeautifulSoup.
If you would like to learn more about how to use BeautifulSoup then check out our other BeautifulSoup guides:
- How To Install BeautifulSoup
- Fix BeautifulSoup Returns Empty List or Value
- How To Use BeautifulSoup's find() Method
- How To Use BeautifulSoup's find_all() Method
If you would like to learn more about Web Scraping, then be sure to check out The Python Web Scraping Playbook.
Or check out one of our more in-depth guides: