Python BeautifulSoup - Scraping HTML Pages With Python

BeautifulSoup Guide: Scraping HTML Pages With Python

In this guide for The Python Web Scraping Playbook, we will look at how to use Python's popular BeautifulSoup library to build our first web scraper.

We will walk your through all the most powerful features and functionality of BeautifulSoup so you can extract data from any web page.

What Is BeautifulSoup?
Installing BeautifulSoup
Getting HTML Data From Website
Getting HTML Data From File
Querying The DOM Tree
Querying With Python Object Attributes
Querying With BeautifulSoup Methods
Querying With CSS Selectors

First, let's get a quick overview of what is BeautifulSoup.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

What Is BeautifulSoup?

BeautifulSoup is a Python library for extracting data from HTML and XML files.

You simply load a HTML response (or file) into a BeautifulSoup instance, and you can extract any value you want from the HTML page:

from bs4 import BeautifulSoup

html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The first paragraph</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

## Get Title Tag
print(soup.find('title'))
# --> <title>The Dormouse's story</title>

## Get Title Tag Inner Text
print(soup.find('title').get_text())
# -->  "The Dormouse's story"

## Get First Paragraph
print(soup.find('p'))
# --> <p class="title"><b>The Dormouse's story</b></p>

## Get First <a> Tag
print(soup.find('a'))
# --> <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

## Get All <a> Tags
print(soup.find_all('a'))
# --> [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

BeautifulSoup transforms a complex HTML document into a complex tree of Python objects. From here you can search the HTML response for data based on its:

Tag Type: i.e. , <a>, , etc.
Id: i.e. Target Data
Classes: i.e. Target Data
CSS Selectors: i.e. td:nth-child(2) > span:nth-child(1)

We go into detail how to query the DOM tree with BeautifulSoup later in this guide, however, BeautifulSoup provides a number of ways in which we can query this DOM tree:

Via Python object attributes
BeautifulSoup methods .find() and .find_all()
CSS Selectors .select()

Each of which have their own pros and cons, which we will walk through.

Installing BeautifulSoup

Setting up and installing BeautifulSoup in your Python project is very simple.

You just need to install the latest version of BeautifulSoup:

pip install beautifulsoup4

Then import it into your Python Script and initializing a BeautifulSoup instance by passing it some HTML data (YOUR_HTML_DATA):

from bs4 import BeautifulSoup

soup = BeautifulSoup(YOUR_HTML_DATA, 'html.parser')

From here BeautifulSoup will parse the HTML response and allow you to query the HTML for the data you need.

Getting HTML Data From Website

BeautifulSoup is a HTML & XML parsing library, that allows you to extract data from a HTML file.

However, it doesn't provide any functionality to actually get HTML data from a website.

To get the HTML data you need, you have to use a Python HTTP client library like Python Requests or Python HTTPX which allow you to send HTTP requests to websites to get the HTML response.

import requests
from bs4 import BeautifulSoup

response = requests.get('https://quotes.toscrape.com/')
soup = BeautifulSoup(response.content, 'html.parser')

Above we use Python Requests to get the HTML response from QuotesToScrape.com, then we pass the HTML response into BeautifulSoup.

Character Encoding

We use response.content instead of response.text as using response.text can sometimes lead to character encoding issues.

The .content attribute holds raw bytes, which can be decoded better than the text representation we recieve with the .text attribute.

From here, we can use BeautifulSoup to extract the data we need from the HTML response.

Python HTTP Clients

Python has numerous HTTP client libraries to choose from. With Python Requests and Python HTTPX being the most popular.

Getting HTML Data From File

If you already have the HTML page stored as a file on your local machine or storage bucket then you can also load that into BeautifulSoup.

from bs4 import BeautifulSoup

with open("index.html") as fp:
    soup = BeautifulSoup(fp, 'html.parser')

Just open the file containing the HTML and load it into your BeautifulSoup instance.

Querying The DOM Tree

As we mentioned previously, when a HTML page is initialized within a BeautifulSoup instance, BeautifulSoup transforms the HTML document into a complex tree of Python objects.

BeautifulSoup provides a number of ways in which we can query this DOM tree:

Via Python object attributes
BeautifulSoup methods .find() and .find_all()
CSS Selectors .select()

Each of which have their own pros and cons, which we will walk through.

Querying With Python Object Attributes

As BeautifulSoup converts the HTML file into a complex tree of Python objects, we can select values from within that DOM tree like we would with any other Python dictionary.

For example, here are some examples of querying the DOM tree of QuotesToScrape.com with object attributes:

import requests
from bs4 import BeautifulSoup

response = requests.get('https://quotes.toscrape.com/')
soup = BeautifulSoup(response.content, 'html.parser')

## H1 Element
print(soup.h1)
## --> <h1><a href="/" style="text-decoration: none">Quotes to Scrape</a></h1>

## H1 Text
print(soup.h1.a.string)
## --> 'Quotes to Scrape'

## H1 href
print(soup.h1.a['href'])
## --> '/'

This method works but it isn't the best as:

It will only return the first value it finds that matches your criteria.
You can't create complex queries like searching for all div tags where class='quotes'

As a result, it is recommended to use BeautifulSoups .find() and .find_all() methods, or use CSS Selectors via .select().

Querying With BeautifulSoup Methods

The recommended way to search the DOM tree for the data you want to extract is using BeautifulSoups .find() and .find_all() methods.

.find() returns the first element it finds that matches your query.
.find_all() returns all elements it finds that matches your query. Returns an array.

BeautifulSoup `.find()` Method

You should use the .find() method when you know there is only one element on the page that matches your query as it only returns the first element that matches your query.

Let's rewrite the prior example using the .find() method:

import requests
from bs4 import BeautifulSoup

response = requests.get('https://quotes.toscrape.com/')
soup = BeautifulSoup(response.content, 'html.parser')

## H1 Element
print(soup.find('h1'))
## --> <h1><a href="/" style="text-decoration: none">Quotes to Scrape</a></h1>

## H1 Text
print(soup.find('h1').get_text())
## --> 'Quotes to Scrape'

## H1 href
print(soup.find('h1').find('a').get('href'))
## --> '/'

That all looks pretty similar to querying with object attributes, however, the .find() gives us the ability to use more complex queries like searching by class, id, and other element attributes.

Using .find() you can create queries where two conditions or more conditions must be satisfied:

## <p> Tag + Class Name
soup.find('p', class_='class_name')

## <p> Tag + Id
soup.find('p', id='id_name')

## <p> Tag + Any Attribute
soup.find('span', attrs={"aria-hidden": "true"})

## <p> Tag + Class Name & Id
soup.find('p', attrs={"class": "class_name", "id": "id_name"})

You can also pass functions into the .find() method when you want to make even more complex queries:

def custom_selector(tag):
	# Return "span" tags with a class name of "target_span"
	return tag.name == "span" and tag.has_attr("class") and "target_span" in tag.get("class")

soup.find(custom_selector)

For a more detailed explaination of the .find() method the check out our How To Use BeautifulSoup's find() Method guide.

How To Use BeautifulSoup's find_all() Method

BeautifulSoup `.find_all()` Method

You should use the .find_all() method when there are multiple instances of the element on the page that matches your query. .find_all() returns an array of elements that you can then parse individually.

Let's find all the quotes on the QuotesToScrape.com page:

import requests
from bs4 import BeautifulSoup

response = requests.get('https://quotes.toscrape.com/')
soup = BeautifulSoup(response.content, 'html.parser')

## Find All Quotes
print(soup.find_all('span', class_='text'))

"""
Output:

[
    <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>,
    <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>,
    <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>,
    ...
]

"""

Like the .find() method, with the .find_all() method you can create queries where two conditions or more conditions must be satisfied:

## <p> Tag + Class Name
soup.find_all('p', class_='class_name')

## <p> Tag + Id
soup.find_all('p', id='id_name')

## <p> Tag + Any Attribute
soup.find_all('span', attrs={"aria-hidden": "true"})

## <p> Tag + Class Name & Id
soup.find_all('p', attrs={"class": "class_name", "id": "id_name"})

Like the .find() method, with the .find_all() method you can also pass functions into the .find_all() method when you want to make even more complex queries:

def custom_selector(tag):
	# Return "span" tags with a class name of "target_span"
	return tag.name == "span" and tag.has_attr("class") and "target_span" in tag.get("class")

soup.find_all(custom_selector)

For a more detailed explaination of the .find_all() method the check out our How To Use BeautifulSoup's find_all() Method guide.

Querying With CSS Selectors

BeautifulSoup provides a .select() method which uses the SoupSieve package to run a CSS selector against a parsed document and return all the matching elements.

The SoupSieve lists all the currently supported CSS selectors, however, here are some of the most commonly used:

.classes
#ids
[attributes=value]
parent child
parent > child
sibling ~ sibling
sibling + sibling
:not(element.class, element2.class)
:is(element.class, element2.class)
parent:has(> child)

Looking at our QuotesToScrape.com example again here is how you would use CSS Selectors:

import requests
from bs4 import BeautifulSoup

response = requests.get('https://quotes.toscrape.com/')
soup = BeautifulSoup(response.content, 'html.parser')

## Find H1 Text
print(soup.select('h1 a')[0].get_text())
## --> 'Quotes to Scrape'

## Find All Quotes
print(soup.select('span.text'))

"""
Output:

[
    <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>,
    <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>,
    <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>,
    ...
]

"""

.select() Returns List

The .select() method returns a list of elements, so when only looking for 1 element you need to take the first element ([0]) from the list.

Using CSS selectors over BeautifulSoups inbuilt methods have a number of benefits:

Easy Testing: With CSS selectors you can quickly develop and test your CSS selectors in your developer tools prior to implementing them in your code.
Transferable: If you decide to use a different parsing library at a future point then if you are using CSS selectors it is should be quick process to transfer them into your new code base.
Easy to Maintain: Scrapers that use CSS selectors tend to be easier to maintain as you can set the CSS selectors as variables and reuse that variable in multiple places in your code.

BeautifulSoup Guide: Scraping HTML Pages With Python

Need help scraping the web?

What Is BeautifulSoup?​

Installing BeautifulSoup​

Getting HTML Data From Website​

Getting HTML Data From File​

Querying The DOM Tree​

Querying With Python Object Attributes​

Querying With BeautifulSoup Methods​

BeautifulSoup .find() Method​

BeautifulSoup .find_all() Method​

Querying With CSS Selectors​

More Web Scraping Tutorials​