Python BeautifulSoup - Eliminate Span & Other HTML Tags With BeautifulSoup

How To Eliminate Span & Other HTML Tags With BeautifulSoup

Sometimes when you are parsing HTML responses when web scraping or simply manipulating HTML files you will want to remove certain HTML elements from the HTML document like  and <script> tags.

Using BeautifulSoup there are a number of ways to accomplish this depending on what you would like to achieve.

So in this guide, we will look at the various ways you can use BeautifulSoup to eliminate or manipulate HTML docs to get the output you would like:

Unwrap Tag Contents With unwrap() Method
Delete Tag With decompose() Method
Replace Tag With replace_with() Method
Extract Inner Text With text() Method

Let's get started.

Unwrap Tag Contents With unwrap() Method

In the first scenario we will look at how you use the unwrap() method to unwrap the contents of a HTML element and insert them back into the HTML document without the outer tags.

From example, say we have the following HTML document:

<html>
    <body>
        <h1>Hello, BeautifulSoup!</h1>
        <ul>
            <li><a href="http://example.com">Example.com <span>(Link)</span></a></li>
            <li><a href="http://scrapy.org">Scrapy.com <span>(Link)</span></a></li>
        </ul>
    </body>
</html>

But we want to remove the  tags but leave the inner text then we could use BeautifulSoup's unwrap() method as follows:

from bs4 import BeautifulSoup

html_doc = """
<html>
    <body>
        <h1>Hello, BeautifulSoup!</h1>
        <ul>
            <li><a href="http://example.com">Example.com <span>(Link)</span></a></li>
            <li><a href="http://scrapy.org">Scrapy.com <span>(Link)</span></a></li>
        </ul>
    </body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

li_tags = soup.find_all('li')
for li in li_tags:
    li.span.unwrap()

print(soup)

The output would be as follows:

<html>
    <body>
        <h1>Hello, BeautifulSoup!</h1>
        <ul>
            <li><a href="http://example.com">Example.com (Link)</a></li>
            <li><a href="http://scrapy.org">Scrapy.com (Link)</a></li>
        </ul>
    </body>
</html>

As you can see, unwrap() has replaced the  tags with just the inner contents.

Delete Tag With decompose() Method

Another scenario you might encounter is the need to delete a tag from the HTML document.

We can do this using BeautifulSoup's decompose() method which will delete the specified tag from the HTML document.

This time in the below example we will delete the  tags from the <li> elements entirely:

from bs4 import BeautifulSoup

html_doc = """
<html>
    <body>
        <h1>Hello, BeautifulSoup!</h1>
        <ul>
            <li><a href="http://example.com">Example.com <span>(Link)</span></a></li>
            <li><a href="http://scrapy.org">Scrapy.com <span>(Link)</span></a></li>
        </ul>
    </body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

li_tags = soup.find_all('li')
for li in li_tags:
    li.span.decompose()

print(soup)

The output would be as follows:

<html>
    <body>
        <h1>Hello, BeautifulSoup!</h1>
        <ul>
            <li><a href="http://example.com">Example.com </a></li>
            <li><a href="http://scrapy.org">Scrapy.com </a></li>
        </ul>
    </body>
</html>

As you can see, decompose() has removed the  tags from the <li> elements completely.

Replace Tag With replace_with() Method

In certain situations, you mightn't actually want to delete the HTML element entirely, merely replacing it with another element/text.

We can do this using BeautifulSoup's replace_with() method which will replace the selected tag with a new tag.

This time in the below example we will replace the  tags with the  tags:

from bs4 import BeautifulSoup

html_doc = """
<html>
    <body>
        <h1>Hello, BeautifulSoup!</h1>
        <ul>
            <li><a href="http://example.com">Example.com <span>(Link)</span></a></li>
            <li><a href="http://scrapy.org">Scrapy.com <span>(Link)</span></a></li>
        </ul>
    </body>
</html>
"""

new_tag = soup.new_tag("b")
new_tag.string = "[Click Here]"
    
doc_soup = BeautifulSoup(html_doc, 'html.parser')

li_tags = doc_soup.find_all('li')
for li in li_tags:
    li.span.replace_with(new_tag)

print(soup)

The output would be as follows:

<html>
    <body>
        <h1>Hello, BeautifulSoup!</h1>
        <ul>
            <li><a href="http://example.com">Example.com <b>[Click Here]</b></a></li>
            <li><a href="http://scrapy.org">Scrapy.com <b>[Click Here]</b></a></li>
        </ul>
    </body>
</html>

As you can see, replace_with() has replaced the (Link)</a> tags with [Click Here] elements.

Extract Inner Text With get_text() Method

Finally, you mightn't want to change the HTML document at all just extract all text from within an element. Even if there are inner elements.

In this case we can simply use BeautifulSoup's get_text() method which will extract all the text contained within a selected tag.

This time in the below example we will extract all the text from within the <li> tags (including the  tags):

from bs4 import BeautifulSoup

html_doc = """
<html>
    <body>
        <h1>Hello, BeautifulSoup!</h1>
        <ul>
            <li><a href="http://example.com">Example.com <span>(Link)</span></a></li>
            <li><a href="http://scrapy.org">Scrapy.com <span>(Link)</span></a></li>
        </ul>
    </body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

li_tags = soup.find_all('li')
for li in li_tags:
    print(li.get_text())

The output would be as follows:

'Example.com (Link)'
'Scrapy.com (Link)'

How To Eliminate Span & Other HTML Tags With BeautifulSoup

Unwrap Tag Contents With unwrap() Method

Delete Tag With decompose() Method

Replace Tag With replace_with() Method

Extract Inner Text With get_text() Method

More Web Scraping Tutorials​

More Web Scraping Tutorials