Skip to main content

Python BeautifulSoup - Eliminate Span & Other HTML Tags With BeautifulSoup

How To Eliminate Span & Other HTML Tags With BeautifulSoup

Sometimes when you are parsing HTML responses when web scraping or simply manipulating HTML files you will want to remove certain HTML elements from the HTML document like <span> and <script> tags.

Using BeautifulSoup there are a number of ways to accomplish this depending on what you would like to achieve.

So in this guide, we will look at the various ways you can use BeautifulSoup to eliminate or manipulate HTML docs to get the output you would like:

Let's get started.


Unwrap Tag Contents With unwrap() Method

In the first scenario we will look at how you use the unwrap() method to unwrap the contents of a HTML element and insert them back into the HTML document without the outer tags.

From example, say we have the following HTML document:


<html>
<body>
<h1>Hello, BeautifulSoup!</h1>
<ul>
<li><a href="http://example.com">Example.com <span>(Link)</span></a></li>
<li><a href="http://scrapy.org">Scrapy.com <span>(Link)</span></a></li>
</ul>
</body>
</html>

But we want to remove the <span> tags but leave the inner text then we could use BeautifulSoup's unwrap() method as follows:


from bs4 import BeautifulSoup

html_doc = """
<html>
<body>
<h1>Hello, BeautifulSoup!</h1>
<ul>
<li><a href="http://example.com">Example.com <span>(Link)</span></a></li>
<li><a href="http://scrapy.org">Scrapy.com <span>(Link)</span></a></li>
</ul>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

li_tags = soup.find_all('li')
for li in li_tags:
li.span.unwrap()

print(soup)

The output would be as follows:


<html>
<body>
<h1>Hello, BeautifulSoup!</h1>
<ul>
<li><a href="http://example.com">Example.com (Link)</a></li>
<li><a href="http://scrapy.org">Scrapy.com (Link)</a></li>
</ul>
</body>
</html>

As you can see, unwrap() has replaced the <span> tags with just the inner contents.


Delete Tag With decompose() Method

Another scenario you might encounter is the need to delete a tag from the HTML document.

We can do this using BeautifulSoup's decompose() method which will delete the specified tag from the HTML document.

This time in the below example we will delete the <span> tags from the <li> elements entirely:


from bs4 import BeautifulSoup

html_doc = """
<html>
<body>
<h1>Hello, BeautifulSoup!</h1>
<ul>
<li><a href="http://example.com">Example.com <span>(Link)</span></a></li>
<li><a href="http://scrapy.org">Scrapy.com <span>(Link)</span></a></li>
</ul>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

li_tags = soup.find_all('li')
for li in li_tags:
li.span.decompose()

print(soup)

The output would be as follows:


<html>
<body>
<h1>Hello, BeautifulSoup!</h1>
<ul>
<li><a href="http://example.com">Example.com </a></li>
<li><a href="http://scrapy.org">Scrapy.com </a></li>
</ul>
</body>
</html>

As you can see, decompose() has removed the <span> tags from the <li> elements completely.


Replace Tag With replace_with() Method

In certain situations, you mightn't actually want to delete the HTML element entirely, merely replacing it with another element/text.

We can do this using BeautifulSoup's replace_with() method which will replace the selected tag with a new tag.

This time in the below example we will replace the <span> tags with the <b> tags:


from bs4 import BeautifulSoup

html_doc = """
<html>
<body>
<h1>Hello, BeautifulSoup!</h1>
<ul>
<li><a href="http://example.com">Example.com <span>(Link)</span></a></li>
<li><a href="http://scrapy.org">Scrapy.com <span>(Link)</span></a></li>
</ul>
</body>
</html>
"""

new_tag = soup.new_tag("b")
new_tag.string = "[Click Here]"

doc_soup = BeautifulSoup(html_doc, 'html.parser')

li_tags = doc_soup.find_all('li')
for li in li_tags:
li.span.replace_with(new_tag)

print(soup)

The output would be as follows:


<html>
<body>
<h1>Hello, BeautifulSoup!</h1>
<ul>
<li><a href="http://example.com">Example.com <b>[Click Here]</b></a></li>
<li><a href="http://scrapy.org">Scrapy.com <b>[Click Here]</b></a></li>
</ul>
</body>
</html>

As you can see, replace_with() has replaced the <span>(Link)</span></a> tags with <b>[Click Here]</b> elements.


Extract Inner Text With get_text() Method

Finally, you mightn't want to change the HTML document at all just extract all text from within an element. Even if there are inner elements.

In this case we can simply use BeautifulSoup's get_text() method which will extract all the text contained within a selected tag.

This time in the below example we will extract all the text from within the <li> tags (including the <span> tags):


from bs4 import BeautifulSoup

html_doc = """
<html>
<body>
<h1>Hello, BeautifulSoup!</h1>
<ul>
<li><a href="http://example.com">Example.com <span>(Link)</span></a></li>
<li><a href="http://scrapy.org">Scrapy.com <span>(Link)</span></a></li>
</ul>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

li_tags = soup.find_all('li')
for li in li_tags:
print(li.get_text())

The output would be as follows:


'Example.com (Link)'
'Scrapy.com (Link)'


More Web Scraping Tutorials

So that's how to remove certain HTML elements from the HTML document like <span> and <script> tags using BeautifulSoup.

If you would like to learn more about how to use BeautifulSoup then check out our other BeautifulSoup guides:

Or if you would like to learn more about Web Scraping, then be sure to check out The Python Web Scraping Playbook.

Or check out one of our more in-depth guides: