How To Eliminate Span & Other HTML Tags With BeautifulSoup
Sometimes when you are parsing HTML responses when web scraping or simply manipulating HTML files you will want to remove certain HTML elements from the HTML document like <span>
and <script>
tags.
Using BeautifulSoup there are a number of ways to accomplish this depending on what you would like to achieve.
So in this guide, we will look at the various ways you can use BeautifulSoup to eliminate or manipulate HTML docs to get the output you would like:
- Unwrap Tag Contents With unwrap() Method
- Delete Tag With decompose() Method
- Replace Tag With replace_with() Method
- Extract Inner Text With text() Method
Let's get started.
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
Unwrap Tag Contents With unwrap() Method
In the first scenario we will look at how you use the unwrap()
method to unwrap the contents of a HTML element and insert them back into the HTML document without the outer tags.
From example, say we have the following HTML document:
<html>
<body>
<h1>Hello, BeautifulSoup!</h1>
<ul>
<li><a href="http://example.com">Example.com <span>(Link)</span></a></li>
<li><a href="http://scrapy.org">Scrapy.com <span>(Link)</span></a></li>
</ul>
</body>
</html>
But we want to remove the <span>
tags but leave the inner text then we could use BeautifulSoup's unwrap() method as follows:
from bs4 import BeautifulSoup
html_doc = """
<html>
<body>
<h1>Hello, BeautifulSoup!</h1>
<ul>
<li><a href="http://example.com">Example.com <span>(Link)</span></a></li>
<li><a href="http://scrapy.org">Scrapy.com <span>(Link)</span></a></li>
</ul>
</body>
</html>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
li_tags = soup.find_all('li')
for li in li_tags:
li.span.unwrap()
print(soup)
The output would be as follows:
<html>
<body>
<h1>Hello, BeautifulSoup!</h1>
<ul>
<li><a href="http://example.com">Example.com (Link)</a></li>
<li><a href="http://scrapy.org">Scrapy.com (Link)</a></li>
</ul>
</body>
</html>
As you can see, unwrap()
has replaced the <span>
tags with just the inner contents.
Delete Tag With decompose() Method
Another scenario you might encounter is the need to delete a tag from the HTML document.
We can do this using BeautifulSoup's decompose() method which will delete the specified tag from the HTML document.
This time in the below example we will delete the <span>
tags from the <li>
elements entirely:
from bs4 import BeautifulSoup
html_doc = """
<html>
<body>
<h1>Hello, BeautifulSoup!</h1>
<ul>
<li><a href="http://example.com">Example.com <span>(Link)</span></a></li>
<li><a href="http://scrapy.org">Scrapy.com <span>(Link)</span></a></li>
</ul>
</body>
</html>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
li_tags = soup.find_all('li')
for li in li_tags:
li.span.decompose()
print(soup)
The output would be as follows:
<html>
<body>
<h1>Hello, BeautifulSoup!</h1>
<ul>
<li><a href="http://example.com">Example.com </a></li>
<li><a href="http://scrapy.org">Scrapy.com </a></li>
</ul>
</body>
</html>
As you can see, decompose()
has removed the <span>
tags from the <li>
elements completely.
Replace Tag With replace_with() Method
In certain situations, you mightn't actually want to delete the HTML element entirely, merely replacing it with another element/text.
We can do this using BeautifulSoup's replace_with() method which will replace the selected tag with a new tag.
This time in the below example we will replace the <span>
tags with the <b>
tags:
from bs4 import BeautifulSoup
html_doc = """
<html>
<body>
<h1>Hello, BeautifulSoup!</h1>
<ul>
<li><a href="http://example.com">Example.com <span>(Link)</span></a></li>
<li><a href="http://scrapy.org">Scrapy.com <span>(Link)</span></a></li>
</ul>
</body>
</html>
"""
new_tag = soup.new_tag("b")
new_tag.string = "[Click Here]"
doc_soup = BeautifulSoup(html_doc, 'html.parser')
li_tags = doc_soup.find_all('li')
for li in li_tags:
li.span.replace_with(new_tag)
print(soup)
The output would be as follows:
<html>
<body>
<h1>Hello, BeautifulSoup!</h1>
<ul>
<li><a href="http://example.com">Example.com <b>[Click Here]</b></a></li>
<li><a href="http://scrapy.org">Scrapy.com <b>[Click Here]</b></a></li>
</ul>
</body>
</html>
As you can see, replace_with()
has replaced the <span>(Link)</span></a>
tags with <b>[Click Here]</b>
elements.
Extract Inner Text With get_text() Method
Finally, you mightn't want to change the HTML document at all just extract all text from within an element. Even if there are inner elements.
In this case we can simply use BeautifulSoup's get_text() method which will extract all the text contained within a selected tag.
This time in the below example we will extract all the text from within the <li>
tags (including the <span>
tags):
from bs4 import BeautifulSoup
html_doc = """
<html>
<body>
<h1>Hello, BeautifulSoup!</h1>
<ul>
<li><a href="http://example.com">Example.com <span>(Link)</span></a></li>
<li><a href="http://scrapy.org">Scrapy.com <span>(Link)</span></a></li>
</ul>
</body>
</html>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
li_tags = soup.find_all('li')
for li in li_tags:
print(li.get_text())
The output would be as follows:
'Example.com (Link)'
'Scrapy.com (Link)'
More Web Scraping Tutorials
So that's how to remove certain HTML elements from the HTML document like <span>
and <script>
tags using BeautifulSoup.
If you would like to learn more about how to use BeautifulSoup then check out our other BeautifulSoup guides:
- BeautifulSoup Guide: Scraping HTML Pages With Python
- How To Install BeautifulSoup
- Fix BeautifulSoup Returns Empty List or Value
- How To Install BeautifulSoup
- How To Use BeautifulSoup's find_all() Method
Or if you would like to learn more about Web Scraping, then be sure to check out The Python Web Scraping Playbook.
Or check out one of our more in-depth guides: