Parsel Guide: Scraping HTML Pages With Python
In this guide for The Python Web Scraping Playbook, we will look at how to use Python's popular Parsel library to build our first web scraper.
We will walk your through all the most powerful features and functionality of Parsel so you can extract data from any web page.
- What Is Parsel?
- Installing Parsel
- Getting HTML Data From Website
- Getting HTML Data From File
- Querying The DOM Tree
- Querying With Python Object Attributes
- Querying With BeautifulSoup Methods
- Querying With CSS Selectors
First, let's get a quick overview of what is Parsel.
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
What Is Parsel?
Parsel is a Python library for extracting data from HTML and XML files using XPath and CSS Selectors.
You simply load a HTML response (or file) into a Parsel's Selector instance, and you can extract any value you want from the HTML page:
from parsel import Selector
html_doc = """
<html>
<body>
<h1>Hello, Parsel!</h1>
<ul>
<li><a href="http://example.com">Link 1</a></li>
<li><a href="http://scrapy.org">Link 2</a></li>
</ul>
</body>
</html>
"""
selector = Selector(text=html_doc)
## CSS Selector
selector.css('h1::text').get()
## --> 'Hello, Parsel!'
selector.css('a::text').getall()
## --> ['Link 1', 'Link 2']
## XPath Selector
selector.xpath('//h1/text()').get()
## --> 'Hello, Parsel!'
selector.xpath('.//a/text()').getall()
## --> ['Link 1', 'Link 2']
Parsel parses a complex HTML document and makes its data queryable via XPath and CSS Selectors, which can also be optionally combined with regular expressions.
You can use the:
.get()
method to get the first element that matches your query..getall()
method to get all elements that match your query.
Check out Parsel's online demo here.
Installing Parsel
Setting up and installing Parsel in your Python project is very simple.
You just need to install the latest version of Parsel:
pip install parsel
Then import Parsel into your Python Script and initializing a Parsel Selector instance by passing it some HTML data:
from parsel import Selector
html_doc = """
<html>
<body>
<h1>Hello, Parsel!</h1>
<ul>
<li><a href="http://example.com">Link 1</a></li>
<li><a href="http://scrapy.org">Link 2</a></li>
</ul>
</body>
</html>
"""
selector = Selector(text=html_doc)
From here Parsel will parse the HTML response and allow you to query the HTML for the data you need.
Getting HTML Data From Website
Parsel is a HTML & XML parsing library, that allows you to extract data from a HTML file.
However, it doesn't provide any functionality to actually get HTML data from a website.
To get the HTML data you need, you have to use a Python HTTP client library like Python Requests or Python HTTPX which allow you to send HTTP requests to websites to get the HTML response.
import requests
from parsel import Selector
response = requests.get('https://quotes.toscrape.com/')
selector = Selector(text=response.text)
Above we use Python Requests to get the HTML response from QuotesToScrape.com, then we pass the HTML response into Parsel.
From here, we can use Parsel to extract the data we need from the HTML response.
Python has numerous HTTP client libraries to choose from. With Python Requests and Python HTTPX being the most popular.
Getting HTML Data From File
If you already have the HTML page stored as a file on your local machine or storage bucket then you can also load that into Parsel.
from parsel import Selector
with open("index.html") as fp:
selector = Selector(text=fp)
Just open the file containing the HTML and load it into your Parsel Selector instance.
Using Parsel CSS Selectors
Parsel enables you to use CSS selectors using the .css()
method which will run a CSS selector against a parsed document and return all the matching elements.
Parsel supports all the commonly used CSS selectors:
.classes
#ids
[attributes=value]
parent child
parent > child
sibling ~ sibling
sibling + sibling
:not(element.class, element2.class)
:is(element.class, element2.class)
parent:has(> child)
For example here is how to use Parsel CSS Selectors to scrape data from QuotesToScrape.com:
import requests
from parsel import Selector
response = requests.get('https://quotes.toscrape.com/')
selector = Selector(text=response.text)
## Headline
print(selector.css('h1 a::text').get())
## --> 'Quotes to Scrape'
## All Quotes (Quote Elements)
quotes = selector.css("div.quote").getall()
print(quotes)
"""
Output:
[
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>,
<span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>,
<span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>,
...
]
"""
## Individual Quotes
for quote in quotes:
print({
'text': quote.css("span.text::text").get(),
'author': quote.css("small.author::text").get(),
'tags': quote.css("div.tags > a.tag::text").get()
})
"""
Output:
[
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': 'change'}
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': 'abilities'}
{'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein', 'tags': 'inspirational'}
...
]
"""
Here we use CSS selectors to get the:
- Text in the
<h1>
tag. - All quotes blocks (which are
<div class="quotes>...</div>"
) using the.getall()
method. - The the data from each quote block using the
.get()
method.
Using Parsel Xpath Selectors
Parsel enables you to use XPath selectors using the .xpath()
method which will run a XPath selector against a parsed document and return all the matching elements.
Parsel supports all the commonly used XPath selectors that you can use.
For example here is how to use Parsel XPath Selectors to scrape data from QuotesToScrape.com:
import requests
from parsel import Selector
response = requests.get('https://quotes.toscrape.com/')
selector = Selector(text=response.text)
## Headline
print(selector.css('//h1/a/text()').get())
## --> 'Quotes to Scrape'
## All Quotes (Quote Elements)
quotes = selector.xpath('//div[@class="quote"]').getall()
print(quotes)
"""
Output:
[
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>,
<span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>,
<span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>,
...
]
"""
## Individual Quotes
for quote in quotes:
print({
'text': quote.xpath('./span[@class="text"]/text()').get(),
'author': quote.xpath('.//small[@class="author"]/text()').get(),
'tags': quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').get()
})
"""
Output:
[
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': 'change'}
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': 'abilities'}
{'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein', 'tags': 'inspirational'}
...
]
"""
Here we use XPath selectors to get the:
- Text in the
<h1>
tag. - All quotes blocks (which are
<div class="quotes>...</div>"
) using the.getall()
method. - The the data from each quote block using the
.get()
method.
More Web Scraping Tutorials
So that's an introduction to Python Parsel.
If you would like to learn more about Web Scraping, then be sure to check out The Python Web Scraping Playbook.
Or check out one of our more in-depth guides: