FeedParser Guide: Parse RSS, Atom & RDF Feeds With Python
FeedParser is a very useful Python library for parsing for downloading and parsing syndicated feeds including RSS, Atom & RDF Feeds.
FeedParser can be used as a way to monitor syndicated feeds or as a way to find new articles to feed into your scrapers.
In this guide, we're going to walk through:
- Installing FeedParser
- Loading Syndicated Feeds
- Parsing RSS Feeds
- Parsing Atom Feeds
- Advanced Functionality
- HTTP Requests
You can find the official FeedParser documentation here.
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
Installing FeedParser
The first step to using FeedParser is to install it using pip:
pip install feedparser
Once installed then we can integrate FeedParser into our scripts and scrapers.
Loading Syndicated Feeds
There are 3 ways to load RSS, Atom & RDF Feed into FeedParser for parsing.
Method 1: Load From URL
The first method is to have feedparser retrieve the feed from the website by passing it a URL and have it parse the response.
import feedparser
d = feedparser.parse('http://feedparser.org/docs/examples/atom10.xml')
d['feed']['title']
## --> u'Sample Feed'
Method 2: Load From Local File
The second method is to have feedparser load the feed from a local file by passing it the path to the XML file.
import feedparser
d = feedparser.parse(r'C:\incoming\atom10.xml')
d['feed']['title']
## --> u'Sample Feed'
Method 3: Load From String
The third method is to have feedparser load the feed from string.
import feedparser
rawdata = """<rss version="2.0">
<channel>
<title>Sample Feed</title>
</channel>
</rss>"""
d = feedparser.parse(rawdata)
d['feed']['title']
## --> u'Sample Feed'
Parsing RSS Feeds
FeedParser makes it very easy to parse RSS Feeds which we will see below.
In the following examples we will parse data from the following example RSS feed.
RSS Channel Elements
With FeedParser you can easily parse the most commonly used elements in RSS feeds (regardless of version): title, link, description, publication date, and entry ID.
The channel elements are available using the feed
method.
import feedparser
d = feedparser.parse('http://feedparser.org/docs/examples/rss20.xml')
d.feed.title
## --> u'Sample Feed'
d.feed.link
## --> u'http://example.org/'
d.feed.description
## --> u'For documentation <em>only</em>'
d.feed.published
## --> u'Sat, 07 Sep 2002 00:00:01 GMT'
d.feed.published_parsed
## --> (2002, 9, 7, 0, 0, 1, 5, 250, 0)
d.feed.image
## --> {'title': u'Example banner', 'href': u'http://example.org/banner.png', 'width': 80, 'height': 15, 'link': u'http://example.org/}
d.feed.categories
## --> [(u'Syndic8', u'1024'), (u'dmoz', 'Top/Society/People/Personal_Homepages/P/')]
RSS Item Elements
To access the items in the RSS feed you can use the entries
method, which returns a ordered list of the items as they appear in the orignal feed. So the first item is available in d.entries[0]
.
import feedparser
d = feedparser.parse('http://feedparser.org/docs/examples/rss20.xml')
d.entries[0].title
## --> u'First item title'
d.entries[0].link
## --> u'http://example.org/item/1'
d.entries[0].description
## --> u'Watch out for <span>nasty tricks</span>'
d.entries[0].published
## --> u'Thu, 05 Sep 2002 00:00:01 GMT'
d.entries[0].published_parsed
## --> (2002, 9, 5, 0, 0, 1, 3, 248, 0)
d.entries[0].id
## --> u'http://example.org/guid/1'
Parsing Atom Feeds
Atom feeds generally contain more information than RSS feeds, however, FeedParser still makes it very easy to parse Atom Feeds.
In the following examples we will parse data from the following example Atom feed.
Atom Channel Elements
With FeedParser you can easily parse the most commonly used elements in Atom feeds (regardless of version): title, link, subtitle/description, various dates, and ID.
The channel elements are available using the feed
method.
import feedparser
d = feedparser.parse('http://feedparser.org/docs/examples/atom10.xml')
d.feed.title
## --> u'Sample Feed'
d.feed.link
## --> u'http://example.org/'
d.feed.subtitle
## --> u'For documentation <em>only</em>'
d.feed.updated
## --> u'2005-11-09T11:56:34Z'
d.feed.updated_parsed
## --> (2005, 11, 9, 11, 56, 34, 2, 313, 0)
d.feed.id
## --> u'tag:feedparser.org,2005-11-09:/docs/examples/atom10.xml'
Atom Item Elements
To access the items in the Atom feed you can use the entries
method, which returns a ordered list of the items as they appear in the orignal feed. So the first item is available in d.entries[0]
.
import feedparser
d = feedparser.parse('http://feedparser.org/docs/examples/atom10.xml')
d.entries[0].title
## --> u'First entry title'
d.entries[0].link
## --> u'http://example.org/entry/1'
d.entries[0].id
## --> u'tag:feedparser.org,2005-11-09:/docs/examples/atom10.xml:3'
d.entries[0].published
## --> u'2005-11-09T00:23:47Z'
d.entries[0].published_parsed
## --> (2005, 11, 9, 0, 23, 47, 2, 313, 0)
d.entries[0].updated
## --> u'2005-11-09T11:56:34Z'
d.entries[0].updated_parsed
## --> (2005, 11, 9, 11, 56, 34, 2, 313, 0)
d.entries[0].summary
## --> u'Watch out for nasty tricks'
d.entries[0].content
## --> [{'type': u'application/xhtml+xml', 'base': u'http://example.org/entry/3', 'language': u'en-US', 'value': u'<div>Watch out for <span>nasty tricks</span></div>'}]
d.entries[0].contributors[0]
## --> {'name': u'Joe', 'href': u'http://example.org/joe/', 'email': u'joe@example.org'}
d.entries[0].links[0]
## --> {'rel': u'alternate', 'type': u'text/html', 'href': u'http://example.org/entry/3'}
Because Atom entries can have more than one content element, d.entries[0].content
is a list of dictionaries. Each dictionary contains metadata about a single content element. The two most important values in the dictionary are the content type, in d.entries[0].content[0].type
, and the actual content value, in d.entries[0].content[0].value
.
The parsed summary and content are not the same as they appear in the original feed. The original elements contained dangerous HTML markup which was sanitized. See Sanitization for details.
Advanced Functionality
FeedParser also has a whole host of advanced functionality that you can use as well. You check out the full list of advanced functionality here, however, we're going to explore two of the most useful features: Content Normalization and Feed Type/Version Detection.
Content Normalization
The Universal Feed Parser that FeedParser uses can parse many different types of feeds: Atom, CDF, and nine different versions of RSS.
However, to simplify the use of FeedParser it allows you to use RSS feed terminology when parsing Atom feeds for example (and vica versa) so that you are not forced to learn the differences between these formats.
The Universal Feed Parser does its best to normalize the parsed feeds regardless of type or version so that you use the same terminology when parsing any feed type.
The following is an example of parsing a Atom feed as a RSS feed:
import feedparser
d = feedparser.parse('http://feedparser.org/docs/examples/atom10.xml')
d['channel']['title']
## --> u'Sample Feed'
d['channel']['link']
## --> u'http://example.org/'
d['channel']['description']
## --> u'For documentation <em>only</em>
len(d['items'])
## --> 1
e = d['items'][0]
e['title']
## --> u'First entry title'
e['link']
## --> u'http://example.org/entry/3'
e['description']
## --> u'Watch out for nasty tricks'
e['author']
## --> u'Mark Pilgrim (mark@example.org)'
Feed Type/Version Detection
Another handy bit of functionality FeedParser provides is the ability to detect the type and version of feed.
d = feedparser.parse('http://feedparser.org/docs/examples/atom10.xml')
d.version
## --> 'atom10'
d = feedparser.parse('http://feedparser.org/docs/examples/atom03.xml')
d.version
## --> 'atom03'
d = feedparser.parse('http://feedparser.org/docs/examples/rss20.xml')
d.version
## --> 'rss20'
d = feedparser.parse('http://feedparser.org/docs/examples/rss20dc.xml')
d.version
## --> 'rss20'
d = feedparser.parse('http://feedparser.org/docs/examples/rss10.rdf')
d.version
## --> 'rss10'
If the feed type is completely unknown, version will be an empty
string.
HTTP Requests
FeedParser has the built in functionality that allows it retrieve the feed data from the target feed, however, by default this functionality is pretty limited and is suspectible to getting blocked by the target website.
That is why FeedParser allows you to control how HTTP requests are made at a lower level, including setting user-agents, headers and dealing with authentication.
Setting User-Agents With FeedParser
By default the user-agent FeedParser uses when requesting a feed's data is:
UniversalFeedParser/5.0.1 +http://feedparser.org/
When using FeedParser in production you should change this user-agent to either:
- A user-agent that clearly identifies your app and gives a way for the feed owner to contact you.
- Hides your identify and makes the request look like it is coming from another app so they won't block you.
Which option you choose depends on your relationship with the feed owner and their policies with regards external apps interacting with their feeds.
Actually, changing the default user-agent is pretty simple. Simply add a custom user-agent to the HTTP request using the agent
parameter of the parse
method.
import feedparser
user_agent = 'MyApp/1.0 +http://example.com/'
d = feedparser.parse('http://feedparser.org/docs/examples/atom10.xml', agent=user_agent)
You can also pernamently change the user-agent in your script by setting the USER_AGENT
parameter of feedparser
.
import feedparser
feedparser.USER_AGENT = "MyApp/1.0 +http://example.com/"
d = feedparser.parse('http://feedparser.org/docs/examples/atom10.xml')
Accessing Password-Protected Feeds With FeedParser
Another common issue developers run into is how to configure FeedParser to access restricted feeds.
FeedParser provides a number of ways to access restricted feeds which you can see here, however, the easiest is to simple add the username
and password
to the feed URL.
For example to access this restricted feed:
'http://feedparser.org/docs/examples/basic_auth.xml'
We add the username
and password
to the URL. In this example, the username
is test and the password
is basic.
'http://test:basic@feedparser.org/docs/examples/basic_auth.xml'
From which point FeedParser will handle the basic authentication.
Using Proxies With FeedParser
When accessing RSS/Atom feeds at scale then you will likely run into issues with your requests getting blocked so you will need to use a proxy solution to bypass their anti-scraping methods.
You can pass in proxies using a ProxyHandler
:
import feedparser
from urllib.request import ProxyHandler
proxy_handler = ProxyHandler(proxies)
print(feedparser.parse("rss_url", handlers=[proxy_handler]))
Or you could use a smart proxy solution like the ScrapeOps Proxy Aggregator that will find the best performing and cheapest proxies for your target RSS feed.
import feedparser
from urllib.parse import urlencode
url = 'http://feedparser.org/docs/examples/atom10.xml'
payload = {'api_key': 'YOUR_API_KEY', 'url': url}
proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)
d = feedparser.parse(proxy_url)
Here you simply send the URL you want to scrape to the ScrapeOps Proxy API Aggregator endpoint in the URL
query parameter, along with your API key in the api_key
query parameter, and ScrapeOps will deal with finding the best proxy for that domain and return the HTML response to you.
You can get your own free API key with 1,000 free requests by signing up here.
More Web Scraping Tutorials
So that's how you can use the FeedParser to retrieve and parse RSS, Atom, etc. syndication feeds.
If you would like to learn more about Web Scraping, then be sure to check out The Web Scraping Playbook.
Or check out one of our more in-depth guides: