Skip to main content

FeedParser

FeedParser Guide: Parse RSS, Atom & RDF Feeds With Python

FeedParser is a very useful Python library for parsing for downloading and parsing syndicated feeds including RSS, Atom & RDF Feeds.

FeedParser can be used as a way to monitor syndicated feeds or as a way to find new articles to feed into your scrapers.

In this guide, we're going to walk through:

You can find the official FeedParser documentation here.


Installing FeedParser

The first step to using FeedParser is to install it using pip:


pip install feedparser

Once installed then we can integrate FeedParser into our scripts and scrapers.


Loading Syndicated Feeds

There are 3 ways to load RSS, Atom & RDF Feed into FeedParser for parsing.

Method 1: Load From URL

The first method is to have feedparser retrieve the feed from the website by passing it a URL and have it parse the response.


import feedparser
d = feedparser.parse('http://feedparser.org/docs/examples/atom10.xml')
d['feed']['title']
## --> u'Sample Feed'

Method 2: Load From Local File

The second method is to have feedparser load the feed from a local file by passing it the path to the XML file.


import feedparser
d = feedparser.parse(r'C:\incoming\atom10.xml')
d['feed']['title']
## --> u'Sample Feed'

Method 3: Load From String

The third method is to have feedparser load the feed from string.


import feedparser

rawdata = """<rss version="2.0">
<channel>
<title>Sample Feed</title>
</channel>
</rss>"""

d = feedparser.parse(rawdata)
d['feed']['title']
## --> u'Sample Feed'


Parsing RSS Feeds

FeedParser makes it very easy to parse RSS Feeds which we will see below.

In the following examples we will parse data from the following example RSS feed.

RSS Channel Elements

With FeedParser you can easily parse the most commonly used elements in RSS feeds (regardless of version): title, link, description, publication date, and entry ID.

The channel elements are available using the feed method.


import feedparser
d = feedparser.parse('http://feedparser.org/docs/examples/rss20.xml')

d.feed.title
## --> u'Sample Feed'

d.feed.link
## --> u'http://example.org/'

d.feed.description
## --> u'For documentation <em>only</em>'

d.feed.published
## --> u'Sat, 07 Sep 2002 00:00:01 GMT'

d.feed.published_parsed
## --> (2002, 9, 7, 0, 0, 1, 5, 250, 0)

d.feed.image
## --> {'title': u'Example banner', 'href': u'http://example.org/banner.png', 'width': 80, 'height': 15, 'link': u'http://example.org/}

d.feed.categories
## --> [(u'Syndic8', u'1024'), (u'dmoz', 'Top/Society/People/Personal_Homepages/P/')]


RSS Item Elements

To access the items in the RSS feed you can use the entries method, which returns a ordered list of the items as they appear in the orignal feed. So the first item is available in d.entries[0].


import feedparser
d = feedparser.parse('http://feedparser.org/docs/examples/rss20.xml')

d.entries[0].title
## --> u'First item title'

d.entries[0].link
## --> u'http://example.org/item/1'

d.entries[0].description
## --> u'Watch out for <span>nasty tricks</span>'

d.entries[0].published
## --> u'Thu, 05 Sep 2002 00:00:01 GMT'

d.entries[0].published_parsed
## --> (2002, 9, 5, 0, 0, 1, 3, 248, 0)

d.entries[0].id
## --> u'http://example.org/guid/1'


Parsing Atom Feeds

Atom feeds generally contain more information than RSS feeds, however, FeedParser still makes it very easy to parse Atom Feeds.

In the following examples we will parse data from the following example Atom feed.

Atom Channel Elements

With FeedParser you can easily parse the most commonly used elements in Atom feeds (regardless of version): title, link, subtitle/description, various dates, and ID.

The channel elements are available using the feed method.


import feedparser
d = feedparser.parse('http://feedparser.org/docs/examples/atom10.xml')

d.feed.title
## --> u'Sample Feed'

d.feed.link
## --> u'http://example.org/'

d.feed.subtitle
## --> u'For documentation <em>only</em>'

d.feed.updated
## --> u'2005-11-09T11:56:34Z'

d.feed.updated_parsed
## --> (2005, 11, 9, 11, 56, 34, 2, 313, 0)

d.feed.id
## --> u'tag:feedparser.org,2005-11-09:/docs/examples/atom10.xml'


Atom Item Elements

To access the items in the Atom feed you can use the entries method, which returns a ordered list of the items as they appear in the orignal feed. So the first item is available in d.entries[0].


import feedparser
d = feedparser.parse('http://feedparser.org/docs/examples/atom10.xml')

d.entries[0].title
## --> u'First entry title'

d.entries[0].link
## --> u'http://example.org/entry/1'

d.entries[0].id
## --> u'tag:feedparser.org,2005-11-09:/docs/examples/atom10.xml:3'

d.entries[0].published
## --> u'2005-11-09T00:23:47Z'

d.entries[0].published_parsed
## --> (2005, 11, 9, 0, 23, 47, 2, 313, 0)

d.entries[0].updated
## --> u'2005-11-09T11:56:34Z'

d.entries[0].updated_parsed
## --> (2005, 11, 9, 11, 56, 34, 2, 313, 0)

d.entries[0].summary
## --> u'Watch out for nasty tricks'

d.entries[0].content
## --> [{'type': u'application/xhtml+xml', 'base': u'http://example.org/entry/3', 'language': u'en-US', 'value': u'<div>Watch out for <span>nasty tricks</span></div>'}]

d.entries[0].contributors[0]
## --> {'name': u'Joe', 'href': u'http://example.org/joe/', 'email': u'joe@example.org'}

d.entries[0].links[0]
## --> {'rel': u'alternate', 'type': u'text/html', 'href': u'http://example.org/entry/3'}

Because Atom entries can have more than one content element, d.entries[0].content is a list of dictionaries. Each dictionary contains metadata about a single content element. The two most important values in the dictionary are the content type, in d.entries[0].content[0].type, and the actual content value, in d.entries[0].content[0].value.

Sanitized Content

The parsed summary and content are not the same as they appear in the original feed. The original elements contained dangerous HTML markup which was sanitized. See Sanitization for details.


Advanced Functionality

FeedParser also has a whole host of advanced functionality that you can use as well. You check out the full list of advanced functionality here, however, we're going to explore two of the most useful features: Content Normalization and Feed Type/Version Detection.

Content Normalization

The Universal Feed Parser that FeedParser uses can parse many different types of feeds: Atom, CDF, and nine different versions of RSS.

However, to simplify the use of FeedParser it allows you to use RSS feed terminology when parsing Atom feeds for example (and vica versa) so that you are not forced to learn the differences between these formats.

The Universal Feed Parser does its best to normalize the parsed feeds regardless of type or version so that you use the same terminology when parsing any feed type.

The following is an example of parsing a Atom feed as a RSS feed:


import feedparser
d = feedparser.parse('http://feedparser.org/docs/examples/atom10.xml')

d['channel']['title']
## --> u'Sample Feed'

d['channel']['link']
## --> u'http://example.org/'

d['channel']['description']
## --> u'For documentation <em>only</em>

len(d['items'])
## --> 1

e = d['items'][0]

e['title']
## --> u'First entry title'

e['link']
## --> u'http://example.org/entry/3'

e['description']
## --> u'Watch out for nasty tricks'

e['author']
## --> u'Mark Pilgrim (mark@example.org)'


Feed Type/Version Detection

Another handy bit of functionality FeedParser provides is the ability to detect the type and version of feed.


d = feedparser.parse('http://feedparser.org/docs/examples/atom10.xml')
d.version
## --> 'atom10'

d = feedparser.parse('http://feedparser.org/docs/examples/atom03.xml')
d.version
## --> 'atom03'

d = feedparser.parse('http://feedparser.org/docs/examples/rss20.xml')
d.version
## --> 'rss20'

d = feedparser.parse('http://feedparser.org/docs/examples/rss20dc.xml')
d.version
## --> 'rss20'

d = feedparser.parse('http://feedparser.org/docs/examples/rss10.rdf')
d.version
## --> 'rss10'

If the feed type is completely unknown, version will be an empty string.


HTTP Requests

FeedParser has the built in functionality that allows it retrieve the feed data from the target feed, however, by default this functionality is pretty limited and is suspectible to getting blocked by the target website.

That is why FeedParser allows you to control how HTTP requests are made at a lower level, including setting user-agents, headers and dealing with authentication.

Setting User-Agents With FeedParser

By default the user-agent FeedParser uses when requesting a feed's data is:


UniversalFeedParser/5.0.1 +http://feedparser.org/

When using FeedParser in production you should change this user-agent to either:

  1. A user-agent that clearly identifies your app and gives a way for the feed owner to contact you.
  2. Hides your identify and makes the request look like it is coming from another app so they won't block you.

Which option you choose depends on your relationship with the feed owner and their policies with regards external apps interacting with their feeds.

Actually, changing the default user-agent is pretty simple. Simply add a custom user-agent to the HTTP request using the agent parameter of the parse method.


import feedparser

user_agent = 'MyApp/1.0 +http://example.com/'
d = feedparser.parse('http://feedparser.org/docs/examples/atom10.xml', agent=user_agent)

You can also pernamently change the user-agent in your script by setting the USER_AGENT parameter of feedparser.


import feedparser

feedparser.USER_AGENT = "MyApp/1.0 +http://example.com/"
d = feedparser.parse('http://feedparser.org/docs/examples/atom10.xml')


Accessing Password-Protected Feeds With FeedParser

Another common issue developers run into is how to configure FeedParser to access restricted feeds.

FeedParser provides a number of ways to access restricted feeds which you can see here, however, the easiest is to simple add the username and password to the feed URL.

For example to access this restricted feed:


'http://feedparser.org/docs/examples/basic_auth.xml'

We add the username and password to the URL. In this example, the username is test and the password is basic.


'http://test:basic@feedparser.org/docs/examples/basic_auth.xml'

From which point FeedParser will handle the basic authentication.


Using Proxies With FeedParser

When accessing RSS/Atom feeds at scale then you will likely run into issues with your requests getting blocked so you will need to use a proxy solution to bypass their anti-scraping methods.

You can pass in proxies using a ProxyHandler:


import feedparser
from urllib.request import ProxyHandler
proxy_handler = ProxyHandler(proxies)
print(feedparser.parse("rss_url", handlers=[proxy_handler]))

Or you could use a smart proxy solution like the ScrapeOps Proxy Aggregator that will find the best performing and cheapest proxies for your target RSS feed.


import feedparser
from urllib.parse import urlencode

url = 'http://feedparser.org/docs/examples/atom10.xml'
payload = {'api_key': 'YOUR_API_KEY', 'url': url}
proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)

d = feedparser.parse(proxy_url)

Here you simply send the URL you want to scrape to the ScrapeOps Proxy API endpoint in the URL query parameter, along with your API key in the api_key query parameter, and ScrapeOps will deal with finding the best proxy for that domain and return the HTML response to you.

You can get your own free API key with 1,000 free requests by signing up here.


More Web Scraping Tutorials

So that's how you can use the FeedParser to retrieve and parse RSS, Atom, etc. syndication feeds.

If you would like to learn more about Web Scraping, then be sure to check out The Web Scraping Playbook.

Or check out one of our more in-depth guides: