Newspaper3k Guide: Scrape Articles Using AI
Newspaper3k is a powerful Python library that allows you to scrape newspaper and article websites without having to design your dedicated parsers for every website you want to scrape.
So in this guide we're going to walk through:
- What Is Newspaper3k?
- Installing Newspaper3k
- Using Newspaper3k To Parse Articles
- Using Newspaper3k's Advanced NLP Methods
- Specifying Article Language
- News Article URL Detection
- Using Proxies With Newspaper3k
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
What Is Newspaper3k?
Newspaper3k is a article downloading and parsing library for Python, that enables you to scrape newspaper and article websites without having to write custom parsers for every website you want to scrape.
Newspaper3k uses intelligent parsers and NLP techniques to parse the most critical data from newspaper and article pages including the article:
- Title
- Author
- Published date
- Text
- Featured Image
- Embedded videos
- Main keywords
- Summary
It currently supports parsing in 38 languages, news article URL detection and multi-threaded article downloading.
Installing Newspaper3k
The first step to using Newspaper3k is to install it using pip:
pip install newspaper3k
Once installed then we can integrate Newspaper3k into our scrapers.
Using Newspaper3k To Parse Articles
Using Newspaper3k to parse articles is actually super easy and comprises two steps:
Step 1: Loading HTML Content Into Newspaper3k
First we need to download the article HTML we want to parse from the website by passing in the URL we want to scrape into a Article
instance then downloading it:
from newspaper import Article
url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article = Article(url)
article.download()
Behind the scenes Newspaper3k using Python Requests to download the HTML page from the URL you want to scrape.
Alternatively, you can input raw HTML data into the Article
instance itself using the input_html
attribute of the download
method:
import requests
from newspaper import Article
url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article = Article(url)
## Download HTML yourself and insert into Newspaper3k
response = requests.get(url)
article.download(input_html=response.text)
This is approach is better when you either have pre-scraped all the HTML content you want to parse, or you need to control how the requests are made at a lower level to bypass anti-bot systems, etc.
Step 2: Parsing The Article Content
Now that we have download the HTML content we want to parse, next will need to tell Newspaper3k to parse the HTML and extract the data we need.
To do this we just need to use the parse()
method, and then have it output the data we want:
article.parse()
article.html
## --> '<!DOCTYPE HTML><html itemscope itemtype="http://...'
article.title
## --> 'New Year, new laws: Obamacare, pot, guns and drones'
article.authors
## --> ['Leigh Ann Caldwell', 'John Honway']
article.publish_date
## --> datetime.datetime(2013, 12, 30, 0, 0)
article.text
## --> 'Washington (CNN) -- Not everyone subscribes to a New Year's resolution...'
article.top_image
## --> 'http://someCDN.com/blah/blah/blah/file.png'
article.images
## --> ['url_to_img_1', 'url_to_img_2', 'url_to_img_3', ...]
article.movies
## --> ['http://youtube.com/path/to/link.com', ...]
The parse method extracts the following data from the HTML page:
.html
returns the full HTML page..title
extracts the article title..authors
extracts the articles authors..publish_date
extracts the articles published date..text
extracts the articles text from the HTML..top_image
extracts the featured image of the article (if any exists)..images
extracts all image urls present in the article (if any exists)..movies
extracts any videos present in the article (if any exists).
You must have called download()
on an article before calling parse()
.
Using Newspaper3k's Advanced NLP Methods
Newspaper3k also has in-built Natural Language Processing (NLP) functionality that allows you to easily process the article and extract the main keywords and a article summary.
To use the NLP functionality we just need to use the .nlp()
method on our Article
instance:
article.nlp()
article.keywords
# --> ['New Years', 'resolution', ...]
article.summary
# --> 'The study shows that 93% of people ...'
Now we we call article.keywords
Newspaper3k will return the main keywords in the article, and article.summary
Newspaper3k will summarize the article for us.
You must have called both download()
and parse()
on the article before calling the nlp()
method.
Using the .nlp()
method is computationally expensive, so you should only use this method when scraping at scale if you really need the keyword and summary data.
As of the current build, the .nlp()
feature only works on western languages.
Specifying Article Language
Currently Newspaper3k supports parsing in 38 languages.
By default, when you don't specify a language when creating a new Article
instance then Newspaper3k will try to auto detect the language of the article and apply it during article parsing.
However, you can also manually set the language
url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article = Article(url, language='en')
article.download()
Here is the complete list of supported languages:
>>> import newspaper
>>> newspaper.languages()
Your available languages are:
input code full name
ar Arabic
be Belarusian
bg Bulgarian
da Danish
de German
el Greek
en English
es Spanish
et Estonian
fa Persian
fi Finnish
fr French
he Hebrew
hi Hindi
hr Croatian
hu Hungarian
id Indonesian
it Italian
ja Japanese
ko Korean
lt Lithuanian
mk Macedonian
nb Norwegian (Bokmål)
nl Dutch
no Norwegian
pl Polish
pt Portuguese
ro Romanian
ru Russian
sl Slovenian
sr Serbian
sv Swedish
sw Swahili
th Thai
tr Turkish
uk Ukrainian
vi Vietnamese
zh Chinese
News Article URL Detection
Newspaper3k also has some other advanced functionality like its ability to find article URLs on a target website.
Here, using the build()
method Newspaper3k will extract all the article URLs it finds on a given page via the articles
method:
import newspaper
cnn_paper = newspaper.build('http://cnn.com')
for article in cnn_paper.articles:
print(article.url)
# --> 'http://www.cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/'
# --> 'http://www.cnn.com/2013/12/11/us/texas-teen-dwi-wreck/index.html'
It can also find the RSS feeds on a target website for you.
import newspaper
cnn_paper = newspaper.build('http://cnn.com')
for feed_url in cnn_paper.feed_urls():
print(feed_url)
# --> u'http://rss.cnn.com/rss/cnn_crime.rss'
# --> u'http://rss.cnn.com/rss/cnn_tech.rss'
This is especially useful when you want to build a scraper that monitors the news content on a particular website.
Using Proxies With Newspaper3k
One of the drawbacks of Newspaper3k is that it doesn't have great functionality when you want to scrape websites at scale but they use sophisticated anti-bot technologies to prevent scraping.
In cases like these, you will have to optimize your headers and use proxies to retrieve the raw HTML from the website. However, Newspaper3k's download functionality doesn't have in-built support for this.
As a result, the better option is to retrieve the HTML using a HTTP client like Python Requests and then parse the HTML using the Newspaper3k library.
For more details on how to use proxies and headers with Python Requests then check out this guide.
However, as an integration example we will use the ScrapeOps Proxy Manager as a proxy solution and pass the HTML content into Newspaper3k for parsing.
import requests
from urllib.parse import urlencode
from newspaper import Article
url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article = Article(url)
## Download HTML using ScrapeOps Proxy Aggregator
payload = {'api_key': 'YOUR_API_KEY', 'url': url}
response = requests.get('https://proxy.scrapeops.io/v1/', params=urlencode(payload))
## Insert HTML into Newspaper3k
article.download(input_html=response.text)
Here you simply send the URL you want to scrape to the ScrapeOps Proxy API Aggregator endpoint in the URL
query parameter, along with your API key in the api_key
query parameter, and ScrapeOps will deal with finding the best proxy for that domain and return the HTML response to you.
You can get your own free API key with 1,000 free requests by signing up here.
More Web Scraping Tutorials
So that's how you can use Newspaper3k to scrape newspaper and article websites without having to design your dedicated parsers for every website you want to scrape.
If you would like to learn more about Web Scraping, then be sure to check out The Web Scraping Playbook.
Or check out one of our more in-depth guides: