Newspaper3k

Newspaper3k Guide: Scrape Articles Using AI

Newspaper3k is a powerful Python library that allows you to scrape newspaper and article websites without having to design your dedicated parsers for every website you want to scrape.

So in this guide we're going to walk through:

What Is Newspaper3k?
Installing Newspaper3k
Using Newspaper3k To Parse Articles
Using Newspaper3k's Advanced NLP Methods
Specifying Article Language
News Article URL Detection
Using Proxies With Newspaper3k

What Is Newspaper3k?

Newspaper3k is a article downloading and parsing library for Python, that enables you to scrape newspaper and article websites without having to write custom parsers for every website you want to scrape.

Newspaper3k uses intelligent parsers and NLP techniques to parse the most critical data from newspaper and article pages including the article:

Title
Author
Published date
Text
Featured Image
Embedded videos
Main keywords
Summary

It currently supports parsing in 38 languages, news article URL detection and multi-threaded article downloading.

Installing Newspaper3k

The first step to using Newspaper3k is to install it using pip:

pip install newspaper3k

Once installed then we can integrate Newspaper3k into our scrapers.

Using Newspaper3k To Parse Articles

Using Newspaper3k to parse articles is actually super easy and comprises two steps:

Step 1: Loading HTML Content Into Newspaper3k

First we need to download the article HTML we want to parse from the website by passing in the URL we want to scrape into a Article instance then downloading it:

from newspaper import Article

url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article = Article(url)
article.download()

Behind the scenes Newspaper3k using Python Requests to download the HTML page from the URL you want to scrape.

Alternatively, you can input raw HTML data into the Article instance itself using the input_html attribute of the download method:

import requests
from newspaper import Article

url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article = Article(url)

## Download HTML yourself and insert into Newspaper3k 
response = requests.get(url)
article.download(input_html=response.text)

This is approach is better when you either have pre-scraped all the HTML content you want to parse, or you need to control how the requests are made at a lower level to bypass anti-bot systems, etc.

Step 2: Parsing The Article Content

Now that we have download the HTML content we want to parse, next will need to tell Newspaper3k to parse the HTML and extract the data we need.

To do this we just need to use the parse() method, and then have it output the data we want:

article.parse()

article.html
## --> '<!DOCTYPE HTML><html itemscope itemtype="http://...'

article.title
## --> 'New Year, new laws: Obamacare, pot, guns and drones'

article.authors
## --> ['Leigh Ann Caldwell', 'John Honway']

article.publish_date
## --> datetime.datetime(2013, 12, 30, 0, 0)

article.text
## --> 'Washington (CNN) -- Not everyone subscribes to a New Year's resolution...'

article.top_image
## --> 'http://someCDN.com/blah/blah/blah/file.png'

article.images
## --> ['url_to_img_1', 'url_to_img_2', 'url_to_img_3', ...]

article.movies
## --> ['http://youtube.com/path/to/link.com', ...]

The parse method extracts the following data from the HTML page:

.html returns the full HTML page.
.title extracts the article title.
.authors extracts the articles authors.
.publish_date extracts the articles published date.
.text extracts the articles text from the HTML.
.top_image extracts the featured image of the article (if any exists).
.images extracts all image urls present in the article (if any exists).
.movies extracts any videos present in the article (if any exists).

Correct Order

You must have called download() on an article before calling parse().

Using Newspaper3k's Advanced NLP Methods

Newspaper3k also has in-built Natural Language Processing (NLP) functionality that allows you to easily process the article and extract the main keywords and a article summary.

To use the NLP functionality we just need to use the .nlp() method on our Article instance:

article.nlp()

article.keywords
# --> ['New Years', 'resolution', ...]

article.summary
# --> 'The study shows that 93% of people ...'

Now we we call article.keywords Newspaper3k will return the main keywords in the article, and article.summary Newspaper3k will summarize the article for us.

You must have called both download() and parse() on the article before calling the nlp() method.

Notes

Using the .nlp() method is computationally expensive, so you should only use this method when scraping at scale if you really need the keyword and summary data.

As of the current build, the .nlp() feature only works on western languages.

Specifying Article Language

Currently Newspaper3k supports parsing in 38 languages.

By default, when you don't specify a language when creating a new Article instance then Newspaper3k will try to auto detect the language of the article and apply it during article parsing.

However, you can also manually set the language

url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article = Article(url, language='en')
article.download()

Here is the complete list of supported languages:

>>> import newspaper
>>> newspaper.languages()

Your available languages are:
input code      full name

  ar              Arabic
  be              Belarusian
  bg              Bulgarian
  da              Danish
  de              German
  el              Greek
  en              English
  es              Spanish
  et              Estonian
  fa              Persian
  fi              Finnish
  fr              French
  he              Hebrew
  hi              Hindi
  hr              Croatian
  hu              Hungarian
  id              Indonesian
  it              Italian
  ja              Japanese
  ko              Korean
  lt              Lithuanian
  mk              Macedonian
  nb              Norwegian (Bokmål)
  nl              Dutch
  no              Norwegian
  pl              Polish
  pt              Portuguese
  ro              Romanian
  ru              Russian
  sl              Slovenian
  sr              Serbian
  sv              Swedish
  sw              Swahili
  th              Thai
  tr              Turkish
  uk              Ukrainian
  vi              Vietnamese
  zh              Chinese

News Article URL Detection

Newspaper3k also has some other advanced functionality like its ability to find article URLs on a target website.

Here, using the build() method Newspaper3k will extract all the article URLs it finds on a given page via the articles method:

import newspaper
cnn_paper = newspaper.build('http://cnn.com')
for article in cnn_paper.articles:
    print(article.url)


# --> 'http://www.cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/'
# --> 'http://www.cnn.com/2013/12/11/us/texas-teen-dwi-wreck/index.html'

It can also find the RSS feeds on a target website for you.

import newspaper
cnn_paper = newspaper.build('http://cnn.com')
for feed_url in cnn_paper.feed_urls():
    print(feed_url)

# --> u'http://rss.cnn.com/rss/cnn_crime.rss'
# --> u'http://rss.cnn.com/rss/cnn_tech.rss'

This is especially useful when you want to build a scraper that monitors the news content on a particular website.

Using Proxies With Newspaper3k

One of the drawbacks of Newspaper3k is that it doesn't have great functionality when you want to scrape websites at scale but they use sophisticated anti-bot technologies to prevent scraping.

In cases like these, you will have to optimize your headers and use proxies to retrieve the raw HTML from the website. However, Newspaper3k's download functionality doesn't have in-built support for this.

As a result, the better option is to retrieve the HTML using a HTTP client like Python Requests and then parse the HTML using the Newspaper3k library.

For more details on how to use proxies and headers with Python Requests then check out this guide.

However, as an integration example we will use the ScrapeOps Proxy Manager as a proxy solution and pass the HTML content into Newspaper3k for parsing.

import requests
from urllib.parse import urlencode
from newspaper import Article

url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article = Article(url)

## Download HTML using ScrapeOps Proxy Aggregator 
payload = {'api_key': 'YOUR_API_KEY', 'url': url}
response = requests.get('https://proxy.scrapeops.io/v1/', params=urlencode(payload))

## Insert HTML into Newspaper3k 
article.download(input_html=response.text)

Here you simply send the URL you want to scrape to the ScrapeOps Proxy API Aggregator endpoint in the URL query parameter, along with your API key in the api_key query parameter, and ScrapeOps will deal with finding the best proxy for that domain and return the HTML response to you.

You can get your own free API key with 1,000 free requests by signing up here.

Newspaper3k Guide: Scrape Articles Using AI

What Is Newspaper3k?​

Installing Newspaper3k​

Using Newspaper3k To Parse Articles​

Step 1: Loading HTML Content Into Newspaper3k​

Step 2: Parsing The Article Content​

Using Newspaper3k's Advanced NLP Methods​

Specifying Article Language​

News Article URL Detection​

Using Proxies With Newspaper3k​

More Web Scraping Tutorials​