Skip to main content

Scraping HTML Pages With NodeJs Cheerio

CheerioJS Guide: Scraping HTML Pages With NodeJs

Web scraping includes fetching web pages from the internet and extracting useful data from them. This data can be stored in various formats, such as Excel, CSV, or JSON, for future use. In the era of machine learning, many companies and research centers use web scraping tools to gather large amounts of data for their machine learning algorithms.

Cheerio provides a way to perform web scraping and extract data from HTML, primarily in a Node.js environment. Developed as a server-side library, Cheerio allows you to load HTML and then interact with it using familiar jQuery syntax. In this comprehensive guide, we'll walk you through:


What is Cheerio

Cheerio is a web scraping tool specifically designed for the Node.js environment. It is used to parse HTML and XML, providing a user-friendly API for locating and modifying DOM elements using CSS-style selectors.

The Document Object Model (DOM) represents all the HTML elements within a web page in the browser.

Cheerio is commonly employed in Node.js applications for tasks like web scraping, data mining, and server-side rendering. It is particularly useful when there is a requirement to extract data from HTML content and perform various operations programmatically.

Its lightweight nature and similarity to jQuery make it a popular choice for developers working with Node.js for tasks related to HTML manipulation and data extraction.


Why Use Cheerio

Cheerio is commonly used for several reasons, particularly in Node.js environments.

  • JQuery-Like Syntax: Cheerio uses a JQuery like syntax, which is familiar to many developers, making it easy to work with.
  • Server Side: It's a server side tool that doesn't require a web browser, making it suitable for backend tasks.
  • Fast and Lightweight: Because it doesn't rely on a web browser, cheerio doesn't have to worry about DOM inconsistencies and browser quirks, making it fast and lightweight.
  • Modularity: It has a modular design that allows developers to extend its functionality with custom JavaScript.
  • Open Source: Cheerio is open source which means you can freely fork and tailor it for your specific project needs.

Overall, developers often choose Cheerio for its simplicity, speed, and flexibility, making it a powerful tool for web scraping, data extraction, and server-side HTML manipulation in Node.js applications.


How to Install Cheerio

To install Cheerio, you should download NodeJS installer that is compatible with your operating system from its Official Website. NodeJS comes bundled with a package manager called npm which we are going to use for installing and managing third party utilities like Axios and Cheerio itself.

After installation, verify that node and npm are running correctly with these commands:

node --version
npm --version

Create a new folder and open it with an IDE like VSCode or IntelliJ.

mkdir web-scraping

Now that we're inside our web-scraping folder, let's create a package.json file. This file will help us manage our packages and their versions.

npm init -y

The -y flag will generate a package.json file with default values.

Open the package.json file and add "type"="module". This will allow using import/export statements in our javascript code. After that your package.json file will look like this:

{
"name": "web-scraping",
"version": "1.0.0",
"description": "",
"main": "index.js",
"type": "module",
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
},
"keywords": [],
"author": "",
"license": "ISC"
}

We are going to use Axios to fetch HTML from websites. Let's install axios alongside cheerio.

npm install cheerio
npm install axios

Create an index.js file which will be the starting point for writing our code.


Loading HTML with Cheerio

Cheerio has a load() method that takes an HTML string and parses it into a complex tree of JavaScript objects. The return value of load() is actually a Function with several methods.

This function $() is used to select elements with CSS selectors and its methods are subsequently used to retrieve or change data within the selected elements. Here is a simple example:

import * as cheerio from 'cheerio';
const html = '<p>Hello World</p>';
const $ = cheerio.load(html);

In the code above, we wrote HTML directly for learning purpose.

But in real scenarios, you mostly fetch HTML from either:

  1. websites or
  2. local storage

1. Getting HTML Data From Website

We're going to use axios.js to fetch HTML content from websites. Axios has a get() method, which accepts a URL and returns a response object. Our HTML content is stored as the value of the data key within the response object.

Let's try to fetch HTML from QuotesToScrape

import * as cheerio from 'cheerio';
import axios from 'axios';

const url = 'https://quotes.toscrape.com';
const response = await axios.get(url);
const html = response.data;

const $ = cheerio.load(html);
console.log(html);

// <!DOCTYPE html>
// <html lang="en">
// <head>
// <meta charset="UTF-8">
// <title>Quotes to Scrape</title>
// ... More HTML

get(url) is an asynchronous function and it might take some time to fetch html. That's why we used await keyword before it to tell NodeJS to pause execution until the data has been fetched.

Handling Errors in Cheerio

Sometimes the data is not fetched properly due to server-side errors. We can wrap out code in a try/catch block to print an error message instead of crashing our program. Here is an example:

import * as cheerio from 'cheerio';
import axios from 'axios';
const url = 'https://quotes.toscrape.com';

try {
const response = await axios.get(url);
const html = response.data
const $ = cheerio.load(html);
} catch (error) {
console.error('An error occurred while fetching the page', error.message);
}

2. Getting HTML Data From File

If you have some HTML files in your local storage, you can use the standard module of NodeJS called fs to read those files. To begin, create a new file named content.html and place the following HTML content inside it:

<!DOCTYPE html>
<html>
<head>
<title>Cheerio Guide</title>
</head>
<body>
<h1>Introduction</h1>
<p>Cheerio is used to parse static HTML and XML</p>
</body>
</html>

fs module has a readFile() method that takes the location, encoding and a callback function. In our case the location is the same directory as of index.js, so we will just provide the name of our file and it will work. The most commonly used encoding for HTML files is utf8.

import * as cheerio from 'cheerio';
import fs from 'fs/promises';

try {
const html = await fs.readFile('content.html', 'utf8')
const $ = cheerio.load(html)
console.log(html)
} catch(error) {
console.error('An error occurred while fetching the page', error.message);
}

// <!DOCTYPE html>
// <html>
// <head>
// <title>Cheerio Guide</title>
// .. More HTML

Querying The DOM Tree

Cheerio relies on CSS selectors to find DOM elements. The function $() returned by load() method accepts a CSS selector and returns an object containing all the selected elements.

We are going to select title on the first page of QuotesToScrape.

import * as cheerio from 'cheerio';
import axios from 'axios';

const response = await axios.get('https://quotes.toscrape.com');
const html = response.data;

const $ = cheerio.load(html);
const $title = $('title');
console.log($title.text());

// Quotes to Scrape

text() returns the inner text of an element. Using $ sign is merely a convention in the Cheerio Community. You can use any name you prefer.

Looping Through Elements

You can iterate through arrays using forEach() method but this doesn't work with objects. Cheerio provides its own method, each(callback), for this specific purpose.

The way each() works is that Cheerio object uses numbered keys for DOM elements that act like indexes. Now, let's see that in action by selecting all the links a on the first page of our website.

import * as cheerio from 'cheerio';
import axios from 'axios';

const response = await axios.get('https://quotes.toscrape.com');
const html = response.data;

const $ = cheerio.load(html);
const $links = $('a');

$links.each((index, link) => {
console.log($(link).text());
});

// Quotes to Scrape
// Login
// (about)
// change
// ... More Links

Finding Elements with CSS Selectors

If you have already worked with CSS for styling your websites, you can skip this portion. We have already seen selecting elements with their Tag name in previous example. Now let's see other ways we can select DOM elements.

1. Selecting Elements with Class and Id Names

Elements can be selected with their class and id names. Class names are prefixed with "." and id names with a "#" symbol. Let's see an example for both:

import * as cheerio from 'cheerio';
const html = `
<ul>
<li class="red">Red</li>
<li id="blue">Blue</li>
</ul>
`;
const $ = cheerio.load(html);

const $red = $('.red');
const $blue = $('#blue');

console.log($red.text());
console.log($blue.text());

// Red
// Blue

2. Selecting Elements with Attribute Names

You can select elements with their attribute names. The attribute names must be wrapped inside square brackets [] to distinguish them from Tag names. Here is an example of a link with an href attribute.

import * as cheerio from 'cheerio';
const html = `
<span>
<a href="https://quotes.toscrape.com">QuotesToScrape</a>
</span>
`;
const $ = cheerio.load(html);

const $link = $('[href]');
console.log($link.text());

// QuotesToScrape

We can also select an element whose attributes has a specific value. Lets select an image with a specific width:

import * as cheerio from 'cheerio';
const html = `
<div>
<img src="cat.png" width=100 />
<img src="dog.png" width=200 />
</div>
`;
const $ = cheerio.load(html);

const $image = $('[width=100]');
console.log($image.attr('src'));

// cat.png

In code above, we used the attr() method to get the attribute value for src. We will learn about it later.

3. Selecting Elements by Combinator Selector

Combinator selectors are used to select elements based on their relationship with other elements. For example, an element can be selected as a child nested inside a parent element, among other sibling elements.

  • Selecting elements that are descendents of a some parent element with A B, where "A" is the parent element and "B" is the descendent element.
import * as cheerio from 'cheerio';
const html = `
<ul>
<li></li>
<li>
<ol>
<li></li>
</ol>
</li>
<li></li>
</ul>
`;
const $ = cheerio.load(html);
const $descendents = $('ul li');

console.log($descendents.length);

// 4
  • Selecting elements that are direct children of some parent element with A > B, where "A" is the parent element and "B" is the child element.
import * as cheerio from 'cheerio';
const html = `
<ul>
<li></li>
<li>
<ol>
<li></li>
</ol>
</li>
<li></li>
</ul>
`;
const $ = cheerio.load(html);
const $children = $('ul > li');

console.log($children.length);

// 3

Other Selectors

There are several other types of selectors. Some of them are mentioned here:

  • :first select the first element from a list of elements
  • :last select the last element from a list of elements
  • :nth(index) select the nth element from a list of elements

You can find a detailed list of several other CSS and Cheerio selectors by visiting these links:


Traversing DOM tree

Once you have selected an element, you can navigate to its parent, child or sibling elements. You might think that we can use CSS selectors for this goal, like in previous examples. But traversing offers a more flexible and dynamic approach.

1. Finding Descendents and Children

Cheerio provides find() to find all the descendents and children() method to find all the direct children of a selected element.

find():

The find() function from the Cheerio library is used to search for all descendant elements of elements matched by the provided selector.

import * as cheerio from 'cheerio';
const html = `
<div class="parent">
<p>Red</p>
<div>
<p>Blue</p>
<p>Green</p>
</div>
</div>
`;
const $ = cheerio.load(html);
const $descendents = $('.parent').find('p');

$descendents.each((index, element) => {
console.log($(element).text());
});

// Red
// Blue
// Green
  • The code above in the example extracts and logs the text content of the descendant p elements within the element with the class 'parent'.

children()

Thechildren() function from the Cheerio library is used to select all direct children that match the specified selector.

import * as cheerio from 'cheerio';
const html = `
<div class="parent">
<p>Red</p>
<div>
<p>Blue</p>
<p>Green</p>
</div>
</div>
`;
const $ = cheerio.load(html);
const $children = $('.parent').children('p');

$children.each((index, element) => {
console.log($(element).text());
});

// Red

2. Finding Parents

To find direct parent or all the parents, you can use parent() and parents() method provided by Cheerio.

parent():

The parent() function from the Cheerio library is used to select the direct parent of the matched element.

import * as cheerio from 'cheerio';
const html = `
<div class="parent">
<p class="child">Red</p>
<div>
<p>Blue</p>
<p>Green</p>
</div>
</div>
`;
const $ = cheerio.load(html);
const $parent = $('.child').parent();

console.log($parent.attr('class'));

// parent

parents():

The parents() function from the Cheerio library is used to select all the ancestor elements of the matched element.

import * as cheerio from 'cheerio';
const html = `
<div>
<p>Red</p>
<div>
<p class="child">Blue</p>
<p>Green</p>
</div>
</div>
`;
const $ = cheerio.load(html);
const $parents = $('.child').parents();

$parents.each((index, element) => {
console.log($(element).prop('tagName'))
});

// DIV
// DIV
// BODY
// HTML

prop() is similar to attr(), but it can also retrieve the Tag name of the selected element. We'll cover this in more detail later. In the code above, you can see that parents() has returned all the ancestors up to the root element html.

If you want to restrict the number of ancestors, you can utilize the parentsUntil() method and specify the name of the parent element where you want the search to stop. Here is an example:

import * as cheerio from 'cheerio';
const html = `
<div>
<p>Red</p>
<div>
<p class="child">Blue</p>
<p>Green</p>
</div>
</div>
`;
const $ = cheerio.load(html);
const $parents = $('.child').parentsUntil('body');

$parents.each((index, element) => {
console.log($(element).prop('tagName'))
});

// DIV
// DIV

The element specified in the parentsUntil() is not included the the selected elements. That' why we didn't got the body element in the above code.

3. Finding Siblings

Siblings refer to elements within the same parent element. To locate siblings preceding the selected elements, you can use the prev() and prevAll() methods. For siblings following the selected elements, the next() and nextAll() methods are used.

  • prev() returns only one element that comes before the selected element.
  • prevAll() returns all the elements that comes before the selected element.
import * as cheerio from 'cheerio';
const html = `
<div>
<p>Red</p>
<p class="child">Blue</p>
<p>Green</p>
</div>
`;
const $ = cheerio.load(html);
const $prevSibling = $('.child').prev();

console.log($prevSibling.text());

// Red
  • next() returns only one element that comes after the selected element.
  • nextAll() returns all the elements that comes after the selected element.
import * as cheerio from 'cheerio';
const html = `
<div>
<p class="child">Red</p>
<p>Blue</p>
<p>Green</p>
</div>
`;
const $ = cheerio.load(html);
const $nextSiblings = $('.child').nextAll();

$nextSiblings.each((index, element) => {
console.log($(element).text());
});

// Blue
// Green

Additionally, two other methods, prevUntil() and nextUntil(), allow you to constrain the search to a specified element. These methods function similarly to parentsUntil().


Extracting Data With Regex

Regular Expressions are very commonly used for finding patterns in texts using regex syntax. It proves useful in web scraping tasks like extracting emails or phone numbers from a website. We won't delve into the specifics of regex syntax, but you can learn about it from here.

JavaScript comes with a built-in regex engine and offers a match() method to retrieve the text matching with a regex pattern. Let's apply this to extract all the valid emails from our HTML.

import * as cheerio from 'cheerio';
const html = `
<ul>
<li>john.doe@example.com</li>
<li>Hello World</li>
<li>alice.smith@example.com</li>
<li>Invalid Email</li>
<li>bob(at)example.com</li>
</ul>
`;
const $ = cheerio.load(html);
const $li = $('li');

$li.each((index, element) => {
const regex = /@/;
if ($(element).text().match(regex)) {
console.log($(element).text());
}
});

// john.doe@example.com
// alice.smith@example.com

The actual pattern for extracting emails is somewhat complex \b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b. In the above example, we simply matched strings containing @ symbol. For more complex regex patterns, you can utilize tools like regex101 to test them.


Filtering the Selected Elements

You can filter out desired elements from a list of selected elements using Cheerio filter() and not() methods.

import * as cheerio from 'cheerio';
const html = `
<ul>
<li class="color">Red</li>
<li>One</li>
<li class="color">Blue</li>
<li>Two</li>
<li class="color">Green</li>
</ul>
`;
const $ = cheerio.load(html);
const $colors = $('li').filter('.color');

$colors.each((index, element) => {
console.log($(element).text());
});

// Red
// Blue
// Green

not() is reverse of filter(), it only select elements that do not match to given CSS selector.


Manipulating DOM

We've learned how to select and traverse around elements in the DOM tree using CSS selectors and Cheerio methods.

Now, let's explore how we can make changes to these elements, such as adding or removing classes, updating text, altering the HTML content, and more.

1. Changing Attributes and Properties

Cheerio provides the .attr(key, value) and .prop(key, value) methods for getting and setting attributes and properties of an element. When you want to retrieve an attribute or property, you only need to provide the name (key).

However, if you wish to make changes, you include a second argument with the new value. I am assuming that you know the difference between attributes and properties.

In short, attributs control how an element will appear ("h1", "p", "src", "href" etc) while properties control how an element will behave ("target" etc) property in links

In our example, we'll work with a link <a> element. Here, href represents an attribute, and target represents a property. Now, let's put these methods to use and see how they work in action.

import * as cheerio from 'cheerio';
const html = '<a href="https://cheerio.js.org/" target="_blank">CheerioJS</a>';
const $ = cheerio.load(html);

const $link = $('a');
$link.attr('href', "https://nodejs.org/en");
$link.prop('target', "_self");

console.log($link.prop('outerHTML'));

// <a href="https://nodejs.org/en" target="_self">CheerioJS</a>

You can see we have successfully changed the href and target values of our link.

Cheerio uses different names for properties like the one we used in our code outerHTML to get the HTML content of the selected element. Others are innerHTML for HTML content nested within, and innerText for text within the selected element.

2. Adding and Removing Classes

The addClass() and removeClass() methods allow you to add or remove specific classes from an element. You can add or remove multiple classes by separating their names with spaces. If you attempt to add a class that's already there or remove one that isn't, Cheerio won't make any changes; it simply leaves things as they are.

import * as cheerio from 'cheerio';
const html = '<p class="A B">Hello World</p>';
const $ = cheerio.load(html);

const $p = $('p');
$p.removeClass('A');
$p.addClass('C D');

console.log($p.prop('class'));

// B C D

There is also a .toggleClass() method that adds a class if doesn't exist and remove if does.

3. Changing the Text Content

Let's change the inner text of a p element from Hello World to Welcome.

import * as cheerio from 'cheerio';
const html = '<p>Hello World</p>';
const $ = cheerio.load(html);

const $p = $('p');
$p.text('Welcome');

console.log($p.text());

// Welcome

4. Changing HTML Content

In this example, we will add HTML for a p element inside a div element.

import * as cheerio from 'cheerio';
const html = '<div></div>';
const $ = cheerio.load(html);

const $div = $('div');
$div.html('<p>Hello World</p>');

console.log($div.prop('outerHTML'));

// <div><p>Hello World</p></div>

5. Adding Elements

Cheerio provides append(), prepend(), after(), and before() methods to add new elements at specific locations. Let's explore them with examples:

  • append() places element at the end of selected elements.
  • prepend() places element at the beginning of selected elements.
import * as cheerio from 'cheerio';
const html = `
<ul>
<li> A </li>
<li> B </li>
<li> C </li>
</ul>
`;
const $ = cheerio.load(html);
const $ul = $('ul');

$ul.prepend('<li> Prepend </li>');
$ul.append('<li> Append </li>');

const $li = $('li');
$li.each((index, element) => {
console.log($(element).text());
});

// Prepend
// A
// B
// C
// Append
  • before() places elements before a single selected element.
  • after() places elements after a single selected element.
import * as cheerio from 'cheerio';
const html = `
<ul>
<li> A </li>
<li class="middle"> B </li>
<li> C </li>
</ul>
`;
const $ = cheerio.load(html);
const $middle = $('li.middle');

$middle.before('<li> Before </li>');
$middle.after('<li> After </li>');

const $li = $('li');

$li.each((index, element) => {
console.log($(element).text());
});

// A
// Before
// B
// After
// C

6. Removing Elements

To remove a specific element, you can use remove() method. This will also remove all the children of that element. Here is an example:

import * as cheerio from 'cheerio';
const html = `
<div>
<p>Hello World</p>
<p class="remove">Welcome</p>
</div>
`;
const $ = cheerio.load(html);
$('.remove').remove();

const $div = $('div');
console.log($div.prop('outerHTML'));


// <div>
// <p>Hello World</p>

// </div>

If you want to remove all the children of an element while keeping the selected element itself, you can use the empty() method instead.


Example Project: Build Your First Web Scraper Using Cheerio in NodeJS

Now that you have leanred how to select, query and manipulate DOM elements with Cheerio, let's scrape QuotesToScrape to extract all the quotes along with their author and tags in JSON format.

To begin, open QuotesToScrape in your browser. Then, right-click and select "inspect" to view the HTML structure of the page. Take a closer look at the HTML layout and observe the classes assigned to various elements.

For example, this is one of the div components that we copied from the "inspect" tab. It contains a quote, author and related tags.

<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world">

<a class="tag" href="/tag/change/page/1/">change</a>

<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>

<a class="tag" href="/tag/thinking/page/1/">thinking</a>

<a class="tag" href="/tag/world/page/1/">world</a>

</div>
</div>

Let's write our code.

import * as cheerio from 'cheerio';
import axios from 'axios';

const url = 'https://quotes.toscrape.com';
const response = await axios.get(url);
const html = response.data;

const $ = cheerio.load(html);
const $quoteBlock = $('div.quote');

const quotes = [];

$quoteBlock.each((index, block) => {
// Get Tags
let tags = [];
$(block).find('a.tag').each((i, tag) => {
tags.push($(tag).text());
});

quotes.push({
No: index + 1,
Quote: $(block).find('span.text').text(),
Author: $(block).find('small.author').text(),
tags: tags
});
});

console.log(quotes);

/* [
{
No: 1,
Quote: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
Author: 'Albert Einstein',
tags: [ 'change', 'deep-thoughts', 'thinking', 'world' ]
},
{
No: 2,
Quote: '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
Author: 'J.K. Rowling',
tags: [ 'abilities', 'choices' ]
},
{
No: 3,
Quote: '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
Author: 'Albert Einstein',
tags: [ 'inspirational', 'life', 'live', 'miracle', 'miracles' ]
},
... More Quotes
] */

Congratulations! You've just created your first web scraper using Cheerio and NodeJS.


Challenges of Web Scraping With Cheerio

Cheerio as a static web scraper that doesn't function within a browser environment, does have some limitations. It may not be an ideal choice for complex web scraping tasks. Here are some major considerations when working with cheerio.

Challenge 1: Dealing With Dynamic Pages

Dynamic websites change frequently by using JavaScript to load new data. Cheerio, however, can't run JavaScript code, which means it can't perform user interactions like clicking buttons, submitting forms, or scraping data that keeps changing on dynamic sites.

Challange 2: Memory Inefficiency

As we have seen in previous examples, Cheerio loads the entire HTML into memory when parsing it. This means that if you're dealing with web pages that have a very large HTML file, it can consume a significant amount of RAM.

Challange 3: Bypassing Anti-Scraping Measures

A lot of websites use methods to prevent others from easily scraping their data. These methods include things like User Agent Detection, blocking IP addresses, and using those CAPTCHAs you've probably seen. There are measure though, like faking your User Agent to make websites think you're using a regular browser, or changing your IP address regularly. But unfortunately, Cheerio doesn't comes with in-built support for any of these.


Alternatives to Cheerio

There are several alternatives to cheerio for scraping HTML with NodeJS. Here are a few popular options:

  • Puppeteer: Puppeteer works with chrome or chromium browsers, offering high-level APIs for scraping dynamic web pages and supporting user interactions. It's a robust choice for more interactive scraping tasks.

  • Jsdom: Jsdom doesn't work within a brower but it simulates DOM environment. It enables executing javascript code and is well-suited to scrap dynamic web pages.

  • Parse5: If you're looking for something similar to Cheerio, Parse5 is a good option. It offers a straightforward API for parsing and querying DOM elements, although it lacks the jQuery-like syntax found in Cheerio, making it a simpler choice.

  • Htmlparser2: Htmlparser2 is a versatile library for parsing and manipulating both HTML and XML documents. It's particularly handy for straightforward scraping tasks.


More Cheerio Functionality

Up to this point, we've covered a lot about Cheerio, including how to parse HTML and use that parsed tree of elements for finding and changing elements.

We've also discussed the limitations of Cheerio, particularly its challenges with dynamic web pages. However, there are several more features that Cheerio offers that we haven't talked about here. For instance, it can help you extract structured data in JSON format using the extract() method, and you can also extend Cheerio to add extra functionality that suits your needs. No need to worry, you can explore these links to find out more about these topics:


More Web Scraping Tutorials

So that's how you can parse HTML files using Cheerio in a NodeJS environment.

If you would like to learn more about Web Scraping, then be sure to check out The Web Scraping Playbook.

Or check out one of our more in-depth guides: