Web Scraping Guide Part 1: How To Build Our First Scraper

Whether you prefer lightweight HTTP requests or full‑browser automation, this series walks you through building a production‑ready scraper step‑by‑step. In Part 1, we use a single e‑commerce site as the common target and implement five separate stacks - Python Requests + BeautifulSoup, Selenium, Node.js Axios + Cheerio, Puppeteer, and Playwright; so you can see exactly how each tool retrieves, parses, paginates, and exports data.

The goal is pragmatic execution, not theory: every section covers environment setup, resilient request patterns, CSS selector strategies, basic data cleaning, and a CSV hand‑off you can drop straight into a workflow. Choose the stack that fits your project’s scale and move on with confidence.

Python Requests + BeautifulSoup
Python Selenium
Node.js Axios + Cheerio
Node.js Puppeteer
Node.js Playwright

Python Requests/BeautifulSoup Beginners Series Part 1 - First Scraper

Python Requests/BS4 Beginners Series Part 1: How To Build Our First Scraper

When it comes to web scraping Python is the go-to language for web scraping because of its highly active community, great web scraping libraries and popularity within the data science community.

There are lots of articles online, showing you how to make your first basic Python scraper. However, there are very few that walk you through the full process of building a production ready Python scraper.

To address this, we are doing a 6-Part Python Requests/BeautifulSoup Beginner Series, where we're going to build a Python scraping project end-to-end from building the scrapers to deploying on a server and run them every day.

Python Requests/BeautifulSoup 6-Part Beginner Series

Part 1: Basic Python Requests/BeautifulSoup Scraper - We'll go over the basics of scraping with Python, and build our first Python scraper. (Part 1)
Part 2: Cleaning Dirty Data & Dealing With Edge Cases - Web data can be messy, unstructured, and have lots of edge cases. In this tutorial we'll make our scraper robust to these edge cases, using data classes and data cleaning pipelines. (Part 2)
Part 3: Storing Data in AWS S3, MySQL & Postgres DBs - There are many different ways we can store the data that we scrape from databases, CSV files to JSON format, and S3 buckets. We'll explore several different ways we can store the data and talk about their pros, and cons and in which situations you would use them. (Part 3)
Part 4: Managing Retries & Concurrency - Make our scraper more robust and scalable by handling failed requests and using concurrency. (Part 4)
Part 5: Faking User-Agents & Browser Headers - Make our scraper production ready by using fake user agents & browser headers to make our scrapers look more like real users. (Part 5)
Part 6: Using Proxies To Avoid Getting Blocked - Explore how to use proxies to bypass anti-bot systems by hiding your real IP address and location. (Part 6)

For this beginner series, we're going to be using one of the simplest scraping architectures. A single scraper, being given a start URL which will then crawl the site, parse and clean the data from the HTML responses, and store the data all in the same process.

This architecture is suitable for the majority of hobby and small scraping projects, however, if you are scraping business critical data at larger scales then we would use different scraping architectures.

The code for this project is available on Github here!

If you prefer to follow along with a video then check out the video tutorial version here:

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

Part 1: Basic Python Scraper

In this tutorial, Part 1: Basic Python Scraper we're going to cover:

Our Python Web Scraping Stack
How to Setup Our Python Environment
Creating Our Scraper Project
Laying Out Our Python Scraper
Retrieving The HTML From Website
Extracting Data From HTML
Saving Data to CSV
How to Navigate Through Pages
Next Steps

For this series, we will be scraping the products from Chocolate.co.uk as it will be a good example of how to approach scraping a e-commerce store. Plus, who doesn't like Chocolate!

Chocolate shop

Our Python Web Scraping Stack

When it comes to web scraping stacks there are two key components:

HTTP Client: Which sends a request to the website to retrieve the HTML/JSON response from the website.
Parsing Library: Which is used to extract the data from the web page.

Due to the popularity of Python for web scraping, we have numerous options for both.

We can use Python Requests, Python HTTPX or Python aiohttp as HTTP clients.

And BeautifulSoup, Lxml, Parsel, etc. as parsing libraries.

Or we could use Python web scraping libraries/frameworks that combine both HTTP requests and parsing like Scrapy, Python Selenium and Requests-HTML.

Each stack has its own pros and cons, however, for the puposes of this beginners series we will use the Python Requests/BeautifulSoup stack as it is by far the most common web scraping stack used by Python developers.

Using the Python Requests/BeautifulSoup stack you can easily build highly scalable scrapers that will retrieve a pages HTML, parse and process the data, and store it the file format and location of your choice.

How to Setup Our Python Environment

With the intro out of the way, let's start developing our scraper. First, things first we need to setup up our Python environment.

Step 1 - Setup your Python Environment

To avoid version conflicts down the raod it is best practice to create a seperate virtual environment for each of your Python projects. This means that any packages you install for a project are kept seperate from other projects, so you don't inadverently end up breaking other projects.

Depending on the operating system of your machine these commands will be slightly different.

MacOS or Linux

Setup a virtual environment on MacOS or any Linux distro.

First, we want to make sure we've the latest version of our packages installed.

$ sudo apt-get update
$ apt install tree

Then install python3-venv if you haven't done so already

$ sudo apt install -y python3-venv

Next, we will create our Python virtual environment.

$ python3 -m venv venv
$ source venv/bin/activate

Windows

Setup a virtual environment on Windows.

Install virtualenv in your Windows command shell, Powershell, or other terminal you are using.

pip install virtualenv

Navigate to the folder you want to create the virtual environment, and start virtualenv.

virtualenv venv

Activate the virtual environment.

source venv\Scripts\activate

Step 2 - Install Python Requests & BeautifulSoup

Finally, we will install Python Requests and BeautifulSoup in our virtual environment.

pip install requests beautifulsoup4

Creating Our Scraper Project

Now that we have our environment setup, we can get onto the fun stuff. Building our first Python scraper!

The first thing we need to do is create our scraper script. This project will hold all the code for our scrapers.

To do this we will create a new file called chocolate_scraper.py in our project folder.

ChocolateScraper
└── chocolate_scraper.py

This chocolate_scraper.py file will contain all the code we will use to scrape the Chocolate.co.uk website.

In the future we will run this scraper by entering the following into the command line:

python chocolate_scraper.py

Laying Out Our Python Scraper

Now that we have our libraries installed and our chocolate_scraper.py created lets layout our scraper.

import requests
from bs4 import BeautifulSoup

list_of_urls = [
    'https://www.chocolate.co.uk/collections/all',
        ]

scraped_data = []

## Scraping Function
def start_scrape():
    
    ## Loop Through List of URLs
    for url in list_of_urls:

        ## Send Request
        
        ## Parse Data


        ## Add To Data Output
        pass
    
    
if __name__ == "__main__":
    start_scrape()
    print(scraped_data)

Let's walk through what we have just defined:

Imported both Python Requests and BeautifulSoup into our script so we can use them to retrieve the HTML pages and parse the data from the page.
Defined a list_of_urls we want to scrape.
Defined a scraped_data list where we will store the scraped data.
Defined a start_scrape function which is where we will write our scraper.
Created a __main__ which will kick off our scraper when we run the script and a print function that will print out the scraped data.

If we run this script now using python chocolate_scraper.py then we should get a empty list as an output.

python chocolate_scraper.py
## []

Retrieving The HTML From Website

The first step every web scraper must do is retrieve the HTML/JSON response from the target website so that it can extract the data from the response.

We will use the Python Requests library to do this so let's update our scraper to send a request to our target URLs.

import requests
from bs4 import BeautifulSoup

list_of_urls = [
    'https://www.chocolate.co.uk/collections/all',
        ]

scraped_data = []

## Scraping Function
def start_scrape():
    
    ## Loop Through List of URLs
    for url in list_of_urls:
        
        ## Send Request
        response = requests.get(url)
        
        if response.status_code == 200:
            
            ## Parse Data
            print(response.text)

            ## Add To Data Output
        
        pass

if __name__ == "__main__":
    start_scrape()
    print(scraped_data)
    
    

Here you will see that we added 3 lines of code:

response = requests.get(url) this sends a HTTP request to the URL and returns the response.
if response.status_code == 200: here we check if the response is valid (200 status code) before trying to parse the data.
print(response.text) for debugging purposes we print the response to make sure we are getting the correct response.

Now when we run the script we will get something like this:

<!doctype html><html class="no-js" lang="en" dir="ltr">
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0, height=device-height, minimum-scale=1.0, maximum-scale=1.0">
    <meta name="theme-color" content="#682464">

    <title>Products</title><link rel="canonical" href="https://www.chocolate.co.uk/collections/all"><link rel="preconnect" href="https://cdn.shopify.com">
    <link rel="dns-prefetch" href="https://productreviews.shopifycdn.com">
    <link rel="dns-prefetch" href="https://www.google-analytics.com"><link rel="preconnect" href="https://fonts.shopifycdn.com" crossorigin><link rel="preload" as="style" href="//cdn.shopify.com/s/files/1/1991/9591/t/60/assets/theme.css?v=88009966438304226991661266159">
    <link rel="preload" as="script" href="//cdn.shopify.com/s/files/1/1991/9591/t/60/assets/vendor.js?v=31715688253868339281661185416">
    <link rel="preload" as="script" href="//cdn.shopify.com/s/files/1/1991/9591/t/60/assets/theme.js?v=165761096224975728111661185416"><meta property="og:type" content="website">
  <meta property="og:title" content="Products"><meta property="og:image" content="http://cdn.shopify.com/s/files/1/1991/9591/files/Chocolate_Logo1_White-01-400-400_c4b78d19-83c5-4be0-8e5f-5be1eefa9386.png?v=1637350942">
  <meta property="og:image:secure_url" content="https://cdn.shopify.com/s/files/1/1991/9591/files/Chocolate_Logo1_White-01-400-400_c4b78d19-83c5-4be0-8e5f-5be1eefa9386.png?v=1637350942">
  <meta property="og:image:width" content="1200">
  ...
  ...

Extracting Data From HTML

Now that our scraper is successfully retrieving HTML pages from the website, we need to update our scraper to extract the data we want.

We will do this using the BeautifulSoup library and CSS Selectors (another option are XPath Selectors).

XPath and CSS selectors are like little maps our scraper will use to navigate the DOM tree and find the location of the data we require.

First things first though, we need to load the HTML response into BeautifulSoup so we can navigate the DOM.

if response.status_code == 200:
    soup = BeautifulSoup(response.content, 'html.parser')
    

Find Product CSS Selectors

To find the correct CSS selectors to parse the product details we will first open the page in our browsers DevTools.

Open the website, then open the developer tools console (right click on the page and click inspect).

Chocolate shop

Using the inspect element, hover over the item and look at the id's and classes on the individual products.

In this case we can see that each box of chocolates has its own special component which is called product-item. We can just use this to reference our products (see above image).

soup.select('product-item')

We can see that it has found all the elements that match this selector.

[
    <product-item class="product-item product-item--sold-out" reveal=""><div class="product-item__image-wrapper product-item__image-wrapper--multiple"><div class="product-item__label-list label-list"><span class="label label--custom">New</span><span class="label label--subdued">Sold out</span></div><a class="product-item__aspect-ratio aspect-ratio aspect-ratio--square" href="/products/100-dark-hot-chocolate-flakes" style="padding-bottom: 100.0%; --aspect-ratio: 1.0">... </product-item>,
    <product-item class="product-item product-item--sold-out" reveal=""><div class="product-item__image-wrapper product-item__image-wrapper--multiple"><div class="product-item__label-list label-list"><span class="label label--custom">New</span><span class="label label--subdued">Sold out</span></div><a class="product-item__aspect-ratio aspect-ratio aspect-ratio--square" href="/products/100-dark-hot-chocolate-flakes" style="padding-bottom: 100.0%; --aspect-ratio: 1.0">... </product-item>,
    <product-item class="product-item product-item--sold-out" reveal=""><div class="product-item__image-wrapper product-item__image-wrapper--multiple"><div class="product-item__label-list label-list"><span class="label label--custom">New</span><span class="label label--subdued">Sold out</span></div><a class="product-item__aspect-ratio aspect-ratio aspect-ratio--square" href="/products/100-dark-hot-chocolate-flakes" style="padding-bottom: 100.0%; --aspect-ratio: 1.0">... </product-item>,
 ...
]

Get First Product

To just get the first product we use .get() appended to the end of the command.

soup.select('product-item')[0]

This returns all the HTML in this node of the DOM tree.

'<product-item class="product-item product-item--sold-out" reveal><div class="product-item__image-wrapper product-item__image-wrapper--multiple"><div class="product-item__label-list label-list"><span class="label label--custom">New</span><span class="label label--subdued">Sold out</span></div><a href="/products/100-dark-hot-chocolate-flakes" class="product-item__aspect-ratio aspect-ratio " style="padding-bottom: 100.0%; --aspect-ratio: 1.0">\n...'

Get All Products

Now that we have found the DOM node that contains the product items, we will get all of them and save this data into a response variable and loop through the items and extract the data we need.

So can do this with the following command.

products = soup.select('product-item')

The products variable, is now an list of all the products on the page.

To check the length of the products variable we can see how many products are there.

len(products) 
## --> 24

Extract Product Details

Now lets extract the name, price and url of each product from the list of products.

The products variable is a list of products. When we update our spider code, we will loop through this list, however, to find the correct selectors we will test the CSS selectors on the first element of the list products[0].

Single Product - Get single product.

product = products[0]

Name - The product name can be found with:

product.select('a.product-item-meta__title')[0].get_text()
## --> '100% Dark Hot Chocolate Flakes'

Price - The product price can be found with:

product.select('span.price')[0].get_text()
## --> '\nSale price£9.95'

You can see that the data returned for the price has some extra text.

To remove the extra text from our price we can use the .replace() method. The replace method can be useful when we need to clean up data.

Here we're going to replace the \nSale price£ text with empty quotes '':

product.select('span.price')[0].get_text().replace('\nSale price£', '')
## --> '9.95'

Product URL - Next lets see how we can extract the product url for each individual product. To do that we can use the attrib function on the end of product.select('div.product-item-meta a')[0]

product.select('div.product-item-meta a')[0]['href']
## --> '/products/100-dark-hot-chocolate-flakes'

Updated Scraper

Now, that we've found the correct CSS selectors let's update our scraper.

Our updated scraper code should look like this:

import requests
from bs4 import BeautifulSoup

list_of_urls = [
    'https://www.chocolate.co.uk/collections/all',
        ]

scraped_data = []

## Scraping Function
def start_scrape():
    
    ## Loop Through List of URLs
    for url in list_of_urls:
        
        ## Send Request
        response = requests.get(url)
        
        if response.status_code == 200:
            
            ## Parse Data
            soup = BeautifulSoup(response.content, 'html.parser')
            products = soup.select('product-item')
            for product in products:
                name = product.select('a.product-item-meta__title')[0].get_text()
                price = product.select('span.price')[0].get_text().replace('\nSale price£', '')
                url = product.select('div.product-item-meta a')[0]['href']
                
                ## Add To Data Output
                scraped_data.append({
                    'name': name,
                    'price': price,
                    'url': 'https://www.chocolate.co.uk' + url
                })


if __name__ == "__main__":
    start_scrape()
    print(scraped_data)
    

Here, our scraper does the following steps:

Makes a request to 'https://www.chocolate.co.uk/collections/all'.
When it gets a response, it extracts all the products from the page using products = soup.select('product-item').
Loops through each product, and extracts the name, price and url using the CSS selectors we created.
Adds the parsed data to the scraped_data list so they can be stored in a CSV, JSON, DB, etc later.

When we run the scraper now using python chocolate_scraper.py then we should get a output like this.

[{'name': '100% Dark Hot Chocolate Flakes',
  'price': '9.95',
  'url': 'https://www.chocolate.co.uk/products/100-dark-hot-chocolate-flakes'},
 {'name': '2.5kg Bulk 41% Milk Hot Chocolate Drops',
  'price': '45.00',
  'url': 'https://www.chocolate.co.uk/products/2-5kg-bulk-of-our-41-milk-hot-chocolate-drops'},
 {'name': '2.5kg Bulk 61% Dark Hot Chocolate Drops',
  'price': '45.00',
  'url': 'https://www.chocolate.co.uk/products/2-5kg-of-our-best-selling-61-dark-hot-chocolate-drops'},
 {'name': '41% Milk Hot Chocolate Drops',
  'price': '8.75',
  'url': 'https://www.chocolate.co.uk/products/41-colombian-milk-hot-chocolate-drops'},
 {'name': '61% Dark Hot Chocolate Drops',
  'price': '8.75',
  'url': 'https://www.chocolate.co.uk/products/62-dark-hot-chocolate'},
 {'name': '70% Dark Hot Chocolate Flakes',
  'price': '9.95',
  'url': 'https://www.chocolate.co.uk/products/70-dark-hot-chocolate-flakes'},
 {'name': 'Almost Perfect',
  'price': '\nSale priceFrom £1.50\n',
  'url': 'https://www.chocolate.co.uk/products/almost-perfect'},
 {'name': 'Assorted Chocolate Malt Balls',
  'price': '9.00',
  'url': 'https://www.chocolate.co.uk/products/assorted-chocolate-malt-balls'},
  ...
]

Saving Data to CSV

In Part 4 of this beginner series, we go through in much more detail how to save data to various file formats and databases.

However, as a simple example for part 1 of this series we're going to save the data we've scraped and stored in scraped_data into a CSV file once the scrape has been completed.

To do so we will create the following function:

import csv

def save_to_csv(data_list, filename):
    keys = data_list[0].keys()
    with open(filename + '.csv', 'w', newline='') as output_file:
        dict_writer = csv.DictWriter(output_file, keys)
        dict_writer.writeheader()
        dict_writer.writerows(data_list)

And update our scraper to use this function with our scraped_data:

import csv
import requests
from bs4 import BeautifulSoup

list_of_urls = [
    'https://www.chocolate.co.uk/collections/all',
        ]

scraped_data = []

## Scraping Function
def start_scrape():
    
    ## Loop Through List of URLs
    for url in list_of_urls:
        
        ## Send Request
        response = requests.get(url)
        
        if response.status_code == 200:
            
            ## Parse Data
            soup = BeautifulSoup(response.content, 'html.parser')
            products = soup.select('product-item')
            for product in products:
                name = product.select('a.product-item-meta__title')[0].get_text()
                price = product.select('span.price')[0].get_text().replace('\nSale price£', '')
                url = product.select('div.product-item-meta a')[0]['href']
                
                ## Add To Data Output
                scraped_data.append({
                    'name': name,
                    'price': price,
                    'url': 'https://www.chocolate.co.uk' + url
                })


def save_to_csv(data_list, filename):
    keys = data_list[0].keys()
    with open(filename + '.csv', 'w', newline='') as output_file:
        dict_writer = csv.DictWriter(output_file, keys)
        dict_writer.writeheader()
        dict_writer.writerows(data_list)
        
if __name__ == "__main__":
    start_scrape()
    save_to_csv(scraped_data, 'scraped_data')
    

Now when we run the scraper it will create a scraped_data.csv file with all the data once the scrape has been completed.

The output will look something like this:

name,price,url
100% Dark Hot Chocolate Flakes,9.95,https://www.chocolate.co.uk/products/100-dark-hot-chocolate-flakes
2.5kg Bulk 41% Milk Hot Chocolate Drops,45.00,https://www.chocolate.co.uk/products/2-5kg-bulk-of-our-41-milk-hot-chocolate-drops
2.5kg Bulk 61% Dark Hot Chocolate Drops,45.00,https://www.chocolate.co.uk/products/2-5kg-of-our-best-selling-61-dark-hot-chocolate-drops
41% Milk Hot Chocolate Drops,8.75,https://www.chocolate.co.uk/products/41-colombian-milk-hot-chocolate-drops
61% Dark Hot Chocolate Drops,8.75,https://www.chocolate.co.uk/products/62-dark-hot-chocolate
70% Dark Hot Chocolate Flakes,9.95,https://www.chocolate.co.uk/products/70-dark-hot-chocolate-flakes
Almost Perfect,"Sale priceFrom £1.50",https://www.chocolate.co.uk/products/almost-perfect
Assorted Chocolate Malt Balls,9.00,https://www.chocolate.co.uk/products/assorted-chocolate-malt-balls
Blonde Caramel,5.00,https://www.chocolate.co.uk/products/blonde-caramel-chocolate-bar
Blonde Chocolate Honeycomb,9.00,https://www.chocolate.co.uk/products/blonde-chocolate-honeycomb
Blonde Chocolate Honeycomb - Bag,8.50,https://www.chocolate.co.uk/products/blonde-chocolate-sea-salt-honeycomb
Blonde Chocolate Malt Balls,9.00,https://www.chocolate.co.uk/products/blonde-chocolate-malt-balls
Blonde Chocolate Truffles,19.95,https://www.chocolate.co.uk/products/blonde-chocolate-truffles
Blonde Hot Chocolate Flakes,9.95,https://www.chocolate.co.uk/products/blonde-hot-chocolate-flakes
Bulk 41% Milk Hot Chocolate Drops 750 grams,17.50,https://www.chocolate.co.uk/products/bulk-41-milk-hot-chocolate-drops-750-grams
Bulk 61% Dark Hot Chocolate Drops 750 grams,17.50,https://www.chocolate.co.uk/products/750-gram-bulk-61-dark-hot-chocolate-drops
Caramelised Milk,5.00,https://www.chocolate.co.uk/products/caramelised-milk-chocolate-bar
Chocolate Caramelised Pecan Nuts,8.95,https://www.chocolate.co.uk/products/chocolate-caramelised-pecan-nuts
Chocolate Celebration Hamper,55.00,https://www.chocolate.co.uk/products/celebration-hamper
Christmas Cracker,5.00,https://www.chocolate.co.uk/products/christmas-cracker-chocolate-bar
Christmas Truffle Selection,19.95,https://www.chocolate.co.uk/products/pre-order-christmas-truffle-selection
Cinnamon Toast,5.00,https://www.chocolate.co.uk/products/cinnamon-toast-chocolate-bar
Collection of 4 of our Best Selling Chocolate Malt Balls,30.00,https://www.chocolate.co.uk/products/collection-of-our-best-selling-chocolate-malt-balls
Colombia 61%,5.00,https://www.chocolate.co.uk/products/colombian-dark-chocolate-bar

Data Quality

As you might have noticed in the above CSV file, we seem to have a data quality issue with the price for the "Almost Perfect" perfect product. We will deal with this in the Part 2: Data Cleaning & Edge Cases

Navigating to the "Next Page"

So far the code is working great but we're only getting the products from the first page of the site, the URL which we have defined in the list_of_urls list.

So the next logical step is to go to the next page if there is one and scrape the item data from that too! So here's how we do that.

To do so we need to find the correct CSS selector to get the next page button.

And then get the href attribute that contains the url to the next page.

soup.select('a[rel="next"]')[0]['href']
## --> '/collections/all?page=2'

Now, we just need to update our scraper to extract this next page url and add it to our list_of_urls to scrape.

import csv
import requests
from bs4 import BeautifulSoup

list_of_urls = [
    'https://www.chocolate.co.uk/collections/all',
        ]

scraped_data = []

## Scraping Function \{#scraping-function}
def start_scrape():
    
    ## Loop Through List of URLs
    for url in list_of_urls:
        
        ## Send Request
        response = requests.get(url)
        
        if response.status_code == 200:
            
            ## Parse Data
            soup = BeautifulSoup(response.content, 'html.parser')
            products = soup.select('product-item')
            for product in products:
                name = product.select('a.product-item-meta__title')[0].get_text()
                price = product.select('span.price')[0].get_text().replace('\nSale price£', '')
                url = product.select('div.product-item-meta a')[0]['href']
                
                ## Add To Data Output
                scraped_data.append({
                    'name': name,
                    'price': price,
                    'url': 'https://www.chocolate.co.uk' + url
                })
            
            ## Next Page
            next_page = soup.select('a[rel="next"]')
            if len(next_page) > 0:
                list_of_urls.append('https://www.chocolate.co.uk' + next_page[0]['href'])


def save_to_csv(data_list, filename):
    keys = data_list[0].keys()
    with open(filename + '.csv', 'w', newline='') as output_file:
        dict_writer = csv.DictWriter(output_file, keys)
        dict_writer.writeheader()
        dict_writer.writerows(data_list)
        
if __name__ == "__main__":
    start_scrape()
    save_to_csv(scraped_data, 'scraped_data')
    

Now when we run our scraper it will continue scraping the next page until it has completed all available pages.

Python Selenium Beginners Series Part 1: How To Build Our First Scraper

When it comes to web scraping, Python is the go-to language due to its highly active community, excellent web scraping libraries, and popularity within the data science community.

Many articles online show how to create a basic Python scraper. However, a few articles walk you through the full process of building a production-ready scraper.

To address this gap, we are doing a 6-Part Python Selenium Beginner Series. In this series, we'll build a Python scraping project from scratch, covering everything from creating the scraper to making it production-ready.

Python Selenium 6-Part Beginner Series

Part 1: Basic Python Selenium Scraper - We'll go over the basics of scraping with Python, and build our first Python scraper. (This article)
Part 2: Cleaning Dirty Data & Dealing With Edge Cases - Web data can be messy, unstructured, and have lots of edge cases. In this tutorial we'll make our scraper robust to these edge cases, using data classes and data cleaning pipelines. Part 2
Part 3: Storing Data in AWS S3, MySQL & Postgres DBs - There are many different ways we can store the data that we scrape from databases, CSV files to JSON format, and S3 buckets. We'll explore several different ways we can store the data and talk about their pros, and cons and in which situations you would use them. Part 3
Part 4: Managing Retries & Concurrency - Make our scraper more robust and scalable by handling failed requests and using concurrency. Part 4
Part 5: Faking User-Agents & Browser Headers - Make our scraper production ready by using fake user agents & browser headers to make our scrapers look more like real users. (Coming Soon)
Part 6: Using Proxies To Avoid Getting Blocked - Explore how to use proxies to bypass anti-bot systems by hiding your real IP address and location. (Coming Soon)

In this tutorial, Part 1: Basic Python Selenium Scraper we're going to cover:

Our Python Web Scraping Stack
How to Setup Our Python Environment
Creating Our Scraper Project
Laying Out Our Python Scraper
Launching the Browser
Switching to Headless Mode
Extracting Data
Saving Data to CSV
Navigating to the Next Page
Next Steps

For this series, we'll be scraping the products from Chocolate.co.uk because it's a good example of how to approach scraping an e-commerce store. Plus, who doesn't love chocolate?

Selenium Web Scraping Playbook - first page of chocolate website

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

Our Python Web Scraping Stack

To scrape data from a website, we need two key components: an HTTP client and a parsing library. The HTTP client is responsible for sending a request to retrieve the HTML content of the web page, while the parsing library is used to extract the data from that HTML content.

Due to the popularity of Python for web scraping, we've numerous options for both. We can use libraries like Requests, HTTPX, or AIOHTTP (an asynchronous HTTP client) for the HTTP client. Similarly, for the parsing library, we can use BeautifulSoup, lxml, Parsel, and others.

Alternatively, we could use Python web scraping libraries/frameworks such as Scrapy, Selenium, and Requests-HTML that combine both functionalities of making HTTP requests and parsing the retrieved data.

Each stack has its pros and cons. However, for this beginner series, we'll be using Python Selenium. It is a popular open-source library that offers cross-language and cross-browser support and is particularly useful for handling dynamic websites, complex interactions, and browser-specific rendering, such as JavaScript-heavy elements.

How to Setup Our Python Environment

Before we start building our scraper, we need to set up our Python environment. Here's how you can do it:

Step 1: Set up your Python Environment

To prevent any potential version conflicts in the future, it is recommended to create a distinct virtual environment for each of your Python projects. This approach guarantees that any packages you install for a particular project are isolated from other projects.

For MacOS or Linux:

Make sure you've the latest version of your packages installed:
```
$ sudo apt-get update
$ apt install tree
```
Install python3-venv if you haven't done so already:
```
$ sudo apt install -y python3-venv
```

Create your Python virtual environment:

$ python3 -m venv venv
$ source venv/bin/activate

For Windows:

Install virtualenv

D:\selenium-series> pip install virtualenv

Navigate to the folder where you want to create the virtual environment and run the command to create a virtual environment with the name myenv.
```
D:\selenium-series> Python -m venv myenv
```

Activate the virtual environment.

D:\selenium-series> myenv\Scripts\activate

Step 2: Install Python Selenium and WebDriver

Finally, we’ll install Python Selenium and WebDriver in our virtual environment. WebDriver acts as an interface that allows you to control and interact with web browsers.

There are two ways to install Selenium and WebDriver:

WebDriver Manager (Recommended): This method is simpler and recommended for beginners as it automatically downloads and manages the appropriate WebDriver version. Open the command prompt and run:
```
pip install selenium==4.17.2 webdriver-manager==4.0.1
```
Manually Setting up WebDriver: This method requires more manual setup. To begin, install the Chrome driver that matches your Chrome browser version. Once the Chrome driver is set up in your preferred location, you can proceed with installing Python Selenium.
```
pip install selenium==4.17.2
```

Creating Our Scraper Project

Now that our environment is set up, let's dive into the fun stuff: building our first Python scraper! The first step is creating our scraper script. We'll create a new file called chocolate_scraper.py within the ChocolateScraper project folder.

ChocolateScraper
└── chocolate_scraper.py

This chocolate_scraper.py file will contain all the code we use to scrape the Chocolate.co.uk website. In the future, we can run this scraper by entering the following command into the command line:

python chocolate_scraper.py

Laying Out Our Python Scraper

Now that we've our libraries installed and chocolate_scraper.py created, let's lay out our scraper.

from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium import webdriver

list_of_urls = [
    'https://www.chocolate.co.uk/collections/all',
]

scraped_data = []

def start_scrape():
    """
    Function to initiate the scraping process
    """
    for url in list_of_urls:
        # Perform scraping for each URL
        pass

def save_to_csv(data_list, filename="scraped_data.csv"):
    """
    Function to save scraped data to a CSV file
    """
    pass

if __name__ == "__main__":
    start_scrape()
    save_to_csv(scraped_data)

Let's go over what we've just defined:

We imported necessary classes from Selenium and WebDriver Manager to automate Chrome interactions and data retrieval.
We created a list_of_urls containing the product pages we want to scrape.
We defined a scraped_data list to store the extracted data.
We created a start_scrape function where the scraping logic will be written.
We defined a save_to_csv function to save scraped data in a CSV file.
We created a __main__ block that will kick off our scraper when you run the script.

Launching the Browser

The first step is to open the browser and navigate to the website. This allows you to retrieve the HTML, to extract the data you need. You can open the browser with the webdriver module.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service

chromedriver_path = r"C:\Program Files\chromedriver.exe"
driver = webdriver.Chrome(service=Service(chromedriver_path))

list_of_urls = [
    "https://www.chocolate.co.uk/collections/all",
]

scraped_data = []

def start_scrape():
    for url in list_of_urls:
        driver.get(url)
        print(driver.page_source)

if __name__ == "__main__":
    start_scrape()
    driver.quit()

Here’s how the code works:

driver = webdriver.Chrome(chromedriver_path) created a new Chrome WebDriver instance.
driver.get(url) navigates to the provided URL.
print(driver.page_source) prints the HTML source code.

Now, when you run the script, you'll see the HTML source code of the webpage printed to the console.

<html class="js" lang="en" dir="ltr" style="--window-height:515.3333129882812px;--announcement-bar-height:53.0625px;--header-height:152px;--header-height-without-bottom-nav:152px;">
<head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width,initial-scale=1.0,height=device-height,minimum-scale=1.0,maximum-scale=1.0">
    <meta name="theme-color" content="#682464">
    <title>Products</title>
    <link rel="canonical" href="https://www.chocolate.co.uk/collections/all">
    <link rel="preconnect" href="https://cdn.shopify.com">
    <link rel="dns-prefetch" href="https://productreviews.shopifycdn.com">
    <link rel="dns-prefetch" href="https://www.google-analytics.com">
    <link rel="preconnect" href="https://fonts.shopifycdn.com" crossorigin="">
    <link rel="preload" as="style" href="//www.chocolate.co.uk/cdn/shop/t/60/assets/theme.css?v=88009966438304226991661266159">
    <link rel="preload" as="script" href="//www.chocolate.co.uk/cdn/shop/t/60/assets/vendor.js?v=31715688253868339281661185416">
    <link rel="preload" as="script" href="//www.chocolate.co.uk/cdn/shop/t/60/assets/theme.js?v=165761096224975728111661185416">
    <meta property="og:type" content="website">
    <meta property="og:title" content="Products">
    <meta property="og:image" content="http://www.chocolate.co.uk/cdn/shop/files/Chocolate_Logo1_White-01-400-400_c4b78d19-83c5-4be0-8e5f-5be1eefa9386.png?v=1637350942">
    <meta property="og:image:secure_url" content="https://www.chocolate.co.uk/cdn/shop/files/Chocolate_Logo1_White-01-400-400_c4b78d19-83c5-4be0-8e5f-5be1eefa9386.png?v=1637350942">
    <meta property="og:image:width" content="1200">
    <meta property="og:image:height" content="628">
    <meta property="og:url" content="https://www.chocolate.co.uk/collections/all">
    <meta property="og:site_name" content="The Chocolate Society">
    <meta name="twitter:card" content="summary">
    <meta name="twitter:title" content="Products">
    <meta name="twitter:description" content="">
    <meta name="twitter:image" content="https://www.chocolate.co.uk/cdn/shop/files/Chocolate_Logo1_White-01-400-400_c4b78d19-83c5-4be0-8e5f-5be1eefa9386_1200x1200_crop_center.png?v=1637350942">
    <meta name="twitter:image:alt" content="">
</head>
</html>

...
...
...
...

Note that, in the above code, we use the Chrome WebDriver path to launch the browser. You can also use chromedrivermanager to easily manage and open the browser. See the below code:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager # Extra imported

driver = webdriver.Chrome(options=options, service=Service(
    ChromeDriverManager().install()))

Switching to Headless Mode

If you want to switch to headless Chrome, which runs without a graphical user interface (GUI) and is useful for automation and server-side tasks, you need to first create an Options object. Then, use the add_argument method on the Options object to set the --headless flag.

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options # Extra imported

options = Options()
options.add_argument("--headless")

driver = webdriver.Chrome(options=options, service=Service(
    ChromeDriverManager().install()))

Extracting Data

Let's update our scraper to extract the desired data. We'll do this by using class names and CSS selectors.

We’ll use the find_elements and find_element methods to locate specific elements. The find_elements method searches for and returns a list of all elements that match the given criteria, while the find_element method searches for and returns the first element that matches the criteria.

Find Product Selectors

To extract product details using selectors, open the website and then open the developer tools console (right-click and choose "Inspect" or "Inspect Element").

Selenium Web Scraping Playbook - product selectors

Using the inspect element, hover over a product and examine its IDs and classes. Each product has its unique product-item component.

This line of code finds all the elements on the web page that have the class name product-item. The products variable stores a list of these elements. Currently, it holds 24 products, representing all products on the first page.

products = driver.find_elements(By.CLASS_NAME, "product-item")
print(len(products)) # 24

Extract Product Details

Now, let's extract the name, price, and URL of each item in the product list. We'll use the first product (products[0]) to test our CSS selectors while iterating through the list when updating the spider code.

Single Product: Get a single product from the list.

product = products[0]

Name: Get the name of the product with the product-item-meta__title class name.

name = product.find_element(By.CLASS_NAME, "product-item-meta__title").text
## --> '100% Dark Hot Chocolate Flakes'

Price: Get the price of the product with the price class name.

price = product.find_element(By.CLASS_NAME, "price").text
## --> 'Sale price\n£9.95

The price data contains some unwanted text. To remove this extra text, we can use the .replace() method. This method will replace both occurrences of unwanted text with empty quotes ‘’.

product.find_element(By.CLASS_NAME, "price").text.replace("Sale price\n£", "")
## --> '9.95'

Product URL: Now, let's see how to extract the product URL for each item. We can get this using the get_attribute("href") method.

product.find_element(By.CLASS_NAME, "product-item-meta__title").get_attribute("href")
## --> 'https://www.chocolate.co.uk/products/100-dark-hot-chocolate-flakes'

Updated Scraper

Now that we've identified the correct CSS selectors, let's update our scraper code. The updated code will look like this:

from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--headless")

driver = webdriver.Chrome(
    options=options, service=Service(ChromeDriverManager().install())
)

list_of_urls = [
    "https://www.chocolate.co.uk/collections/all",
]

scraped_data = []

def start_scrape():
    print("Scraping started...")
    for url in list_of_urls:
        driver.get(url)
        products = driver.find_elements(By.CLASS_NAME, "product-item")
        for product in products:
            name = product.find_element(
                By.CLASS_NAME, "product-item-meta__title").text

            price_text = product.find_element(By.CLASS_NAME, "price").text
            price = price_text.replace("Sale price\n£", "")

            url = product.find_element(
                By.CLASS_NAME, "product-item-meta__title"
            ).get_attribute("href")

            scraped_data.append({"name": name, "price": price, "url": url})

if __name__ == "__main__":
    start_scrape()
    print(scraped_data)
    driver.quit()

Our scraper performs the following steps:

Load the target URL: It uses the driver.get(url) to load the website's URL in the browser for further processing and data extraction.
Extract product elements: It finds all web elements containing the class name "product-item" using driver.find_elements(By.CLASS_NAME, "product-item"). These elements represent individual product items on the webpage.
Iterate and extract data: It iterates through each product element and extracts the name, price, and URL.
Store extracted data: It adds the extracted information to the scraped_data list, where it can be stored in a desired format like CSV, JSON, or a database.

When we run the scraper now, we should receive an output similar to this.

[{'name': '100% Dark Hot Chocolate Flakes', 'price': '9.95', 'url': 'https://www.chocolate.co.uk/products/100-dark-hot-chocolate-flakes'},
 {'name': '2.5kg Bulk 41% Milk Hot Chocolate Drops', 'price': '32.00', 'url': 'https://www.chocolate.co.uk/products/2-5kg-bulk-of-our-41-milk-hot-chocolate-drops'},
 {'name': '2.5kg Bulk 61% Dark Hot Chocolate Drops', 'price': '32.00', 'url': 'https://www.chocolate.co.uk/products/2-5kg-of-our-best-selling-61-dark-hot-chocolate-drops'},
 {'name': '41% Milk Hot Chocolate Drops', 'price': '8.75', 'url': 'https://www.chocolate.co.uk/products/41-colombian-milk-hot-chocolate-drops'},
 {'name': '61% Dark Hot Chocolate Drops', 'price': '8.75', 'url': 'https://www.chocolate.co.uk/products/62-dark-hot-chocolate'},
 {'name': '70% Dark Hot Chocolate Flakes', 'price': '9.95', 'url': 'https://www.chocolate.co.uk/products/70-dark-hot-chocolate-flakes'},
 {'name': 'Almost Perfect', 'price': '2.00', 'url': 'https://www.chocolate.co.uk/products/almost-perfect'},
 {'name': 'Assorted Chocolate Malt Balls', 'price': '9.00', 'url': 'https://www.chocolate.co.uk/products/assorted-chocolate-malt-balls'},
 {'name': 'Blonde Caramel', 'price': '5.00', 'url': 'https://www.chocolate.co.uk/products/blonde-caramel-chocolate-bar'},
 ...
 ...
 ... 
]

Saving Data to CSV

In Part 4 of this beginner series, we'll dive deeper into saving data to various file formats and databases. But to start you off, let's create a simple function to save the data we've scraped and stored in scraped_data into a CSV file.

To do so, we'll create a function called save_to_csv(data, filename). This function takes two arguments: the scraped data and the desired filename for the CSV file.

Here’s the code snippet:

import csv

def save_to_csv(data_list, filename):
    keys = data_list[0].keys()
    with open(filename + '.csv', 'w', newline='') as output_file:
        dict_writer = csv.DictWriter(output_file, keys)
        dict_writer.writeheader()
        dict_writer.writerows(data_list)

And update our scraper to use this function:

from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import csv

options = Options()
options.add_argument("--headless")  # Enables headless mode

# Using ChromedriverManager to automatically download and install Chromedriver
driver = webdriver.Chrome(
    options=options, service=Service(ChromeDriverManager().install())
)

list_of_urls = [
    "https://www.chocolate.co.uk/collections/all",
]

scraped_data = []

def start_scrape():
    print("Scraping started...")
    for url in list_of_urls:
        driver.get(url)
        products = driver.find_elements(By.CLASS_NAME, "product-item")
        for product in products:
            name = product.find_element(
                By.CLASS_NAME, "product-item-meta__title").text
            price_text = product.find_element(By.CLASS_NAME, "price").text
            price = price_text.replace("Sale price\n£", "")

            url = product.find_element(
                By.CLASS_NAME, "product-item-meta__title"
            ).get_attribute("href")

            scraped_data.append({"name": name, "price": price, "url": url})

def save_to_csv(data_list, filename):
    keys = data_list[0].keys()
    with open(filename + ".csv", "w", newline="") as output_file:
        dict_writer = csv.DictWriter(output_file, keys)
        dict_writer.writeheader()
        dict_writer.writerows(data_list)

if __name__ == "__main__":
    start_scrape()
    save_to_csv(scraped_data, "scraped_data")
    driver.quit()  # Close the browser window after finishing

After running the scraper, it will create a scraped_data.csv file containing all the extracted data.

Here's an example of what the output will look like:

Selenium Web Scraping Playbook - data quality issue in prices

💡DATA QUALITY: As you may have noticed in the CSV file above, the price for the 'Almost Perfect' product (line 8) appears to have a data quality issue. We'll address this in Part 2: Data Cleaning & Edge Cases.

Navigating to the "Next Page"

So far, the code works well, but it only retrieves products from the first page of the site specified in the list_of_urls list. The next logical step is to grab products from subsequent pages if they exist.

To accomplish this, we need to identify the correct CSS selector for the "next page" button and extract the URL from its href attribute.

driver.find_element(By.CSS_SELECTOR,"a[rel='next']").get_attribute("href")

We'll now update our scraper to identify and extract the URL for the next page, adding it to the list_of_urls for subsequent scraping. Here's the updated code:

from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import csv
import time

options = Options()
options.add_argument("--headless")  # Enables headless mode

# Using ChromedriverManager to automatically download and install Chromedriver
driver = webdriver.Chrome(
    options=options, service=Service(ChromeDriverManager().install())
)

list_of_urls = [
    "https://www.chocolate.co.uk/collections/all",
]

scraped_data = []


def start_scrape():
    print("Scraping started...")
    for url in list_of_urls:
        driver.get(url)
        wait = WebDriverWait(driver, 10)
        products = wait.until(EC.visibility_of_all_elements_located(
            (By.CLASS_NAME, "product-item")))
        for product in products:
            name = product.find_element(
                By.CLASS_NAME, "product-item-meta__title").text
            price_text = product.find_element(By.CLASS_NAME, "price").text
            price = price_text.replace("Sale price\n£", "")

            url = product.find_element(
                By.CLASS_NAME, "product-item-meta__title"
            ).get_attribute("href")
            scraped_data.append({"name": name, "price": price, "url": url})
        try:
            next_page = driver.find_element(By.CSS_SELECTOR, "a[rel='next']")
            if next_page:
                list_of_urls.append(next_page.get_attribute("href"))
                print("Scraped page", len(list_of_urls), "...")
                time.sleep(1)  # Add a brief pause between page loads
        except:
            print("No more pages found!")


def save_to_csv(data_list, filename):
    keys = data_list[0].keys()
    with open(filename + ".csv", "w", newline="") as output_file:
        dict_writer = csv.DictWriter(output_file, keys)
        dict_writer.writeheader()
        dict_writer.writerows(data_list)


if __name__ == "__main__":
    start_scrape()
    save_to_csv(scraped_data, "scraped_data")

    print("Scraping completed successfully!")
    driver.quit()  # Close the browser window after finishing

Part 1 - Building Your First Scraper

Node.js Axios/CheerioJS Beginners Series Part 1: Building Your First Scraper

When it comes to web scraping, Node.js is a popular choice due to its strong community, extensive libraries like Cheerio, and its integration with JavaScript.

Many online resources show how to create simple web scrapers in Node.js, but few guide you through building a production-ready scraper.

This 6-part Node.js Axios/CheerioJS Beginner Series will walk you through building a web scraping project from scratch, covering everything from creating the scraper to deployment and scheduling.

Node.js Axios/CheerioJS 6-Part Beginner Series

Part 1: Basic Node.js Cheerio Scraper - We'll learn the fundamentals of web scraping with Node.js and build your first scraper using Cheerio. (This article)
Part 2: Cleaning Unruly Data & Handling Edge Cases - Web data can be messy and unpredictable. In this part, we'll create a robust scraper using data structures and cleaning techniques to handle these challenges. (Part 2)
Part 3: Storing Scraped Data in AWS S3, MySQL & Postgres DBs - Explore various options for storing your scraped data, including databases like MySQL or Postgres, cloud storage like AWS S3, and file formats like CSV and JSON. We'll discuss their pros, cons, and suitable use cases. (Part 3)
Part 4: Managing Retries & Concurrency - Enhance your scraper's reliability and scalability by handling failed requests and utilizing concurrency. (Part 4)
Part 5: Faking User-Agents & Browser Headers - Learn how to create a production-ready scraper by simulating real users through user-agent and browser header manipulation. (Part 5)
Part 6: Using Proxies To Avoid Getting Blocked - Discover how to use proxies to bypass anti-bot systems by disguising your real IP address and location. (Part 6)

GitHub Code

The code for this project is available on Github.

Series Approach:

For this beginner series, we'll focus on a simple scraping structure. We'll build a single scraper that takes a starting URL, fetches the website, parses and cleans data from the HTML response, and stores the extracted information - all within the same process.

This approach is ideal for personal projects and small-scale scraping tasks. However, larger-scale scraping, especially for business-critical data, may require more complex architectures.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

Part 1: Basic Node.js Scraper

In this tutorial, Part 1: Basic Node.js Scraper we're going to cover:

Our Node Web Scraping Stack
How to Setup Our Node.js Environment
Creating Your Node.js Scraper Project
Laying Out our Node Scraper
Retrieving The HTML From Website
Extracting Data from HTML
Saving Data to CSV
Navigating to the "Next Page"
Next Steps

For this series, we will be scraping the products from chocolate.co.uk as it will be a good example of how to approach scraping a e-commerce store. Plus, who doesn't like Chocolate!

Chocolate.co.uk Home Page

Our Node Web Scraping Stack

Web scraping in Node.js involves two key components:

1. HTTP Client: This library sends requests to websites and retrieves their response data, typically HTML or JSON.

2. Parsing Library: This library extracts and processes the desired data from the received response.

Node.js offers a rich ecosystem of libraries for each component.

HTTP Clients:

Axios: A popular Promise-based HTTP client known for its simplicity and clean syntax.
SuperAgent: Another Promise-based library with support for various request features and configuration options.
Got: A lightweight HTTP client focused on performance and ease of use.

Parsing Libraries:

Cheerio: This library directly parses HTML using a familiar jQuery-like syntax, making it intuitive for developers familiar with JavaScript.
JSDOM: A full-fledged DOM implementation that allows manipulation and parsing of HTML beyond simple extraction.
htmlparser2: A high-performance streaming parser suitable for large and complex HTML structures.

Another alternative are headless browsers that combine requests and parsing. For Node.js the most popular are:

Puppeteer: A library that controls a headless Chrome browser, allowing for dynamic rendering and complex scraping scenarios.
Playwright: A newer framework offering browser automation across various browsers like Chrome, Firefox, and WebKit.

Choice for this Series:

For this beginner series, we'll utilize the Axios and Cheerio combination due to its simplicity and ease of learning.

This popular stack allows you to build efficient scrapers that fetch website content, extract relevant data, and store it in your desired format and location.

How to Setup Our Node.js Environment

Before diving into development, let's set up our Node.js environment.

Step 1 - Node.js and npm Installation

Ensure you have the latest Node.js version installed. You can check and download the appropriate installer for your operating system (Windows, macOS, or Linux) from the official website.

Run the installer and follow the prompts to install Node.js.

After installation, open a terminal or command prompt and run the following commands to verify that Node.js and npm are installed correctly:

node -v
npm -v

These commands should output the versions of Node.js and npm installed on your system. If you see version numbers, it means Node.js and npm are installed successfully.

Step 2 - Package Management

Node.js uses npm (Node Package Manager) for installing and managing dependencies. It comes bundled with Node.js by default.

npm allows you to install, manage, and share packages (libraries or modules) with other developers. It is used for dependency management in Node.js projects.

Step 3 - Project Directory

Create a new directory for your project. Open your terminal or command prompt and navigate to this directory.

mkdir my-node-project
cd my-node-project

Step 4 - Initialize Project (Optional)

While not strictly necessary, you can initialize an empty npm project using:

npm init -y

This creates a package.json file at the root of your project, which acts as a manifest for storing project information and dependencies.

Step 5 - Install Dependencies

We'll install the necessary libraries: Axios for HTTP requests and Cheerio for parsing. Use the following command in your terminal:

npm install axios cheerio

This downloads and installs the specified packages and their dependencies into your project's node_modules directory.

Creating Your First Node.js Scraper Project

Now that our environment is ready, let's create our first Node.js scraper!

1. Create a Project File:

In your project directory, create a new JavaScript file named chocolateScraper.js. This file will hold all the code for your scraper.

ChocolateScraper/
  |- chocolateScraper.js

2. Running the Scraper:

You can execute your scraper using the following command in your terminal:

node chocolateScraper.js

This command runs the JavaScript code in your chocolateScraper.js file, executing the scraping logic and potentially displaying the extracted data or saving it to a chosen format.

Laying Out our Node Scraper

Now that you have your chocolateScraper.js file created, let's begin building the scraper logic:

const axios = require("axios");
const cheerio = require("cheerio");

const listOfUrls = ["https://www.chocolate.co.uk/collections/all"];

const scrapedData = [];

async function scrape() {
  for (const url of listOfUrls) {
    console.log(`Scraping: ${url}`);
  }
}

(async () => {
  await scrape();
})();

Let's walk through what we've just written:

Imported axios and cheerio using require so that they can be used.
Defined a listOfUrls containing the URLs to be scraped.
Created an empty array, scrapeData, to store results in.
Defined an asynchronous scrape method to do the work.
Created an Immediately Invoked Function Expression to start the program.

Now, if we run the script, we see the following output:

node chocolateScraper.js
> Scraping: https://www.chocolate.co.uk/collections/all

Retrieving The HTML From Website

With our basic logic in place, we need to begin requesting data from the target URLs. This is usually in the form of HTML or JSON.

We will use Axios to do this, let's make the following updates:

const axios = require("axios");
const cheerio = require("cheerio");

const listOfUrls = ["https://www.chocolate.co.uk/collections/all"];

const scrapedData = [];

async function scrape() {
  for (const url of listOfUrls) {
    const response = await axios.get(url);

    if (response.status == 200) {
      console.log(response.data);
    }
  }
}

(async () => {
  await scrape();
})();

You can see we've added a couple lines to our for loop. We use axios.get with the URL and if the status is equal to 200 we log the data. When you run the script you should see HTML that ends something like this

</div>
          </div></div>
    </div>
  </div>
</footer></div></body>
</html>

Extracting Data from HTML

Now that our scraper is successfully retrieving HTML pages from the website, we need to update our scraper to extract the data we want.

We will do this using the Cheerio library and CSS Selectors (another option are XPath Selectors).

XPath and CSS selectors are like little maps our scraper will use to navigate the DOM tree and find the location of the data we require.

First things first though, we need to load the HTML response into Cheerio so we can navigate the DOM. This can be done as follows:

if (response.status == 200) {
  const html = response.data;
  const $ = cheerio.load(html);
}

Find Product CSS Selectors

To find the correct CSS selectors to parse the product details we will first open the page in our browsers DevTools.

Open the website, then open the developer tools console (right click on the page and click inspect).

Find Selectors Using Inspect Element

Using the inspect element, hover over the item and look at the id's and classes on the individual products.

In this case we can see that each box of chocolates has its own special component which is called product-item. We can just use this to reference our products (see above image). Now using the following code:

const $ = cheerio.load(html);
const productItems = $("product-item");
console.log("Number of product-item: " + productItems.length);

We can see that it finds elements matching the selector

node chocolateScraper.js
> Number of product-item: 24

Extract Product Details

Now that we have a list of product-item elements stored in the productItems let's extract the name, price and url of each product.

First, we can use inspector again to find the selectors for these values. We should learn that

.product-item-meta__title is the title
.price is the price
.product-item-meta__title contains an href for the URL

Using these selectors, we can extract that information from te web page.

for (const productItem of productItems) {
  const title = $(productItem).find(".product-item-meta__title").text();
  const price = $(productItem).find(".price").text();
  const url = $(productItem).find(".product-item-meta__title").attr("href");
}

But one small issue you may see with this, the price field needs to be cleaned up. In it's normal state, it looks like this \n Sale price£32.00\n Regular price£45.00 \n Sale price£8.75

We want to focus purely on the sale price, to do that we will perform some operations rather than grab the text immediately.

const price = $(productItem)
  .find(".price")
  .first()
  .text()
  .replace("Sale price", "")
  .trim();

In the above code, we get all elements matching the .price selector. We know our "Sale price" will always be the first so we use .first(). Then, we can grab the text but it will still include some unwanted content so we use .replace() and finally, .trim() to remove extra whitespace and new lines.

Updated Scraper

Now that we've got the CSS selectors figured out and cleaned up the data some, let's take a look at the code we've got so far.

const axios = require("axios");
const cheerio = require("cheerio");

const listOfUrls = ["https://www.chocolate.co.uk/collections/all"];

const scrapedData = [];

async function scrape() {
  for (const url of listOfUrls) {
    const response = await axios.get(url);

    if (response.status == 200) {
      const html = response.data;
      const $ = cheerio.load(html);
      const productItems = $("product-item");
      for (const productItem of productItems) {
        const title = $(productItem).find(".product-item-meta__title").text();

        const price = $(productItem)
          .find(".price")
          .first()
          .text()
          .replace("Sale price", "")
          .trim();
        const url = $(productItem)
          .find(".product-item-meta__title")
          .attr("href");

        scrapedData.push({
          title,
          price,
          url,
        });
      }
    }
  }
}

(async () => {
  await scrape();
  console.log(scrapedData);
})();

In this code, we take the following steps:

Make a request to the chocolate store from the URLs array
Upon successful response, get all elements matching the product-item selector.
Loop through all product items and extract title, price and url.
Append the extracted data to the scrapedData array

After running that code, we should see an output similar to this (truncated for brevity):

[
    {
    title: 'Cinnamon Toast',
    price: '£5.00',
    url: '/products/cinnamon-toast-chocolate-bar'
  },
  {
    title: 'Collection of 4 of our Best Selling Chocolate Malt Balls',
    price: '£30.00',
    url: '/products/collection-of-our-best-selling-chocolate-malt-balls'
  },
  {
    title: 'Colombia 61%',
    price: '£5.00',
    url: '/products/colombian-dark-chocolate-bar'
  },
  {
    title: 'Crunchy Biscuit',
    price: '£5.00',
    url: '/products/crunchy-biscuit-blonde-chocolate-bar'
  },
  ...
]

Saving Data to CSV

In Part 4, we will go much more in-depth on how to save data in various formats and databases. But for now, we will stick to the simple and common CSV format.

We've already got out data being collected and stored in the array as objects, so now we just need to output it to a CSV file when we're done.

To start, we will write a new function that will do the work of saving the data

function saveAsCSV(data, filename) {
  const header = Object.keys(data[0]).join(",");
  const csv = [header, ...data.map((obj) => Object.values(obj).join(","))].join(
    "\n"
  );
  fs.writeFileSync(filename, csv);
}

This will use the native fs (File System) library to write data to a file. Make sure to add const fs = require("fs"); to the top of the file.

Now update the scraper to use that function with our data

const axios = require("axios");
const cheerio = require("cheerio");
const fs = require("fs");

const listOfUrls = ["https://www.chocolate.co.uk/collections/all"];

const scrapedData = [];

async function scrape() {
  for (const url of listOfUrls) {
    const response = await axios.get(url);

    if (response.status == 200) {
      const html = response.data;
      const $ = cheerio.load(html);
      const productItems = $("product-item");
      for (const productItem of productItems) {
        const title = $(productItem).find(".product-item-meta__title").text();

        const price = $(productItem)
          .find(".price")
          .first()
          .text()
          .replace("Sale price", "")
          .trim();
        const url = $(productItem)
          .find(".product-item-meta__title")
          .attr("href");

        scrapedData.push({
          title,
          price,
          url,
        });
      }
    }
  }
}

function saveAsCSV(data, filename) {
  const header = Object.keys(data[0]).join(",");
  const csv = [header, ...data.map((obj) => Object.values(obj).join(","))].join(
    "\n"
  );
  fs.writeFileSync(filename, csv);
}

(async () => {
  await scrape();
  saveAsCSV(scrapedData, "chocolate.csv");
})();

Now, after running the scraper, we should see a chocolate.csv file with all of our data. It should look something like this:

title,price,url
100% Dark Hot Chocolate Flakes,£9.95,/products/100-dark-hot-chocolate-flakes
2.5kg Bulk 41% Milk Hot Chocolate Drops,£32.00,/products/2-5kg-bulk-of-our-41-milk-hot-chocolate-drops
2.5kg Bulk 61% Dark Hot Chocolate Drops,£32.00,/products/2-5kg-of-our-best-selling-61-dark-hot-chocolate-drops
41% Milk Hot Chocolate Drops,£8.75,/products/41-colombian-milk-hot-chocolate-drops
61% Dark Hot Chocolate Drops,£8.75,/products/62-dark-hot-chocolate
...

Navigating to the "Next Page"

The code is working great but we're only getting the products from the first page of the website.

So the next logical step is to go to the next page if there is one and scrape the item data from that too! So here's how we do that.

To do so we need to find the correct CSS selector to get the next page button.

And then get the href attribute that contains the url to the next page. We can use

$("a[rel='next']").attr("href");

We've already got an array to store URLs that need to be scraped. We can utilize this by simply adding the "Next Page" link to this array:

const axios = require("axios");
const cheerio = require("cheerio");
const fs = require("fs");

const listOfUrls = ["https://www.chocolate.co.uk/collections/all"];

const scrapedData = [];

async function scrape() {
  for (const url of listOfUrls) {
    const response = await axios.get(url);

    if (response.status == 200) {
      const html = response.data;
      const $ = cheerio.load(html);
      const productItems = $("product-item");

      for (const productItem of productItems) {
        const title = $(productItem).find(".product-item-meta__title").text();

        const price = $(productItem)
          .find(".price")
          .first()
          .text()
          .replace("Sale price", "")
          .trim();
        const url = $(productItem)
          .find(".product-item-meta__title")
          .attr("href");

        scrapedData.push({
          title,
          price,
          url,
        });
      }

      const nextPage = $("a[rel='next']").attr("href");
      if (nextPage) {
        listOfUrls.push("https://www.chocolate.co.uk" + nextPage);
      }
    }
  }
}

function saveAsCSV(data, filename) {
  const header = Object.keys(data[0]).join(",");
  const csv = [header, ...data.map((obj) => Object.values(obj).join(","))].join(
    "\n"
  );
  fs.writeFileSync(filename, csv);
}

(async () => {
  await scrape();
  saveAsCSV(scrapedData, "chocolate.csv");
})();

Now, when we run our scraper, it will continue to add URLs when there are more pages available.

Part 1: Basic Puppeteer Scraper

NodeJS Puppeteer Beginners Series Part 1 - First Puppeteer Scraper

In Part 1, we'll start by building a basic web scraper that extracts data from webpages using CSS selectors and saves it in CSV format.

In the following sections, we'll expand on this foundation, adding more features and functionality.

Our Puppeteer Web Scraping Stack
How to Set Up Our Node.js Environment
Creating Our Scraper Project
Laying Out Our Puppeteer Scraper
Retrieving The HTML From Website
Extracting Data From HTML
Saving Data to CSV
Navigating to the Next Page
Next Steps

Node.js Playwright 6-Part Beginner Series

Part 1: Basic Node.js Playwright Scraper - We'll learn the fundamentals of web scraping with Node.js and build your first scraper using Cheerio. (This article)
Part 2: Cleaning Unruly Data & Handling Edge Cases - Web data can be messy and unpredictable. In this part, we'll create a robust scraper using data structures and cleaning techniques to handle these challenges. (Part 2)
Part 3: Storing Scraped Data in AWS S3, MySQL & Postgres DBs - Explore various options for storing your scraped data, including databases like MySQL or Postgres, cloud storage like AWS S3, and file formats like CSV and JSON. We'll discuss their pros, cons, and suitable use cases. (Part 3)
Part 4: Managing Retries & Concurrency - Enhance your scraper's reliability and scalability by handling failed requests and utilizing concurrency. (Part 4)
Part 5: Faking User-Agents & Browser Headers - Learn how to create a production-ready scraper by simulating real users through user-agent and browser header manipulation. (Part 5)
Part 6: Using Proxies To Avoid Getting Blocked - Discover how to use proxies to bypass anti-bot systems by disguising your real IP address and location. (Part 6)

For this series, we will be scraping products from an e-commerce website,chocolate.co.uk, using Puppeteer for its ability to handle JavaScript-heavy pages. Let's get started!

chocoloate.co.uk Products Page

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

Our Puppeteer Web Scraping Stack

When it comes to web scraping stacks, two key components are necessary:

HTTP Client: Sends a request to the website to retrieve the HTML/JSON response.
Browser Automation Tool: Used to navigate and interact with web pages.

For our purposes, we will use Puppeteer, a Node.js library that provides a high-level API to control headless Chrome or Chromium. Puppeteer is particularly useful for scraping dynamic content that requires JavaScript to render.

Using Puppeteer, you can simulate a real user navigating through a website. This includes clicking on buttons, filling out forms, and waiting for dynamic content to load.

This makes it a powerful tool for web scraping, especially for modern websites that rely heavily on JavaScript.

How to Set Up Our Node.js Environment

Let's start by setting up our Node.js environment.

Step 1 - Setup Your Node.js Environment

Ensure you have Node.js installed on your machine. You can download it from nodejs.org.

Once installed, set up a new project and initialize a package.json file:

$ mkdir puppeteer_scraper
$ cd puppeteer_scraper
$ npm init -y

This creates a new directory for our project and initializes it with a default package.json file.

Step 2 - Install Puppeteer

Install Puppeteer using npm:

$ npm install puppeteer

Puppeteer will download a recent version of Chromium by default, which ensures that your scraper works out of the box with a known good version of the browser.

Creating Our Scraper Project

Now that we have our environment set up, we can start building our Puppeteer scraper. First, create a new file called chocolate_scraper.js in our project folder:

puppeteer_scraper
└── chocolate_scraper.js

This chocolate_scraper.js file will contain all the code we use to scrape the e-commerce website.

Laying Out Our Puppeteer Scraper

First, let's lay out the basic structure of our scraper.

const puppeteer = require('puppeteer');

const urls = [
    'https://www.chocolate.co.uk/collections/all'
];

let scrapedData = [];

const startScrape = async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    for (const url of urls) {
        await page.goto(url);
        
        // Parse Data
        
        // Add to Data Output
    }

    await browser.close();
    console.log(scrapedData);
};

startScrape();

We imported Puppeteer which provides an API to control a browser programmatically.
Then, defined a list of URLs (urls) to scrape.
Next, an empty array (scrapedData) is initialized to store the data that will be scraped from the website.
Finally, set up a basic function to start our scraping process.

If we run this script now using startScrape() then we should get a empty list as an output.

Retrieving The HTML From Website

The first step every web scraper must do is retrieve the HTML/JSON response from the target website so that it can extract the data from the response.

Let's update our scraper to navigate to the target URLs and retrieve the HTML content:

const puppeteer = require('puppeteer');

const urls = [
    'https://www.chocolate.co.uk/collections/all'
];

let scrapedData = [];

const startScrape = async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    for (const url of urls) {
        await page.goto(url, { waitUntil: 'networkidle2' });

        const html = await page.content();
    }

    await browser.close();
};

startScrape();

Here, we navigate to each URL in our list using page.goto(url, { waitUntil: 'networkidle2' }).
- The networkidle2 option ensures that Puppeteer waits until there are no more than two network connections for at least 500 ms.
- This is particularly useful for pages that load additional content dynamically.
We then retrieve the HTML content of the page with page.content() and print it out for debugging purposes.

Extracting Data From HTML

Now that our scraper can retrieve HTML content, we need to extract the data we want.

This will be done using Puppeteer's page.evaluate() function, which allows us to execute JavaScript in the context of the page.

Find Product CSS Selectors

To identify the correct CSS selectors for parsing product details, start by opening the website in your browser.

Then, right-click anywhere on the page and select "Inspect" to open the developer tools console.

Product CSS Selectors

Using the inspect element, hover over the item and look at the id's and classes on the individual products.

In this case we can see that each box of chocolates has its own special component which is called product-item.

We can just use this to reference our products (see above image).

const puppeteer = require('puppeteer');

const urls = [
    'https://www.chocolate.co.uk/collections/all'
];

let scrapedData = [];

const startScrape = async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    for (const url of urls) {
        await page.goto(url, { waitUntil: 'networkidle2' });

        const products = await page.evaluate(() => {
            const items = document.querySelectorAll('.product-item');
            return Array.from(items).map(item => ({
                name: item.querySelector('.product-item-meta__title').innerText,
                price: item.querySelector('.price').innerText.replace('\nSale price', '').trim(),
                url: item.querySelector('.product-item-meta a').href
            }));
        });

        scrapedData.push(...products);
    }

    await browser.close();
    console.log(scrapedData);
};

startScrape();

In this code, we use page.evaluate() to execute a function in the context of the page.
This function selects all elements with the class product-item and maps them to an array of objects, each containing the product's name, price, and URL.
We then append this array to our scrapedData array.

Saving Data to CSV

In Part 4 of this beginner series, we go through in much more detail how to save data to various file formats and databases.

However, as a simple example for part 1 of this series we're going to save the data we've scraped and stored in scraped_data into a CSV file once the scrape has been completed.

To do this, we need to install the csv-writer package:

$ npm install csv-writer

Now, update our scraper to include the CSV writing functionality:

const puppeteer = require('puppeteer');
const createCsvWriter = require('csv-writer').createObjectCsvWriter;

const urls = [
    'https://www.chocolate.co.uk/collections/all'
];

let scrapedData = [];

const startScrape = async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    for (const url of urls) {
        await page.goto(url, { waitUntil: 'networkidle2' });

        const products = await page.evaluate(() => {
            const items = document.querySelectorAll('.product-item');
            return Array.from(items).map(item => ({
                name: item.querySelector('.product-item-meta__title').innerText,
                price: item.querySelector('.price').innerText.replace('\nSale price', '').trim(),
                url: item.querySelector('.product-item-meta a').href
            }));
        });

        scrapedData.push(...products);
    }

    await browser.close();
    saveToCSV(scrapedData);
};

const saveToCSV = (data) => {
    const csvWriter = createCsvWriter({
        path: 'scraped_data.csv',
        header: [
            { id: 'name', title: 'Name' },
            { id: 'price', title: 'Price' },
            { id: 'url', title: 'URL' }
        ]
    });

    csvWriter.writeRecords(data).then(() => {
        console.log('CSV file was written successfully');
    });
};

startScrape();

In this code, we define a saveToCSV function that takes an array of data and writes it to a CSV file using csv-writer.
The csvWriter object is configured with the path to the output file and the headers for the CSV columns.
After scraping the data, we call saveToCSV(scrapedData) to save the data to a file.

Navigating to the Next Page

So far the code is working great but we're only getting the products from the first page of the site, the URL which we have defined in the urls list.

So the next logical step is to go to the next page if there is one and scrape the item data from that too! So here's how we do that.

To handle pagination, we need to find the CSS selector for the "next page" button and scrape each page iteratively until there are no more pages.

document.querySelector('a[rel="next"]')

Now, we just need to update our scraper to extract this next page url and add it to our urls to scrape.

const puppeteer = require('puppeteer');
const createCsvWriter = require('csv-writer').createObjectCsvWriter;

const baseURL = 'https://www.chocolate.co.uk/collections/all';
let scrapedData = [];

const startScrape = async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    let nextPageExists = true;
    let currentPage = baseURL;

    while (nextPageExists) {
        await page.goto(currentPage, { waitUntil: 'networkidle2' });

        const products = await page.evaluate(() => {
            const items = document.querySelectorAll('.product-item');
            return Array.from(items).map(item => ({
                name: item.querySelector('.product-item-meta__title').innerText,
                price: item.querySelector('.price').innerText.replace('\nSale price', '').trim(),
                url: item.querySelector('.product-item-meta a').href
            }));
        });

        scrapedData.push(...products);

        nextPageExists = await page.evaluate(() => {
            const nextPage = document.querySelector('a[rel="next"]');
            return nextPage ? nextPage.href : null;
        });

        if (nextPageExists) {
            currentPage = nextPageExists;
        }


    }

    await browser.close();
    saveToCSV(scrapedData);
};

const saveToCSV = (data) => {
    const csvWriter = createCsvWriter({
        path: 'scraped_data.csv',
        header: [
            { id: 'name', title: 'Name' },
            { id: 'price', title: 'Price' },
            { id: 'url', title: 'URL' }
        ]
    });

    csvWriter.writeRecords(data).then(() => {
        console.log('CSV file was written successfully');
    });
};

startScrape();

In this updated code, we use a while loop to keep scraping until there are no more pages.
We check for the presence of a "next page" link using page.evaluate(), and if it exists, we set currentPage to the URL of the next page.
This process repeats until nextPageExists is null, indicating there are no more pages to scrape.

Part 1 - Building Your First Scraper

NodeJS Playwright Beginner Series Part 1: How To Build Your First Playwright Scraper

This guide is your comprehensive, step-by-step journey to building a production-ready web scraper with Node.js and Playwright.

While many tutorials cover only the basics, this six-part series goes further, leading you through the creation of a well-structured scraper using object-oriented programming (OOP) principles.

This 6-part Node.js Playwright Beginner Series will walk you through building a web scraping project from scratch, covering everything from creating the scraper to deployment and scheduling.

You'll learn not just how to scrape data but also how to store and clean it, handle errors and retries, and optimize performance with Node.js concurrency modules. By the end of this guide, you'll be equipped to create a robust, efficient, and scalable web scraper.

Node.js Playwright 6-Part Beginner Series

Part 1: Basic Node.js Playwright Scraper - We'll learn the fundamentals of web scraping with Node.js and build your first scraper using Cheerio. (This article)
Part 2: Cleaning Unruly Data & Handling Edge Cases - Web data can be messy and unpredictable. In this part, we'll create a robust scraper using data structures and cleaning techniques to handle these challenges. (Part 2)
Part 3: Storing Scraped Data in AWS S3, MySQL & Postgres DBs - Explore various options for storing your scraped data, including databases like MySQL or Postgres, cloud storage like AWS S3, and file formats like CSV and JSON. We'll discuss their pros, cons, and suitable use cases. (Part 3)
Part 4: Managing Retries & Concurrency - Enhance your scraper's reliability and scalability by handling failed requests and utilizing concurrency. (Part 4)
Part 5: Faking User-Agents & Browser Headers - Learn how to create a production-ready scraper by simulating real users through user-agent and browser header manipulation. (Part 5)
Part 6: Using Proxies To Avoid Getting Blocked - Discover how to use proxies to bypass anti-bot systems by disguising your real IP address and location. (Part 6)

This approach is ideal for personal projects and small-scale scraping tasks. However, larger-scale scraping, especially for business-critical data, may require more complex architectures.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

Part 1: Basic Node.js Scraper

In Part 1, we'll start by building a basic web scraper that extracts data from webpages using CSS selectors and saves it in CSV format. In the following sections, we'll expand on this foundation, adding more features and functionality.

How to Setup Our Environment
Creating Your Playwright Scraper Project
Laying Out our Playwright Scraper
Retrieving HTML From The Website
Extracting Data from HTML
Saving Data to CSV
Navigating to the "Next Page"
Next Steps

Most commercial web scrapers are typically used to collect data from e-commerce sites for competitive analysis.

To demonstrate this, we've chosen chocolate.co.uk—a simple, multi-page website that perfectly suits our needs.

chocolate.co.uk HomePage

Let's dive in!

How to Setup Our Environment

Before we start writing our code, we need to set up our environment. The requirements are:

The latest version of Node.js
Playwright
A Chromium browser compatible with Playwright

To install Node.js, visit the official website and download the appropriate version for your operating system.

Node.js comes bundled with npm, a package manager that we will use to install Playwright and other dependencies. After installing Node.js, verify the installation with the following commands:

node -v && npm -v

These commands will print the version numbers of Node.js and npm, confirming that they were successfully installed.

For example, it might print:

v22.6.0
10.8.2

Your versions might be different, but that’s okay. Next, let's create a working directory named chocolateScraper and open it in your terminal.

Run the following command to create a package.json file, which will act as a configuration file and keep track of our project dependencies and other information:

npm init -y

After that, install Playwright using this command:

npm i playwright

This will successfully install Playwright.

It's important to note that Playwright doesn’t come with any browser pre-bundled, unlike Puppeteer. You’ll need to install Chromium, Firefox, or WebKit separately.

For our scraper, we’ll use Chromium, so let's install it with this command:

npx playwright install chromium

Creating Your Playwright Scraper Project

Now that we’ve finished setting up our environment, let's create an entry point to start writing our code.

Create a file named chocolateScraper.js, where our code will go.

After this, your directory structure will look like this:

chocolateScraper/
├── node_modules/          # Installed dependencies
├── chocolateScraper.js    # Main entry point
├── package.json           # Project metadata and dependencies
├── package-lock.json      # Lockfile for dependencies (auto-generated)

Laying Out our Playwright Scraper

In this section, we'll outline how to structure our scraper. Let’s take a look at the following code:

const { chromium } = require("playwright");

const listOfUrls = ["https://www.chocolate.co.uk/collections/all"];
const scrapedData = [];

async function scrape() {
  const browser = await chromium.launch({ headless: true });
  const context = await browser.newContext();
  const page = await context.newPage();

  for (let url of listOfUrls) {
    console.log(`Scraping: ${url}`);
  }

  await browser.close();
}

(async () => {
  await scrape();
})();

// Scraping: https://www.chocolate.co.uk/collections/all

In the code above, we:

Imported the Chromium browser from the Playwright library.
Created two data structures: listOfUrls to store the URLs that need to be scraped, and scrapedData to store the important data we extract.
Defined a function scrape() to hold our core scraping logic.
Inside the function, we looped through each URL and simply printed it out.
Then, we closed the browser using browser.close().
Finally, we called the scrape() function using an Immediately Invoked Function Expression (IIFE), ensuring it runs as soon as the script is executed.

Retrieving HTML From The Website

One of the most essential methods in Playwright is the page.goto() method, which is used to navigate to a given URL.

Let's extract the HTML content from the URLs in our listOfUrls array using the page.content() method:

const { chromium } = require("playwright");

const listOfUrls = ["https://www.chocolate.co.uk/collections/all"];
const scrapedData = [];

async function scrape() {
  const browser = await chromium.launch({ headless: true });
  const context = await browser.newContext();
  const page = await context.newPage();

  for (let url of listOfUrls) {
    await page.goto(url);

    const html = await page.content();
    console.log(html);
  }

  await browser.close();
}

(async () => {
  await scrape();
})();

The code above successfully fetches the HTML from the web page and displays it in the console. The output will look something like this

<!DOCTYPE html>
<html class="js" lang="en" dir="ltr"
  style="--announcement-bar-height: 30px; --header-height: 218.453125px; --header-height-without-bottom-nav: 152px; --window-height: 720px;">

<head>
  <meta charset="utf-8">
  <meta name="viewport"
    content="width=device-width, initial-scale=1.0, height=device-height, minimum-scale=1.0, maximum-scale=1.0">
  <meta name="theme-color" content="#682464">

  <title>Products</title>
... More

Extracting Data from HTML

Unlike Cheerio, Playwright doesn’t provide methods for querying plain HTML directly. Instead, you have to use methods that operate within the browser context, executed as client-side JavaScript.

Here are some important methods that you should become familiar with:

$eval(selector, callback): Selects a single element that matches a CSS selector and runs a callback function on it. The function’s result is returned.
$$eval(selector, callback): Similar to $eval, but it selects all elements that match the CSS selector, applies a callback function to each, and returns an array of results.
querySelector(selector): A standard DOM method that selects the first element matching a given CSS selector.
textContent: Retrieves the text content of an element, including all its descendants. It returns the text inside the element as a string.
getAttribute: Retrieves the value of a specified attribute on an element. If the attribute doesn’t exist, it returns null.

For example:

const title = await page.$eval('.product-title', element => element.textContent.trim());
console.log(title);

We will use these methods in the upcoming sections. There, you will see how to apply them in practice.

Find Product CSS Selectors

To identify the CSS selectors for target elements within the DOM, you can utilize the Inspect tool in Google Chrome's Developer Tools (or any browser of your choice).

Begin by opening the desired URL, then right-click on the page and select "Inspect". This will open the Inspect tab, which displays the HTML structure of the webpage.

Inspect Tab

Within this tab, you can hover over or click on DOM elements to reveal their associated IDs, classes, and other attributes.

For example, if you're interested in identifying product items, you might find that their class is ".product-item".

Now, let's put this information to use by counting how many product elements, or product cards, exist on the webpage. The following code demonstrates how to do this using Playwright:

const { chromium } = require("playwright");

const listOfUrls = ["https://www.chocolate.co.uk/collections/all"];
const scrapedData = [];

async function scrape() {
  const browser = await chromium.launch({ headless: true });
  const context = await browser.newContext();
  const page = await context.newPage();

  for (let url of listOfUrls) {
    console.log(`Scraping: ${url}`);
    await page.goto(url);

    const productItems = await page.$$eval('.product-item', items => items.length);
    console.log(productItems);
  }

  await browser.close();
}

(async () => {
  await scrape();
})();

// Scraping: https://www.chocolate.co.uk/collections/all
// 24

In this code, we use the $$eval() function to select all elements matching the ".product-item" CSS selector and then run a callback function on the returned array to print its length, effectively counting the number of product cards on the page.

Extract Product Details

This is where the actual data extraction begins. Up to this point, we’ve just set up the structure of our code and explored some basic Playwright concepts. Now, we’ll focus on extracting the key details we need about each product.

Since we’ve already selected all the product elements, we can now use query selectors to extract specific information like the product title, price, and the URL.

But first, we need use inspector again to find the selectors for these values, which we identified as follows:

title: ".product-item-meta__title"
price: ".price"

Let’s take a look at the code:

const { chromium } = require("playwright");

const listOfUrls = ["https://www.chocolate.co.uk/collections/all"];
const scrapedData = [];

async function scrape() {
  const browser = await chromium.launch({ headless: true });
  const context = await browser.newContext();
  const page = await context.newPage();

  for (let url of listOfUrls) {
    console.log(`Scraping: ${url}`);
    await page.goto(url);

    const productItems = await page.$$eval("product-item", items =>
      items.map(item => {
        const titleElement = item.querySelector(".product-item-meta__title");
        const priceElement = item.querySelector(".price");
        return {
          title: titleElement ? titleElement.textContent : null,
          price: priceElement ? priceElement.textContent : null,
          url: titleElement ? titleElement.getAttribute("href") : null
        };
      })
    );

    scrapedData.push(...productItems);
  }

  await browser.close();
  console.log(scrapedData);
}

(async () => {
  await scrape();
})();

[
   {
     title: '100% Dark Hot Chocolate Flakes',
     price: '\n              Sale price£9.95',
     url: '/products/100-dark-hot-chocolate-flakes'
   },
   {
     title: '2.5kg Bulk 41% Milk Hot Chocolate Drops',
     price: '\n              Sale price£45.00',
     url: '/products/2-5kg-bulk-of-our-41-milk-hot-chocolate-drops'
   },
... More
]

In this code, we use querySelector() to locate the title and price elements within each product element using their CSS selectors.
We then extract their text content, along with the URL from the title element, which is a clickable link (using getAttribute('href')).
If any of this information is missing, we return null instead.
Finally, we push all the extracted data into our scrapedData array and print it out.

One thing you might have noticed in the output of the above code is that the price value isn't very clean.

It contains a newline character at the beginning and includes the text Sale price£, which we don't need as it could interfere with numeric calculations.

//price: '\n              Sale price£45.00',

We'll cover how to clean this data thoroughly by implementing a proper Product class in Part 2.

But for now, let's apply a quick fix using the trim() method to remove any empty spaces and the replace() method to eliminate the Sale price£ text.

return {
  title: titleElement ? titleElement.textContent.trim() : null,
  price: priceElement ? priceElement.textContent.replace("Sale price£", "").trim() : null,
  url: titleElement ? titleElement.getAttribute("href") : null
};

Saving Data to CSV

Extracted data is most valuable when stored properly on your local disk. While there are several formats you can use—such as CSV, JSON, NoSQL databases like MongoDB, or SQL databases like PostgreSQL—we'll focus on saving data in CSV format for now. We'll explore other formats in Part 3 of this guide.

CSV (Comma-Separated Values) files organize data in a simple text format where each column is separated by a comma (,), and each row is separated by a newline (\n).

Here's an example:

title,price,url
Almost Perfect,3,/products/almost-perfect

To save data as a CSV file in Node.js, we can use the fs module. Let's walk through an example:

const { chromium } = require("playwright");
const fs = require("fs");

const listOfUrls = ["https://www.chocolate.co.uk/collections/all"];
const scrapedData = [];

async function scrape() {
  const browser = await chromium.launch({ headless: true });
  const context = await browser.newContext();
  const page = await context.newPage();

  for (let url of listOfUrls) {
    console.log(`Scraping: ${url}`);
    await page.goto(url);

    const productItems = await page.$$eval(".product-item", items =>
      items.map(item => {
        const titleElement = item.querySelector(".product-item-meta__title");
        const priceElement = item.querySelector(".price");
        return {
          title: titleElement ? titleElement.textContent.trim() : null,
          price: priceElement ? priceElement.textContent.replace("Sale price£", "").trim() : null,
          url: titleElement ? titleElement.getAttribute("href") : null
        };
      })
    );

    scrapedData.push(...productItems);
  }

  await browser.close();
  saveAsCSV(scrapedData, 'scraped_data.csv');
}

function saveAsCSV(data, filename) {
  if (data.length === 0) {
    console.log("No data to save.");
    return;
  }

  const header = Object.keys(data[0]).join(",");
  const csv = [header, ...data.map((obj) => Object.values(obj).join(","))].join("\n");
  fs.writeFileSync(filename, csv);
  console.log(`Data saved to ${filename}`);
}

(async () => {
  await scrape();
})();

// Scraping: https://www.chocolate.co.uk/collections/all
// Data saved to scraped_data.csv

In this code, we've created a new function saveAsCSV() that takes the scraped data and a filename as input.

The data is formatted according to the CSV structure (with commas separating columns and newline characters separating rows) and then written to a file using the writeFileSync() method.
The writeFileSync() method is synchronous, meaning it will wait until all the data has been written to the file before proceeding to the next command.

Here’s how the CSV file would look after running this code:

 title,price,url
 100% Dark Hot Chocolate Flakes,9.95,/products/100-dark-hot-chocolate-flakes
 2.5kg Bulk 41% Milk Hot Chocolate Drops,45.00,/products/2-5kg-bulk-of-our-41-milk-hot-chocolate-drops
 2.5kg Bulk 61% Dark Hot Chocolate Drops,45.00,/products/2-5kg-of-our-best-selling-61-dark-hot-chocolate-drops
 ... More

Data Quality

Navigating to the "Next Page"

In this section, we’ll enhance our scraper to handle pagination and scrape all the pages of a website. The "Next Page (→)" button, in our case is indicated by an arrow symbol, and is a link element with an href attribute containing the URL for the next page.

We can extract this URL, add it to our listOfUrls array, and let our loop process it in subsequent iterations. When the "Next Page" button is no longer present, it indicates that we've reached the last page.

The CSS selector for the "Next Page (→)" button is "a.pagination__nav-item:nth-child(4)".

Below is the updated code, which includes a new asynchronous function nextPage() to handle pagination. The function is asynchronous because locating the button may take some time.

const { chromium } = require("playwright");
const fs = require("fs");

const listOfUrls = ["https://www.chocolate.co.uk/collections/all"];
const scrapedData = [];

async function scrape() {
  const browser = await chromium.launch({ headless: true });
  const context = await browser.newContext();
  const page = await context.newPage();

  for (let url of listOfUrls) {
    console.log(`Scraping: ${url}`);
    await page.goto(url);

    const productItems = await page.$$eval("product-item", items =>
      items.map(item => ({
        title: item.querySelector(".product-item-meta__title")?.textContent.trim() || null,
        price: item.querySelector(".price")?.textContent.replace("Sale price£", "").trim() || null,
        url: item.querySelector(".product-item-meta__title")?.getAttribute("href") || null
      }))
    );

    scrapedData.push(...productItems);
    await nextPage(page);
  }

  await browser.close();
  saveAsCSV(scrapedData, 'scraped_data.csv');
}

function saveAsCSV(data, filename) {
  if (data.length === 0) {
    console.log("No data to save.");
    return;
  }

  const header = Object.keys(data[0]).join(",");
  const csv = [header, ...data.map((obj) => Object.values(obj).join(","))].join("\n");
  fs.writeFileSync(filename, csv);
  console.log(`Data saved to ${filename}`);
}

async function nextPage(page) {
  let nextUrl;
  try {
    nextUrl = await page.$eval("a.pagination__nav-item:nth-child(4)", item => item.href);
  } catch (error) {
    console.log('Last Page Reached');
    return;
  }
  listOfUrls.push(nextUrl);
}

(async () => {
  await scrape();
})();

 Scraping: https://www.chocolate.co.uk/collections/all
 Scraping: https://www.chocolate.co.uk/collections/all?page=2
 Scraping: https://www.chocolate.co.uk/collections/all?page=3
 Last Page Reached
 Data saved to scraped_data.csv

Here's a summary of the changes:

The nextPage() function locates the "Next Page (→)" button using page.$eval() (not $$eval(), since it’s a single element). It retrieves the href attribute and adds the URL to the listOfUrls array.
The function uses a try-catch block to handle the absence of the "Next Page" button. If the button isn’t found, an error is thrown, indicating that we've reached the last page, and the function exits successfuly.
As new URLs are added to listOfUrls, the loop in a scrape() continues to navigate through all available pages, scraping data from each one.

This complete code sets up the scraper to automatically handle pagination, ensuring that all relevant data is collected across multiple pages.

Next Steps

You now have five fully functional “first scraper” examples-pick the stack that fits your project and clone its repo to explore the code:

Python Requests + BeautifulSoup https://github.com/Python-Web-Scraping-Playbook/Beginner-Series-Part-1-First-Scraper
Python Selenium https://github.com/ScrapeOps/python-scrapy-playbook/tree/master/1_Beginner_Series
Node.js Axios + Cheerio https://github.com/The-NodeJs-Web-Scraping-Playbook
Node.js Puppeteer https://github.com/The-NodeJs-Web-Scraping-Playbook
Node.js Playwright https://github.com/The-NodeJs-Web-Scraping-Playbook

In Part 2, we’ll tackle Data Cleaning & Edge‑Case Handling - you’ll learn how to normalize prices, deal with missing or malformed fields, build data‑class pipelines, and make your scraper robust against real‑world HTML quirks.

Click Here for Part 2 guide.

Need a Free Proxy? Then check out our Proxy Comparison Tool that allows to compare the pricing, features and limits of every proxy provider on the market so you can find the one that best suits your needs. Including the best free plans.

Web Scraping Guide Part 1: How To Build Our First Scraper

Python Requests/BS4 Beginners Series Part 1: How To Build Our First Scraper​

Python Requests/BeautifulSoup 6-Part Beginner Series​

Need help scraping the web?

Part 1: Basic Python Scraper​

Our Python Web Scraping Stack​

How to Setup Our Python Environment​

Step 1 - Setup your Python Environment​

MacOS or Linux​

Windows​

Step 2 - Install Python Requests & BeautifulSoup​

Creating Our Scraper Project​

Laying Out Our Python Scraper​

Retrieving The HTML From Website​

Extracting Data From HTML​

Find Product CSS Selectors​

Get First Product​

Get All Products​

Extract Product Details​

Updated Scraper​

Saving Data to CSV​

Navigating to the "Next Page"​

Python Selenium Beginners Series Part 1: How To Build Our First Scraper​

Python Selenium 6-Part Beginner Series​

Need help scraping the web?

Our Python Web Scraping Stack​

How to Setup Our Python Environment​

Step 1: Set up your Python Environment​

Step 2: Install Python Selenium and WebDriver​

Creating Our Scraper Project​

Laying Out Our Python Scraper​

Launching the Browser​

Switching to Headless Mode​

Extracting Data​

Find Product Selectors​

Extract Product Details​

Updated Scraper​

Saving Data to CSV​

Navigating to the "Next Page"​

Node.js Axios/CheerioJS Beginners Series Part 1: Building Your First Scraper​

Need help scraping the web?

Part 1: Basic Node.js Scraper

Our Node Web Scraping Stack​

How to Setup Our Node.js Environment​

Step 1 - Node.js and npm Installation​

Step 2 - Package Management​

Step 3 - Project Directory​

Step 4 - Initialize Project (Optional)​

Step 5 - Install Dependencies​

Creating Your First Node.js Scraper Project​

Laying Out our Node Scraper​

Retrieving The HTML From Website​

Extracting Data from HTML​

Find Product CSS Selectors​

Extract Product Details​

Updated Scraper​

Saving Data to CSV​

Navigating to the "Next Page"​

NodeJS Puppeteer Beginners Series Part 1 - First Puppeteer Scraper​

Need help scraping the web?

Our Puppeteer Web Scraping Stack​

How to Set Up Our Node.js Environment​

Step 1 - Setup Your Node.js Environment​

Step 2 - Install Puppeteer​

Creating Our Scraper Project​

Laying Out Our Puppeteer Scraper​

Retrieving The HTML From Website​

Extracting Data From HTML​

Find Product CSS Selectors​

Saving Data to CSV​

Navigating to the Next Page​

NodeJS Playwright Beginner Series Part 1: How To Build Your First Playwright Scraper​

Need help scraping the web?

Part 1: Basic Node.js Scraper​

How to Setup Our Environment​

Creating Your Playwright Scraper Project​

Laying Out our Playwright Scraper​

Retrieving HTML From The Website​

Extracting Data from HTML​

Find Product CSS Selectors​

Python Requests/BS4 Beginners Series Part 1: How To Build Our First Scraper

Python Requests/BeautifulSoup 6-Part Beginner Series

Part 1: Basic Python Scraper

Our Python Web Scraping Stack

How to Setup Our Python Environment

Step 1 - Setup your Python Environment

MacOS or Linux

Windows

Step 2 - Install Python Requests & BeautifulSoup

Creating Our Scraper Project

Laying Out Our Python Scraper

Retrieving The HTML From Website

Extracting Data From HTML

Find Product CSS Selectors

Get First Product

Get All Products

Extract Product Details

Updated Scraper

Saving Data to CSV

Navigating to the "Next Page"

Python Selenium Beginners Series Part 1: How To Build Our First Scraper

Python Selenium 6-Part Beginner Series

Our Python Web Scraping Stack

How to Setup Our Python Environment

Step 1: Set up your Python Environment

Step 2: Install Python Selenium and WebDriver

Creating Our Scraper Project

Laying Out Our Python Scraper

Launching the Browser

Switching to Headless Mode

Extracting Data

Find Product Selectors

Extract Product Details

Updated Scraper

Saving Data to CSV

Navigating to the "Next Page"

Node.js Axios/CheerioJS Beginners Series Part 1: Building Your First Scraper

Our Node Web Scraping Stack

How to Setup Our Node.js Environment

Step 1 - Node.js and npm Installation

Step 2 - Package Management

Step 3 - Project Directory

Step 4 - Initialize Project (Optional)

Step 5 - Install Dependencies

Creating Your First Node.js Scraper Project

Laying Out our Node Scraper

Retrieving The HTML From Website

Extracting Data from HTML

Find Product CSS Selectors

Extract Product Details

Updated Scraper

Saving Data to CSV

Navigating to the "Next Page"

NodeJS Puppeteer Beginners Series Part 1 - First Puppeteer Scraper

Our Puppeteer Web Scraping Stack

How to Set Up Our Node.js Environment

Step 1 - Setup Your Node.js Environment

Step 2 - Install Puppeteer

Creating Our Scraper Project

Laying Out Our Puppeteer Scraper

Retrieving The HTML From Website

Extracting Data From HTML

Find Product CSS Selectors

Saving Data to CSV

Navigating to the Next Page

NodeJS Playwright Beginner Series Part 1: How To Build Your First Playwright Scraper

Part 1: Basic Node.js Scraper

How to Setup Our Environment

Creating Your Playwright Scraper Project

Laying Out our Playwright Scraper

Retrieving HTML From The Website

Extracting Data from HTML

Find Product CSS Selectors

Extract Product Details

Saving Data to CSV

Navigating to the "Next Page"

Next Steps