Skip to main content

freeCodeCamp Scrapy Beginners Course Part 6 - Items & Item Pipelines

freeCodeCamp Scrapy Beginners Course Part 6: Items & Item Pipelines

In Part 6 of the Scrapy Beginner Course, we go through how to use Scrapy Items & Item Pipelines to structure and clean your scraped data.

Scraped data can be very messy and unstructured. Scraped data can be in the:

  • Wrong format (text instead of a number)
  • Contain additional unnecessary data
  • Using the wrong encoding

We will walk through:

The code for this part of the course is available on Github here!

If you prefer video tutorials, then check out the video version of this course on the freeCodeCamp channel here.

freeCodeCamp Scrapy Course

Recap of Part 5

In Part 5, we created a more advanced Scrapy spider that will crawl the entire BooksToScrape.com website and scrape the data from each individual book page.

Here is the final code:

import scrapy

class BookspiderSpider(scrapy.Spider):
name = 'bookspider'
allowed_domains = ['books.toscrape.com']
start_urls = ['https://books.toscrape.com/']

def parse(self, response):
books = response.css('article.product_pod')
for book in books:
relative_url = book.css('h3 a').attrib['href']
if 'catalogue/' in relative_url:
book_url = 'https://books.toscrape.com/' + relative_url
else:
book_url = 'https://books.toscrape.com/catalogue/' + relative_url
yield scrapy.Request(book_url, callback=self.parse_book_page)

## Next Page
next_page = response.css('li.next a ::attr(href)').get()
if next_page is not None:
if 'catalogue/' in next_page:
next_page_url = 'https://books.toscrape.com/' + next_page
else:
next_page_url = 'https://books.toscrape.com/catalogue/' + next_page
yield response.follow(next_page_url, callback=self.parse)

def parse_book_page(self, response):
book = response.css("div.product_main")[0]
table_rows = response.css("table tr")
yield {
'url': response.url,
'title': book.css("h1 ::text").get(),
'upc': table_rows[0].css("td ::text").get(),
'product_type': table_rows[1].css("td ::text").get(),
'price_excl_tax': table_rows[2].css("td ::text").get(),
'price_incl_tax': table_rows[3].css("td ::text").get(),
'tax': table_rows[4].css("td ::text").get(),
'availability': table_rows[5].css("td ::text").get(),
'num_reviews': table_rows[6].css("td ::text").get(),
'stars': book.css("p.star-rating").attrib['class'],
'category': book.xpath("//ul[@class='breadcrumb']/li[@class='active']/preceding-sibling::li[1]/a/text()").get(),
'description': book.xpath("//div[@id='product_description']/following-sibling::p/text()").get(),
'price': book.css('p.price_color ::text').get(),
}

When we inspected the scraped data we saw there were some issues that we needed to fix:

  • Prices aren't numbers
  • The stock availability isn't a number
  • Some text contains trailing & leading white spaces

In Part 6, we will look at how to use Items and Item Pipelines to better structure and clean our data before saving it into a database.


What Are Scrapy Items?

Scrapy Items are a predefined data structure that holds your data.

Instead of yielding your scraped data in the form of a dictionary for example, you define a Item schema beforehand in your items.py file and use this schema when scraping data.

This enables you to quickly and easily check what structured data you are collecting in your project, it will raise exceptions if you try and create incorrect data with your Item.

Because of this, using Scrapy Items have a number of advantages:

  • Structures your data and gives it a clear schema.
  • Enables you to easily clean and process your scraped data.
  • Enables you to validate, deduplicate and monitor your data feeds.
  • Enables you to easily store and export your data with Scrapy Feed Exports.
  • Makes using Scrapy Item Pipelines & Item Loaders.

Scrapy supports multiple types of data formats that are automatically converted into Scrapy Items when yielded:

However, defining your own Item object in your items.py file is normally the best option.


Using Scrapy Items To Structure Our Data

Up until now we've been yielding our data in a dictionary. However, the preferred way of yielding data in Scrapy is using its Item functionality.

So the next step we're going to do is switch to using Scrapy Items in our bookspider.

Creating an Item is very easy. Simply create a Item schema in your items.py file.

This file is usually auto generated when you create a new project using scrapy and lives at the same folder level as where you have the settings.py file for your scrapy project.

# items.py

import scrapy

class BookItem(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
upc = scrapy.Field()
product_type = scrapy.Field()
price_excl_tax = scrapy.Field()
price_incl_tax = scrapy.Field()
tax = scrapy.Field()
availability = scrapy.Field()
num_reviews = scrapy.Field()
stars = scrapy.Field()
category = scrapy.Field()
description = scrapy.Field()
price = scrapy.Field()

Then in our bookspider.py file, import the Item schema and update our spider to store the data in the Item and yield the book_item once the data has been scraped.


import scrapy
from bookscraper.items import BookItem

class BookspiderSpider(scrapy.Spider):
name = 'bookspider'
allowed_domains = ['books.toscrape.com']
start_urls = ['https://books.toscrape.com/']

def parse(self, response):
books = response.css('article.product_pod')
for book in books:
relative_url = book.css('h3 a').attrib['href']
if 'catalogue/' in relative_url:
book_url = 'https://books.toscrape.com/' + relative_url
else:
book_url = 'https://books.toscrape.com/catalogue/' + relative_url
yield scrapy.Request(book_url, callback=self.parse_book_page)

## Next Page
next_page = response.css('li.next a ::attr(href)').get()
if next_page is not None:
if 'catalogue/' in next_page:
next_page_url = 'https://books.toscrape.com/' + next_page
else:
next_page_url = 'https://books.toscrape.com/catalogue/' + next_page
yield response.follow(next_page_url, callback=self.parse)

def parse_book_page(self, response):
book = response.css("div.product_main")[0]
table_rows = response.css("table tr")
book_item = BookItem()
book_item['url'] = response.url
book_item['title'] = book.css("h1 ::text").get()
book_item['upc'] = table_rows[0].css("td ::text").get()
book_item['product_type'] = table_rows[1].css("td ::text").get()
book_item['price_excl_tax'] = table_rows[2].css("td ::text").get()
book_item['price_incl_tax'] = table_rows[3].css("td ::text").get()
book_item['tax'] = table_rows[4].css("td ::text").get()
book_item['availability'] = table_rows[5].css("td ::text").get()
book_item['num_reviews'] = table_rows[6].css("td ::text").get()
book_item['stars'] = book.css("p.star-rating").attrib['class']
book_item['category'] = book.xpath("//ul[@class='breadcrumb']/li[@class='active']/preceding-sibling::li[1]/a/text()").get()
book_item['description'] = book.xpath("//div[@id='product_description']/following-sibling::p/text()").get()
book_item['price'] = book.css('p.price_color ::text').get()
yield book_item

This gives our data more structure and allows us to more easily clean it in data pipelines.


What Are Scrapy Pipelines?

Item Pipelines are the data processors of Scrapy, which all our scraped Items will pass through and from where we can clean, process, validate, and store our data.

Using Scrapy Pipelines we can:

  • Clean our data (ex. remove currency signs from prices)
  • Format our data (ex. convert strings to ints)
  • Enrich our data (ex. convert relative links to absolute links)
  • Valdiate our data (ex. make sure the price scraped is a viable price)
  • Store our data in databases, queues, files or object storage buckets.

In our Scrapy project, we will use Item Pipelines to clean and process our data before storing it in a database.


Cleaning Our Scraped Data With Item Pipelines

As we mentioned previously, there is some data quality issues with the data we are scraping:

  • Prices aren't numbers
  • The stock availability isn't a number
  • Some text contains trailing & leading white spaces

So we will create a Item Pipeline to clean and modify our scraped data before saving it.

First we will create an empty pipeline in our pipelines.py file:


from itemadapter import ItemAdapter

class BookscraperPipeline:

def process_item(self, item, spider):
return item

Next, we will add data cleaning and processing steps to the pipeline to get the data into the format we want.


Strip Whitespaces From Strings

Some of the text we've scraped might have leading or trailing whitespaces or newlines that we don't want, so we will add a step to the pipeline to remove these from every field except the description field:


from itemadapter import ItemAdapter

class BookscraperPipeline:

def process_item(self, item, spider):
adapter = ItemAdapter(item)

## Strip all whitespaces from strings
field_names = adapter.field_names()
for field_name in field_names:
if field_name != 'description':
value = adapter.get(field_name)
adapter[field_name] = value.strip()

return item


Convert Category & Product Type To Lowercase

Next, we will convert the category and product_type fields to lower case instead of title case.


from itemadapter import ItemAdapter

class BookscraperPipeline:

def process_item(self, item, spider):
adapter = ItemAdapter(item)

## ...PREVIOUS STEPS

## Category & Product Type --> switch to lowercase
lowercase_keys = ['category', 'product_type']
for lowercase_key in lowercase_keys:
value = adapter.get(lowercase_key)
adapter[lowercase_key] = value.lower()

return item


Clean Price Data

Currently, the price, price_excl_tax, price_incl_tax and tax fields are strings and contain a £ sign at the start. We want to convert these prices into floats and remove the £ sign.


from itemadapter import ItemAdapter

class BookscraperPipeline:

def process_item(self, item, spider):
adapter = ItemAdapter(item)

## ...PREVIOUS STEPS

## Price --> convert to float
price_keys = ['price', 'price_excl_tax', 'price_incl_tax', 'tax']
for price_key in price_keys:
value = adapter.get(price_key)
value = value.replace('£', '')
adapter[price_key] = float(value)

return item


Extract Availability From Text

Currently, the availability value is a sentence like this In stock (19 available). We want to extract the number and save it as a integer.


from itemadapter import ItemAdapter

class BookscraperPipeline:

def process_item(self, item, spider):
adapter = ItemAdapter(item)

## ...PREVIOUS STEPS

## Availability --> extract number of books in stock
availability_string = adapter.get('availability')
split_string_array = availability_string.split('(')
if len(split_string_array) < 2:
adapter['availability'] = 0
else:
availability_array = split_string_array[1].split(' ')
adapter['availability'] = int(availability_array[0])

return item


Convert Reviews To Integer

Currently, the num_reviews value is a string, however, we would like to save it as a integer.


from itemadapter import ItemAdapter

class BookscraperPipeline:

def process_item(self, item, spider):
adapter = ItemAdapter(item)

## ...PREVIOUS STEPS

## Reviews --> convert string to number
num_reviews_string = adapter.get('num_reviews')
adapter['num_reviews'] = int(num_reviews_string)

return item


Convert Stars To Number

Finally, the stars value is a string like this star-rating Five. We want to extract the text number and convert it into a integer.


from itemadapter import ItemAdapter

class BookscraperPipeline:

def process_item(self, item, spider):
adapter = ItemAdapter(item)

## ...PREVIOUS STEPS

## Stars --> convert text to number
stars_string = adapter.get('stars')
split_stars_array = stars_string.split(' ')
stars_text_value = split_stars_array[1].lower()
if stars_text_value == "zero":
adapter['stars'] = 0
elif stars_text_value == "one":
adapter['stars'] = 1
elif stars_text_value == "two":
adapter['stars'] = 2
elif stars_text_value == "three":
adapter['stars'] = 3
elif stars_text_value == "four":
adapter['stars'] = 4
elif stars_text_value == "five":
adapter['stars'] = 5

return item


Full Item Pipeline

So here is the complete pipeline we will use to clean our scraped data from BooksToScrape.


from itemadapter import ItemAdapter

class BookscraperPipeline:

def process_item(self, item, spider):
adapter = ItemAdapter(item)

## Strip all whitespaces from strings
field_names = adapter.field_names()
for field_name in field_names:
if field_name != 'description':
value = adapter.get(field_name)
adapter[field_name] = value.strip()

## Category & Product Type --> switch to lowercase
lowercase_keys = ['category', 'product_type']
for lowercase_key in lowercase_keys:
value = adapter.get(lowercase_key)
adapter[lowercase_key] = value.lower()

## Price --> convert to float
price_keys = ['price', 'price_excl_tax', 'price_incl_tax', 'tax']
for price_key in price_keys:
value = adapter.get(price_key)
value = value.replace('£', '')
adapter[price_key] = float(value)

## Availability --> extract number of books in stock
availability_string = adapter.get('availability')
split_string_array = availability_string.split('(')
if len(split_string_array) < 2:
adapter['availability'] = 0
else:
availability_array = split_string_array[1].split(' ')
adapter['availability'] = int(availability_array[0])

## Reviews --> convert string to number
num_reviews_string = adapter.get('num_reviews')
adapter['num_reviews'] = int(num_reviews_string)

## Stars --> convert text to number
stars_string = adapter.get('stars')
split_stars_array = stars_string.split(' ')
stars_text_value = split_stars_array[1].lower()
if stars_text_value == "zero":
adapter['stars'] = 0
elif stars_text_value == "one":
adapter['stars'] = 1
elif stars_text_value == "two":
adapter['stars'] = 2
elif stars_text_value == "three":
adapter['stars'] = 3
elif stars_text_value == "four":
adapter['stars'] = 4
elif stars_text_value == "five":
adapter['stars'] = 5

return item


Activating Item Pipeline

To activate our Item Pipeline we just need to add the following code to our settings.py file:

## settings.py

ITEM_PIPELINES = {
'bookscraper.pipelines.BookscraperPipeline': 300,
}

Now, when we run our bookspider all the scraped data will pass through this Item Pipeline and be cleaned as we would like it to be.

Here is an example of the data:


{
"availability": 22,
"category": "poetry",
"description": "It's hard to imagine a world without A Light in the Attic. "
"This now-classic collection of poetry and drawings from Shel "
"Silverstein celebrates its 20th anniversary with this special "
"edition. Silverstein's humorous and creative verse can amuse "
"the dowdiest of readers. Lemon-faced adults and fidgety kids "
"sit still and read these rhythmic words and laugh and smile "
"and love th It's hard to imagine a world without A Light in "
"the Attic. This now-classic collection of poetry and drawings "
"from Shel Silverstein celebrates its 20th anniversary with "
"this special edition. Silverstein's humorous and creative "
"verse can amuse the dowdiest of readers. Lemon-faced adults "
"and fidgety kids sit still and read these rhythmic words and "
"laugh and smile and love that Silverstein. Need proof of his "
"genius? RockabyeRockabye baby, in the treetop Don't you know a "
"treetopIs no safe place to rock?And who put you up there,And "
"your cradle, too?Baby, I think someone down here's Got it in "
"for you. Shel, you never sounded so good. ...more",
"num_reviews": 0,
"price": 51.77,
"price_excl_tax": 51.77,
"price_incl_tax": 51.77,
"product_type": "books",
"stars": 3,
"tax": 0.0,
"title": "A Light in the Attic",
"upc": "a897fe39b1053632",
"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
}


Next Steps

We've just looked at how to use Items to structure our data and Item Pipelines to clean the data.

In Part 7, we will look at how you can save this data into various types of file formats like CSVs, JSON, and databases like MySQL and Postgres.

All parts of the 12 Part freeCodeCamp Scrapy Beginner Course are as follows: