freeCodeCamp Scrapy Beginners Course Part 6: Items & Item Pipelines
In Part 6 of the Scrapy Beginner Course, we go through how to use Scrapy Items & Item Pipelines to structure and clean your scraped data.
Scraped data can be very messy and unstructured. Scraped data can be in the:
- Wrong format (text instead of a number)
- Contain additional unnecessary data
- Using the wrong encoding
We will walk through:
- Recap of Part 5
- What Are Scrapy Items?
- Using Scrapy Items To Structure Our Data
- What Are Scrapy Pipelines?
- Cleaning Our Scraped Data With Item Pipelines
The code for this part of the course is available on Github here!
If you prefer video tutorials, then check out the video version of this course on the freeCodeCamp channel here.
This guide is part of the 12 Part freeCodeCamp Scrapy Beginner Course where we will build a Scrapy project end-to-end from building the scrapers to deploying on a server and run them every day.
If you would like to skip to another section then use one of the links below:
- Part 1: Course & Scrapy Overview
- Part 2: Setting Up Environment & Scrapy
- Part 3: Creating Scrapy Project
- Part 4: First Scrapy Spider
- Part 5: Crawling With Scrapy
- Part 6: Cleaning Data With Item Pipelines
- Part 7: Storing Data In CSVs & Databases
- Part 8: Faking Scrapy Headers & User-Agents
- Part 9: Using Proxies With Scrapy Spiders
- Part 10: Deploying & Scheduling Spiders With Scrapyd
- Part 11: Deploying & Scheduling Spiders With ScrapeOps
- Part 12: Deploying & Scheduling Spiders With Scrapy Cloud
The code for this project is available on Github here!
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
Recap of Part 5
In Part 5, we created a more advanced Scrapy spider that will crawl the entire BooksToScrape.com website and scrape the data from each individual book page.
Here is the final code:
import scrapy
class BookspiderSpider(scrapy.Spider):
name = 'bookspider'
allowed_domains = ['books.toscrape.com']
start_urls = ['https://books.toscrape.com/']
def parse(self, response):
books = response.css('article.product_pod')
for book in books:
relative_url = book.css('h3 a').attrib['href']
if 'catalogue/' in relative_url:
book_url = 'https://books.toscrape.com/' + relative_url
else:
book_url = 'https://books.toscrape.com/catalogue/' + relative_url
yield scrapy.Request(book_url, callback=self.parse_book_page)
## Next Page
next_page = response.css('li.next a ::attr(href)').get()
if next_page is not None:
if 'catalogue/' in next_page:
next_page_url = 'https://books.toscrape.com/' + next_page
else:
next_page_url = 'https://books.toscrape.com/catalogue/' + next_page
yield response.follow(next_page_url, callback=self.parse)
def parse_book_page(self, response):
book = response.css("div.product_main")[0]
table_rows = response.css("table tr")
yield {
'url': response.url,
'title': book.css("h1 ::text").get(),
'upc': table_rows[0].css("td ::text").get(),
'product_type': table_rows[1].css("td ::text").get(),
'price_excl_tax': table_rows[2].css("td ::text").get(),
'price_incl_tax': table_rows[3].css("td ::text").get(),
'tax': table_rows[4].css("td ::text").get(),
'availability': table_rows[5].css("td ::text").get(),
'num_reviews': table_rows[6].css("td ::text").get(),
'stars': book.css("p.star-rating").attrib['class'],
'category': book.xpath("//ul[@class='breadcrumb']/li[@class='active']/preceding-sibling::li[1]/a/text()").get(),
'description': book.xpath("//div[@id='product_description']/following-sibling::p/text()").get(),
'price': book.css('p.price_color ::text').get(),
}
When we inspected the scraped data we saw there were some issues that we needed to fix:
- Prices aren't numbers
- The stock availability isn't a number
- Some text contains trailing & leading white spaces
In Part 6, we will look at how to use Items and Item Pipelines to better structure and clean our data before saving it into a database.
What Are Scrapy Items?
Scrapy Items are a predefined data structure that holds your data.
Instead of yielding your scraped data in the form of a dictionary for example, you define a Item schema beforehand in your items.py
file and use this schema when scraping data.
This enables you to quickly and easily check what structured data you are collecting in your project, it will raise exceptions if you try and create incorrect data with your Item.
Because of this, using Scrapy Items have a number of advantages:
- Structures your data and gives it a clear schema.
- Enables you to easily clean and process your scraped data.
- Enables you to validate, deduplicate and monitor your data feeds.
- Enables you to easily store and export your data with Scrapy Feed Exports.
- Makes using Scrapy Item Pipelines & Item Loaders.
Scrapy supports multiple types of data formats that are automatically converted into Scrapy Items when yielded:
However, defining your own Item object in your items.py
file is normally the best option.
Using Scrapy Items To Structure Our Data
Up until now we've been yielding our data in a dictionary. However, the preferred way of yielding data in Scrapy is using its Item functionality.
So the next step we're going to do is switch to using Scrapy Items in our bookspider
.
Creating an Item is very easy. Simply create a Item schema in your items.py
file.
This file is usually auto generated when you create a new project using scrapy and lives at the same folder level as where you have the settings.py
file for your scrapy project.
# items.py
import scrapy
class BookItem(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
upc = scrapy.Field()
product_type = scrapy.Field()
price_excl_tax = scrapy.Field()
price_incl_tax = scrapy.Field()
tax = scrapy.Field()
availability = scrapy.Field()
num_reviews = scrapy.Field()
stars = scrapy.Field()
category = scrapy.Field()
description = scrapy.Field()
price = scrapy.Field()
Then in our bookspider.py
file, import the Item schema and update our spider to store the data in the Item and yield the book_item
once the data has been scraped.
import scrapy
from bookscraper.items import BookItem
class BookspiderSpider(scrapy.Spider):
name = 'bookspider'
allowed_domains = ['books.toscrape.com']
start_urls = ['https://books.toscrape.com/']
def parse(self, response):
books = response.css('article.product_pod')
for book in books:
relative_url = book.css('h3 a').attrib['href']
if 'catalogue/' in relative_url:
book_url = 'https://books.toscrape.com/' + relative_url
else:
book_url = 'https://books.toscrape.com/catalogue/' + relative_url
yield scrapy.Request(book_url, callback=self.parse_book_page)
## Next Page
next_page = response.css('li.next a ::attr(href)').get()
if next_page is not None:
if 'catalogue/' in next_page:
next_page_url = 'https://books.toscrape.com/' + next_page
else:
next_page_url = 'https://books.toscrape.com/catalogue/' + next_page
yield response.follow(next_page_url, callback=self.parse)
def parse_book_page(self, response):
book = response.css("div.product_main")[0]
table_rows = response.css("table tr")
book_item = BookItem()
book_item['url'] = response.url
book_item['title'] = book.css("h1 ::text").get()
book_item['upc'] = table_rows[0].css("td ::text").get()
book_item['product_type'] = table_rows[1].css("td ::text").get()
book_item['price_excl_tax'] = table_rows[2].css("td ::text").get()
book_item['price_incl_tax'] = table_rows[3].css("td ::text").get()
book_item['tax'] = table_rows[4].css("td ::text").get()
book_item['availability'] = table_rows[5].css("td ::text").get()
book_item['num_reviews'] = table_rows[6].css("td ::text").get()
book_item['stars'] = book.css("p.star-rating").attrib['class']
book_item['category'] = book.xpath("//ul[@class='breadcrumb']/li[@class='active']/preceding-sibling::li[1]/a/text()").get()
book_item['description'] = book.xpath("//div[@id='product_description']/following-sibling::p/text()").get()
book_item['price'] = book.css('p.price_color ::text').get()
yield book_item
This gives our data more structure and allows us to more easily clean it in data pipelines.
What Are Scrapy Pipelines?
Item Pipelines are the data processors of Scrapy, which all our scraped Items will pass through and from where we can clean, process, validate, and store our data.
Using Scrapy Pipelines we can:
- Clean our data (ex. remove currency signs from prices)
- Format our data (ex. convert strings to ints)
- Enrich our data (ex. convert relative links to absolute links)
- Valdiate our data (ex. make sure the price scraped is a viable price)
- Store our data in databases, queues, files or object storage buckets.
In our Scrapy project, we will use Item Pipelines to clean and process our data before storing it in a database.
Cleaning Our Scraped Data With Item Pipelines
As we mentioned previously, there is some data quality issues with the data we are scraping:
- Prices aren't numbers
- The stock availability isn't a number
- Some text contains trailing & leading white spaces
So we will create a Item Pipeline to clean and modify our scraped data before saving it.
First we will create an empty pipeline in our pipelines.py
file:
from itemadapter import ItemAdapter
class BookscraperPipeline:
def process_item(self, item, spider):
return item
Next, we will add data cleaning and processing steps to the pipeline to get the data into the format we want.
Strip Whitespaces From Strings
Some of the text we've scraped might have leading or trailing whitespaces or newlines that we don't want, so we will add a step to the pipeline to remove these from every field except the description
field:
from itemadapter import ItemAdapter
class BookscraperPipeline:
def process_item(self, item, spider):
adapter = ItemAdapter(item)
## Strip all whitespaces from strings
field_names = adapter.field_names()
for field_name in field_names:
if field_name != 'description':
value = adapter.get(field_name)
adapter[field_name] = value.strip()
return item
Convert Category & Product Type To Lowercase
Next, we will convert the category
and product_type
fields to lower case instead of title case.
from itemadapter import ItemAdapter
class BookscraperPipeline:
def process_item(self, item, spider):
adapter = ItemAdapter(item)
## ...PREVIOUS STEPS
## Category & Product Type --> switch to lowercase
lowercase_keys = ['category', 'product_type']
for lowercase_key in lowercase_keys:
value = adapter.get(lowercase_key)
adapter[lowercase_key] = value.lower()
return item
Clean Price Data
Currently, the price
, price_excl_tax
, price_incl_tax
and tax
fields are strings and contain a £
sign at the start. We want to convert these prices into floats and remove the £
sign.
from itemadapter import ItemAdapter
class BookscraperPipeline:
def process_item(self, item, spider):
adapter = ItemAdapter(item)
## ...PREVIOUS STEPS
## Price --> convert to float
price_keys = ['price', 'price_excl_tax', 'price_incl_tax', 'tax']
for price_key in price_keys:
value = adapter.get(price_key)
value = value.replace('£', '')
adapter[price_key] = float(value)
return item
Extract Availability From Text
Currently, the availability
value is a sentence like this In stock (19 available)
. We want to extract the number and save it as a integer.
from itemadapter import ItemAdapter
class BookscraperPipeline:
def process_item(self, item, spider):
adapter = ItemAdapter(item)
## ...PREVIOUS STEPS
## Availability --> extract number of books in stock
availability_string = adapter.get('availability')
split_string_array = availability_string.split('(')
if len(split_string_array) < 2:
adapter['availability'] = 0
else:
availability_array = split_string_array[1].split(' ')
adapter['availability'] = int(availability_array[0])
return item
Convert Reviews To Integer
Currently, the num_reviews
value is a string, however, we would like to save it as a integer.
from itemadapter import ItemAdapter
class BookscraperPipeline:
def process_item(self, item, spider):
adapter = ItemAdapter(item)
## ...PREVIOUS STEPS
## Reviews --> convert string to number
num_reviews_string = adapter.get('num_reviews')
adapter['num_reviews'] = int(num_reviews_string)
return item
Convert Stars To Number
Finally, the stars
value is a string like this star-rating Five
. We want to extract the text number and convert it into a integer.
from itemadapter import ItemAdapter
class BookscraperPipeline:
def process_item(self, item, spider):
adapter = ItemAdapter(item)
## ...PREVIOUS STEPS
## Stars --> convert text to number
stars_string = adapter.get('stars')
split_stars_array = stars_string.split(' ')
stars_text_value = split_stars_array[1].lower()
if stars_text_value == "zero":
adapter['stars'] = 0
elif stars_text_value == "one":
adapter['stars'] = 1
elif stars_text_value == "two":
adapter['stars'] = 2
elif stars_text_value == "three":
adapter['stars'] = 3
elif stars_text_value == "four":
adapter['stars'] = 4
elif stars_text_value == "five":
adapter['stars'] = 5
return item
Full Item Pipeline
So here is the complete pipeline we will use to clean our scraped data from BooksToScrape.
from itemadapter import ItemAdapter
class BookscraperPipeline:
def process_item(self, item, spider):
adapter = ItemAdapter(item)
## Strip all whitespaces from strings
field_names = adapter.field_names()
for field_name in field_names:
if field_name != 'description':
value = adapter.get(field_name)
adapter[field_name] = value.strip()
## Category & Product Type --> switch to lowercase
lowercase_keys = ['category', 'product_type']
for lowercase_key in lowercase_keys:
value = adapter.get(lowercase_key)
adapter[lowercase_key] = value.lower()
## Price --> convert to float
price_keys = ['price', 'price_excl_tax', 'price_incl_tax', 'tax']
for price_key in price_keys:
value = adapter.get(price_key)
value = value.replace('£', '')
adapter[price_key] = float(value)
## Availability --> extract number of books in stock
availability_string = adapter.get('availability')
split_string_array = availability_string.split('(')
if len(split_string_array) < 2:
adapter['availability'] = 0
else:
availability_array = split_string_array[1].split(' ')
adapter['availability'] = int(availability_array[0])
## Reviews --> convert string to number
num_reviews_string = adapter.get('num_reviews')
adapter['num_reviews'] = int(num_reviews_string)
## Stars --> convert text to number
stars_string = adapter.get('stars')
split_stars_array = stars_string.split(' ')
stars_text_value = split_stars_array[1].lower()
if stars_text_value == "zero":
adapter['stars'] = 0
elif stars_text_value == "one":
adapter['stars'] = 1
elif stars_text_value == "two":
adapter['stars'] = 2
elif stars_text_value == "three":
adapter['stars'] = 3
elif stars_text_value == "four":
adapter['stars'] = 4
elif stars_text_value == "five":
adapter['stars'] = 5
return item
Activating Item Pipeline
To activate our Item Pipeline we just need to add the following code to our settings.py
file:
## settings.py
ITEM_PIPELINES = {
'bookscraper.pipelines.BookscraperPipeline': 300,
}
Now, when we run our bookspider
all the scraped data will pass through this Item Pipeline and be cleaned as we would like it to be.
Here is an example of the data:
{
"availability": 22,
"category": "poetry",
"description": "It's hard to imagine a world without A Light in the Attic. "
"This now-classic collection of poetry and drawings from Shel "
"Silverstein celebrates its 20th anniversary with this special "
"edition. Silverstein's humorous and creative verse can amuse "
"the dowdiest of readers. Lemon-faced adults and fidgety kids "
"sit still and read these rhythmic words and laugh and smile "
"and love th It's hard to imagine a world without A Light in "
"the Attic. This now-classic collection of poetry and drawings "
"from Shel Silverstein celebrates its 20th anniversary with "
"this special edition. Silverstein's humorous and creative "
"verse can amuse the dowdiest of readers. Lemon-faced adults "
"and fidgety kids sit still and read these rhythmic words and "
"laugh and smile and love that Silverstein. Need proof of his "
"genius? RockabyeRockabye baby, in the treetop Don't you know a "
"treetopIs no safe place to rock?And who put you up there,And "
"your cradle, too?Baby, I think someone down here's Got it in "
"for you. Shel, you never sounded so good. ...more",
"num_reviews": 0,
"price": 51.77,
"price_excl_tax": 51.77,
"price_incl_tax": 51.77,
"product_type": "books",
"stars": 3,
"tax": 0.0,
"title": "A Light in the Attic",
"upc": "a897fe39b1053632",
"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
}
Next Steps
We've just looked at how to use Items to structure our data and Item Pipelines to clean the data.
In Part 7, we will look at how you can save this data into various types of file formats like CSVs, JSON, and databases like MySQL and Postgres.
All parts of the 12 Part freeCodeCamp Scrapy Beginner Course are as follows:
- Part 1: Course & Scrapy Overview
- Part 2: Setting Up Environment & Scrapy
- Part 3: Creating Scrapy Project
- Part 4: First Scrapy Spider
- Part 5: Crawling With Scrapy
- Part 6: Cleaning Data With Item Pipelines
- Part 7: Storing Data In CSVs & Databases
- Part 8: Faking Scrapy Headers & User-Agents
- Part 9: Using Proxies With Scrapy Spiders
- Part 10: Deploying & Scheduling Spiders With Scrapyd
- Part 11: Deploying & Scheduling Spiders With ScrapeOps
- Part 12: Deploying & Scheduling Spiders With Scrapy Cloud