Scrapy Items: The Better Way To Process Your Data
The goal of web scraping is to turn unstructured HTML data into clean and structured data that can be used in our applications and data pipelines.
Oftentimes, developers just yield their scraped data in the form of a dictionary when starting out with Scrapy, however there is a better way which is called Scrapy Items.
In this guide, we're going to walk through:
- What Are Scrapy Items & Why Should We Use Them?
- How To Integrate Items Into Your Spiders
- Processing Items with Scrapy Item Loaders
- Processing Items with Scrapy Item Pipelines
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
What Are Scrapy Items & Why Should We Use Them?
Scrapy Items are a predefined data structure that holds your data.
Instead of yielding your scraped data in the form of a dictionary for example, you define a Item schema beforehand in your items.py
file and use this schema when scraping data.
This enables you to quickly and easily check what structured data you are collecting in your project, it will raise exceptions if you try and create incorrect data with your Item.
Because of this, using Scrapy Items have a number of advantages:
- Structures your data and gives it a clear schema.
- Enables you to easily clean and process your scraped data.
- Enables you to validate, deduplicate and monitor your data feeds.
- Enables you to easily store and export your data with Scrapy Feed Exports.
- Makes using Scrapy Item Pipelines & Item Loaders.
Scrapy supports multiple types of data formats that are automatically converted into Scrapy Items when yielded:
However, defining your own Item object in your items.py
file is normally the best option.
How To Integrate Items Into Your Spiders
Creating an Item in Scrapy is very straight forward. Simply open your items.py
file and define the data you would like to scrape by inheriting from the Scrapy Item class.
# items.py
from scrapy.item import Item, Field
class QuoteItem(Item):
text = Field()
tags = Field()
author = Field()
Then inside in your spider, instead of yielding a dictionary you would create a new Item with the scraped data before yielding it.
# items.py
import scrapy
from items_demo.items import QuoteItem
class QuotesSpider(scrapy.Spider):
name = 'quotes'
def start_requests(self):
url = 'https://quotes.toscrape.com/'
yield scrapy.Request(url, callback=self.parse)
def parse(self, response):
quote_item = QuoteItem()
for quote in response.css('div.quote'):
quote_item['text'] = quote.css('span.text::text').get()
quote_item['author'] = quote.css('small.author::text').get()
quote_item['tags'] = quote.css('div.tags a.tag::text').getall()
yield quote_item
Now, all your scraped data will be contained in the structured QuoteItem
we created which can then be sent through Item Pipelines to clean, validate and store your data.
Processing Items with Scrapy Item Loaders
To take things a step further we can also use Scrapy Item Loaders to process the data as our spider is inserting it into our Item.
Scrapy Item Loaders provide a much more convenient mechanism for populating Items during the scraping process, by automating some common tasks like parsing the raw extracted data before assigning it.
# items.py
import scrapy
from items_demo.items import QuoteItem
from scrapy.loader import ItemLoader
class QuotesSpider(scrapy.Spider):
name = 'quotes'
def start_requests(self):
url = 'https://quotes.toscrape.com/'
yield scrapy.Request(url, callback=self.parse)
def parse(self, response):
quote_item = QuoteItem()
for quote in response.css('div.quote'):
quote_item['text'] = quote.css('span.text::text').get()
quote_item['author'] = quote.css('small.author::text').get()
quote_item['tags'] = quote.css('div.tags a.tag::text').getall()
yield quote_item