Skip to main content

Saving Scraped Data To JSON With Scrapy Feed Exporters

You've built a spider that will scrape data from a website, now you want to save it somewhere. One of the easiest ways to save scrape data is to save it to a JSON file.

In this guide, we will go through how:

First, let's go over what are Scrapy Feed Exporters.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.


What Are Scrapy Feed Exporters?

The need to save scraped data to a file is a very common requirement for developers, so to make our lives easier the developers behind Scrapy have implemented Feed Exporters.

Feed Exporters are a ready made toolbox of methods we can use to easily save/export our scraped data into:

  • JSON & JSON lines file format
  • CVS file format
  • XML file format
  • Pythons pickle format

And save them to:

  • The local machine Scrapy is running on
  • A remote machine using FTP (file transfer protocall)
  • Amazon S3 Storage
  • Google Cloud Storage
  • Standard output

In this guide, we will walk you through the different ways you can save JSON files from Scrapy.


Saving In JSON & JSON Lines Format

When saving in JSON format, we have two options:

  • JSON
  • JSON lines

Storing data in JSON format is okay for small anounts of data but it doesn’t scale well for large amounts of data, as incremental (aka. stream-mode) parsing is not well supported (if at all) and can result in the entire dataset being stored into memory creating the potential for a memory leak.

JSON data is held memory in an array and new data is appended to it:


[
{"name": "Color TV", "price": "1200"},
{"name": "DVD player", "price": "200"}
]

As a result, it is advised to use JSON lines format if you want to save data in JSON.


{"name": "Color TV", "price": "1200"}
{"name": "DVD player", "price": "200"}

Using JSON lines allows new data to be incrementally added to a file and can be split into numerous chunks.


Saving JSON Files Via The Command Line

The first and simplest way to create a JSON file of the data you have scraped, is to simply define a output path when starting your spider in the command line.

To save to a JSON file add the flag -o to the scrapy crawl command along with the file path you want to save the file to.

You can set a relative path like below:


scrapy crawl bookspider -o bookspider_data.json

To save in JSON lines format, simply change the file format:


scrapy crawl bookspider -o bookspider_data.jsonl

Or you can also set a absolute path like this:


scrapy crawl bookspider -o file:///path/to/my/project/bookspider_data.json

You have two options when using this command, use are small -o or use a capital -O.

FlagDescription
-oAppends new data to an existing file.
-OOverwrites any existing file with the same name with the current data.

Telling Scrapy to save the data to a JSON via the command line is okay, but can be a little messy. The other option is setting it in your code, which Scrapy makes very easy.


Saving JSON Files With Feeds Setting

Often the cleanest option is to tell Scrapy to save the data to a JSON via the FEEDS setting.

We can configure it in our settings.py file by passing it a dictionary with the path/name of the file and the file format.

For JSON format:

# settings.py 

FEEDS = {
'data.json': {'format': 'json'}
}

For JSON format:

# settings.py 

FEEDS = {
'data.jsonl': {'format': 'jsonlines'}
}

You can also configure this in each individual spider by setting a custom_setting in your spider.

# bookspider.py 

import scrapy
from proxy_waterfall.items import BookItem

class BookSpider(scrapy.Spider):
name = 'bookspider'
start_urls = ["http://books.toscrape.com"]

custom_settings = {
'FEEDS': { 'data.jsonl': { 'format': 'jsonlines',}}
}

def parse(self, response):

for article in response.css('article.product_pod'):
book_item = BookItem(
url = article.css("h3 > a::attr(href)").get(),
title = article.css("h3 > a::attr(title)").extract_first(),
price = article.css(".price_color::text").extract_first(),
)
yield book_item

The default overwriting behaviour of the FEEDS functionality is dependant on where the data is going to be stored. However, you can set it to overwite existing data or not by adding a overwrite key to the FEEDS dictionary with either True or False.

# settings.py 

FEEDS = {
'data.jsonl': {'format': 'jsonlines', 'overwrite': True}
}

When saving locally, by default overwrite is set to False. The full set of defaults can be found in the Feeds docs.


1. Setting Dynamic File Paths/Names

Setting a static filepath is okay for development or very small projects, however, when in production you will likely don't want all your data being saved into one big file. So to solve this Scrapy allows you create dynamic file paths/names using spider variables.

For example, here tell create a JSON file for the data in the data folder, followed by the subfolder with the spiders name, and a file name that includes the spider name and date it was scraped.

# settings.py 

FEEDS = {
'data/%(name)s/%(name)s_%(time)s.jsonl': {
'format': 'jsonlines',
}
}

The generated path would look something like this.


"data/bookspider/bookspider_2022-05-18T07-47-03.jsonl"

Any other named parameter gets replaced by the spider attribute of the same name. For example, %(site_id)s would get replaced by the spider.site_id attribute the moment the feed is being created.


2. Configuring Extra Functionality

The Feeds functionality has other settings that you can configure by passing key/value pairs to the FEEDS dictionary you define.

KeyDescription
encodingThe encoding to be used for the feed. If unset or set to None (default) it uses UTF-8 for everything except JSON output, which uses safe numeric encoding (\uXXXX sequences) for historic reasons.
fieldsA list of fields to export, allowing you to only save certain fields from your Items.
item_classesA list of item classes to export. If undefined or empty, all items are exported.
item_filterA filter class to filter items to export. ItemFilter is used be default.
indentAmount of spaces used to indent the output on each level.
store_emptyWhether to export empty feeds (i.e. feeds with no items).
uri_paramsA string with the import path of a function to set the parameters to apply with printf-style string formatting to the feed URI.
postprocessingList of plugins to use for post-processing.
batch_item_countIf assigned an integer number higher than 0, Scrapy generates multiple output files storing up to the specified number of items in each output file. Docs

An example FEED setting use multiple of these would be:

# settings.py 

FEEDS = {
'data/%(name)s/%(name)s_%(time)s.jsonl': {
'format': 'jsonlines',
'encoding': 'utf8',
'store_empty': False,
'item_classes': [MyItemClass1, 'myproject.items.MyItemClass2'],
'fields': None,
'indent': 4,
'item_export_kwargs': {
'export_empty_fields': True,
},
}
}


Saving Data To Multiple JSON File Batches

Depending on your job, you may want to store the scraped data in numerous file batches instead of in one large JSON lines file to make it more managable. Scrapy makes it very easy to do this with the batch_item_count key you can set in your FEEDS settings.

Simply set add the batch_item_count key to your Feed settings and set the number of Items you would like in each file. This will then start a new JSON file when it reaches this limit.

Note: You will also need to add at least one of the following placeholders in the feed URI to indicate how the different output file names are generated:

PlaceholdDescription
%(batch_time)sInserts a timestamp when the batch is being created
%(batch_id)dInserts a 1-based sequence number of the batch.

For example, these Feed settings will break the data up into numerous batches of equal size (except the last batch).

# settings.py 

FEEDS = {
'data/%(name)s/%(name)s_batch_%(batch_id)d.jsonl': {
'format': 'jsonlines',
'batch_item_count': 10,
}
}

The resulting batch files, with 10 rows in each.


"data/bookspider/bookspider_batch_1.jsonl"
"data/bookspider/bookspider_batch_2.jsonl"
"data/bookspider/bookspider_batch_3.jsonl"
"data/bookspider/bookspider_batch_4.jsonl"
"data/bookspider/bookspider_batch_5.jsonl"
"data/bookspider/bookspider_batch_6.jsonl"


More Scrapy Tutorials

We've covered everything you need to know about saving data to JSON files with Scrapy. If you would like to save your JSON files to AWS S3 then check out our Saving CSV/JSON Files to Amazon AWS S3 Bucket guide here

If you would like to learn more about saving data, then be sure to check out these guides:

If you would like to learn more about Scrapy in general, then be sure to check out The Scrapy Playbook.