Saving Scraped Data To Postgres Database With Scrapy Pipelines
If your scraping a website, you need to save that data somewhere. A great option is Postgres as it is a battletested database, trusted by thousands of companies to handle huge amounts of data.
In this guide, we will go through how to save our data to a Postgres database using Scrapy pipelines:
- What Are Scrapy Item Pipelines?
- Setting Up A Postgres Database
- Saving Data To A Postgre Database
- Only Saving New Data
First, let's go over what are Scrapy Item Pipelines.
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
What Are Scrapy Item Pipelines?
Item Pipelines are Scrapy's way of process data scraped by spiders.
After an item has been scraped by a spider, it is sent to the Item Pipeline which processes it through a sequence of steps that can be configured to clean and process the scraped data before ultimately saving it somewhere.
You can use Item Pipelines to:
- Clean HTML data
- Validate scraped data
- Checking for and removing duplicate data
- Storing the data in database
For the purpose of this guide, we're going to focus on using Item Pipelines to store data in a Postgres database.
Setting Up A Postgres Database
To get started we first need to setup a Postgres database.
Either you can set one up on your local machine by using one of the following downloads.
Or you could get a hosted version with cloud provider like DigitalOcean.
Once setup you should have access to the database connection details of your database:
host="localhost",
database="my_database",
user="root",
password="123456"
Saving Data to a Postgres Database
Okay, now let's now integrate saving data into our Postgres database.
1. Install psycopg2
To interact with our database we will need a library to handle the interaction. For this will install psycopg2
.
pip install psycopg2
We will use psycopg2
to interact with our Postgres database.
2. Setup Our Pipeline
The next step is we need to open our pipelines.py
file and set up our pipeline.
When you open your pipelines.py
file, the default file should look like this:
# pipelines.py
from itemadapter import ItemAdapter
class PostgresDemoPipeline:
def process_item(self, item, spider):
return item
Now we will configure this empty pipeline to store our data.
Note: For this guide I created a Scrapy project called postgres_demo (thus the default pipeline is PostgresDemoPipeline
), and am use this spider:
# spiders/quotes.py
import scrapy
from postgres_demo.items import QuoteItem
class QuotesSpider(scrapy.Spider):
name = 'quotes'
def start_requests(self):
url = 'https://quotes.toscrape.com/'
yield scrapy.Request(url, callback=self.parse)
def parse(self, response):
quote_item = QuoteItem()
for quote in response.css('div.quote'):
quote_item['text'] = quote.css('span.text::text').get()
quote_item['author'] = quote.css('small.author::text').get()
quote_item['tags'] = quote.css('div.tags a.tag::text').getall()
yield quote_item
And the Item:
# items.py
from scrapy.item import Item, Field
class QuoteItem(Item):
text = Field()
tags = Field()
author = Field()
3. Connect to Postgres & Create a Table
First, we're going to import psycopg2
into our pipelines.py
file, and create an __init__
method that we will use to create our database and table.
# pipelines.py
import psycopg2
class PostgresDemoPipeline:
def __init__(self):
pass
def process_item(self, item, spider):
return item
Inside the __init__
method, we will configure the pipeline to do the following everytime the pipeline gets activated by a spider:
- Try to connect to our database
quotes
, but if it doesn't exist create the database. - Create a cursor which we will use to execute SQL commands in the database.
- Create a new table
quotes
with the columnscontent
,tags
andauthor
, if one doesn't already exist in the database.
# pipelines.py
import psycopg2
class PostgresDemoPipeline:
def __init__(self):
## Connection Details
hostname = 'localhost'
username = 'postgres'
password = '*******' # your password
database = 'quotes'
## Create/Connect to database
self.connection = psycopg2.connect(host=hostname, user=username, password=password, dbname=database)
## Create cursor, used to execute commands
self.cur = self.connection.cursor()
## Create quotes table if none exists
self.cur.execute("""
CREATE TABLE IF NOT EXISTS quotes(
id serial PRIMARY KEY,
content text,
tags text,
author VARCHAR(255)
)
""")
def process_item(self, item, spider):
return item