Skip to main content

freeCodeCamp Scrapy Beginners Course Part 3 - Creating Scrapy Project

freeCodeCamp Scrapy Beginners Course Part 3: Creating Scrapy Project

In Part 3 of the Scrapy Beginner Course, we go through how create a Scrapy project and explain all of its components.

We will walk through:

The code for this part of the course is available on Github here!

If you prefer video tutorials, then check out the video version of this course on the freeCodeCamp channel here.

freeCodeCamp Scrapy Course

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.


How To Create A Scrapy Project

Now that we have our virtual environment setup and Scrapy installed, we can get onto the fun stuff. creating our first Scrapy project

Our Scrapy project will hold all the code for our scrapers, and is a pre-built template for how we should structure our scrapers when using Scrapy.

To create a scrapy project, we need to use the following command in our command line:

scrapy startproject <project_name>

So in our projects case, as we're going to be scraping the BooksToScrape website, we will call our project bookscraper. But you can use any project name you would like.

scrapy startproject bookscraper

Now if we enter the ls command into the command line we should see the following files/folders:


├── scrapy.cfg
└── bookscraper

Overview of The Scrapy Project Structure

To help us understand what we've just done, and how Scrapy structures it projects we're going to pause for a second.

First, we're going to see what the scrapy startproject bookscraper command we ran just did. If you open the folder in VS Code or another code editor program you should see the full folder structure.

You should see something like this:

├── scrapy.cfg
└── bookscraper
├── __init__.py
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
└── __init__.py

When we ran the scrapy startproject bookscraper command, Scrapy automatically generated a template project for us to use.

This folder structure illustrates the 5 main building blocks of every Scrapy project: Spiders, Items, Middlewares, Pipelines and Settings.

Using these 5 building blocks you can create a scraper to do pretty much anything.

We won't be using all of these files in this beginners project, but we will give a quick explanation of each as each one has a special purpose:

  • settings.py is where all your project settings are contained, like activating pipelines, middlewares etc. Here you can change the delays, concurrency, and lots more things.
  • items.py is a model for the extracted data. You can define a custom model (like a ProductItem) that will inherit the Scrapy Item class and contain your scraped data.
  • pipelines.py is where the item yielded by the spider gets passed, it’s mostly used to clean the text and connect to file outputs or databases (CSV, JSON SQL, etc).
  • middlewares.py is useful when you want to modify how the request is made and scrapy handles the response.
  • scrapy.cfg is a configuration file to change some deployment settings, etc.

The most fundamental of which are Spiders.

More Complex Explanations

Below, we explain the 5 main building blocks of every Scrapy project Spiders, Items, Middlewares, Pipelines and Settings in more detail. If you are new to Python and/or Scrapy this might be too much information too fast so feel free to skip this section as we will be covering each section in more detail later.

However, if you would like a high level overview of Spiders, Items, Middlewares, Pipelines and Settings then check out the following sections.


Scrapy Spiders Explained

Scrapy spiders is where the magics happens. "Spiders" are the Scrapy name for the main Python class that extracts the data you need from a website.

In your Scrapy project, you can have multiple Spiders all scraping the same or different websites and storing the data in different places.

Anything you could do with a Python Requests/BeautifulSoup scraper you can do with a Scrapy Spider.


import scrapy

class BooksSpider(scrapy.Spider):
name = 'books'

def start_requests(self):
url = 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'
yield scrapy.Request(url, callback=self.parse)

def parse(self, response):
item = {}
product = response.css("div.product_main")
item["title"] = product.css("h1 ::text").extract_first()
item['category'] = response.xpath(
"//ul[@class='breadcrumb']/li[@class='active']/preceding-sibling::li[1]/a/text()"
).extract_first()
item['description'] = response.xpath(
"//div[@id='product_description']/following-sibling::p/text()"
).extract_first()
item['price'] = response.css('p.price_color ::text').extract_first()
yield item

To run this Spider, you simply need to run:


scrapy crawl books

When the above Spider is run, it will send a request to https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html and once it has responded it will scrape all book data from the page.

There are a couple things to point out here:

  1. Asynchronous - As Scrapy is built using the Twisted framework, when you send a request to a website it isn't blocking. Scrapy will send the request to the website, and once it has retrieved a successful response it will tigger the parse method using the callback defined in the original Scrapy Request yield scrapy.Request(url, callback=self.parse).
  2. Spider Name - Every spider in your Scrapy project must have a unique name so that Scrapy can identify it. You set this using the name = 'books' attribute.
  3. Start Requests - You define the starting points for your spider using the start_requests() method. Subsequent requests can be generated successively from these initial requests.
  4. Parse - You use the parse() method to process the response from the website and extract the data you need. After extraction this data is sent to the Item Pipelines using the yield command.

Although this Scrapy spider is a bit more structured than your typical Python Requests/BeautifulSoup scraper it accomplishes the same things.

However, it is with Scrapy Items, Middlewares, Pipelines and Settings that Scrapy really stands out versus Python Requests/BeautifulSoup.


Scrapy Items Explained

Scrapy Items are how we store and process our scraped data. They provide a structured container for the data we scrape so that we can clean, validate and store it easily with Scrapy ItemLoaders, Item Pipelines, and Feed Exporters.

Using Scrapy Items have a number of advantages:

  • Structures your data and gives it a clear schema.
  • Enables you to easily clean and process your scraped data.
  • Enables you to validate, deduplicate and monitor your data feeds.
  • Enables you to easily store and export your data with Scrapy Feed Exports.
  • Makes using Scrapy Item Pipelines & Item Loaders.

We typically define our Items in out items.py file.

# items.py

from scrapy.item import Item, Field

class BookItem(Item):
title = Field()
category = Field()
description = Field()
price = Field()

Then inside in your spider, instead of yielding a dictionary you would create a new Item with the scraped data before yielding it.


import scrapy
from bookscraper.items import BookItem

class BooksSpider(scrapy.Spider):
name = 'books'

def start_requests(self):
url = 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'
yield scrapy.Request(url, callback=self.parse)

def parse(self, response):
book_item = BookItem()
product = response.css("div.product_main")
book_item["title"] = product.css("h1 ::text").extract_first()
book_item['category'] = response.xpath(
"//ul[@class='breadcrumb']/li[@class='active']/preceding-sibling::li[1]/a/text()"
).extract_first()
book_item['description'] = response.xpath(
"//div[@id='product_description']/following-sibling::p/text()"
).extract_first()
book_item['price'] = response.css('p.price_color ::text').extract_first()
yield book_item


Scrapy Item Pipelines Explained

Item Pipelines are the data processors of Scrapy, which all our scraped Items will pass through and from where we can clean, process, validate, and store our data.

Using Scrapy Pipelines we can:

  • Clean our data (ex. remove currency signs from prices)
  • Format our data (ex. convert strings to ints)
  • Enrich our data (ex. convert relative links to absolute links)
  • Valdiate our data (ex. make sure the price scraped is a viable price)
  • Store our data in databases, queues, files or object storage buckets.

For example, here is a Item pipeline that stores our scraped into a Postgres Database:

# pipelines.py

import psycopg2

class PostgresDemoPipeline:

def __init__(self):
## Connection Details
hostname = 'localhost'
username = 'postgres'
password = '******' # your password
database = 'quotes'

## Create/Connect to database
self.connection = psycopg2.connect(host=hostname, user=username, password=password, dbname=database)

## Create cursor, used to execute commands
self.cur = self.connection.cursor()

## Create quotes table if none exists
self.cur.execute("""
CREATE TABLE IF NOT EXISTS quotes(
id serial PRIMARY KEY,
title text,
category text,
description VARCHAR(255)
)
""")

def process_item(self, item, spider):

## Define insert statement
self.cur.execute(""" insert into books (title, category, description) values (%s,%s,%s)""", (
item["title"],
str(item["category"]),
item["description"]
))

## Execute insert of data into database
self.connection.commit()
return item

def close_spider(self, spider):

## Close cursor & connection to database
self.cur.close()
self.connection.close()


Scrapy Middlewares Explained

As we've discussed, Scrapy is a complete web scraping framework that manages a lot of the complexity of scraping at scale for you behind the scenes without you having to configure anything.

Most of this functionality is contained within Middlewares in the form of Downloader Middlewares and Spider Middlewares.


Downloader Middlewares

Downloader middlewares are specific hooks that sit between the Scrapy Engine and the Downloader, which process requests as they pass from the Engine to the Downloader, and responses as they pass from Downloader to the Engine.

By default Scrapy has the following downloader middlewares enabled:

# settings.py

DOWNLOADER_MIDDLEWARES_BASE = {
# Engine side
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
# Downloader side
}


These middlewares control everything from:

  • Timing out requests
  • What headers to send with your requests
  • What user agents to use with your requests
  • Retrying failed requests
  • Managing cookies, caches and response compression

You can disable any of these default middlewares by setting it to none in your settings.py file. Here is an example of disabling the RobotsTxtMiddleware.

# settings.py

DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': None,
}


You can also override existing middlewares, or insert your own completely new middlewares if you want to:

  • alter a request just before it is sent to the website (change the proxy, user-agent, etc.)
  • change received response before passing it to a spider
  • retry a request if the response doesn't contain the correct data instead of passing received response to a spider
  • pass response to a spider without fetching a web page
  • silently drop some requests

Here is an example of inserting our own middleware to use a proxy with all of your requests. We will create this in our middlewares.py file:

## middlewares.py

import base64

class MyProxyMiddleware(object):

@classmethod
def from_crawler(cls, crawler):
return cls(crawler.settings)

def __init__(self, settings):
self.user = settings.get('PROXY_USER')
self.password = settings.get('PROXY_PASSWORD')
self.endpoint = settings.get('PROXY_ENDPOINT')
self.port = settings.get('PROXY_PORT')

def process_request(self, request, spider):
user_credentials = '{user}:{passw}'.format(user=self.user, passw=self.password)
basic_authentication = 'Basic ' + base64.b64encode(user_credentials.encode()).decode()
host = 'http://{endpoint}:{port}'.format(endpoint=self.endpoint, port=self.port)
request.meta['proxy'] = host
request.headers['Proxy-Authorization'] = basic_authentication

We would enable it in your settings.py file, and fill in your proxy connection details:

## settings.py

PROXY_USER = 'username'
PROXY_PASSWORD = 'password'
PROXY_ENDPOINT = 'proxy.proxyprovider.com'
PROXY_PORT = '8000'

DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.MyProxyMiddleware': 350,
}


Spider Middlewares

Spider middlewares are specific hooks that sit between the Scrapy Engine and the Spiders, and which process spider input (responses) and output (items and requests).

By default Scrapy has the following downloader middlewares enabled:

# settings.py

SPIDER_MIDDLEWARES_BASE = {
# Engine side
'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50,
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500,
'scrapy.spidermiddlewares.referer.RefererMiddleware': 700,
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800,
'scrapy.spidermiddlewares.depth.DepthMiddleware': 900,
# Spider side
}

Spider middleware are used to:

  • post-process output of spider callbacks - change/add/remove requests or items
  • post-process start_requests
  • handle spider exceptions
  • call errback instead of callback for some of the requests based on response content

Like Downloader middlewares you can disable any of these default Spider middlewares by setting it to none in your settings.py file. Here is an example of disabling the RobotsTxtMiddleware.

# settings.py

SPIDER_MIDDLEWARES = {
'scrapy.spidermiddlewares.referer.RefererMiddleware': None,
}


Scrapy Settings Explained

The settings.py file is the central control panel for your Scrapy project. You can enable/disable default functionality or integrate your own custom middlewares and extensions.

You can change the settings on a project basis by updating the settings.py file, or on a individual Spider basis by adding custom_settings to each spider.

In the following example, we add custom settings to our spider so that the scraped data will be saved to a data.csv file using the custom_settings attribute.


import scrapy
from bookscraper.items import BookItem

class BooksSpider(scrapy.Spider):
name = 'books'
custom_settings = {
'FEEDS': { 'data.csv': { 'format': 'csv',}}
}

def start_requests(self):
url = 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'
yield scrapy.Request(url, callback=self.parse)

def parse(self, response):
book_item = BookItem()
product = response.css("div.product_main")
book_item["title"] = product.css("h1 ::text").extract_first()
book_item['category'] = response.xpath(
"//ul[@class='breadcrumb']/li[@class='active']/preceding-sibling::li[1]/a/text()"
).extract_first()
book_item['description'] = response.xpath(
"//div[@id='product_description']/following-sibling::p/text()"
).extract_first()
book_item['price'] = response.css('p.price_color ::text').extract_first()
yield book_item

There are a huge range of settings you can configure in Scrapy, so if you'd like to explore them all, here is a complete list of the default settings Scrapy provides.


Next Steps

Now that we have our Scrapy Project setup we will move onto creating our first Scrapy spider.

All parts of the 12 Part freeCodeCamp Scrapy Beginner Course are as follows: