How To Customise Scrapy: Extensions, Middlewares & Pipelines Explained
What makes Scrapy great, isn't just the fact that it comes with so much functionality out of the box, it is because Scrapy's core functionality is so easily customizable when you understand how Scrapy works and how you can create your own Scrapy Extensions, Downloader Middlewares, and Spider Middlewares.
With over 150+ of open-source Scrapy extensions available and the ability to easily create your own customizations, you can greatly improve Scrapy's performance to your particular use case with little to no effort.
In this guide, we're going to explain how to customize Scrapy using Extensions, Middlewares and Pipelines.
- Why Customize Scrapy?
- Extensions, Middlewares, and Pipelines: An Overview and Comparison
- Understanding Scrapy Extensions
- Scrapy Middlewares: Customizing the Spider Workflow
- Scrapy Pipelines: Managing Item Processing
- Integrating Custom Components in Scrapy
- Advanced Tips for Customizing Scrapy
- Real-World Examples of Scrapy Customization
- Conclusion
- More Python Web Scraping Guides
Why Customize Scrapy?
Scrapy is a full fledged and fully customizable web scraping framework. When it comes to performance, almost nothing beats Scrapy.
It's asynchronous by default which allows you to crawl multiple pages concurrently right out of the box! Scrapy is arguably the most efficient Python scraping framework out there.
However, when you start a new Scrapy project, it's pretty barebones. Every scrape is different and the developers behind Scrapy understand this. When you customize your Scrapy project, you can tweak it to better fit your target site and your desired results.
With custom extensions, you can alter the global behavior of your project. If you want to save parts of your log to a file, we can do that in well under 50 lines of code.
With custom middleware, you can alter the behavior of your scraper during operation to add things like retry logic.
With custom pipelines, Scrapy allows you to do whatever you want with your data, whether you want to save it to a CSV, or update a database directly!
Before continuing, you should create a new Scrapy project. You can use it to follow along and customize your own Scrapy project.
scrapy startproject my_custom_crawler
Extensions, Middlewares, and Pipelines: An Overview and Comparison
In Scrapy, extensions, middlewares, and pipelines play distinct yet interconnected roles in customizing and optimizing your web scraping workflow.
- Extensions extend Scrapy’s core functionality, enabling tasks like monitoring performance or sending alerts.
- Middlewares act as intermediaries between Scrapy’s engine and requests or responses, making them ideal for handling user-agent rotation, proxy management, or request modifications.
- Finally, pipelines focus on processing and storing scraped data, handling tasks like cleaning, validating, and exporting items.
To help you better understand their roles and how they differ, we’ve included a comparison table that outlines their key functions, typical use cases, and implementation details.
This table will serve as a quick reference to guide you in determining when and how to use each component effectively in your projects.
Aspect | Extensions | Middleware | Pipelines |
---|---|---|---|
Primary Purpose | Modify global project behavior | Customize request and response processing | Process and store scraped data |
Scope | Project-wide (affects all spiders and processes) | Operates on requests (Downloader Middleware) or responses (Spider Middleware) during the crawl | Focused on individual scraped items |
Common Use Cases | - Logging - Monitoring - Error reporting - Alerts | - Retry logic - Proxy integration - Adding custom headers - Response modification | - Data cleaning - Validation - Storing data (e.g., saving to database or file) |
Integration Level | High-level hooks triggered by Scrapy signals | Middle-layer between Scrapy engine and Downloader/Spider | Post-processing step for handling extracted items |
Advantages | - Centralized modifications - Easy to monitor spider activity | - Flexibility for handling requests/responses - Granular control over scraping logic | - Simplifies data processing and storage logic |
Disadvantages | - Errors can impact the entire project - Requires careful testing | - Can introduce complexity in managing requests/responses - Needs configuration for specific tasks | - Can be resource-intensive with large datasets - Requires proper validation to avoid errors |
Understanding Scrapy Extensions
Scrapy extensions provide a powerful way to augment the functionality of your scraping projects, allowing you to customize and enhance Scrapy’s behavior without modifying its core framework.
Extensions are particularly useful for tasks such as monitoring the performance of your spiders, logging important events, sending alerts, or integrating with external tools and services.
What are Scrapy Extensions?
Extensions are used to modify the global behavior of your project. Whether you run multiple crawls using multiple spiders, or just one single spider, your project will always follow the rules set by your extensions.
If we create an extension to log the final status of each response, every time we're finished with a response, the status will come up in our log.
- Scope: Extensions modify global behavior of the project. If you implement a logging extension, it will trigger whenever a spider is open or a crawl has started.
- Usecases: Extensions are great for monitoring your entire Scrapy system. Think of the logging example again. If you receive a bad HTTP response, this will be recorded in the log for you to inspect later. If you want to receive an email when your scraper encounters an error, you'd use an extension to implement this.
- Pros: This greatly enhances the flexibility of the full project. With a relatively small amount of code, you can make large changes to the project in a single place.
- Cons: Because extensions use a global scope, it can lead to global bugs if you implement something wrong. Because of their global nature, extensions shouldn't interact with our request/response protocol or our extraction logic.
Built-in vs. Custom Extensions
There are many extensions that come built-in with Scrapy. These extensions are useful tools for debugging, pausing, throttling, memory management and much more.
You can get a better understanding of these extensions below.
- TelnetConsole: This allows you to debug, inspect and modify your scraper during runtime.
- LogStats: This extension logs basic stats from your scraper such as requests and pages you've scraped.
- CoreStats: Collect core stats such as item counts, status code counts, and saves them into the stats collector.
- AutoThrottle: Throttles your scraper based on server responses. This is incredibly useful in managing latency.
- MemoryUsage: Monitor the memory usage of your scraper. You can set a shutoff when your memory exceeds a specific usage level.
Limitations of Built-in Extensions
Built-in extensions are sort of like a "one size fits all" piece of clothing. They're meant to cover basic usage for the average person.
If you want to perform specific actions, these built-in extensions limit what you're able to do because they're meant to be generic.
There's also a bit of performance overhead with these extensions because they're made to do so much.
If you're using an extension for one piece of functionality, you don't need the other 5 functions that come with it.
Popular Open Source Scrapy Extensions
The Scrapy ecosystem boasts a variety of open-source extensions that enhance its functionality and simplify common scraping tasks.
Here are some popular open-source Scrapy extensions:
- ScrapeOps Scrapy Extension: This one comes right from us.
- Spidermon: Monitor your spiders to check your output data and create custom alerts. Get custom notifications via Slack, Telegram, Discord and email.
- Spider Feeder: Place a file inside your Scrapy project and crawl every url from the file. Use almost any file type:
.txt
,.csv
,json
quickly without the need for boilerplate code. - Scrapy Statsd: Send your stats to a hosted server. This can help you to manage many scrapers all from one place.
- Scrapy JSONRPC: Control your scraper using JSON RPC commands.
Steps to Create a Custom Extension
When creating an extension, we need to follow several steps.
- We need to identify the hook for the extension.
- Then, we write the extension class.
- Finally, we adjust our Scrapy settings to use the new extension.
1. Identify the Hook
There are a number of hooks we can use to trigger our extension. The table below outlines a few of the most common ones.
Hook | Triggered When | Use Case |
---|---|---|
spider_opened | A spider starts crawling. | Initialize resources |
spider_closed | A spider finishes crawling. | Cleanup resources/Create a report. |
response_received | A response is received. | Process/Log response data globally. |
engine_started | The Scrapy engine starts running. | Setup global monitoring systems. |
item_scraped | An item is successfully scraped. | Track or process scraped data. |
request_dropped | Request has been dropped/filtered. | Log dropped requests for debugging. |
2. Write the Extension Class
Next, you need to write your extension class. Here's an extension for a custom logger that we'll use.
class CustomLoggerExtension:
def __init__(self, stats, log_file):
self.stats = stats
self.log_file = log_file
self.start_time = None
self.end_time = None
self.logger = logging.getLogger("custom_logger")
self.logger.setLevel(logging.INFO)
handler = logging.FileHandler(self.log_file)
handler.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
self.logger.addHandler(handler)
@classmethod
def from_crawler(cls, crawler):
log_file = crawler.settings.get("CUSTOM_LOG_FILE", "custom_log.txt")
ext = cls(crawler.stats, log_file)
crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
crawler.signals.connect(ext.response_received, signal=signals.response_received)
return ext
def spider_opened(self, spider):
self.logger.info(f"Spider {spider.name} opened.")
self.start_time = datetime.now()
def spider_closed(self, spider):
self.end_time = datetime.now()
runtime = self.end_time - self.start_time
self.logger.info(f"Spider {spider.name} closed. Time elapsed: {runtime} seconds")
def response_received(self, response, request, spider):
self.logger.info(f"Response received, status: {response.status} for {response.url}")
3. Activate the Extension
Once you've created your class, you need to adjust your settings to account for this new class.
Scroll down to the EXTENSIONS section of settings.py
.
Uncomment the section to turn on custom extensions and add the path to the extension you just created.
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
EXTENSIONS = {
"scrapy.extensions.telnet.TelnetConsole": None,
"my_custom_crawler.extensions.custom_logger.CustomLoggerExtension": 500
}
Example: Creating an Extension for Runtime Stats Collection
The following extension monitors some basic stats for us and saves them to a .log
file.
- Whenever a crawler is opened, we get the current time with the
datetime
module. - Once the spider is closed, we get the
datetime
again. - We use these times to calculate the total runtime of the spider.
- We also log the status code of any responses that the spider receives.
Here is the full code to our custom logger.
import logging
from datetime import datetime
from scrapy import signals
class CustomLoggerExtension:
def __init__(self, stats, log_file):
self.stats = stats
self.log_file = log_file
self.start_time = None
self.end_time = None
self.logger = logging.getLogger("custom_logger")
self.logger.setLevel(logging.INFO)
handler = logging.FileHandler(self.log_file)
handler.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
self.logger.addHandler(handler)
@classmethod
def from_crawler(cls, crawler):
log_file = crawler.settings.get("CUSTOM_LOG_FILE", "custom_log.txt")
ext = cls(crawler.stats, log_file)
crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
crawler.signals.connect(ext.response_received, signal=signals.response_received)
return ext
def spider_opened(self, spider):
self.logger.info(f"Spider {spider.name} opened.")
self.start_time = datetime.now()
def spider_closed(self, spider):
self.end_time = datetime.now()
runtime = self.end_time - self.start_time
self.logger.info(f"Spider {spider.name} closed. Time elapsed: {runtime} seconds")
def response_received(self, response, request, spider):
self.logger.info(f"Response received, status: {response.status} for {response.url}")
from_crawler()
: This is essentially the runtime for the extension. This triggers every time a spider is opened.spider_opened()
: This saves the current time when the spider is opened.spider_closed()
: Checks the time again and calculates the total runtime of the spider.
Scrapy Middlewares: Customizing the Spider Workflow
Middlewares in Scrapy serve as powerful intermediaries that allow you to customize and control the flow of requests and responses between the Scrapy engine and your spiders.
What are Scrapy Middlewares?
Scrapy middleware is designed to interact directly with your scraping process. Middleware executes custom logic during the request and response phase of your crawl.
There are two main types of middleware:
- downloader middleware and
- spider middleware.
Scrapy's core runs both the downloader software (for requests and responses), and the spider software (for extracting data). Custom middleware adds a separation layer between the core and these types of software.
-
Downloader Middleware: Use downloader middleware to customize your request and response processing. Downloader middleware is ideal for adding custom cookies, adding retry logic, and setting up proxy integration.
-
Spider Middleware: Spider middleware is used when we process information before the spider receives a response. These are useful for modifying responses such as filtering bad data and modifying responses before they get to the spider.
Popular Open Source Scrapy Middlewares
Scrapy middlewares play a vital role in enhancing and customizing the scraping process, and the open-source community has developed several middlewares to address common challenges.
Here are some of the most popular open-source Scrapy middlewares:
Built-in Scrapy Middlewares
Scrapy comes equipped with a range of built-in middlewares that simplify common web scraping tasks and enhance the efficiency of your spiders.
-
RetryMiddleware: When Scrapy receives a bad response (something other than 200 or 404), it will retry the request by default. This eliminates boilerplate code for stable, redundant scrapers.
-
UserAgentMiddleware: Sets the user-agent when making requests to a web server. This allows our scraper to appear as a normal browser, or anything else we set the user-agent to.
-
HTTPProxyMiddleware: Use this to create and configure HTTP proxy connections.
-
HTTPCompressionMiddleware: This middleware is used to compress response objects. If you want to turn a giant response into a ZIP file, this is your tool.
-
CookiesMiddleware: Control your cookies for authentication and other purposes.
-
RefererMiddleware: Automatically give a referer in your url. This makes it much easier to get legitimate responses from a web scraper.
-
DepthMiddleware: This is used to set limits on crawling certain sites. With this middleware, you can prevent your scraper from getting stuck on websites.
-
ChunkedTransferMiddleware: This middleware is used to handle chunked information. When you're downloading a large file, you'll actually receive multiple responses containing chunks of your data. This middleware manages these chunks easily for you.
-
AJAXCrawlMiddleware: AJAX heavy sites can be a huge pain. This middleware makes it much easier to crawl them.
3rd Party Middlewares
Third-party Scrapy middlewares, developed and maintained by the open-source community, extend the framework’s capabilities and address specialized scraping challenges.
-
Scrapy-Splash: Splash is a browser that you can control through HTTP requests. This middleware allows you to interface with Splash easily and handle sites when rendering is required.
-
Scrapy-Redis: Use Redis to store and filter your scraped data.
-
Scrapy-UserAgents: This allows you to set and customize user-agents easily. It includes a list of user-agent strings and rotates them dynamically to strengthen your scraper.
-
Scrapy-Antiban: Combine user-agents, delays, and retries to avoid detection. This middleware gives us the ability to get past anti-bots in difficult situations.
Writing a Custom Middleware
When we create a middleware, we need to identify what this middleware will do.
- Does this middleware sit within our request and response objects?
- Does it modify the content we receive before passing it on to the spider?
Middleware Type | Purpose |
---|---|
Downloader | Proxies, retries, user agents and other HTTP settings. |
Spider | Modify and filter data before it gets to the spider. |
Example: Creating a Middleware for Retrying Requests With Custom Logic
The example below contains a custom downloader middleware.
- Create a new
middlewares
folder inside your crawler project. - Add a blank file called
__init__.py
. - Inside the middlewares folder you can then add a file with the following code.
- We define a list of status codes in which to execute a retry. While our retries haven't maxed out, we'll retry a request until it succeeds.
from scrapy.exceptions import IgnoreRequest
from scrapy.http import Request
from scrapy.utils.response import response_status_message
class CustomMiddleware:
def __init__(self, retry_times, retry_http_codes):
self.retry_times = retry_times
self.retry_http_codes = set(retry_http_codes)
@classmethod
def from_crawler(cls, crawler):
return cls(
retry_times=crawler.settings.getint("RETRY_TIMES", 3),
retry_http_codes=crawler.settings.getlist("RETRY_HTTP_CODES", [500, 502, 503, 504, 522, 524, 408])
)
def process_request(self, request, spider):
spider.logger.info(f"Modifying request: {request.url}")
return None
def process_response(self, request, response, spider):
retries = request.meta.get("retry_times", 0)
if retries > 0:
spider.logger.info(f"Retry response received: {response.status} for {response.url} (Retry {retries})")
else:
spider.logger.info(f"Processing response: {response.status} for {response.url}")
if response.status in self.retry_http_codes:
spider.logger.warning(f"Retrying {response.url} due to HTTP {response.status}")
return self._retry(request, response_status_message(response.status), spider)
return response
def process_exception(self, request, exception, spider):
spider.logger.warning(f"Exception encountered: {exception} for {request.url}")
return self._retry(request, str(exception), spider)
def _retry(self, request, reason, spider):
retries = request.meta.get("retry_times", 0) + 1
if retries <= self.retry_times:
spider.logger.info(f"Retrying {request.url} ({retries}/{self.retry_times}) due to: {reason}")
retry_req = request.copy()
retry_req.meta["retry_times"] = retries
retry_req.dont_filter = True
return retry_req
else:
spider.logger.error(f"Gave up retrying {request.url} after {self.retry_times} attempts")
raise IgnoreRequest(f"Request failed after retries: {request.url}")
-
from_crawler()
: We define a list of bad status codes and a retry limit. If we receive a response with one of these status codes, and we haven't reached our retry limit, we will retry the request. -
process_request()
: This is used to tell the log that we're modifying a request. -
process_response()
: Our real retry logic takes place here. If a response is incorrect, we run the_retry()
method. -
process_exception()
: If we encounter an exception, instead of allowing the scraper to crash, we retry the request.
Scrapy Pipelines: Managing Item Processing
Scrapy pipelines are essential for processing and managing the data extracted by your spiders, allowing you to refine, validate, store, and export items with ease.
What are Scrapy Pipelines?
Pipelines are used to store our data. Once your data gets extracted, you need to do something with it.
Pipelines take in this extracted data and store it wherever you want. This can be a CSV file, a database... whatever you want! When built properly, pipelines will also filter bad data and duplicates from getting stored.
Here are some of the built-in pipelines that come with Scrapy.
-
Feed Export Pipeline: Save exported data automatically in files like JSON, CSV, and XML.
-
Media Pipelines: Used for downloading videos and images and saving them to a specific location.
-
Drop Item: Drop duplicate items before they get saved to the output file.
How To Create a Custom Pipeline:
When writing a custom pipeline, you need to create a pipelines
folder within your Scrapy project.
Make sure to add an __init__.py
file to the folder. Without this file, your pipeline will not execute.
Inside your settings file, you set a priority for your pipeline, just as we did with our extension and our middleware.
Uncomment the pipelines section and add the path to your pipeline.
When setting multiple pipelines, the lower numbers are executed first. For example, if you have two pipelines, one with a priority of 300 and one with 400, the pipeline with 300 will be executed first.
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
"my_custom_crawler.pipelines.custom_pipeline.CustomPipeline": 300,
}
Example: Creating a Pipeline for Saving Data to a Database
Before we get started with our pipeline, we need to have a database to work with. Here, we'll go with sqlite3
.
Once you've got it installed, use the following commands to create your database.
Create the database.
sqlite3 data.db
Create a table to hold the quotes we scrape.
CREATE TABLE quotes (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT NOT NULL,
author TEXT NOT NULL
);
In the code below, our pipeline takes in data and saves it to our database, data.db
.
If an item already exists in the item is missing either its title
or author
, it gets dropped from the pipeline before it can get into our database.
import sqlite3
from scrapy.exceptions import DropItem
class CustomPipeline:
def open_spider(self, spider):
self.connection = sqlite3.connect('data.db')
self.cursor = self.connection.cursor()
spider.logger.info("SQLitePipeline: Database connection opened.")
self.cursor.execute('''
CREATE TABLE IF NOT EXISTS quotes (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT NOT NULL,
author TEXT NOT NULL
)
''')
self.connection.commit()
def process_item(self, item, spider):
if not item.get('title') or not item.get('author'):
raise DropItem(f"Missing title or author in {item}")
self.cursor.execute('''
INSERT INTO quotes (title, author) VALUES (?, ?)
''', (item['title'], item['author']))
self.connection.commit()
spider.logger.info(f"SQLitePipeline: Item stored in database: {item}")
return item
def close_spider(self, spider):
self.connection.close()
spider.logger.info("SQLitePipeline: Database connection closed.")
Integrating Custom Components in Scrapy
Now that we've built our custom components, we need to test them all out.
Follow the instructions below to test your code. First, make sure your settings are pointing to the correct paths.
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
"my_custom_crawler.middlewares.custom_request.CustomMiddleware": 501,
}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
EXTENSIONS = {
"scrapy.extensions.telnet.TelnetConsole": None,
"my_custom_crawler.extensions.custom_logger.CustomLoggerExtension": 500
}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
"my_custom_crawler.pipelines.custom_pipeline.CustomPipeline": 300,
}
Run a simple crawl on quotes.toscrape.
You should see a log file in your project folder. If you open it up, it'll look similar to this.
2024-12-21 02:07:36,083 - custom_logger - INFO - Spider default opened.
2024-12-21 02:07:37,687 - custom_logger - INFO - Response received, status: 404 for https://quotes.toscrape.com/robots.txt
2024-12-21 02:07:37,890 - custom_logger - INFO - Response received, status: 200 for https://quotes.toscrape.com
2024-12-21 02:07:37,996 - custom_logger - INFO - Spider default closed. Time elapsed: 0:00:01.913207 seconds
Now we'll test our retry logic. We'll run fetch on a site that automatically gives us an error.
scrapy fetch https://httpstat.us/500
You'll need to search the terminal output for your retry logic. You'll find something that looks like what you see below.
2024-12-21 02:08:42 [custom_logger] INFO: Spider default opened.
2024-12-21 02:08:42 [default] INFO: Modifying request: https://httpstat.us/robots.txt
2024-12-21 02:08:43 [default] INFO: Processing response: 404 for https://httpstat.us/robots.txt
2024-12-21 02:08:43 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://httpstat.us/robots.txt> (referer: None)
2024-12-21 02:08:43 [custom_logger] INFO: Response received, status: 404 for https://httpstat.us/robots.txt
2024-12-21 02:08:43 [default] INFO: Modifying request: https://httpstat.us/500
2024-12-21 02:08:44 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://httpstat.us/500> (failed 1 times): 500 Internal Server Error
2024-12-21 02:08:44 [default] INFO: Modifying request: https://httpstat.us/500
2024-12-21 02:08:44 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://httpstat.us/500> (failed 2 times): 500 Internal Server Error
2024-12-21 02:08:44 [default] INFO: Modifying request: https://httpstat.us/500
2024-12-21 02:08:45 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://httpstat.us/500> (failed 3 times): 500 Internal Server Error
2024-12-21 02:08:45 [default] INFO: Retry response received: 500 for https://httpstat.us/500 (Retry 2)
2024-12-21 02:08:45 [default] WARNING: Retrying https://httpstat.us/500 due to HTTP 500
2024-12-21 02:08:45 [default] ERROR: Gave up retrying https://httpstat.us/500 after 2 attempts
2024-12-21 02:08:45 [scrapy.core.engine] INFO: Closing spider (finished)
Before testing our pipeline, we need a spider to extract our data. This one will extract the name and author of each quote.
Create a new file inside the spiders
folder and paste the following code.
import scrapy
class TestSpider(scrapy.Spider):
name = "test_pipeline"
start_urls = ['https://quotes.toscrape.com']
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'title': quote.css('span.text::text').get(),
'author': quote.css('span small.author::text').get(),
}
Now, you can go ahead and run your spider.
scrapy crawl test_pipeline
If the spider succeeded, you can run the following commands to check your database.
Open the database.
sqlite3 data.db
Print its contents.
SELECT * FROM quotes;
If you're database stored the quotes properly, you should get an output similar to this.
1|“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”|Albert Einstein
2|“It is our choices, Harry, that show what we truly are, far more than our abilities.”|J.K. Rowling
3|“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”|Albert Einstein
4|“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”|Jane Austen
5|“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”|Marilyn Monroe
6|“Try not to become a man of success. Rather become a man of value.”|Albert Einstein
7|“It is better to be hated for what you are than to be loved for what you are not.”|André Gide
8|“I have not failed. I've just found 10,000 ways that won't work.”|Thomas A. Edison
9|“A woman is like a tea bag; you never know how strong it is until it's in hot water.”|Eleanor Roosevelt
10|“A day without sunshine is like, you know, night.”|Steve Martin
sqlite>
Advanced Tips for Customizing Scrapy
When you customize Scrapy, you should utilize every tool at your disposal. With our extensions and middlewares, we utilized signals
to trigger our custom logic.
Take a look at some of the third-party libraries we mentioned in the sections above.
Using Signals to Create Complex Extensions
You can combine signals for great error handling. Use spider_error
, item_dropped
and request_dropped
to handle custom logic based on what the scraper encounters.
For example, send email alerts when your spider encounters an error. You could also save your dropped items and requests to individual file for you to review later.
This can make your debugging tools more comprehensive and easy to manage at the same time.
Third-Party Libraries in Your Custom Components
You can utilize Scrapy-Splash for rendering JavaScript-heavy pages. Combine it with a custom downloader middleware and cache the pages.
This allows you to access the page on multiple instances without additional requests to the server.
Scrapy-Redis can help you create a distributed scraper. You could write a custom scheduler to prioritize different urls based on how frequently they're updated. This can give you an entirely automated crawling process.
The only time you need to mess with it is when something goes wrong... Which, if you implement custom notifications, you could find out about using your smartphone.
Optimizing Performance
If you decide to upgrade your database using MySQL or Postgres, your database will support async operations. Combine this with multiple crawlers, and you've got a database being managed by multiple scrapers at the same time.
Since these databases support asynchronous operations, you don't need to worry about performance issues or data corruption.
Use custom settings with the built-in AutoThrottle middleware. Depending on what you do, your scraper can adjust its requests dynamically to avoid rate limiting and prevent from getting blocked.
If it auto-adjusts, it can find the optimal setting for each site it scrapes.
Real-World Examples of Scrapy Customization
Below are some real world examples where custom extensions can help you greatly.
Take a look and see how your new skills can help you.
E-Commerce
E-commerce platforms can be incredibly difficult to scrape. Depending on the site, you can run into rate limiting, dynamic page content, and varying data formats.
You can use AutoThrottle with a custom downloader middleware to retry bad responses. You can use spider middlewares to alter the data and format it before extraction.
Use a custom pipeline to save all of this in an async database for a website telling people where the best deals are.
News Aggregator
Each day, thousands of news sites publish countless articles and its often difficult to separate the important news from the nonsense.
You can write custom middlewares to search multiple news sites for keywords. Store the results in a database, and you've got a database of top news articles of the day. Sites like this are great.
People often don't know what to read when looking for news online, and you can help them figure it out.
Conclusion
Out of the box, Scrapy gives us a powerful and effective way to crawl and scrape the web.
With extensions, middlewares, and pipelines, you can take Scrapy to new heights.
Depending on your setup, Scrapy in Python can even compete with compiled languages like Rust, Go, and C++.
When you understand what your scraping project needs, you can write all sorts of components tailored to fit the job perfectly.
-
Extensions: Utilize your custom Scrapy extensions to handle global behavior such as logging and error notifications.
-
Middlewares: Use these for retry logic and even render content on demand when you receive a site with embedded JavaScript.
-
Pipelines: Whether you want to save your data to a CSV, JSON file, or a full blown database, pipelines will help you filter out duplicates and save them in your desired format.
If you'd like to view Scrapy's documentation, you can find it here.
More Scrapy Web Scraping Guides
At ScrapeOps, Scrapy is one of our favorite tools for scraping the web.
Our Scrapy Playbook contains a bunch of comprehensive examples for those of you looking to enhance your extraction with one of the best tools out there.
You can view a couple samples in the links below.