Skip to main content

Python Scrapy: Build A Amazon Products Scraper

Python Scrapy: Build A Wayfair Products Scraper [2023]

In this guide for our "How To Scrape X With Python Scrapy" series, we're going to look at how to build a Python Scrapy spider that will crawl Wayfair.com for products and scrape product pages.

Wayfair is one of the most popular e-commerce websites in the world. It specialises in selling furniture and home decor items.

In this guide we will go through:

GitHub Code

The full code for this Wayfair Spider is available on Github here.

If you prefer to follow along with a video then check out the video tutorial version here:

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.


How To Architect Our Wayfair Product Scraper

How we design our Wayfair product scraper is going to heavily depend on:

  • The use case for scraping this data?
  • What data we want to extract from Wayfair?
  • How often do we want to extract data?
  • How much data do we want to extract?
  • Your technical sophistication?

How you answer these questions will change what type of scraping architecture we build.

For this Wayfair scraper example we will assume the following:

  • Objective: The objective for this scraping system is to monitor product rankings for our target keywords and monitor the individual products every day.
  • Required Data: We want to store the product rankings for each keyword and the essential product data (price, reviews, etc.)
  • Scale: This will be a relatively small scale scraping process (handful of keywords), so no need to design a more sophisticated infrastructure.
  • Data Storage: To keep things simple for the example we will store to a CSV file, but provide examples on how to store to MySQL & Postgres DBs.

To do this will design a Scrapy spider that combines both a product discovery crawler and a product data scraper.

As the spider runs it will crawl Wayfair's product search pages, extract product URLs and then send them to the product scraper via a callback. Saving the data to a CSV file via Scrapy Feed Exports.

The advantage of this scraping architecture is that is pretty simple to build and completely self-contained.


How To Build a Wayfair Product Crawler

The first part of scraping Wayfair is designing a web crawler that will build a list of product URLs for our product scraper to scrape.

Step 1: Understand Wayfair Search Pages

With Wayfair.com the easiest way to do this is to build a Scrapy crawler that uses the Wayfair search pages which returns up to 20 products per page.

How To Scrape Wayfair.com Search Pages

For example, here is how we would get search results for bed.


'https://www.wayfair.com/keyword.php?keyword=bed'

This URL contains a number of parameters that we will explain:

  • keyword stands for the search keyword. In our case, keyword=bed. Note: If you want to search for a keyword that contains spaces or special characters then remember you need to encode this value.

  • curpage stands for the page number. In our cases, we've requested curpage=1.

Using these parameters we can query the Wayfair search endpoint to start building a list of URLs to scrape.

To extract product URLs from this page, we need to look through every product on this page, extract the relative URL to the product and the either create an absolute product URL.


Step 2: Build Wayfair Search Crawler

The first thing we need to do is to build a Scrapy spider that will send a request to the Amazon Wayfair page, and paginate through every available results page.

Here is an example Python Scapy crawler that will paginate through each page of search results for each keyword in our keyword_list.


import json
import scrapy
from urllib.parse import urljoin
import re

class WayfairSearchProductSpider(scrapy.Spider):
name = "wayfair_search_product"

def start_requests(self):
keyword_list = ['bed','couch']
for keyword in keyword_list:
wayfair_search_url = f'https://www.wayfair.com/keyword.php?keyword={keyword}&curpage=1'
yield scrapy.Request(url=wayfair_search_url, callback=self.discover_product_urls, meta={'keyword': keyword, 'page': 1})

def discover_product_urls(self, response):
page = response.meta['page']
keyword = response.meta['keyword']

## Get All Pages
if page == 1:
available_pages = response.xpath(
'//*[contains(@class, "s-pagination-item")][not(has-class("s-pagination-separator"))]/text()'
).getall()

last_page = available_pages[-1]
for page_num in range(2, int(last_page)):
wayfair_search_url = f'https://www.wayfair.com/keyword.php?keyword={keyword}&curpage={page_num}'
yield scrapy.Request(url=wayfair_search_url, callback=self.parse_search_results, meta={'keyword': keyword, 'page': page_num})


We can run this spider using the following command:


scrapy crawl wayfair_search_product

When we run this spider it will crawl through every available Wayfair search page for your target keyword (in this case bed). However, it won't output any data.

Now that we have a product discovery spider we can extract the product URLs and scrape each individual Amazon product page.


How To Build a Wayfair Product Scraper

To scrape actual product data we will add a callback to our product discovery crawler, that will request each product page and then a product scraper to scrape all the product information we want.

Step 1: Add Product Scraper Callback

First, we need to update our discover_product_urls() method to extract all the product URLs from the search_products and then send a request to each one.


import json
import scrapy
from urllib.parse import urljoin
import re

class AmazonSearchProductSpider(scrapy.Spider):
name = "amazon_search_product"

def start_requests(self):
keyword_list = ['ipad']
for keyword in keyword_list:
amazon_search_url = f'https://www.amazon.com/s?k={keyword}&page=1'
yield scrapy.Request(url=amazon_search_url, callback=self.discover_product_urls, meta={'keyword': keyword, 'page': 1})

def discover_product_urls(self, response):
page = response.meta['page']
keyword = response.meta['keyword']

## Discover Product URLs
search_products = response.css("div.s-result-item[data-component-type=s-search-result]")
for product in search_products:
relative_url = product.css("h2>a::attr(href)").get()
product_url = urljoin('https://www.amazon.com/', relative_url).split("?")[0]
yield scrapy.Request(url=product_url, callback=self.parse_product_data, meta={'keyword': keyword, 'page': page})

## Get All Pages
if page == 1:
available_pages = response.xpath(
'//*[contains(@class, "s-pagination-item")][not(has-class("s-pagination-separator"))]/text()'
).getall()

last_page = available_pages[-1]
for page_num in range(2, int(last_page)):
amazon_search_url = f'https://www.amazon.com/s?k={keyword}&page={page_num}'
yield scrapy.Request(url=amazon_search_url, callback=self.parse_search_results, meta={'keyword': keyword, 'page': page_num})


def parse_product_data(self, response):
## Add product data extraction code here
pass


This will extract all the URLs from the Amazon search page, request the URL and then trigger a parse_product_data scraper when it recieves a response.


Step 2: Understand Amazon Product Page

Here is an example Amazon product page URL:


'https://www.amazon.com/2021-Apple-10-2-inch-iPad-Wi-Fi/dp/B09G9FPHY6/ref=sr_1_1'

Amazon ASIN Product URLs

Alternatively, we could use URLs solely based on the products ASIN code:


'https://www.amazon.com/dp/B09G9FPHY6'

Which looks like this in our browser:

How To Scrape Amazon.com Product Pages

Unlike websites like Walmart.com, who return the data in JSON format within the HTML response, with Amazon, we will need to create parsers to extract the data from the actual HTML.


Step 3: Build Our Amazon Product Page Scraper

To scrape the resulting Amazon product page we need to populate our callback parse_product_data(), which will parse the data from the Amazon product page after Scrapy has recieved a response:


import json
import scrapy
from urllib.parse import urljoin
import re

class AmazonSearchProductSpider(scrapy.Spider):
name = "amazon_search_product"

def start_requests(self):
keyword_list = ['ipad']
for keyword in keyword_list:
amazon_search_url = f'https://www.amazon.com/s?k={keyword}&page=1'
yield scrapy.Request(url=amazon_search_url, callback=self.discover_product_urls, meta={'keyword': keyword, 'page': 1})

def discover_product_urls(self, response):
page = response.meta['page']
keyword = response.meta['keyword']

## Discover Product URLs
search_products = response.css("div.s-result-item[data-component-type=s-search-result]")
for product in search_products:
relative_url = product.css("h2>a::attr(href)").get()
product_url = urljoin('https://www.amazon.com/', relative_url).split("?")[0]
yield scrapy.Request(url=product_url, callback=self.parse_product_data, meta={'keyword': keyword, 'page': page})

## Get All Pages
if page == 1:
available_pages = response.xpath(
'//*[contains(@class, "s-pagination-item")][not(has-class("s-pagination-separator"))]/text()'
).getall()

last_page = available_pages[-1]
for page_num in range(2, int(last_page)):
amazon_search_url = f'https://www.amazon.com/s?k={keyword}&page={page_num}'
yield scrapy.Request(url=amazon_search_url, callback=self.parse_search_results, meta={'keyword': keyword, 'page': page_num})


def parse_product_data(self, response):
image_data = json.loads(re.findall(r"colorImages':.*'initial':\s*(\[.+?\])},\n", response.text)[0])
variant_data = re.findall(r'dimensionValuesDisplayData"\s*:\s* ({.+?}),\n', response.text)
feature_bullets = [bullet.strip() for bullet in response.css("#feature-bullets li ::text").getall()]
price = response.css('.a-price span[aria-hidden="true"] ::text').get("")
if not price:
price = response.css('.a-price .a-offscreen ::text').get("")
yield {
"name": response.css("#productTitle::text").get("").strip(),
"price": price,
"stars": response.css("i[data-hook=average-star-rating] ::text").get("").strip(),
"rating_count": response.css("div[data-hook=total-review-count] ::text").get("").strip(),
"feature_bullets": feature_bullets,
"images": image_data,
"variant_data": variant_data,
}


Now when we run our scraper:


scrapy crawl amazon_search_product

We will get an ouptut like this:


{"name": "Apple iPad 9.7inch with WiFi 32GB- Space Gray (2017 Model) (Renewed)",
"price": "$137.00",
"stars": "4.6 out of 5 stars",
"rating_count": "8,532 global ratings",
"feature_bullets": [
"Make sure this fits by entering your model number.",
"9.7-Inch Retina Display, wide Color and True Tone",
"A9 third-generation chip with 64-bit architecture",
"M9 motion coprocessor, 1.2MP FaceTime HD Camera",
"8MP insight Camera, touch ID, Apple Pay"],
"images": [{"hiRes": "https://m.media-amazon.com/images/I/51dBcW+NXPL._AC_SL1000_.jpg",
"thumb": "https://m.media-amazon.com/images/I/51pGtRLfaZL._AC_US40_.jpg",
"large": "https://m.media-amazon.com/images/I/51pGtRLfaZL._AC_.jpg",
"main": {...},
"variant": "MAIN",
"lowRes": None,
"shoppableScene": None},
{"hiRes": "https://m.media-amazon.com/images/I/51c43obovcL._AC_SL1000_.jpg",
"thumb": "https://m.media-amazon.com/images/I/415--n36L8L._AC_US40_.jpg",
"large": "https://m.media-amazon.com/images/I/415--n36L8L._AC_.jpg",
"main": {...},
"variant": "PT01",
"lowRes": None,
"shoppableScene": None},
"variant_data": ["{`B074PXZ5GC`:[`9.7 inches`,`Wi-Fi`,`Silver`],`B00TJGN4NG`:[`16GB`,`Wi-Fi`,`White`],`B07F93611L`:[`5 Pack`,`Wi-Fi`,`Space grey`],`B074PWW6NS`:[`Refurbished`,`Wi-Fi`,`Black`],`B0725LCLYQ`:[`9.7`,`Wi-Fi`,`Space Gray`],`B07D3DDJ4L`:[`32GB`,`Wi-Fi`,`Space Gray`],`B07G9N7J3S`:[`32GB`,`Wi-Fi`,`Gold`]}"]}


Storing Data To Database Or S3 Bucket

With Scrapy, it is very easy to save our scraped data to CSV files, databases or file storage systems (like AWS S3) using Scrapy's Feed Export functionality.

To configure Scrapy to save all our data to a new CSV file everytime we run the scraper we simply need to create a Scrapy Feed and configure a dynamic file path.

If we add the following code to our settings.py file, Scrapy will create a new CSV file in our data folder using the spider name and time the spider was run.

# settings.py 

FEEDS = {
'data/%(name)s_%(time)s.csv': {
'format': 'csv',
}
}

If you would like to save your CSV files to a AWS S3 bucket then check out our Saving CSV/JSON Files to Amazon AWS S3 Bucket guide here

Or if you would like to save your data to another type of database then be sure to check out these guides:


Bypassing Amazon's Anti-Bot Protection

As you might have seen already if you run this code Amazon might be blocking you and returning a error page like this:

How To Scrape Amazon.com - Error Page

Or telling you that if you want automated access to their data reach out to them.

This is because Amazon uses anti-bot protection to try and prevent (or at least make it harder) developers from scraping their site.

You will need to using rotating proxies, browser-profiles and possibly fortify your headless browser if you want to scrape Amazon reliably at scale.

We have written guides about how to do this here:

However, if you don't want to implement all this anti-bot bypassing logic yourself, the easier option is to use a smart proxy solution like ScrapeOps Proxy Aggregator.

The ScrapeOps Proxy Aggregator is a smart proxy that handles everything for you:

  • Proxy rotation & selection
  • Rotating user-agents & browser headers
  • Ban detection & CAPTCHA bypassing
  • Country IP geotargeting
  • Javascript rendering with headless browsers

You can get a ScrapeOps API key with 1,000 free API credits by signing up here.

To use the ScrapeOps Proxy Aggregator with our Amazon Scrapy Spider, we just need to send the URL we want to scrape to the Proxy API instead of making the request directly ourselves. You can test it out with Curl using the command below:


curl 'https://proxy.scrapeops.io/v1/?api_key=YOUR_API_KEY&url=https://amazon.com'

We can integrate the proxy easily into our scrapy project by installing the ScrapeOps Scrapy Proxy SDK a Downloader Middleware. We can quickly install it into our project using the following command:


pip install scrapeops-scrapy-proxy-sdk

And then enable it in your project in the settings.py file.

## settings.py

SCRAPEOPS_API_KEY = 'YOUR_API_KEY'
SCRAPEOPS_PROXY_ENABLED = True

DOWNLOADER_MIDDLEWARES = {
'scrapeops_scrapy_proxy_sdk.scrapeops_scrapy_proxy_sdk.ScrapeOpsScrapyProxySdk': 725,
}

Now when we make requests with our scrapy spider they will be routed through the proxy and Amazon won't block them.

Full documentation on how to integrate the ScrapeOps Proxy here.


Monitoring Your Amazon Scraper

When scraping in production it is vital that you can see how your scrapers are doing so you can fix problems early.

You could see if your jobs are running correctly by checking the output in your file or database but the easier way to do it would be to install the ScrapeOps Monitor.

ScrapeOps gives you a simple to use, yet powerful way to see how your jobs are doing, run your jobs, schedule recurring jobs, setup alerts and more. All for free!

Live demo here: ScrapeOps Demo

ScrapeOps Promo

You can create a free ScrapeOps API key here.

We'll just need to run the following to install the ScrapeOps Scrapy Extension:


pip install scrapeops-scrapy

Once that is installed you need to add the following to your Scrapy projects settings.py file if you want to be able to see your logs in ScrapeOps:


# Add Your ScrapeOps API key
SCRAPEOPS_API_KEY = 'YOUR_API_KEY'


# Add In The ScrapeOps Extension
EXTENSIONS = {
'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500,
}


# Update The Download Middlewares
DOWNLOADER_MIDDLEWARES = {
'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
}

Now, every time we run our Amazon spider (scrapy crawl amazon_search_product), the ScrapeOps SDK will monitor the performance and send the data to ScrapeOps dashboard.

Full documentation on how to integrate the ScrapeOps Monitoring here.


Scheduling & Running Our Scraper In The Cloud

Lastly, we will want to deploy our Amazon scraper to a server so that we can schedule it to run every day, week, etc.

To do this you have a couple of options.

However, one of the easiest ways is via ScrapeOps Job Scheduler. Plus it is free!

ScrapeOps Job Scheduler Demo

Here is a video guide on how to connect a Digital Ocean to ScrapeOps and schedule your jobs to run.

You could also connect ScrapeOps to any server like Vultr or Amazon Web Services(AWS).


More Web Scraping Guides

In this edition of our "How To Scrape X" series, we went through how you can scrape Amazon.com including how to bypass its anti-bot protection.

The full code for this Amazon Spider is available on Github here.

If you would like to learn how to scrape other popular websites then check out our other How To Scrape With Scrapy Guides here:

Of if you would like to learn more about web scraping in general, then be sure to check out The Web Scraping Playbook, or check out one of our more in-depth guides: