The Scrapy Playwright Guide

Scrapy Playwright Guide: Render & Scrape JS Heavy Websites

Released by Microsoft in 2020, Playwright.js is quickly becoming the most popular headless browser library for browser automation and web scraping thanks to its cross-browser support (can drive Chromium, WebKit, and Firefox browsers, whilst Puppeteer only drives Chromium) and developer experience improvements over Puppeteer.

So it is great to see that a number of the core Scrapy maintainers developed a Playwright integration for Scrapy: scrapy-playwright

Scrapy Playwright is one of the best headless browser options you can use with Scrapy so in this guide we will go through how:

How To Install Scrapy Playwright
How To Use Scrapy Playwright In Your Spiders
How To Wait For The Page To Load
How To Scrape Multiple Pages
How To Scroll The Page Elements With Scrapy Playwright
How To Take screenshots With Scrapy Playwright

Note

As of writing this guide, Scrapy Playwright doesn't work with Windows. However, it is possible to run it with WSL (Windows Subsystem for Linux).

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

Base Scrapy Project

If you'd like to follow along with a project that is already setup and ready to go you can clone our scrapy project that is made espcially to be used with this tutorial.

Once you download the code from our github repo. You can just copy/paste in the code snippets we use below and see the code working correctly on your computer.

The only thing that you need to do after downloading the code is to install a python virtual environment. If you don't know how to do that you can check out our guide here

If you prefer video tutorials, then check out the video version of this article.

How To Install Scrapy Playwright

Installing scrapy-playwright into your Scrapy projects is very straightforward.

First, you need to install scrapy-playwright itself:

pip install scrapy-playwright

Then if your haven't already installed Playwright itself, you will need to install it using the following command in your command line:

playwright install

Next, we will need to update our Scrapy projects settings to activate scrapy-playwright in the project:

# settings.py

DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

The ScrapyPlaywrightDownloadHandler class inherits from Scrapy's default http/https handler. So unless you explicitly activate scrapy-playwright in your Scrapy Request, those requests will be processed by the regular Scrapy download handler.

How To Use Scrapy Playwright In Your Spiders

Now, let's integrate scrapy-playwright into a Scrapy spider so all our requests will be JS rendered.

To route our requests through scrapy-playwright we just need to enable it in the Request meta dictionary by setting meta={'playwright': True}.

# spiders/quotes.py

import scrapy
from quotes_js_scraper.items import QuoteItem

class QuotesSpider(scrapy.Spider):
	name = 'quotes'

	def start_requests(self):
		url = "https://quotes.toscrape.com/js/"
		yield scrapy.Request(url, meta={'playwright': True})

	def parse(self, response):
		for quote in response.css('div.quote'):
			quote_item = QuoteItem()
			quote_item['text'] = quote.css('span.text::text').get()
			quote_item['author'] = quote.css('small.author::text').get()
			quote_item['tags'] = quote.css('div.tags a.tag::text').getall()
			yield quote_item

The response will now contain the rendered page as seen by the browser. However, sometimes Playwright will have ended the rendering before the entire page has been rendered which we can solve using Playwright PageMethods.

Note: If you are getting the following error when running scrapy crawl:

scrapy.exceptions.NotSupported: Unsupported URL scheme 'https': No module named 'scrapy_playwright'

What usually resolves this error is running deactivate to deactivate your venv and then re-activate your virtual environment again.

Interacting With The Page Using Playwright PageMethods

To interaction with the page using scrapy-playwright we will need to use the PageMethod class.

PageMethod's allow us to do alot of different things on the page, including:

Wait for elements to load before returning response
Scrolling the page
Clicking on page elements
Taking a screenshot of the page
Creating PDFs of the page

First, to use the PageMethod functionality in your spider you will need to set playwright_include_page equal to True so we can access the Playwright Page object and also define any callbacks (i.e. def parse) as a coroutine function (async def) in order to await the provided Page object.

# spiders/quotes.py

import scrapy
from quotes_js_scraper.items import QuoteItem


class QuotesSpider(scrapy.Spider):
	name = 'quotes'

	def start_requests(self):
		url = 'https://quotes.toscrape.com/js/'
		yield scrapy.Request(url, meta=dict(
			playwright = True,
			playwright_include_page = True, 
		))

	async def parse(self, response):
		...
 

Note: When setting 'playwright_include_page': True it is also recommended that you set a Request errback to make sure pages are closed even if a request fails (if playwright_include_page=False or unset, pages are automatically closed upon encountering an exception).

# spiders/quotes.py

import scrapy
from quotes_js_scraper.items import QuoteItem


class QuotesSpider(scrapy.Spider):
	name = 'quotes'

	def start_requests(self):
		url = 'https://quotes.toscrape.com/js/'
		yield scrapy.Request(url, meta=dict(
			playwright = True,
			playwright_include_page = True, 
      		errback=self.errback,
		))

	async def parse(self, response):
		page = response.meta["playwright_page"]
		await page.close()

		for quote in response.css('div.quote'):
			quote_item = QuoteItem()
			quote_item['text'] = quote.css('span.text::text').get()
			quote_item['author'] = quote.css('small.author::text').get()
			quote_item['tags'] = quote.css('div.tags a.tag::text').getall()
			yield quote_item
  
	async def errback(self, failure):
		page = failure.request.meta["playwright_page"]
		await page.close()
 

1. Waiting For Page Elements

To wait for a specific page element before stopping the javascript rendering and returning a response to our scraper we just need to add a PageMethod to the playwright_page_methods key in out Playwrright settings and define a wait_for_selector.

Now, when we run the spider scrapy-playwright will render the page until a div with a class quote appears on the page.

# spiders/quotes.py

import scrapy
from quotes_js_scraper.items import QuoteItem
from scrapy_playwright.page import PageMethod

class QuotesSpider(scrapy.Spider):
	name = 'quotes'

	def start_requests(self):
		url = "https://quotes.toscrape.com/js/"
		yield scrapy.Request(url, meta=dict(
				playwright = True,
				playwright_include_page = True, 
				playwright_page_methods =[PageMethod('wait_for_selector', 'div.quote')],
        errback=self.errback,
			))

	async def parse(self, response):
    	page = response.meta["playwright_page"]
    	await page.close()

		for quote in response.css('div.quote'):
			quote_item = QuoteItem()
			quote_item['text'] = quote.css('span.text::text').get()
			quote_item['author'] = quote.css('small.author::text').get()
			quote_item['tags'] = quote.css('div.tags a.tag::text').getall()
			yield quote_item
  
	async def errback(self, failure):
		page = failure.request.meta["playwright_page"]
		await page.close()

2. Scraping Multiple Pages

Usually we need to scrape multiple pages on a javascript rendered website. We will do this by checking if there is a next page link present on the page and then requesting that page with the url that we scrape from the page.

# spiders/quotes.py

import scrapy
from quotes_js_scraper.items import QuoteItem
from scrapy_playwright.page import PageMethod


class QuotesSpider(scrapy.Spider):
    name = 'quotes'

    def start_requests(self):
        url = "https://quotes.toscrape.com/js/"
        yield scrapy.Request(url, meta=dict(
                playwright = True,
                playwright_include_page = True, 
                playwright_page_methods =[
                    PageMethod('wait_for_selector', 'div.quote'),
                ],
        errback=self.errback,
            ))

    async def parse(self, response):
        page = response.meta["playwright_page"]
        await page.close()

        for quote in response.css('div.quote'):
            quote_item = QuoteItem()
            quote_item['text'] = quote.css('span.text::text').get()
            quote_item['author'] = quote.css('small.author::text').get()
            quote_item['tags'] = quote.css('div.tags a.tag::text').getall()
            yield quote_item

        next_page = response.css('.next>a ::attr(href)').get()

        if next_page is not None:
            next_page_url = 'http://quotes.toscrape.com' + next_page
            yield scrapy.Request(next_page_url, meta=dict(
                playwright = True,
                playwright_include_page = True, 
                playwright_page_methods =[
                    PageMethod('wait_for_selector', 'div.quote'),
                ],
                errback=self.errback,
            ))
  
    async def errback(self, failure):
        page = failure.request.meta["playwright_page"]
        await page.close()

3. Scroll Down Infinite Scroll Pages

We can also configure scrapy-playwright to scroll down a page when a website uses an infinite scroll to load in data.

In this example, Playwright will wait for div.quote to appear, before scrolling down the page until it reachs the 10th quote.

# spiders/quotes.py

import scrapy
from quotes_js_scraper.items import QuoteItem
from scrapy_playwright.page import PageMethod

class QuotesSpider(scrapy.Spider):
	name = 'quotes'

	def start_requests(self):
		url = "https://quotes.toscrape.com/scroll"
		yield scrapy.Request(url, meta=dict(
				playwright = True,
				playwright_include_page = True, 
				playwright_page_methods =[
          PageMethod("wait_for_selector", "div.quote"),
          PageMethod("evaluate", "window.scrollBy(0, document.body.scrollHeight)"),
          PageMethod("wait_for_selector", "div.quote:nth-child(11)"),  # 10 per page
          ],
        errback=self.errback,
			))

	async def parse(self, response):
    	page = response.meta["playwright_page"]
    	await page.close()

		for quote in response.css('div.quote'):
			quote_item = QuoteItem()
			quote_item['text'] = quote.css('span.text::text').get()
			quote_item['author'] = quote.css('small.author::text').get()
			quote_item['tags'] = quote.css('div.tags a.tag::text').getall()
			yield quote_item
  
	async def errback(self, failure):
		page = failure.request.meta["playwright_page"]
		await page.close()

4. Take Screenshot Of Page

Taking screenshots of the page are simple too.

Here we wait for Playwright to see the selector div.quote then it takes a screenshot of the page.

# spiders/quotes.py

import scrapy
from quotes_js_scraper.items import QuoteItem
from scrapy_playwright.page import PageMethod

class QuotesSpider(scrapy.Spider):
	name = 'quotes'

	def start_requests(self):
		url = "https://quotes.toscrape.com/js/"
		yield scrapy.Request(url, meta=dict(
				playwright = True,
				playwright_include_page = True, 
				playwright_page_methods =[
          PageMethod("wait_for_selector", "div.quote"),
          ]
			))

	async def parse(self, response):
    	page = response.meta["playwright_page"]
    	screenshot = await page.screenshot(path="example.png", full_page=True)
    	# screenshot contains the image's bytes
    	await page.close()
  

Using Proxies With Scrapy Playwright

In Scrapy Playwright, proxies can be configured at the Browser level by specifying the proxy key in the PLAYWRIGHT_LAUNCH_OPTIONS setting:

We also need to set the ignore_https_errors key in the playwright_context_kwargs for the scrapy.Request:

In our example below we will show how it works if you are using the ScrapeOps Proxy API Aggregator. The only part you should have to change is the YOUR_API_KEY_HERE - replace this with your scrapeops api key.

(Following code is working as of October 2023)

# spiders/quotes.py

import scrapy

class ProxySpider(scrapy.Spider):
    name = "quotes"
    custom_settings = {
        "PLAYWRIGHT_LAUNCH_OPTIONS": {
            "proxy": {
                "server": "http://proxy.scrapeops.io:5353",
                "username": "scrapeops",
                "password": "YOUR_API_KEY_HERE",
            },
        }
    }

    def start_requests(self):
		url = 'https://quotes.toscrape.com/js/'

        yield scrapy.Request(
            "http://httpbin.org/get", 
            meta=dict(
                playwright = True,
                playwright_context_kwargs= {
                     "ignore_https_errors": True,
                })
        )

    def parse(self, response):
        print(response.text)
  

More Functionality

Scrapy Playwright has a huge amount of functionality and is highly customisable, so much so that it is hard to cover everything properly in a single guide.

So if you would like to learn more about Scrapy Playwright then check out the offical documentation here.

Scrapy Playwright Guide: Render & Scrape JS Heavy Websites

Need help scraping the web?

Base Scrapy Project​

How To Install Scrapy Playwright​

How To Use Scrapy Playwright In Your Spiders​

Interacting With The Page Using Playwright PageMethods​

1. Waiting For Page Elements​

2. Scraping Multiple Pages​

3. Scroll Down Infinite Scroll Pages​

4. Take Screenshot Of Page​

Using Proxies With Scrapy Playwright​

More Functionality​

More Scrapy Tutorials​