Playwright Guide: How To Block Images and Resources
When running Playwright, Selenium, or any other automated browser, we often run into performance issues. Rendering and loading content itself is a resource intensive job, and the more content there is to render, the more it effects performance.
This guide is intended to give developers a clearer understanding of how to fine tune their Playwright project by blocking resources and only loading the information that they require.
- TLDR - How To Block Images and Resources
- Understanding Playwright's Capabilities
- Images and Resources in a Web Page
- Why Block Images and Resources
- Blocking Images and Resources: A Step By Step Guide
- Advanced Concepts
- Real World Applications
- Best Practices and Considerations
- Conclusion
- More Web Scraping Guides
TLDR - How To Block Images and Resources
Here's a quick and short step-by-step explanation of how to block images and resources in web pages:
//import playwright
const playwright = require("playwright");
//create an async function to scrape the data
async function scrapeData() {
//launch Chromium
const browser = await playwright.chromium.launch({
headless: false
});
//create a constant "page" instance
const page = await browser.newPage();
//set up rules for all routes on the page
await page.route("**/*", (route, request) => {
//block images
if (request.resourceType() === "image") {
//abort the route
route.abort();
//block stylesheets
} else if (request.resourceType() === "stylesheet") {
//abort the route
route.abort();
//block media
} else if (request.resourceType === "media") {
//abort the route
route.abort();
//block javascript
} else if (request.resourceType === "script") {
//abort the route
route.abort();
//block xhr and fetch requests
} else if (request.resourceType === "xhr" || request.resourceType === "fetch") {
//abort the route
route.abort();
} else {
//let it through
route.continue();
}
});
//now that we've set our rules, navigate to the site
await page.goto("https://unsplash.com/");
await page.screenshot({path: "noImages.png"})
//close the browser
await browser.close();
}
//run the scrapeData function
scrapeData();
In summary, this script launches a Chromium browser, navigates to the Unsplash website, blocks specific all types of network requests (images, stylesheets, media, scripts, XHR, and fetch requests), takes a screenshot, and then closes the browser.
If you would like to block a certain type of resource in a web page, you can use the code blocks below:
How To Block Images
This code is designed to prevent the loading of images on a web page.
await page.route("**/*", (route, request) => {
if (request.resourceType() === "image") {
route.abort();
} else {
route.continue();
}
});
How To Block CSS
Similar to blocking images, this code focuses on preventing the loading of CSS (stylesheet) files.
await page.route("**/*", (route, request) => {
if (request.resourceType() === "stylesheet") {
route.abort();
} else {
route.continue();
}
});
How To Block Media Loading
This code targets the blocking of media files, such as audio or video.
await page.route("**/*", (route, request) => {
if (request.resourceType === "media") {
route.abort();
} else {
route.continue();
}
});
How To Block Script Loading
In this code snippet, the goal is to block the loading and execution of JavaScript scripts.
await page.route("**/*", (route, request) => {
if (request.resourceType === "script") {
route.abort();
} else {
route.continue();
}
});
How To Block XHR & Fetch Requests
This code showcases how to block XMLHttpRequest (XHR) and Fetch requests.
await page.route("**/*", (route, request) => {
if (request.resourceType === "xhr" || request.resourceType === "fetch") {
route.abort();
} else {
route.continue();
}
});
Understanding Playwright's Capabilities
Playwright controls a web browser from inside a JavaScript file. While resource intensive, this gives Playwright extreme power and flexibility when using it to scrape data. As you've read in the introduction, Playwright has the power to block content from loading.
When using Playwright, we can block images, CSS, media, script loading, and even HTTP requests! Playwright can actually intercept and block outgoing requests from a website frontend.
If you don't want to wait for a bunch of irrelevant API calls to return values you don't need, you can simply block the API calls from happening!
Images and Resources in a Web Page
"Images and Resources" play a pivotal role in shaping the visual and interactive elements of a web page. Let's quickly introduce each of the resources in a web page:
- Images: Encompassing graphics and photographs, contribute to the aesthetic appeal, conveying information, and enhancing user engagement.
- CSS (stylesheet): Stylesheets define the layout and presentation of web pages, ensuring a consistent visual theme.
- Media: Media files, encompassing audio and video, contribute to a richer multimedia experience, making web pages more engaging and versatile.
- Scripts: Scripts, often written in JavaScript, add dynamic behavior, enabling features like interactivity and real-time updates.
- XHR & Fetch Requests: Enable web browsers to make HTTP requests to retrieve data from a server or send data to a server.
When finding or filtering content on the page, we can use page.route()
to handle different kinds of content on the page. When dealing with routes, we can use **/*
to specify global a global route.
Take a look at the snippet below:
//for all global routes
await page.route("**/*", (route, request) => {
//do whatever we specifiy in this block of code
})
Playwright holds a very special method for handling resources on a web page. The resourceType()
method is very similar to the typeof
keyword in JavaScript. resourceType()
returns the type of resource that the browser is trying to access. Once we know the type of the resource, we can then handle each resource in a specific way.
When blocking images, or any other resource for that matter, we use request.resourceType()
method to identify the resource, and we can then use route.abort()
to abort the content request. If we wish to allow the resource, we would instead use route.continue()
.
Here is a more complete (but still incomplete) snippet of what our code will look like.
await page.route("**/*", (route, request) => {
if (request.resourceType() === "someTypeOfResource") {
//do something different here
} else {
//continue with the route as directed by the page
route.continue();
}
});
Why Block Images and Resources?
As previously mentioned, running a browser is a resource intensive task to begin with. Unless we tell the browser otherwise, the browser will try to render all content on the page by default. Rendering not only uses resources, but also takes time. When resources are strained, content takes even longer to load.
Let's pretend a scraper saves 10% of its time when blocking unneeded content. If you're only viewing one page and said page takes 2 seconds to load instead of 2.2 seconds, this probably won't matter.
If you're looking to scrape a bunch of pages using a crawler that goes for 10 hours, the crawler could instead have the job done in 9. When scraping at scale, time and resources are among the most important things consider.
Blocking Images and Resources: A Step-by-Step Guide
Intercepting Network Requests
As mentioned earlier in this article, we use a combination of page.route()
, request.resourceType()
, and route.abort()
to block content. request.resourceType()
should always return one of the following types:
document
: A full page document, like a link to another page.stylesheet
: We can use this to alter or block the style of the page.image
: An image of some kind to display on the page.media
: Audio and video files that are used inside of the page display.font
: Defines custom font on the webpage.script
: Links and/or loads JavaScript content to the page.texttrack
: A resource type for handling the<track>
element, more specifically: text tracks on media pages.xhr
: Short for XMLHttpRequest, this type of resource makes an HTTP request to an external source (most often a web sever) and returns new information to the client (our web page).fetch
: Also used for making HTTP requests, but uses a newer API that returns aPromise
. This is used mainly for making asynchronous calls and then returning them to the page.eventsource
: Used to make a consistent connection to a server. It is often used to update the page based on server-side events.websocket
: Also used for a continuous connection to the server, listens to events and new information is often used to update the page accordingly.manifest
: Links to a JSON file that describes the page.other
: Anything that is not listed above. Sometimes includes texttracks and many other non-standard resources.
As I'm sure you've noticed, there are a lot of different resource types. We will only focus on the ones that have significant impact on performance. The list of resources for you to actually remember will be much shorter than the one you see above.
Writing Rules to Block Resources
Let's expand on our previous snippet to something that actually blocks images from loading in the code. Below is the full code for a scraper that blocks images.
//import playwright
const playwright = require("playwright");
//create an async function to scrape the data
async function scrapeData() {
//launch Chromium
const browser = await playwright.chromium.launch({
headless: false
});
//create a constant "page" instance
const page = await browser.newPage();
//set up rules for all routes on the page
await page.route("**/*", (route, request) => {
//if the resourceType is an image
if (request.resourceType() === "image") {
//abort the route
route.abort();
//if it's something else
} else {
//let it through
route.continue();
}
});
//now that we've set our rules, navigate to the site
await page.goto("https://unsplash.com/");
await page.screenshot({path: "noImages.png"})
//close the browser
await browser.close();
}
//run the scrapeData function
scrapeData();
In the code above, we:
- Import playwright using
require()
- Define an
async
function,scrapeData()
- Launch the browser with a head using
const browser = await playwright.chromium.launch({headless: false});
- Open a new page
- Set our page rules with
page.route()
- Set a rule: If
request.resourceType()
returns "image", weabort
the request. - Navigate to the page using
page.goto()
- Take a screenshot using
page.screenshot()
- Close the browser
If you look at the screenshot, you should see something very important:
There are no images on the page!
Handling Edge Cases
In some cases, lazy loaded images are slightly trickier to block. Let's expand on the previous example.
The following example builds on the previous one but with just a few key differences.
//import playwright
const playwright = require("playwright");
//create an async function to scrape the data
async function scrapeData() {
//launch Chromium
const browser = await playwright.chromium.launch({
headless: false
});
//create a constant "page" instance
const page = await browser.newPage();
//set up rules for all routes on the page
await page.route("**/*", (route, request) => {
//if the resourceType is an image
if (request.resourceType() === "image") {
//abort the route
route.abort();
//if the request is JavaScript
} else if (request.resourceType() === "script") {
//abort the route
route.abort();
} else {
//let it through
route.continue();
}
});
//now that we've set our rules, navigate to the site
await page.goto("https://shopping.google.com/m/bestthings/");
await page.screenshot({path: "noScriptImages.png"})
//close the browser
await browser.close();
}
//run the scrapeData function
scrapeData();
The main differences:
- We define another rule,
if request.resourceType() === "script"
that usesrequest.abort()
to block script requests
Here is a screenshot from running it.
Testing and Debugging
Testing and debugging are vital to ensure that our scraper is indeed working correctly. While unit tests are proper for any large piece of software, we can easily use our screenshots as a test themselves.
Simply run the first example above and if you get a screenshot without images, images are correctly blocked. If you run our second example, script is correctly being blocked.
Advanced Concepts
Handling Dynamic Content Loading
If we wish to block dynamic content, we simply block script
requests as we did in the example above. When blocking script
, we block JavaScript from executing inside the page.
No JavaScript means no dynamic content. If we wish to filter our content (or JavaScript) more finely, we can use conditional blocking to refine our page.
What if we only want to allow certain images through?
We can combine multiple conditions using the ||
and &&
operators. If you are coming from Python, in most C descendant languages (JavaScript is one of them) use ||
as the equivalent to the or
keyword that you've become accustomed to. &&
is the equivalent to the and
keyword.
Conditional Resource Blocking
The snippet below redefines our image blocker to only block images from a certain place. In this case we block images with a URL including the string, "essential"
.
if (request.resourceType() === "image" && !request.url().includes("essential")) {
route.abort();
}
In the above snippet, we showed a theoretical example that only blocks images with a url containing the word "essential".
Let's put this into action and see what happens.
//import playwright
const playwright = require("playwright");
//create an async function to scrape the data
async function scrapeData() {
//launch Chromium
const browser = await playwright.chromium.launch({
headless: false
});
//create a constant "page" instance
const page = await browser.newPage();
//set up rules for all routes on the page
await page.route("**/*", (route, request) => {
//if the resourceType is a "premium" image
if (request.resourceType() === "image" && request.url().includes("premium")) {
//abort the route
route.abort();
//if it's something else
} else {
//let it through
route.continue();
}
});
//now that we've set our rules, navigate to the site
await page.goto("https://unsplash.com/");
await page.screenshot({path: "conditionalImages.png"})
//close the browser
await browser.close();
}
//run the scrapeData function
scrapeData();
There is one main difference between this example and our first image blocker:
if (request.resourceType() === "image" && request.url().includes("premium"))
uses the&&
operator so that the condition is only met if our resource is an imageand
its url contains the word "premium".
Take a look at the screenshot below, you'll see that only certain images have been blocked.
Real-World Applications
Let's make a couple real world example that fetches a list of different websites.
Images and Script Allowed
//import playwright
const playwright = require("playwright");
//create an async function to scrape the data
async function scrapeData() {
const urls = ["https://quotes.toscrape.com", "https://www.amazon.com", "https://ebay.com"]
//launch Chromium
const browser = await playwright.chromium.launch({
headless: false
});
//create a constant "page" instance
const page = await browser.newPage();
//set up rules for all routes on the page
//now that we've set our rules, navigate to the site
for (var i=0; i<urls.length; i++) {
await page.goto(urls[i]);
}
//close the browser
await browser.close();
}
//run the scrapeData function
scrapeData();
Images and Script Blocked
//import playwright
const playwright = require("playwright");
//create an async function to scrape the data
async function scrapeData() {
const urls = ["https://quotes.toscrape.com", "https://www.amazon.com", "https://ebay.com"]
//launch Chromium
const browser = await playwright.chromium.launch({
headless: false
});
//create a constant "page" instance
const page = await browser.newPage();
//set up rules for all routes on the page
await page.route("**/*", (route, request) => {
//if the resourceType is an image
if (request.resourceType() === "image") {
//abort the route
route.abort();
//if the request is JavaScript
} else if (request.resourceType() == "script") {
//abort the route
route.abort();
} else {
//let it through
route.continue();
}
});
//now that we've set our rules, navigate to the site
for (var i=0; i<urls.length; i++) {
await page.goto(urls[i]);
}
//close the browser
await browser.close();
}
//run the scrapeData function
console.time();
scrapeData();
console.timeEnd();
The two examples finished on our test machine (Lenovo Ideapad 1i) with the following times:
- Images and Script Allowed: 4.731 seconds
- Images and Script Blocked: 3.813 seconds
The scraper that blocked the images finished approximately .9 seconds faster. Blocking the images and script created a speed boost of almost 20%!!! At scale, this 20% is huge.
Best Practices and Considerations
When blocking resources on any web page, one must be careful and pay attention to what they're blocking. You don't want to block script
resources if the info you need gets fetched by a JavaScript function! You also don't want to block all images if you're only looking for a specific subset of images.
Always write your code carefully and write your conditions with your end goal in mind. If you don't you could spend an hour debugging only to find out that you wrote an if
statement incorrectly!
Conclusion
Congratulations! You've finished this tutorial. You should now have a solid grasp on page.route("**/*", (route, request)
, and the resourceType()
and abort()
methods that are used to block resources from a web page.
Go on and build a scraper. Experiment on a few other sites and see you quickly you can scrape them!
You can visit the Official Playwright Docs to get more information.
More Web Scraping Guides
Want to learn more? Checkout the links below: