Puppeteer Guide: How to Capture Background XHR Requests
Puppeteer is a robust tool widely used for automating web browsers, particularly in testing and web scraping scenarios. However, a common challenge that users encounter is the need to capture background XMLHttpRequests (XHR).
In this article, we'll explore the causes of this challenge and provide effective solutions to overcome it.
- TL:DR - How to Capture Background XHR Requests with Puppeteer
- What is XHR and Why is it Important?
- Benefits of Scraping Background Requests
- XHR Request Capture Methods
- Real-World Example: Capturing XHR Requests in Puppeteer
- Common Challenges and Their Solutions in Capturing Background XHR Requests
- Conclusion
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
TLDR - How to Capture Background XHR Requests with Puppeteer
Ultimately the best way to capture XHR requests in Puppeteer is using request interception. Request interception allows you to stop XHR requests in the browser before they are sent to the target.
In this way you can manipulate or capture them and set up to capture the response content as well.
The following code is a simple example of this.
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Enable Request Interception
await page.setRequestInterception(true);
// Listen for Request events
page.on("request", (request) => {
// If statement to catch XHR requests and Ignore XHR requests to Google Analytics
if (
request.resourceType() === "xhr" &&
!request.url().includes("google-analytics")
) {
// Capture some XHR request data and log it to the console
console.log("XHR Request", request.method(), request.url());
console.log("Headers", request.headers());
console.log("Post Data", request.postData());
}
// Allow the request to be sent
request.continue();
});
// Listen for response events
page.on("response", (response) => {
const request = response.request();
// Only check responses for XHR requests and ignore google-analytics
if (
request.resourceType() === "xhr" &&
!request.url().includes("google-analytics")
) {
// Log response status and URL
console.log("XHR Response", response.status(), response.url());
// Log response headers
console.log("Headers", response.headers());
// Log response body
response.text().then((text) => console.log("Body", text));
}
});
// Navigate to page
await page.goto("https://www.scrapethissite.com/pages/ajax-javascript/#2015");
await browser.close();
})();
From the above code you see how we set up request and response interception to begin working with XHR requests.
-
The two key portions are
page.on("request", ...)
andpage.on("response", ...)
which handle requests and responses respectively. -
Then in both handlers, we verify that the request
resourceType
is equal to"xhr"
. -
In the same
if
statement, we also check that the request url does not include"google-analytics"
because we are not concerned with that request for this page. -
For the response handler, we have to get the request using the
response.request()
method before we can check theresourceTyp
. Then, in the request handler, we log the method, url, headers and POST body (if there is one). -
In the response handler, we log the status code, url, response headers and response body.
-
Finally, in the request handler, we allow the XHR request to
continue()
and be sent. This is not required in the response handler because we can not allow/deny them. If you are interested in this you should instead continue reading to mocking responses.
What is XHR and Why is it Important?
What is XMLHttpRequest (XHR)
In simple terms, XHR (XMLHttpRequest) is a JavaScript API that allows web browsers to communicate with a server asynchronously. This means your web application can send requests and receive responses from the server without reloading the entire page.
Think of it like a behind-the-scenes messenger, fetching data without users ever noticing.
XHR plays a crucial role in creating dynamic and interactive web experiences. It powers features like:
- Live updates: XHR enables features like live chat, stock tickers, and dynamic news feeds that update without refreshing.
- Partial page rendering: Updating specific sections of a page (e.g., a shopping cart) without reloading the entire layout.
- Single-page applications (SPAs): XHR helps SPAs communicate with the server behind the scenes, creating seamless navigation and fluid user interactions.
Why Capturing XHR Requests Matter
Understanding and capturing background XHR requests is essential for various reasons:
- Debugging and testing: Analyzing XHR calls is crucial for debugging application logic and ensuring correct data retrieval.
- Security analysis: Capturing XHR requests helps identify potential security vulnerabilities like unauthorized data leaks.
- Performance optimization: Monitoring XHR performance can reveal bottlenecks and opportunities for improving website responsiveness.
- Data scraping: In specific contexts, capturing XHR requests can be used for data extraction and analysis.
Benefits of Scraping Background Requests
While traditional scraping tools skim the HTML surface, background XHR scraping grants access to a hidden data treasure trove. Here's why it shines:
- Deeper Data Dives:
- Uncover hidden product details, reviews, and dynamic info.
- Capture real-time data like auction bids or stock tickers.
- Enhanced Accuracy & Reliability:
- Bypass HTML obfuscation to grab raw, unmanipulated data.
- Capture dynamic content that traditional scraping misses.
- Improved Efficiency & Scalability:
- Reduce download size and scraping latency.
- Minimize server load for smooth, scalable scraping.
- New Frontiers Unleashed:
- Extract user interactions, analytics, and even chat logs.
- Open doors for data analysis, research, and market insights.
XHR Request Capture Methods
There are multiple parts to capturing XHR background requests from a page. The most important ones are:
- Capturing XHR Requests
- Capturing XHR Responses
- Capturing Requests and Responses
All of the concepts mentioned above rely on Request Interception. There are many resources you can intercept and a variety of uses to request interception. For the purposes of this guide we will only discuss interception with XHR.
Capturing XHR Requests
This is the building block for interacting with XHR requests in Puppeteer with NodeJS. You can intercept and log all network requests in the browser but we will filter for XHR requests specifically.
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Enable Request Interception
await page.setRequestInterception(true);
// Listen for Request events
page.on("request", (request) => {
// If statement to catch XHR requests
if (request.resourceType() === "xhr") {
console.log(request.url() + " - " + request.method());
}
// Allow the request to be sent
request.continue();
});
// Navigate to page
await page.goto("https://www.scrapethissite.com/pages/ajax-javascript/#2015");
await browser.close();
})();
In the above code, we enable request interception and then listen for requests. When handling requests we only process XHR requests and we log some basic info about them.
Capturing XHR Responses
Using a similar format, we can attach a listener to the "response" event. Note that we do not need to enable request interception to listen for responses.
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Enable Request Interception
await page.setRequestInterception(true);
// Listen for Request events
page.on("request", (request) => {
// If statement to catch XHR requests
if (request.resourceType() === "xhr") {
console.log(request.url() + " - " + request.method());
}
// Allow the request to be sent
request.continue();
});
// Listen for response events
page.on("response", (response) => {
const request = response.request();
// Only check responses for XHR requests
if (request.resourceType() === "xhr") {
console.log(response.status(), response.url());
}
});
// Navigate to page
await page.goto("https://www.scrapethissite.com/pages/ajax-javascript/#2015");
await browser.close();
})();
In the above code, we attach a handler to the "response" event. This handler extracts the request object from the response object. From there we verify the request is actually an XHR request and then print some basic information about the response.
Capturing requests and responses.
Finally, you can put both together to capture requests and responses in the same program.
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Listen for response events
page.on("response", (response) => {
const request = response.request();
// Only check responses for XHR requests
if (request.resourceType() === "xhr") {
console.log(response.status(), response.url());
}
});
// Navigate to page
await page.goto("https://www.scrapethissite.com/pages/ajax-javascript/#2015");
await browser.close();
})();
Real-World Example: Capturing XHR Requests in Puppeteer
One real-world example is trying to catch responses to requests that are fired after the page loads. In the following example, we are looking at an auction listing website.
The website loads a page for the listing which fires a web request to load the listing details.
The following code catches that web request and works directly with the JSON so that we do not have to worry about manipulating the HTML or extracting data from the browser.
// Import the puppeteer library
const puppeteer = require("puppeteer");
// Immediately invoked function expression (IIFE) to handle async/await
(async () => {
// Launch a new browser instance
const browser = await puppeteer.launch();
// Create a new page in the browser
const page = await browser.newPage();
// Listen for all response events on the page
page.on("response", async (response) => {
// If the response URL includes "bid-status", process the response
if (response.url().includes("bid-status")) {
// Parse the response body as JSON
const json = await response.json();
// Log the relevant information from the response
console.log("-- Listing: " + json.listing_id + " --");
console.log("Bid Count: " + json.bid_count);
console.log("Current Bid: " + json.current_bid_amount);
console.log("Starting Bid: " + json.start_bid_amount);
console.log("Bid Increment: " + json.bid_increment);
}
});
// Navigate to the auction website
await page.goto("https://www.auction.com/");
// Get all the anchor tags on the page
const links = await page.$$("a");
// Extract the href attribute from each anchor tag
const hrefList = await Promise.all(
links.map(
async (link) => await (await link.getProperty("href")).jsonValue()
)
);
// Loop over each href
for (const href of hrefList) {
// If the href includes "/details/", navigate to that page
if (href.includes("/details/")) {
await page.goto(href);
await new Promise((r) => setTimeout(r, 1000));
}
}
await browser.close();
})();
Common Challenges and Their Solutions in Capturing Background XHR Requests
Capturing background XHR requests with Puppeteer isn't always a smooth ride.
Here's a guide to navigate common challenges and keep your data flowing:
Handling Asynchronous Requests
XHR requests often pop up unexpectedly and may require asynchronous work, making it tough to capture them all.
Solution:
Employ Promise.all
or similar techniques to patiently wait for request completion. Track ongoing requests and responses using counters or flags.
const requestPromises = [];
page.on("request", (request) => {
requestPromises.push(
new Promise((resolve, reject) => {
// Handle request logic here
resolve(); // Or reject() if an error occurs
})
);
request.continue();
});
await Promise.all(requestPromises); // Wait for all requests to finish
The above code is an example of how you can perform asynchronous work on outgoing requests and then make sure you wait for all of the work to be completed before continuing. The same can be applied to the "response" listener as well.
Handling Redirects
XHR requests might take detours through redirects, requiring you to capture both the initial request and the final destination.
Solution:
Use the request.redirectChain()
method to track these twists and turns in your code and check for redirect status codes.
page.on("response", async (response) => {
const request = response.request();
const redirects = request.redirectChain();
if (response.status > 300 && response.status < 400) {
if (redirects.length > 0) {
console.log(
request.url() + " has been redirected [" + redirects.length + "]"
);
}
} else if (redirects.length > 0) {
console.log(
response.url() + " finished with " + redirects.length + " redirects"
);
}
});
The above code checks if incoming responses are a redirect by using status codes. Otherwise it checks if a non-redirect response has still originated from redirects by checking the length of the redirectChain()
.
Capturing POST Request Data
Extracting data sent in POST requests can be crucial but not always obvious.
Solution:
The request.fetchPostData()
method has your back, revealing the secrets hidden within POST requests.
Managing Diverse Response Data
Response bodies come in various shapes and sizes (JSON, XML, binary, etc.), requiring careful handling.
Solution:
Check the Content-Type header and handle each type accordingly, using methods like response.json()
, response.text()
, or response.buffer()
.
page.on("response", async (response) => {
if (response.request().resourceType() === "xhr") {
const contentType = response.headers().get("Content-Type");
if (contentType.includes("application/json")) {
const data = await response.json();
// Handle JSON data here
} else if (contentType.includes("text/html")) {
const text = await response.text();
// Handle text or HTML data here
} else {
// Handle other content types as needed
}
}
});
Handling Large Volumes of Data
Data overload can overwhelm memory.
Solution:
Be selective! Store only essential data and consider writing to files or databases instead of keeping everything in memory.
Mastering Cookies and Session Data
Accurate XHR capture often relies on proper cookie handling, especially in authenticated sessions.
Solution:
Use Puppeteer's cookie management methods (page.cookies(), page.setCookie()) to control cookies like a pro. Ensure your session is properly authenticated if needed.
Check our Puppeteer Guide - Managing Cookies in order to master cookie management in Puppeteer.
Embracing the Unexpected: Error Handling
Network hiccups and server errors can disrupt requests.
Solution:
Implement graceful error handling with try-catch blocks and the requestfailed event to keep your script resilient.
Creating Private Spaces: Browser Context and Isolation
Isolating different scraping tasks or sessions within the same script can get messy.
Solution:
Use different browser contexts (browser.createIncognitoBrowserContext()) for each task to maintain session privacy and avoid conflicts.
const context = await browser.createIncognitoBrowserContext();
const page = await context.newPage();
// Set up request interception here
Conclusion
Capturing background XHR requests with Puppeteer is a powerful technique. From this article you have learned a number of ways to perform work on outgoing and incoming requests/responses in the Puppeteer browser. For more information see the following sources
More Web Scraping Guides
If you would like to learn more about Web Scraping with Puppeteer, then be sure to check out The Puppeteer Web Scraping Playbook.
Or check out one of our more in-depth guides: