How to Capture Background XHR Requests

Puppeteer Guide: How to Capture Background XHR Requests

Puppeteer is a robust tool widely used for automating web browsers, particularly in testing and web scraping scenarios. However, a common challenge that users encounter is the need to capture background XMLHttpRequests (XHR).

In this article, we'll explore the causes of this challenge and provide effective solutions to overcome it.

TL:DR - How to Capture Background XHR Requests with Puppeteer
What is XHR and Why is it Important?
Benefits of Scraping Background Requests
XHR Request Capture Methods
Real-World Example: Capturing XHR Requests in Puppeteer
Common Challenges and Their Solutions in Capturing Background XHR Requests
Conclusion

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

TLDR - How to Capture Background XHR Requests with Puppeteer

Ultimately the best way to capture XHR requests in Puppeteer is using request interception. Request interception allows you to stop XHR requests in the browser before they are sent to the target.

In this way you can manipulate or capture them and set up to capture the response content as well.

The following code is a simple example of this.

const puppeteer = require("puppeteer");

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  // Enable Request Interception
  await page.setRequestInterception(true);

  // Listen for Request events
  page.on("request", (request) => {
    // If statement to catch XHR requests and Ignore XHR requests to Google Analytics
    if (
      request.resourceType() === "xhr" &&
      !request.url().includes("google-analytics")
    ) {
      // Capture some XHR request data and log it to the console
      console.log("XHR Request", request.method(), request.url());
      console.log("Headers", request.headers());
      console.log("Post Data", request.postData());
    }

    // Allow the request to be sent
    request.continue();
  });

  // Listen for response events
  page.on("response", (response) => {
    const request = response.request();
    // Only check responses for XHR requests and ignore google-analytics
    if (
      request.resourceType() === "xhr" &&
      !request.url().includes("google-analytics")
    ) {
      // Log response status and URL
      console.log("XHR Response", response.status(), response.url());
      // Log response headers
      console.log("Headers", response.headers());
      // Log response body
      response.text().then((text) => console.log("Body", text));
    }
  });

  // Navigate to page
  await page.goto("https://www.scrapethissite.com/pages/ajax-javascript/#2015");

  await browser.close();
})();

From the above code you see how we set up request and response interception to begin working with XHR requests.

The two key portions are page.on("request", ...) and page.on("response", ...) which handle requests and responses respectively.
Then in both handlers, we verify that the request resourceType is equal to "xhr".
In the same if statement, we also check that the request url does not include "google-analytics" because we are not concerned with that request for this page.
For the response handler, we have to get the request using the response.request() method before we can check the resourceTyp. Then, in the request handler, we log the method, url, headers and POST body (if there is one).
In the response handler, we log the status code, url, response headers and response body.
Finally, in the request handler, we allow the XHR request to continue() and be sent. This is not required in the response handler because we can not allow/deny them. If you are interested in this you should instead continue reading to mocking responses.

What is XHR and Why is it Important?

What is XMLHttpRequest (XHR)

In simple terms, XHR (XMLHttpRequest) is a JavaScript API that allows web browsers to communicate with a server asynchronously. This means your web application can send requests and receive responses from the server without reloading the entire page.

Think of it like a behind-the-scenes messenger, fetching data without users ever noticing.

XHR plays a crucial role in creating dynamic and interactive web experiences. It powers features like:

Live updates: XHR enables features like live chat, stock tickers, and dynamic news feeds that update without refreshing.
Partial page rendering: Updating specific sections of a page (e.g., a shopping cart) without reloading the entire layout.
Single-page applications (SPAs): XHR helps SPAs communicate with the server behind the scenes, creating seamless navigation and fluid user interactions.

Why Capturing XHR Requests Matter

Understanding and capturing background XHR requests is essential for various reasons:

Debugging and testing: Analyzing XHR calls is crucial for debugging application logic and ensuring correct data retrieval.
Security analysis: Capturing XHR requests helps identify potential security vulnerabilities like unauthorized data leaks.
Performance optimization: Monitoring XHR performance can reveal bottlenecks and opportunities for improving website responsiveness.
Data scraping: In specific contexts, capturing XHR requests can be used for data extraction and analysis.

Benefits of Scraping Background Requests

While traditional scraping tools skim the HTML surface, background XHR scraping grants access to a hidden data treasure trove. Here's why it shines:

Deeper Data Dives:
- Uncover hidden product details, reviews, and dynamic info.
- Capture real-time data like auction bids or stock tickers.
Enhanced Accuracy & Reliability:
- Bypass HTML obfuscation to grab raw, unmanipulated data.
- Capture dynamic content that traditional scraping misses.
Improved Efficiency & Scalability:
- Reduce download size and scraping latency.
- Minimize server load for smooth, scalable scraping.
New Frontiers Unleashed:
- Extract user interactions, analytics, and even chat logs.
- Open doors for data analysis, research, and market insights.

XHR Request Capture Methods

There are multiple parts to capturing XHR background requests from a page. The most important ones are:

Capturing XHR Requests
Capturing XHR Responses
Capturing Requests and Responses

All of the concepts mentioned above rely on Request Interception. There are many resources you can intercept and a variety of uses to request interception. For the purposes of this guide we will only discuss interception with XHR.

Capturing XHR Requests

This is the building block for interacting with XHR requests in Puppeteer with NodeJS. You can intercept and log all network requests in the browser but we will filter for XHR requests specifically.

const puppeteer = require("puppeteer");

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  // Enable Request Interception
  await page.setRequestInterception(true);

  // Listen for Request events
  page.on("request", (request) => {
    // If statement to catch XHR requests
    if (request.resourceType() === "xhr") {
      console.log(request.url() + " - " + request.method());
    }

    // Allow the request to be sent
    request.continue();
  });

  // Navigate to page
  await page.goto("https://www.scrapethissite.com/pages/ajax-javascript/#2015");

  await browser.close();
})();

In the above code, we enable request interception and then listen for requests. When handling requests we only process XHR requests and we log some basic info about them.

Capturing XHR Responses

Using a similar format, we can attach a listener to the "response" event. Note that we do not need to enable request interception to listen for responses.

const puppeteer = require("puppeteer");

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  // Enable Request Interception
  await page.setRequestInterception(true);

  // Listen for Request events
  page.on("request", (request) => {
    // If statement to catch XHR requests
    if (request.resourceType() === "xhr") {
      console.log(request.url() + " - " + request.method());
    }

    // Allow the request to be sent
    request.continue();
  });

  // Listen for response events
  page.on("response", (response) => {
    const request = response.request();
    // Only check responses for XHR requests
    if (request.resourceType() === "xhr") {
      console.log(response.status(), response.url());
    }
  });

  // Navigate to page
  await page.goto("https://www.scrapethissite.com/pages/ajax-javascript/#2015");

  await browser.close();
})();

In the above code, we attach a handler to the "response" event. This handler extracts the request object from the response object. From there we verify the request is actually an XHR request and then print some basic information about the response.

Capturing requests and responses.

Finally, you can put both together to capture requests and responses in the same program.

const puppeteer = require("puppeteer");

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  // Listen for response events
  page.on("response", (response) => {
    const request = response.request();
    // Only check responses for XHR requests
    if (request.resourceType() === "xhr") {
      console.log(response.status(), response.url());
    }
  });

  // Navigate to page
  await page.goto("https://www.scrapethissite.com/pages/ajax-javascript/#2015");

  await browser.close();
})();

Real-World Example: Capturing XHR Requests in Puppeteer

One real-world example is trying to catch responses to requests that are fired after the page loads. In the following example, we are looking at an auction listing website.

The website loads a page for the listing which fires a web request to load the listing details.

The following code catches that web request and works directly with the JSON so that we do not have to worry about manipulating the HTML or extracting data from the browser.

// Import the puppeteer library
const puppeteer = require("puppeteer");

// Immediately invoked function expression (IIFE) to handle async/await
(async () => {
  // Launch a new browser instance
  const browser = await puppeteer.launch();

  // Create a new page in the browser
  const page = await browser.newPage();

  // Listen for all response events on the page
  page.on("response", async (response) => {
    // If the response URL includes "bid-status", process the response
    if (response.url().includes("bid-status")) {
      // Parse the response body as JSON
      const json = await response.json();

      // Log the relevant information from the response
      console.log("-- Listing: " + json.listing_id + " --");
      console.log("Bid Count: " + json.bid_count);
      console.log("Current Bid: " + json.current_bid_amount);
      console.log("Starting Bid: " + json.start_bid_amount);
      console.log("Bid Increment: " + json.bid_increment);
    }
  });

  // Navigate to the auction website
  await page.goto("https://www.auction.com/");

  // Get all the anchor tags on the page
  const links = await page.$$("a");

  // Extract the href attribute from each anchor tag
  const hrefList = await Promise.all(
    links.map(
      async (link) => await (await link.getProperty("href")).jsonValue()
    )
  );

  // Loop over each href
  for (const href of hrefList) {
    // If the href includes "/details/", navigate to that page
    if (href.includes("/details/")) {
      await page.goto(href);
      await new Promise((r) => setTimeout(r, 1000));
    }
  }

  await browser.close();
})();

Common Challenges and Their Solutions in Capturing Background XHR Requests

Capturing background XHR requests with Puppeteer isn't always a smooth ride.

Here's a guide to navigate common challenges and keep your data flowing:

Handling Asynchronous Requests

XHR requests often pop up unexpectedly and may require asynchronous work, making it tough to capture them all.

Solution:

Employ Promise.all or similar techniques to patiently wait for request completion. Track ongoing requests and responses using counters or flags.

const requestPromises = [];
page.on("request", (request) => {
  requestPromises.push(
    new Promise((resolve, reject) => {
      // Handle request logic here
      resolve(); // Or reject() if an error occurs
    })
  );

  request.continue();
});

await Promise.all(requestPromises); // Wait for all requests to finish

The above code is an example of how you can perform asynchronous work on outgoing requests and then make sure you wait for all of the work to be completed before continuing. The same can be applied to the "response" listener as well.

Handling Redirects

XHR requests might take detours through redirects, requiring you to capture both the initial request and the final destination.

Solution:

Use the request.redirectChain() method to track these twists and turns in your code and check for redirect status codes.

page.on("response", async (response) => {
  const request = response.request();
  const redirects = request.redirectChain();
  if (response.status > 300 && response.status < 400) {
    if (redirects.length > 0) {
      console.log(
        request.url() + " has been redirected [" + redirects.length + "]"
      );
    }
  } else if (redirects.length > 0) {
    console.log(
      response.url() + " finished with " + redirects.length + " redirects"
    );
  }
});

The above code checks if incoming responses are a redirect by using status codes. Otherwise it checks if a non-redirect response has still originated from redirects by checking the length of the redirectChain().

Capturing POST Request Data

Extracting data sent in POST requests can be crucial but not always obvious.

Solution:

The request.fetchPostData() method has your back, revealing the secrets hidden within POST requests.

Managing Diverse Response Data

Response bodies come in various shapes and sizes (JSON, XML, binary, etc.), requiring careful handling.

Solution:

Check the Content-Type header and handle each type accordingly, using methods like response.json(), response.text(), or response.buffer().

page.on("response", async (response) => {
  if (response.request().resourceType() === "xhr") {
    const contentType = response.headers().get("Content-Type");
    if (contentType.includes("application/json")) {
      const data = await response.json();
      // Handle JSON data here
    } else if (contentType.includes("text/html")) {
      const text = await response.text();
      // Handle text or HTML data here
    } else {
      // Handle other content types as needed
    }
  }
});

Handling Large Volumes of Data

Data overload can overwhelm memory.

Solution:

Be selective! Store only essential data and consider writing to files or databases instead of keeping everything in memory.

Mastering Cookies and Session Data

Accurate XHR capture often relies on proper cookie handling, especially in authenticated sessions.

Solution:

Use Puppeteer's cookie management methods (page.cookies(), page.setCookie()) to control cookies like a pro. Ensure your session is properly authenticated if needed.

Check our Puppeteer Guide - Managing Cookies in order to master cookie management in Puppeteer.

Embracing the Unexpected: Error Handling

Network hiccups and server errors can disrupt requests.

Solution:

Implement graceful error handling with try-catch blocks and the requestfailed event to keep your script resilient.

Creating Private Spaces: Browser Context and Isolation

Isolating different scraping tasks or sessions within the same script can get messy.

Solution:

Use different browser contexts (browser.createIncognitoBrowserContext()) for each task to maintain session privacy and avoid conflicts.

const context = await browser.createIncognitoBrowserContext();
const page = await context.newPage();
// Set up request interception here

Conclusion

Capturing background XHR requests with Puppeteer is a powerful technique. From this article you have learned a number of ways to perform work on outgoing and incoming requests/responses in the Puppeteer browser. For more information see the following sources

More Web Scraping Guides

If you would like to learn more about Web Scraping with Puppeteer, then be sure to check out The Puppeteer Web Scraping Playbook.

Or check out one of our more in-depth guides:

Puppeteer Guide: How to Capture Background XHR Requests

Need help scraping the web?

TLDR - How to Capture Background XHR Requests with Puppeteer​

What is XHR and Why is it Important?​

What is XMLHttpRequest (XHR)​

Why Capturing XHR Requests Matter​

Benefits of Scraping Background Requests​

XHR Request Capture Methods​

Capturing XHR Requests​

Capturing XHR Responses​

Capturing requests and responses.​

Real-World Example: Capturing XHR Requests in Puppeteer​

Common Challenges and Their Solutions in Capturing Background XHR Requests​

Handling Asynchronous Requests​

Handling Redirects​

Capturing POST Request Data​

Managing Diverse Response Data​

Handling Large Volumes of Data​

Mastering Cookies and Session Data​

Embracing the Unexpected: Error Handling​

Creating Private Spaces: Browser Context and Isolation​

Conclusion​

More Web Scraping Guides​

TLDR - How to Capture Background XHR Requests with Puppeteer

What is XHR and Why is it Important?

What is XMLHttpRequest (XHR)

Why Capturing XHR Requests Matter

Benefits of Scraping Background Requests

XHR Request Capture Methods

Capturing XHR Requests

Capturing XHR Responses

Capturing requests and responses.

Real-World Example: Capturing XHR Requests in Puppeteer

Common Challenges and Their Solutions in Capturing Background XHR Requests

Handling Asynchronous Requests

Handling Redirects

Capturing POST Request Data

Managing Diverse Response Data

Handling Large Volumes of Data

Mastering Cookies and Session Data

Embracing the Unexpected: Error Handling

Creating Private Spaces: Browser Context and Isolation

Conclusion

More Web Scraping Guides