Skip to main content

How to Scroll a Page with NodeJS

How to Scroll a Page with NodeJS

Web scraping is all about extracting data from websites. This may become difficult when the desired data isn't loaded initially or is beyond the initial viewport of the browser.

This article explores web scraping with NodeJS, focusing on scrolling techniques using Puppeteer and Playwright.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.


TLDR: Quick Guide to Scrolling with Puppeteer and Playwright

Here's a quick look at simple scrolling with code snippets for both frameworks:

Puppeteer:

const puppeteer = require("puppeteer");

(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://example.com");
await page.evaluate(() => window.scrollBy(0, 500)); // Scroll down 500px
// ...
})();

Playwright:

const { chromium } = require("playwright");

(async () => {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto("https://example.com");
await page.evaluate(() => window.scrollBy(0, 500)); // Scroll down 500px
// ...
})();

You can see both of the snippets are extremely similar. This is because the libraries share a lot of naming for functions. In both snippets the same process occurs:

  1. Launch a browser
  2. Open a new page
  3. Navigate that page to example.com
  4. Evaluate window.scrollBy(0, 500) as JavaScript on the page.

Understanding Page Scrolling

Scrolling is necessary when the target content resides beyond the initially visible portion of the webpage.

Common scenarios include:

  • Infinite Scrolling: Content loads dynamically as you scroll down, often indicated by a "Load More" button or a progress indicator.
  • Pagination: Content is divided into multiple pages, requiring navigation through links or buttons rather than scrolling.
  • Lazy Loading: Images or other elements load only when they come into view to improve initial page load speed.

Scrolling Techniques:

  • Smooth Scrolling: Gradually scrolls to a specific position, mimicking user behavior. This technique is usually good for lazy loading because you have to briefly include all the desired elements in the view port. If you used Jump Scrolling, you might skip lazy loaded elements.
  • Jump Scrolling: Instantly jumps to a specific location on the page. This technique can be useful for infinite scrolls and paginated styled scenarios. This is because once you reach the end of the page, the next section of data is loaded. Meaning you do not have to gradually scroll the entire page, you only need to trigger the end.

Setting Up Your Environment

1. Node.js

Ensure you have Node.js installed. Download it from the official website (https://nodejs.org/en).

2. Installing Puppeteer and Playwright

Use npm to install the libraries:

npm install puppeteer
npm install playwright

Scrolling with Puppeteer

Puppeteer is a high-level Node.js library built on top of Chromium for controlling headless Chrome or Chromium browsers.

Steps to Scroll:

  1. Launch a Puppeteer browser instance.
  2. Open the target webpage using page.goto().
  3. Use page.evaluate() to execute JavaScript within the browser context and scroll.
  4. Utilize appropriate scrolling methods like window.scrollTo() or window.scrollBy().

Basic Scrolling Technique

This example scrolls the page to a specific vertical position (1000px) using window.scrollTo():

const puppeteer = require("puppeteer");

(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://example.com");
await page.evaluate(() => window.scrollTo(0, 1000));
// ...
})();

Alternatively, you could use window.scrollBy to scroll down a specific number of pixels.

Advanced Scrolling Techniques

Advanced scrolling techniques in Puppeteer can involve more complex scenarios, such as:

1. Scroll Until Element Visible

This example utilizes page.$() to check if a specific selector (div.box1:nth-child(89)) becomes available after scrolling. If not, it scrolls again by a certain amount:

const puppeteer = require("puppeteer");

(async () => {
const browser = await puppeteer.launch({
headless: false,
});
const page = await browser.newPage();

// Navigate to page
await page.goto(
"https://scrollmagic.io/examples/advanced/infinite_scrolling.html"
);

// Loop until querySelector is no longer null
while ((await page.$("div.box1:nth-child(89)")) == null) {
// Scroll by 200 pixels each time
await page.evaluate(() => window.scrollBy(0, 200));

// Small delay to give loading time and prevent CPU spikes
await new Promise((resolve) => setTimeout(resolve, 100));
}

// Take a screenshot
await page.screenshot({
path: "puppeteer-wait-selector.png",
});

// Close the browser
await browser.close();
})();

Explanation:

  1. We visit the website with scrollable content.
  2. We enter a loop that runs until a given query selector is found (no longer equal to null)
  3. In the loop, we scroll by 200 pixels then wait 100 milliseconds.
  4. Once the loop exits we take a screenshot and close the browser.

2. Scroll Until Network Idle

If you are not looking for a specific element you can instead scroll until the browser network requests idle likely meaning no more data can be fetched.

The following example will continue to scroll to the bottom of the page and check if the network is idle and then scroll again to the bottom of the page.

const puppeteer = require("puppeteer");

(async () => {
const browser = await puppeteer.launch({
headless: false,
});
const page = await browser.newPage();

// Navigate to page
await page.goto("https://infinite-scroll.com/demo/masonry/");

// Function to check if we've reached the bottom of the page
const isBottom = async () => {
return await page.evaluate(() => {
return window.innerHeight + window.scrollY >= document.body.offsetHeight;
});
};

// Loop until we've reached the bottom of the page
// This is because as new elements load we will no longer
// be at the bottom of the page.
while (!(await isBottom())) {
// Scroll to the bottom of the page
await page.evaluate(() => {
window.scrollTo(0, document.body.scrollHeight);
});

// Check if network is idle after scrolling
await page.waitForNetworkIdle();
}

// Take a screenshot
await page.screenshot({
path: "puppeteer-wait-networkidle.png",
});

// Close the browser
await browser.close();
})();

Explanation:

  1. We launch a browser and visit the website.
  2. We define an isBottom function that uses page.evaluate to determine if the bottom of the page is within our viewport.
  3. We enter a loop that runs until we've hit the bottom of the page.
  4. Inside the loop, we scroll to the bottom of the available page and then wait for network activity to idle. If new elements were loaded, where we have previously scrolled is no longer the bottom
  5. After the loop, we take a screenshot and close the browser.

This approach is particularly useful for infinite scrolling. As mentioned earlier, it jumps straight to the bottom of the page and waits for loading.

To modify this in a way that works for lazy loading you would instead use smooth scrolling. You can implement this by changing the window.scrollTo call to a more gradual window.scrollBy.


Scrolling with Playwright

Playwright is a Node.js library that provides a single API to control Chromium, Firefox, and WebKit browsers for headless and browser testing.

Steps to Scroll:

  1. Launch a Playwright browser instance (Chromium in this example).
  2. Open the target webpage using page.goto().
  3. Use page.evaluate() to execute JavaScript within the browser context.

Basic Scrolling Technique

This example scrolls the page down by 500px using window.scrollBy():

const { chromium } = require("playwright");

(async () => {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto("https://example.com");
await page.evaluate(() => window.scrollBy(0, 500));
// ...
})();

Advanced Scrolling Techniques

Advanced scrolling techniques in Playwright can involve more complex scenarios, such as:

1. Scroll Until Element Visible

This example utilizes page.locator to check if a specific selector (div.box1:nth-child(89)) becomes available after scrolling. If not, it scrolls again by a certain amount:

const { chromium } = require("playwright");

(async () => {
const browser = await chromium.launch({
headless: false,
});
const context = await browser.newContext();
const page = await context.newPage();

// Navigate to page
await page.goto(
"https://scrollmagic.io/examples/advanced/infinite_scrolling.html"
);

// Loop until count of querySelector more than 0
while ((await page.locator("div.box1:nth-child(89)").count()) < 1) {
// Scroll by 200 pixels each time
await page.evaluate(() => window.scrollBy(0, 200));

// Small delay to give loading time and prevent CPU spikes
await page.waitForTimeout(100);
}

// Take a screenshot
await page.screenshot({
path: "playwright-wait-selector.png",
});

// Close the browser
await browser.close();
})();

Explanation:

  1. We visit the website with scrollable content.
  2. We enter a loop that runs until a given query selector is found (comparing the length of matching elements)
  3. In the loop, we scroll by 200 pixels then wait 100 milliseconds.
  4. Once the loop exits we take a screenshot and close the browser.

2. Scroll Until Network Idle

If you are not looking for a specific element you can instead scroll until the browser network requests idle likely meaning no more data can be fetched. The following example will continue to scroll to the bottom of the page and check if the network is idle and then scroll again to the bottom of the page.

const { chromium } = require("playwright");

(async () => {
const browser = await chromium.launch({
headless: false,
});
const context = await browser.newContext();
const page = await context.newPage();

// Navigate to page
await page.goto("https://infinite-scroll.com/demo/masonry/");

// Function to check if we've reached the bottom of the page
const isBottom = async () => {
return await page.evaluate(() => {
return window.innerHeight + window.scrollY >= document.body.offsetHeight;
});
};

// Loop until we've reached the bottom of the page
// This is because as new elements load we will no longer
// be at the bottom of the page.
while (!(await isBottom())) {
// Scroll to the bottom of the page
await page.evaluate(() => {
window.scrollTo(0, document.body.scrollHeight);
});

// Check if network is idle after scrolling
await page.waitForLoadState("networkidle");

// Give images some time to render
await page.waitForTimeout(300);
}

// Take a screenshot
await page.screenshot({
path: "playwright-wait-networkidle.png",
});

// Close the browser
await browser.close();
})();

Explanation:

  1. We launch a browser and visit the website.
  2. We define an isBottom function that uses page.evaluate to determine if the bottom of the page is within our viewport.
  3. We enter a loop that runs until we've hit the bottom of the page.
  4. Inside the loop, we scroll to the bottom of the available page and then wait for network activity to idle. If new elements were loaded, where we have previously scrolled is no longer the bottom
  5. After the loop, we take a screenshot and close the browser.

This approach is particularly useful for infinite scrolling. As mentioned earlier, it jumps straight to the bottom of the page and waits for loading. To modify this in a way that works for lazy loading you would instead use smooth scrolling. You can implement this by changing the window.scrollTo call to a more gradual window.scrollBy.


Scroll Via Page URL

While scrolling is effective for all the data on a single page, some websites may implement pagination through modifications in the URL.

Reverse Engineering Pagination: Reverse engineering pagination involves understanding how a website structures its pagination system and constructing the URLs or requests needed to navigate through different pages programmatically. This process can be generalized as follows:

  1. Inspect Network Requests: Use browser developer tools to monitor network requests as you navigate through pages. Look for patterns in the URL structure that change with each page.
  2. Identify Parameters: Often, pagination is implemented using parameters like page, pg or offset within the URL.

Because the page is controlled by a query parameter, it is very easy to follow from a browser. You may not even need to interact with elements:

const { chromium } = require("playwright");

(async () => {
const browser = await chromium.launch({
headless: false,
});
const context = await browser.newContext();
const page = await context.newPage();

const maxPage = 10;
for (let i = 1; i < maxPage + 1; i++) {
await page.goto("https://example.com?page=" + i);
// ...
}
})();

The above code is for Playwright, but it is almost the same for Puppeteer:

const puppeteer = require("puppeteer");

(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const maxPage = 10;
for (let i = 1; i < maxPage + 1; i++) {
await page.goto("https://example.com?page=" + i);
// ...
}
})();

In both examples:

  1. We launch a browser
  2. We set a max number of pages
  3. We visit the appropriate URL for the current page

Obviously in your own implementations, you would add web scraping after the navigation. Furthermore, to make it more sophisticated, you may try to dynamically assess the maxPage number from elements on the page.


Scroll Via Internal API Endpoints

Some websites might use internal API endpoints to load content for infinite scrolling. This means that, as the page is scrolled, JavaScript is used to make HTTP requests in the background for data.

In most cases, you can make the same HTTP requests yourself, without the browser. Here's how to handle that:

Reverse Engineering API Endpoints: Reverse engineering pagination for an infinite scroll using internal API endpoints involves understanding how the website loads additional content dynamically as you scroll. Here’s a step-by-step guide to achieving this:

  1. Monitor Network Requests: Similar to pagination, look for API requests triggered during scrolling that load new content.
  2. Identify Endpoint URL and Parameters: Analyze the request URL and any parameters sent for data retrieval.

Fetching Data via API:

Use libraries like Axios or Node.js fetch to make HTTP requests to the identified API endpoint:

const axios = require("axios");

async function fetchDataFromAPI(endpoint, payload) {
const response = await axios.post(endpoint, payload);
// Process the fetched data here
}

This example demonstrates sending a POST request with potential payload data to the API endpoint using Axios. Remember, you can find the endpoint and values needed for the payload by monitoring requests from your browser.


Common Issues and Troubleshooting

When using Puppeteer or Playwright to automate scrolling pages, several common issues can arise. Here are the most frequent problems along with troubleshooting tips:

Infinite Scrolling Not Triggering

  • Check for dynamic loading elements like "Load More" buttons and click them if necessary.
  • Implement a small delay between scrolls to mimic user behavior and avoid overwhelming the server. This may require some experimentation as it may need varying lengths, sometimes before scrolling or sometimes after.

Infinite Scroll Lazy Loading Takes Too Long

  • Increase the delay between scrolls to allow for content loading.
  • Consider using browser waiting libraries like puppeteer.waitForTimeout() or playwright.waitForTimeout() to wait for specific elements to appear before proceeding.

Load More Buttons/Pagination

  • Use Puppeteer's page.click() or Playwright's page.click() to interact with buttons or pagination links.
  • Extract URLs from pagination links and process them individually using techniques from the "Scroll Via Page URL" section.

Best Practices for Scrolling in Web Scraping

When implementing scrolling in web scraping, following best practices ensures efficient, reliable, and ethical scraping. Here are some best practices to consider:

  • Implement Delays: Avoid overwhelming the server with rapid requests.
  • Scrape Responsibly: Extract only the necessary data and avoid excessive scraping that can impact performance.
  • Handle Rate Limiting: Websites might implement rate limiting to prevent excessive scraping. Implement logic to retry requests after a certain delay if encountered.

Conclusion

This article explored web scraping with NodeJS, focusing on scrolling techniques using Puppeteer and Playwright. Remember to experiment with different approaches, prioritize responsible scraping practices, and explore the vast functionalities of Puppeteer and Playwright for web scraping tasks!

Don't forget to check the official documentation of Puppeteer and Playwright to get more information.


More Web Scraping Guides

For more Node.JS resources, feel free to check out the NodeJS Web Scraping Playbook or some of our in-depth guides: