How to Scroll a Page with NodeJS
Web scraping is all about extracting data from websites. This may become difficult when the desired data isn't loaded initially or is beyond the initial viewport of the browser.
This article explores web scraping with NodeJS, focusing on scrolling techniques using Puppeteer and Playwright.
- TLDR: Quick Guide to Scrolling with Puppeteer and Playwright
- Understanding Page Scrolling
- Setting Up Your Environment
- Scrolling with Puppeteer
- Scrolling with Playwright
- BONUS: Pagination
- BONUS: Internal API Endpoints
- Common Issues and Troubleshooting
- Best Practices for Scrolling in Web Scraping
- Conclusion
- More Web Scraping Guides
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
TLDR: Quick Guide to Scrolling with Puppeteer and Playwright
Here's a quick look at simple scrolling with code snippets for both frameworks:
Puppeteer:
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://example.com");
await page.evaluate(() => window.scrollBy(0, 500)); // Scroll down 500px
// ...
})();
Playwright:
const { chromium } = require("playwright");
(async () => {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto("https://example.com");
await page.evaluate(() => window.scrollBy(0, 500)); // Scroll down 500px
// ...
})();
You can see both of the snippets are extremely similar. This is because the libraries share a lot of naming for functions. In both snippets the same process occurs:
- Launch a browser
- Open a new page
- Navigate that page to
example.com
- Evaluate
window.scrollBy(0, 500)
as JavaScript on the page.
Understanding Page Scrolling
Scrolling is necessary when the target content resides beyond the initially visible portion of the webpage.
Common scenarios include:
- Infinite Scrolling: Content loads dynamically as you scroll down, often indicated by a "Load More" button or a progress indicator.
- Pagination: Content is divided into multiple pages, requiring navigation through links or buttons rather than scrolling.
- Lazy Loading: Images or other elements load only when they come into view to improve initial page load speed.
Scrolling Techniques:
- Smooth Scrolling: Gradually scrolls to a specific position, mimicking user behavior. This technique is usually good for lazy loading because you have to briefly include all the desired elements in the view port. If you used Jump Scrolling, you might skip lazy loaded elements.
- Jump Scrolling: Instantly jumps to a specific location on the page. This technique can be useful for infinite scrolls and paginated styled scenarios. This is because once you reach the end of the page, the next section of data is loaded. Meaning you do not have to gradually scroll the entire page, you only need to trigger the end.
Setting Up Your Environment
1. Node.js
Ensure you have Node.js installed. Download it from the official website (https://nodejs.org/en).
2. Installing Puppeteer and Playwright
Use npm to install the libraries:
npm install puppeteer
npm install playwright
Scrolling with Puppeteer
Puppeteer is a high-level Node.js library built on top of Chromium for controlling headless Chrome or Chromium browsers.
Steps to Scroll:
- Launch a Puppeteer browser instance.
- Open the target webpage using
page.goto()
. - Use
page.evaluate()
to execute JavaScript within the browser context and scroll. - Utilize appropriate scrolling methods like
window.scrollTo()
orwindow.scrollBy()
.
Basic Scrolling Technique
This example scrolls the page to a specific vertical position (1000px) using window.scrollTo()
:
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://example.com");
await page.evaluate(() => window.scrollTo(0, 1000));
// ...
})();
Alternatively, you could use window.scrollBy
to scroll down a specific number of pixels.
Advanced Scrolling Techniques
Advanced scrolling techniques in Puppeteer can involve more complex scenarios, such as:
1. Scroll Until Element Visible
This example utilizes page.$()
to check if a specific selector (div.box1:nth-child(89)
) becomes available after scrolling. If not, it scrolls again by a certain amount:
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch({
headless: false,
});
const page = await browser.newPage();
// Navigate to page
await page.goto(
"https://scrollmagic.io/examples/advanced/infinite_scrolling.html"
);
// Loop until querySelector is no longer null
while ((await page.$("div.box1:nth-child(89)")) == null) {
// Scroll by 200 pixels each time
await page.evaluate(() => window.scrollBy(0, 200));
// Small delay to give loading time and prevent CPU spikes
await new Promise((resolve) => setTimeout(resolve, 100));
}
// Take a screenshot
await page.screenshot({
path: "puppeteer-wait-selector.png",
});
// Close the browser
await browser.close();
})();
Explanation:
- We visit the website with scrollable content.
- We enter a loop that runs until a given query selector is found (no longer equal to null)
- In the loop, we scroll by 200 pixels then wait 100 milliseconds.
- Once the loop exits we take a screenshot and close the browser.
2. Scroll Until Network Idle
If you are not looking for a specific element you can instead scroll until the browser network requests idle likely meaning no more data can be fetched.
The following example will continue to scroll to the bottom of the page and check if the network is idle and then scroll again to the bottom of the page.
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch({
headless: false,
});
const page = await browser.newPage();
// Navigate to page
await page.goto("https://infinite-scroll.com/demo/masonry/");
// Function to check if we've reached the bottom of the page
const isBottom = async () => {
return await page.evaluate(() => {
return window.innerHeight + window.scrollY >= document.body.offsetHeight;
});
};
// Loop until we've reached the bottom of the page
// This is because as new elements load we will no longer
// be at the bottom of the page.
while (!(await isBottom())) {
// Scroll to the bottom of the page
await page.evaluate(() => {
window.scrollTo(0, document.body.scrollHeight);
});
// Check if network is idle after scrolling
await page.waitForNetworkIdle();
}
// Take a screenshot
await page.screenshot({
path: "puppeteer-wait-networkidle.png",
});
// Close the browser
await browser.close();
})();
Explanation:
- We launch a browser and visit the website.
- We define an
isBottom
function that usespage.evaluate
to determine if the bottom of the page is within our viewport. - We enter a loop that runs until we've hit the bottom of the page.
- Inside the loop, we scroll to the bottom of the available page and then wait for network activity to idle. If new elements were loaded, where we have previously scrolled is no longer the bottom
- After the loop, we take a screenshot and close the browser.
This approach is particularly useful for infinite scrolling. As mentioned earlier, it jumps straight to the bottom of the page and waits for loading.
To modify this in a way that works for lazy loading you would instead use smooth scrolling. You can implement this by changing the window.scrollTo
call to a more gradual window.scrollBy
.
Scrolling with Playwright
Playwright is a Node.js library that provides a single API to control Chromium, Firefox, and WebKit browsers for headless and browser testing.
Steps to Scroll:
- Launch a Playwright browser instance (Chromium in this example).
- Open the target webpage using
page.goto()
. - Use
page.evaluate()
to execute JavaScript within the browser context.
Basic Scrolling Technique
This example scrolls the page down by 500px using window.scrollBy()
:
const { chromium } = require("playwright");
(async () => {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto("https://example.com");
await page.evaluate(() => window.scrollBy(0, 500));
// ...
})();
Advanced Scrolling Techniques
Advanced scrolling techniques in Playwright can involve more complex scenarios, such as:
1. Scroll Until Element Visible
This example utilizes page.locator
to check if a specific selector (div.box1:nth-child(89)
) becomes available after scrolling. If not, it scrolls again by a certain amount:
const { chromium } = require("playwright");
(async () => {
const browser = await chromium.launch({
headless: false,
});
const context = await browser.newContext();
const page = await context.newPage();
// Navigate to page
await page.goto(
"https://scrollmagic.io/examples/advanced/infinite_scrolling.html"
);
// Loop until count of querySelector more than 0
while ((await page.locator("div.box1:nth-child(89)").count()) < 1) {
// Scroll by 200 pixels each time
await page.evaluate(() => window.scrollBy(0, 200));
// Small delay to give loading time and prevent CPU spikes
await page.waitForTimeout(100);
}
// Take a screenshot
await page.screenshot({
path: "playwright-wait-selector.png",
});
// Close the browser
await browser.close();
})();
Explanation:
- We visit the website with scrollable content.
- We enter a loop that runs until a given query selector is found (comparing the length of matching elements)
- In the loop, we scroll by 200 pixels then wait 100 milliseconds.
- Once the loop exits we take a screenshot and close the browser.
2. Scroll Until Network Idle
If you are not looking for a specific element you can instead scroll until the browser network requests idle likely meaning no more data can be fetched. The following example will continue to scroll to the bottom of the page and check if the network is idle and then scroll again to the bottom of the page.
const { chromium } = require("playwright");
(async () => {
const browser = await chromium.launch({
headless: false,
});
const context = await browser.newContext();
const page = await context.newPage();
// Navigate to page
await page.goto("https://infinite-scroll.com/demo/masonry/");
// Function to check if we've reached the bottom of the page
const isBottom = async () => {
return await page.evaluate(() => {
return window.innerHeight + window.scrollY >= document.body.offsetHeight;
});
};
// Loop until we've reached the bottom of the page
// This is because as new elements load we will no longer
// be at the bottom of the page.
while (!(await isBottom())) {
// Scroll to the bottom of the page
await page.evaluate(() => {
window.scrollTo(0, document.body.scrollHeight);
});
// Check if network is idle after scrolling
await page.waitForLoadState("networkidle");
// Give images some time to render
await page.waitForTimeout(300);
}
// Take a screenshot
await page.screenshot({
path: "playwright-wait-networkidle.png",
});
// Close the browser
await browser.close();
})();
Explanation:
- We launch a browser and visit the website.
- We define an
isBottom
function that usespage.evaluate
to determine if the bottom of the page is within our viewport. - We enter a loop that runs until we've hit the bottom of the page.
- Inside the loop, we scroll to the bottom of the available page and then wait for network activity to idle. If new elements were loaded, where we have previously scrolled is no longer the bottom
- After the loop, we take a screenshot and close the browser.
This approach is particularly useful for infinite scrolling. As mentioned earlier, it jumps straight to the bottom of the page and waits for loading. To modify this in a way that works for lazy loading you would instead use smooth scrolling. You can implement this by changing the window.scrollTo
call to a more gradual window.scrollBy
.
Scroll Via Page URL
While scrolling is effective for all the data on a single page, some websites may implement pagination through modifications in the URL.
Reverse Engineering Pagination: Reverse engineering pagination involves understanding how a website structures its pagination system and constructing the URLs or requests needed to navigate through different pages programmatically. This process can be generalized as follows:
- Inspect Network Requests: Use browser developer tools to monitor network requests as you navigate through pages. Look for patterns in the URL structure that change with each page.
- Identify Parameters: Often, pagination is implemented using parameters like
page
,pg
oroffset
within the URL.
Because the page is controlled by a query parameter, it is very easy to follow from a browser. You may not even need to interact with elements:
const { chromium } = require("playwright");
(async () => {
const browser = await chromium.launch({
headless: false,
});
const context = await browser.newContext();
const page = await context.newPage();
const maxPage = 10;
for (let i = 1; i < maxPage + 1; i++) {
await page.goto("https://example.com?page=" + i);
// ...
}
})();
The above code is for Playwright, but it is almost the same for Puppeteer:
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const maxPage = 10;
for (let i = 1; i < maxPage + 1; i++) {
await page.goto("https://example.com?page=" + i);
// ...
}
})();
In both examples:
- We launch a browser
- We set a max number of pages
- We visit the appropriate URL for the current page
Obviously in your own implementations, you would add web scraping after the navigation. Furthermore, to make it more sophisticated, you may try to dynamically assess the maxPage
number from elements on the page.
Scroll Via Internal API Endpoints
Some websites might use internal API endpoints to load content for infinite scrolling. This means that, as the page is scrolled, JavaScript is used to make HTTP requests in the background for data.
In most cases, you can make the same HTTP requests yourself, without the browser. Here's how to handle that:
Reverse Engineering API Endpoints: Reverse engineering pagination for an infinite scroll using internal API endpoints involves understanding how the website loads additional content dynamically as you scroll. Here’s a step-by-step guide to achieving this:
- Monitor Network Requests: Similar to pagination, look for API requests triggered during scrolling that load new content.
- Identify Endpoint URL and Parameters: Analyze the request URL and any parameters sent for data retrieval.
Fetching Data via API:
Use libraries like Axios or Node.js fetch
to make HTTP requests to the identified API endpoint:
const axios = require("axios");
async function fetchDataFromAPI(endpoint, payload) {
const response = await axios.post(endpoint, payload);
// Process the fetched data here
}
This example demonstrates sending a POST request with potential payload data to the API endpoint using Axios. Remember, you can find the endpoint
and values needed for the payload
by monitoring requests from your browser.
Common Issues and Troubleshooting
When using Puppeteer or Playwright to automate scrolling pages, several common issues can arise. Here are the most frequent problems along with troubleshooting tips:
Infinite Scrolling Not Triggering
- Check for dynamic loading elements like "Load More" buttons and click them if necessary.
- Implement a small delay between scrolls to mimic user behavior and avoid overwhelming the server. This may require some experimentation as it may need varying lengths, sometimes before scrolling or sometimes after.
Infinite Scroll Lazy Loading Takes Too Long
- Increase the delay between scrolls to allow for content loading.
- Consider using browser waiting libraries like
puppeteer.waitForTimeout()
orplaywright.waitForTimeout()
to wait for specific elements to appear before proceeding.
Load More Buttons/Pagination
- Use Puppeteer's
page.click()
or Playwright'spage.click()
to interact with buttons or pagination links. - Extract URLs from pagination links and process them individually using techniques from the "Scroll Via Page URL" section.
Best Practices for Scrolling in Web Scraping
When implementing scrolling in web scraping, following best practices ensures efficient, reliable, and ethical scraping. Here are some best practices to consider:
- Implement Delays: Avoid overwhelming the server with rapid requests.
- Scrape Responsibly: Extract only the necessary data and avoid excessive scraping that can impact performance.
- Handle Rate Limiting: Websites might implement rate limiting to prevent excessive scraping. Implement logic to retry requests after a certain delay if encountered.
Conclusion
This article explored web scraping with NodeJS, focusing on scrolling techniques using Puppeteer and Playwright. Remember to experiment with different approaches, prioritize responsible scraping practices, and explore the vast functionalities of Puppeteer and Playwright for web scraping tasks!
Don't forget to check the official documentation of Puppeteer and Playwright to get more information.
More Web Scraping Guides
For more Node.JS resources, feel free to check out the NodeJS Web Scraping Playbook or some of our in-depth guides: