Skip to main content

NodeJs Puppeteer Examples

The following are code examples on how to integrate the ScrapeOps Proxy Aggregator with your NodeJs Puppeteer Scrapers.

Residential & Mobile Proxy Aggregator

We have launched a Residential & Mobile Proxy Aggregator that aggregates all the residential and mobile proxy providers into a single proxy port and charges based on bandwidth consumed, not successful requests.

This proxy solution is better suited for use with Headless browsers as headless browsers often generate 10-100+ individual requests to retrieve a single web page. With Proxy API Aggregator you are charged for each one of these requests, whereas with our Residential Proxy Aggregator you are only charged for the bandwidth consumed.

Also, this form of proxy is more reliable for use with Headless browsers.

Here is the Residential & Mobile Proxy Aggregator and the documentation.


Authorisation - API Key

To use the ScrapeOps proxy, you first need an API key which you can get by signing up for a free account here.

Your API key must be included with every request using the api_key query parameter otherwise the API will return a 403 Forbidden Access status code.


Proxy Port Integration

When integrating the Proxy Aggregator with your Puppeteer scrapers it is recommended that you use our proxy port integration over the API endpoint integration.

The proxy port integration is a light front-end for the API and has all the same functionality and performance as sending requests to the API endpoint but allow you to integrate our proxy aggregator as you would with any normal proxy.

The username for the proxy is scrapeops and the password is your API key.


"http://scrapeops.headless_browser_mode=true:YOUR_API_KEY@proxy.scrapeops.io:5353"

Here are the individual connection details:

  • Proxy: proxy.scrapeops.io
  • Port: 5353
  • Username: scrapeops.headless_browser_mode=true
  • Password: YOUR_API_KEY

This is because using the API endpoint can create issues when loading pages and following links when the website uses relative links over absolute links.

We've also added the key/value headless_browser_mode=true to the username section of the proxy string as this will optimize the proxy port for use with headless browsers like Puppeteer, Selenium, etc.

SSL Certificate Verification

Note: So that we can properly direct your requests through the API, your code must be configured to ignore SSL certificate verification errors ignoreHTTPSErrors: true.

To enable extra/advanced functionality, you can pass parameters by adding them to username, separated by periods.

For example, if you want to enable country geotargeting with US-based proxies, the username would be scrapeops.country=us.

Also, multiple parameters can be included by separating them with periods, for example:


"http://scrapeops.headless_browser_mode=true.country=us:YOUR_API_KEY@proxy.scrapeops.io:5353"


Integrating With Puppeteer Scrapers

To integrate our proxy with your Puppeteer you just need to define the proxy port settings, set Puppeteer to ignore HTTPS errors and configure the proxy authorization:


const puppeteer = require('puppeteer');
const cheerio = require('cheerio');

// ScrapeOps proxy configuration
PROXY_USERNAME = 'scrapeops.headless_browser_mode=true';
PROXY_PASSWORD = 'YOUR_API_KEY'; // <-- enter your API_Key here
PROXY_SERVER = 'proxy.scrapeops.io';
PROXY_SERVER_PORT = '5353';

(async () => {
const browser = await puppeteer.launch({
ignoreHTTPSErrors: true,
args: [
`--proxy-server=http://${PROXY_SERVER}:${PROXY_SERVER_PORT}`
]
});
const page = await browser.newPage();
await page.authenticate({
username: PROXY_USERNAME,
password: PROXY_PASSWORD,
});


try {
await page.goto('https://quotes.toscrape.com/page/1/', {timeout: 180000});
let bodyHTML = await page.evaluate(() => document.body.innerHTML);
let $ = cheerio.load(bodyHTML);

// find H1 text
let h1Text = $('h1').text()
console.log('h1Text:', h1Text)

} catch(err) {
console.log(err);
}

await browser.close();

})();

ScrapeOps will take care of the proxy selection and rotation for you so you just need to send us the URL you want to scrape.


Response Format

After receiving a response from one of our proxy providers the ScrapeOps Proxy API Aggregator will then respond with the raw HTML content of the target URL along with a response code:


<html>
<head>
...
</head>
<body>
...
</body>
</html>

The ScrapeOps Proxy API Aggregator will return a 200 status code when it successfully got a response from the website that also passed response validation, or a 404 status code if the website responds with a 404 status code. Both of these status codes are considered successful requests.

Here is the full list of status codes the Proxy API returns.

Limiting API Requests

When you use a headless browser to scrape a page, it will generate 10's or 100's of additional requests to download extra JS, CSS and image files and query external APIs. These additional requests will be treated as additional requests to our Proxy API so it is recommended you configure your Puppeteer scraper to only make the requests that are absolutely necessary to retrieve the data you want to extract.