Skip to main content

NodeJs Puppeteer Examples

The following are code examples on how to integrate the ScrapeOps Proxy Aggregator with your NodeJs Puppeteer Scrapers.

Authorisation - API Key

To use the ScrapeOps proxy, you first need an API key which you can get by signing up for a free account here.

Your API key must be included with every request using the api_key query parameter otherwise the API will return a 403 Forbidden Access status code.


Proxy Port Integration

When integrating the Proxy Aggregator with your Puppeteer scrapers it is recommended that you use our proxy port integration over the API endpoint integration.

The proxy port integration is a light front-end for the API and has all the same functionality and performance as sending requests to the API endpoint but allow you to integrate our proxy aggregator as you would with any normal proxy.

The username for the proxy is scrapeops and the password is your API key.


"http://scrapeops.headless_browser_mode=true:YOUR_API_KEY@proxy.scrapeops.io:5353"

Here are the individual connection details:

  • Proxy: proxy.scrapeops.io
  • Port: 5353
  • Username: scrapeops.headless_browser_mode=true
  • Password: YOUR_API_KEY

This is because using the API endpoint can create issues when loading pages and following links when the website uses relative links over absolute links.

We've also added the key/value headless_browser_mode=true to the username section of the proxy string as this will optimize the proxy port for use with headless browsers like Puppeteer, Selenium, etc.

SSL Certificate Verification

Note: So that we can properly direct your requests through the API, your code must be configured to ignore SSL certificate verification errors ignoreHTTPSErrors: true.

To enable extra/advanced functionality, you can pass parameters by adding them to username, separated by periods.

For example, if you want to enable country geotargeting with US-based proxies, the username would be scrapeops.country=us.

Also, multiple parameters can be included by separating them with periods, for example:


"http://scrapeops.headless_browser_mode=true.country=us:YOUR_API_KEY@proxy.scrapeops.io:5353"


Integrating With Puppeteer Scrapers

To integrate our proxy with your Puppeteer you just need to define the proxy port settings, set Puppeteer to ignore HTTPS errors and configure the proxy authorization:


const puppeteer = require('puppeteer');
const cheerio = require('cheerio');

// ScrapeOps proxy configuration
PROXY_USERNAME = 'scrapeops.headless_browser_mode=true';
PROXY_PASSWORD = 'YOUR_API_KEY'; // <-- enter your API_Key here
PROXY_SERVER = 'proxy.scrapeops.io';
PROXY_SERVER_PORT = '5353';

(async () => {
const browser = await puppeteer.launch({
ignoreHTTPSErrors: true,
args: [
`--proxy-server=http://${PROXY_SERVER}:${PROXY_SERVER_PORT}`
]
});
const page = await browser.newPage();
await page.authenticate({
username: PROXY_USERNAME,
password: PROXY_PASSWORD,
});


try {
await page.goto('https://quotes.toscrape.com/page/1/', {timeout: 180000});
let bodyHTML = await page.evaluate(() => document.body.innerHTML);
let $ = cheerio.load(bodyHTML);

// find H1 text
let h1Text = $('h1').text()
console.log('h1Text:', h1Text)

} catch(err) {
console.log(err);
}

await browser.close();

})();

ScrapeOps will take care of the proxy selection and rotation for you so you just need to send us the URL you want to scrape.


Response Format

After recieving a response from one of our proxy providers the ScrapeOps Proxy API will then respond with the raw HTML content of the target URL along with a response code:


<html>
<head>
...
</head>
<body>
...
</body>
</html>

The ScrapeOps Proxy API will return a 200 status code when it successfully got a response from the website that also passed response validation, or a 404 status code if the website responds with a 404 status code. Both of these status codes are considered successful requests.

Here is the full list of status codes the Proxy API returns.

Limiting API Requests

When you use a headless browser to scrape a page, it will generate 10's or 100's of additional requests to download extra JS, CSS and image files and query external APIs. These additional requests will be treated as additional requests to our Proxy API so it is recommended you configure your Puppeteer scraper to only make the requests that are absolutely necessary to retrieve the data you want to extract.