Skip to main content

Part 6 - Using Proxies To Avoid Getting Blocked

NodeJS Puppeteer Beginners Series Part 6 - Using Proxies To Avoid Getting Blocked

So far in this NodeJS Puppeteer 6-Part Beginner Series, we have learned how to build a basic web scraper Part 1, scrape data from a website in Part 2, clean it up, save it to a file or database in Part 3, and make our scraper more robust and scalable by handling failed requests and using concurrency in Part 4. We also learned how to use fake user-agents and browser headers to bypass restrictions on sites trying to prevent scraping in Part 5.

In Part 6, we'll explore how to use proxies to bypass various website restrictions by hiding your real IP address and location without needing to worry about user agents and headers.

Node.js Puppeteer 6-Part Beginner Series

  • Part 1: Basic Node.js Puppeteer Scraper - We'll learn the fundamentals of web scraping with Node.js and build your first scraper using NpdeJS Puppeteer. (Part 1)

  • Part 2: Cleaning Unruly Data & Handling Edge Cases - Web data can be messy and unpredictable. In this part, we'll create a robust scraper using data structures and cleaning techniques to handle these challenges. (Part 2)

  • Part 3: Storing Scraped Data in AWS S3, MySQL & Postgres DBs - Explore various options for storing your scraped data, including databases like MySQL or Postgres, cloud storage like AWS S3, and file formats like CSV and JSON. We'll discuss their pros, cons, and suitable use cases. (Part 3)

  • Part 4: Managing Retries & Concurrency - Enhance your scraper's reliability and scalability by handling failed requests and utilizing concurrency. (Part 4)

  • Part 5: Faking User-Agents & Browser Headers - Learn how to create a production-ready scraper by simulating real users through user-agent and browser header manipulation. (Part 5)

  • Part 6: Using Proxies To Avoid Getting Blocked - Discover how to use proxies to bypass anti-bot systems by disguising your real IP address and location. (This article)

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.


Why Use Proxies?

When scraping data from websites, obstacles like location-based restrictions or IP bans can pose significant challenges. This is where proxies become invaluable.

Proxies enable you to bypass these restrictions by concealing your actual IP address and location. When you employ a proxy, your request first goes through a proxy server, which acts as an intermediary. Consequently, the website only sees the proxy's IP address, not yours.

Websites often display different content based on the user's location. Without a proxy, you might not be able to access location-specific information that you need.

Moreover, proxies provide an additional layer of security by encrypting your data as it travels between your device and the server. This helps protect your data from being intercepted by third parties.

Additionally, using multiple proxies simultaneously allows you to distribute your scraping requests across different IP addresses, helping you avoid website rate limits.


The 3 Most Common Proxy Integration

Let's explore integrating Puppeteer with the three most common proxy formats:

  1. Rotating Through a List of Proxy IPs
  2. Using Proxy Gateways
  3. Using Proxy API Endpoints

Previously, proxy providers offered lists of IP addresses, and you'd configure your scraper to cycle through them, using a new IP for each request. However, this method is less common now.

Many providers now offer access through proxy gateways or proxy API endpoints instead of raw lists. These gateways act as intermediaries, routing your requests through their pool of IPs.

Proxy Integration #1: Rotating Through a Proxy IP List

Rotating proxies is essential because websites can block scrapers sending numerous requests from the same IP address. By frequently changing the IP address used, this technique makes it harder for websites to detect and block your scraping activity.

Here’s how you can rotate through a list of proxies using Puppeteer:

const puppeteer = require('puppeteer');
const proxyList = ['http://proxy1:port', 'http://proxy2:port', 'http://proxy3:port']; // Add your proxies here

(async () => {
for (let i = 0; i < proxyList.length; i++) {
const browser = await puppeteer.launch({
args: [`--proxy-server=${proxyList[i]}`]
});
const page = await browser.newPage();
try {
await page.goto('https://icanhazip.com/');
const ip = await page.evaluate(() => document.body.textContent.trim());
console.log(`Proxy IP: ${proxyList[i]}, Actual IP: ${ip}`);
} catch (error) {
console.log(`Failed to use proxy ${proxyList[i]}`);
}
await browser.close();
}
})();

This script launches a new browser instance for each proxy in the list, navigates to a website that displays the IP address, and logs the result. If a proxy fails, the script moves on to the next one in the list.

Proxy Integration #2: Using Proxy Gateway

Many proxy providers are now offering access through proxy gateways, eliminating the need to manage and rotate individual IP addresses. The provider handles this for you, making it a preferred method for residential and mobile proxies.

Here's an example of how to integrate a proxy gateway with Puppeteer:

const puppeteer = require('puppeteer');

(async () => {
const browser = await puppeteer.launch({
args: ['--proxy-server=http://gateway.proxyprovider.com:port'] // Replace with your proxy gateway
});
const page = await browser.newPage();
await page.authenticate({
username: 'your-username',
password: 'your-password'
});
await page.goto('https://icanhazip.com/');
const ip = await page.evaluate(() => document.body.textContent.trim());
console.log(`IP Address: ${ip}`);
await browser.close();
})();

Using a proxy gateway simplifies the integration as you don’t need to handle the rotation logic manually. The provider’s gateway takes care of routing requests through different IP addresses.

Proxy Integration #3: Using Proxy API Endpoint

Many proxy providers now offer smart proxy APIs that manage your proxy infrastructure by automatically rotating proxies and headers. This allows you to focus on extracting the data you need.

Here’s an example using Puppeteer with a proxy API endpoint:

const puppeteer = require('puppeteer');

(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const targetUrl = 'https://httpbin.org/ip';
const proxyApiUrl = `https://proxyapi.provider.com?api_key=YOUR_API_KEY&url=${encodeURIComponent(targetUrl)}`;

await page.goto(proxyApiUrl);
const ipData = await page.evaluate(() => document.body.textContent.trim());
console.log(`IP Data: ${ipData}`);
await browser.close();
})();

In this example, the target URL is sent to the proxy API endpoint, which handles the proxy rotation and returns the response.


Integrate Proxy Aggregator into the Existing Scraper

Integrating a proxy aggregator like ScrapeOps Proxy Aggregator simplifies proxy management further. You don’t need to worry about user agents or headers, as these are managed by the proxy aggregator.

Here’s how you can integrate it into an existing Puppeteer scraper:

const puppeteer = require('puppeteer');
const { encode } = require('querystring');

(async () => {
const scrapeOpsApiKey = 'YOUR_API_KEY';
const targetUrl = 'https://example.com';
const proxyApiUrl = `https://proxy.scrapeops.io/v1/?${encode({ api_key: scrapeOpsApiKey, url: targetUrl })}`;

const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(proxyApiUrl);
const content = await page.content();

// Process the content as needed
console.log(content);

await browser.close();
})();

The proxy aggregator routes your requests through a pool of proxies, providing different user-agent strings and headers to help avoid detection and blocks.


Complete Code

Below is a complete example of a Puppeteer scraper that integrates proxy management and scrapes data from a website, storing the results in a CSV file.

const puppeteer = require('puppeteer');
const fs = require('fs');
const path = require('path');

// Define the Product class
class Product {
constructor(name, priceString, url) {
this.name = this.cleanName(name);
this.priceGBP = this.cleanPrice(priceString);
this.priceUSD = this.convertPriceToUSD();
this.url = this.createAbsoluteURL(url);
}

cleanName(name) {
return name.trim() || 'missing';
}

cleanPrice(priceString) {
if (!priceString) return 0.0;
priceString = priceString..replace(/[^0-9\.]+/g, '');
return parseFloat(priceString) || 0.0;
}

convertPriceToUSD() {
const exchangeRate = 1.21;
return this.priceGBP * exchangeRate;
}

createAbsoluteURL(relativeURL) {
const baseURL = 'https://www.chocolate.co.uk';
return relativeURL ? `${baseURL}${relativeURL}` : 'missing';
}
}

// Define the ProductDataPipeline class
class ProductDataPipeline {
constructor(csvFilename = '', storageQueueLimit = 5) {
this.namesSeen = [];
this.storageQueue = [];
this.storageQueueLimit = storageQueueLimit;
this.csvFilename = csvFilename;
this.csvFileOpen = false;
}

saveToCSV() {
this.csvFileOpen = true;
const productsToSave = [...this.storageQueue];
this.storageQueue = [];

if (productsToSave.length === 0) return;

const headers = Object.keys(productsToSave[0]);
const fileExists = fs.existsSync(this.csvFilename);

const csvWriter = fs.createWriteStream(this.csvFilename, { flags: 'a' });
if (!fileExists) {
csvWriter.write(headers.join(',') + '\n');
}

productsToSave.forEach(product => {
const row = headers.map(header => product[header]).join(',');
csvWriter.write(row + '\n');
});

csvWriter.end();
this.csvFileOpen = false;
}

cleanRawProduct(scrapedData) {
return new Product(scrapedData.name || '', scrapedData.price || '', scrapedData.url || '');
}

isDuplicate(product) {
if (this.namesSeen.includes(product.name)) {
console.log(`Duplicate item found: ${product.name}. Item dropped.`);
return true;
}
this.namesSeen.push(product.name);
return false;
}

addProduct(scrapedData) {
const product = this.cleanRawProduct(scrapedData);
if (!this.isDuplicate(product)) {
this.storageQueue.push(product);
if (this.storageQueue.length >= this.storageQueueLimit && !this.csvFileOpen) {
this.saveToCSV();
}
}
}

closePipeline() {
if (this.csvFileOpen) {
setTimeout(() => this.saveToCSV(), 3000);
} else if (this.storageQueue.length > 0) {
this.saveToCSV();
}
}
}

// Define the RetryLogic class
class RetryLogic {
constructor(retryLimit = 5, antiBotCheck = false, useFakeBrowserHeaders = false, scrapeOpsApiKey = '') {
this.retryLimit = retryLimit;
this.antiBotCheck = antiBotCheck;
this.useFakeBrowserHeaders = useFakeBrowserHeaders;
this.scrapeOpsApiKey = scrapeOpsApiKey;
}

async makeScrapeOpsRequest(page, url) {
const payload = {
api_key: this.scrapeOpsApiKey,
url: url
};
const proxyUrl = `https://proxy.scrapeops.io/v1/?${new URLSearchParams(payload)}`;

return this.makeRequest(page, proxyUrl);
}

async makeRequest(page, url) {
for (let i = 0; i < this.retryLimit; i++) {
try {
await page.goto(url, { waitUntil: 'networkidle2' });
const status = page.statusCode;
if ([200, 404].includes(status)) {
if (this.antiBotCheck && status === 200) {
const passed = await this.passedAntiBotCheck(page);
if (!passed) return { success: false, page };
}
return { success: true, page };
}
} catch (error) {
console.log('Error:', error);
}
}
return { success: false, page };
}

async passedAntiBotCheck(page) {
const content = await page.content();
return !content.includes('<title>Robot or human?</title>');
}
}

// Define the scraping function
const scrapePage = async (url, retryLogic, dataPipeline, listOfUrls) => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
listOfUrls.splice(listOfUrls.indexOf(url), 1);
const { success, page: responsePage } = await retryLogic.makeScrapeOpsRequest(page, url);
page.on('response', response => {
// Keep track of the page status code
if (response.request().resourceType() === 'document') {
page.statusCode = response.status();
}
})
if (success) {
const content = await responsePage.content();
const $ = cheerio.load(content);
const products = $('.product-item');
products.each((index, product) => {
const name = $(product).find('a.product-item-meta__title').text();
const price = $(product).find('span.price').text().replace(/[^0-9\.]+/g, '');
const url = $(product).find('div.product-item-meta a').attr('href');

dataPipeline.addProduct({ name, price, url });
});

const nextPage = $('a[rel="next"]').attr('href');
if (nextPage) {
listOfUrls.push(`https://www.chocolate.co.uk${nextPage}`);
}
}

await browser.close();
};

// Define the function to start concurrent scraping
const startConcurrentScrape = async (numThreads = 5, retryLogic, dataPipeline, listOfUrls) => {
while (listOfUrls.length) {
await Promise.all(listOfUrls.slice(0, numThreads).map(url => scrapePage(url, retryLogic, dataPipeline, listOfUrls)));
}
};

// Initialize and run the scraper
const listOfUrls = ['https://www.chocolate.co.uk/collections/all'];

const dataPipeline = new ProductDataPipeline('product_data.csv');
const retryRequest = new RetryLogic(3, false, false, 'YOUR_API_KEY');

startConcurrentScrape(10, retryRequest, dataPipeline, listOfUrls).then(() => {
dataPipeline.closePipeline();
});

The CSV file:

CSV Data


Conclusion

This guide explored using proxies with Puppeteer to bypass website restrictions by concealing your real IP address and location. We discussed the three most common proxy integration methods in detail.

Finally, we integrated ScrapeOps Proxy Aggregator into a Puppeteer scraper to manage proxies efficiently.

You can visit of our previous articles in the Node.js Puppeteer 6-Part Beginner Series:

  • Part 1: Basic Node.js Puppeteer Scraper - We'll learn the fundamentals of web scraping with Node.js and build your first scraper using NpdeJS Puppeteer. (Part 1)

  • Part 2: Cleaning Unruly Data & Handling Edge Cases - Web data can be messy and unpredictable. In this part, we'll create a robust scraper using data structures and cleaning techniques to handle these challenges. (Part 2)

  • Part 3: Storing Scraped Data in AWS S3, MySQL & Postgres DBs - Explore various options for storing your scraped data, including databases like MySQL or Postgres, cloud storage like AWS S3, and file formats like CSV and JSON. We'll discuss their pros, cons, and suitable use cases. (Part 3)

  • Part 4: Managing Retries & Concurrency - Enhance your scraper's reliability and scalability by handling failed requests and utilizing concurrency. (Part 4)

  • Part 5: Faking User-Agents & Browser Headers - Learn how to create a production-ready scraper by simulating real users through user-agent and browser header manipulation. (Part 5)

  • Part 6: Using Proxies To Avoid Getting Blocked - Discover how to use proxies to bypass anti-bot systems by disguising your real IP address and location. (This article)