Skip to main content

Part 6 - Avoiding Detection with Proxies

Node.js Playwright Beginners Series Part 6: Using Proxies

So far in this Node.js Playwright Beginner Series, we have learned how to build a basic web scraper Part 1, scrape data from a website in Part 2, clean it up, save it to a file or database in Part 3, and make our scraper more robust and scalable by handling failed requests and using concurrency in Part 4. We also learned how to use fake user-agents and browser headers to bypass restrictions on sites trying to prevent scraping in Part 5

In Part 6, we'll explore how to use proxies to bypass various website restrictions by hiding your real IP address and location without needing to worry about user agents and headers.

Node.js Playwright 6-Part Beginner Series

  • Part 1: Basic Node.js Playwright Scraper - We'll learn the fundamentals of web scraping with Node.js and build your first scraper using Cheerio. (Part 1)

  • Part 2: Cleaning Unruly Data & Handling Edge Cases - Web data can be messy and unpredictable. In this part, we'll create a robust scraper using data structures and cleaning techniques to handle these challenges. (Part 2)

  • Part 3: Storing Scraped Data in AWS S3, MySQL & Postgres DBs - Explore various options for storing your scraped data, including databases like MySQL or Postgres, cloud storage like AWS S3, and file formats like CSV and JSON. We'll discuss their pros, cons, and suitable use cases. (Part 3)

  • Part 4: Managing Retries & Concurrency - Enhance your scraper's reliability and scalability by handling failed requests and utilizing concurrency. (Part 4)

  • Part 5: Faking User-Agents & Browser Headers - Learn how to create a production-ready scraper by simulating real users through user-agent and browser header manipulation. (Part 5)

  • Part 6: Using Proxies To Avoid Getting Blocked - Discover how to use proxies to bypass anti-bot systems by disguising your real IP address and location. (This Article)

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.


Why Use Proxies?

Proxies are intermediary servers that route your web requests through a different IP address, effectively masking your real identity and location on the web. They act as middlemen between your device and the target website.

Scraping data from websites can be challenging due to restrictions like location-based blocks or IP bans.

Proxies help overcome these obstacles by:

  • Bypassing Restrictions:
    • Proxies mask your real IP address and location.
    • Your request is routed through a proxy server, so the website only sees the proxy’s IP, not yours.

  • Accessing Geo-Restricted Content:
    • Websites often show different content based on location.
    • Proxies allow you to appear as if you're browsing from different regions, enabling access to location-specific data.

  • Enhancing Security:
    • Proxies add a layer of security by encrypting data between your device and the server.
    • This helps protect your data from third-party interception.

  • Avoiding Rate Limits:
    • Using multiple proxies allows you to distribute requests across different IPs.
    • This helps prevent hitting rate limits or being flagged by the website.

Proxies are essential for scalable, secure, and unrestricted web scraping.


The 3 Most Common Proxy Integration

When integrating proxies into your web scraper, there are several approaches to choose from:

  1. Rotating Through a List of Proxy IPs
  2. Using Proxy Gateways
  3. Using Proxy API Endpoints

While manually rotating through a list of raw proxy IPs can work, it’s often inefficient and requires constant maintenance (e.g., replacing blocked or dead IPs).

Instead, using proxy gateways or proxy API endpoints is a preferred approach. These services act as intermediaries, managing the rotation and availability of IPs, making the integration smoother and more reliable.

Let’s dive into each method!

Proxy Integration #1: Rotating Through Proxy IP List

Rotating through a proxy IP list involves manually managing a list of static proxies and cycling through them during scraping. This method is suitable for scenarios where you have a relatively small number of requests and don’t need to scale rapidly.

For this demonstration, we'll use Free Proxy List to extract proxies that support HTTPS. We will then verify each proxy's functionality by accessing a test website.

Free Proxy List

Here’s how you can implement this:

const playwright = require("playwright");

async function getProxies() {
const url = "https://free-proxy-list.net/";
const browser = await playwright.chromium.launch();
const page = await browser.newPage();
await page.goto(url);

const proxies = await page.evaluate(() => {
const rows = Array.from(document.querySelectorAll("tbody tr"));
return rows
.filter(row => row.children[6].innerText === "yes")
.map(row => ({
ip: row.children[0].innerText,
port: row.children[1].innerText
}));
});

await browser.close();
return proxies;
}

async function testProxy(proxy, index) {
const browser = await playwright.chromium.launch({
proxy: {
server: `http://${proxy.ip}:${proxy.port}`
}
});

const page = await browser.newPage();

try {
await page.goto("https://httpbin.org/ip");
const content = await page.content();
console.log(index, content);
} catch (e) {
console.log(index, `Failed to use ${proxy.ip}:${proxy.port}`);
}

await browser.close();
}

(async () => {
const proxies = await getProxies();
const proxyCount = proxies.length;

for (let i = 0; i < proxyCount; i++) {
const proxy = proxies[i];
await testProxy(proxy, i);
}
})();


In the above code:

  • The getProxies() function scrapes the proxies from the Free Proxy List, specifically targeting rows where the "HTTPS" column indicates availability.
  • Then the testProxy() function tests each proxy by accessing httpbin.org/ip to verify its functionality.

Proxies from Free Proxy List are often unreliable and may not work as expected. For example, only a few of the proxies we tested were successful. Here are some results from our tests:

// 0 <html><head><meta name="color-scheme" content="light dark"><meta charset="utf-8"></head><body><pre>{
// "origin": "189.240.60.168"
// }
// </pre><div class="json-formatter-container"></div></body></html>
// 1 Failed to use 68.178.168.41:80
// 2 Failed to use 160.86.242.23:8080
// 3 <html><head><meta name="color-scheme" content="light dark"><meta charset="utf-8"></head><body><pre>{
// "origin": "164.52.206.180"
// }
// </pre><div class="json-formatter-container"></div></body></html>

This approach can work for a limited number of requests but may require proxy monitoring for large-scale scraping to handle and replace blocked proxies.

In subsequent sections, we'll explore more advanced and scalable approaches.

Proxy Integration #2: Using Proxy Gateway

As the landscape of proxy services evolves, many providers are shifting from offering static IP lists to providing access via proxy gateways.

This approach simplifies proxy management by allowing the provider to handle IP rotation and management. It’s particularly advantageous for accessing residential, mobile, and increasingly datacenter proxies.

Here are the advantages of using a proxy gateway:

  • You no longer need to manually manage and rotate individual IP addresses. The proxy gateway takes care of this for you.
  • Proxy gateways often come with built-in features like IP rotation, blacklist management, and performance monitoring.
  • Ideal for high-volume scraping, as you don’t need to worry about the maintenance and rotation of a large list of proxies.

BrightData (formerly Luminati) is a well-known provider offering such proxy gateway services. Here’s a basic example of how to integrate BrightData’s residential proxy gateway into a Node.js scraper using Playwright:

const { chromium } = require("playwright");

(async () => {
for (let i = 0; i < 100; i++) {
const browser = await chromium.launch({
proxy: {
server: "http://username:password@brd.superproxy.io:22225"
}
});
const page = await browser.newPage();

try {
await page.goto("https://httpbin.org/ip");
const content = await page.textContent("body");
console.log(content);
} catch (e) {
console.log("Failed to use proxy");
}

await browser.close();
}
})();

In the script above:

  • The proxy object in chromium.launch() includes the proxy server’s URL, which contains authentication credentials (username and password) and the endpoint provided by BrightData. Be sure to replace "username", "password", and "brd.superproxy.io:22225" with your actual credentials and endpoint.

  • The code runs a loop 100 times, each time launching a new browser instance with the proxy settings. This allows you to test multiple requests through the proxy.

  • Each browser instance navigates to httpbin.org/ip, which returns the IP address of the request. The content of the page is logged to verify if the proxy is working correctly.

Using a proxy gateway significantly reduces the complexity of proxy management, making it easier to focus on scraping tasks without worrying about the intricacies of proxy rotation and maintenance. In the next sections, we’ll explore even more advanced proxy integration methods.

Proxy Integration #3: Using Proxy API Endpoint

Managing proxies manually can be a time-consuming and error-prone process, especially when scaling your scraping efforts. To simplify this, many modern proxy providers offer smart proxy APIs that take care of proxy management and rotation for you.

One of the standout solutions in this space is ScrapeOps Proxy Manager, which automates the entire proxy process.

Why Use ScrapeOps Proxy Manager?

  • Seamless Integration: You only need to send the target URL and your API key, and ScrapeOps handles the rest.
  • Automated Proxy Rotation: The service automatically rotates IP addresses and optimizes the connection, saving you from manually managing proxy lists.
  • Advanced Features: Built-in functionalities like header rotation and request retries ensure you can focus purely on scraping without worrying about proxy infrastructure.
  • Scalable: Ideal for large-scale web scraping projects as it handles high-volume requests effortlessly.

You can get your free API key with 1,000 free requests by signing up here.

Here’s a quick demonstration of how to integrate ScrapeOps Proxy Manager into a Playwright scraper.

const playwright = require("playwright");

(async () => {
const apiKey = "<YOUR_SCRAPE_OPS_KEY>";
const targetUrl = encodeURIComponent("https://httpbin.org/ip");
const scrapeOpsUrl = `https://proxy.scrapeops.io/v1/?api_key=${apiKey}&url=${targetUrl}`;

const browser = await playwright.chromium.launch();
const page = await browser.newPage();

try {
await page.goto(scrapeOpsUrl);
const content = await page.textContent("body");
console.log(content);
} catch (e) {
console.log("Failed to fetch the URL through ScrapeOps proxy");
}

await browser.close();
})();
  • Replace <YOUR_SCRAPE_OPS_KEY> with your actual API key to authenticate with the ScrapeOps service.
  • Ensure the target URL is encoded with encodeURIComponent() to prevent issues with query parameters.

With ScrapeOps, you can eliminate the hassle of managing proxies manually, making your scraping process more reliable and efficient. If you're serious about scaling your scraping operations, this is one of the best solutions out there.


Integrate Proxy Aggregator into the Existing Scraper

In the previous part of this series, we manually configured user agents and headers to help evade detection.

However, with ScrapeOps Proxy Aggregator, this manual setup is no longer necessary. ScrapeOps handles the rotation of both proxies and headers automatically, providing a seamless experience that greatly reduces the chance of your scraper being blocked.

To integrate the ScrapeOps Proxy Aggregator into our scraper, we’ll create a new method named makeScrapeOpsRequest(). This method will construct the ScrapeOps API URL using your API key and the target URL, and then pass that to our existing makeRequest() function.

Here’s how the code looks:

async function makeScrapeOpsRequest(page, url) {
const payload = {
api_key: "<YOUR_SCRAPE_OPS_KEY>",
url: encodeURIComponent(url),
};

const proxyUrl = `https://proxy.scrapeops.io/v1?${new URLSearchParams(
payload
).toString()}`;

return makeRequest(page, proxyUrl, 3, true);
}

In this snippet, we use URLSearchParams() to convert our payload object (which contains the API key and encoded URL) into a query string. This helps in appending the key-value pairs to the base ScrapeOps API URL in a clean and structured format.

By passing this newly formed ScrapeOps API URL to our makeRequest() method, ScrapeOps will handle everything for you—including proxy rotation and request headers.


Complete Code

We now have a fully functional scraper that gathers all the required data and exports it into a final CSV file. Here's the complete code for our scraper:

const { chromium } = require('playwright');
const fs = require('fs');
const { Worker, isMainThread, parentPort, workerData } = require('worker_threads');

class Product {
constructor(name, priceStr, url) {
this.name = this.cleanName(name);
this.priceGb = this.cleanPrice(priceStr);
this.priceUsd = this.convertPriceToUsd(this.priceGb);
this.url = this.createAbsoluteUrl(url);
}

cleanName(name) {
if (name === " " || name === "" || name == null) {
return "missing";
}
return name.trim();
}

cleanPrice(priceStr) {
if (!priceStr) {
return 0.0;
}
priceStr = priceStr.trim();
priceStr = priceStr.replace("Sale price£", "");
priceStr = priceStr.replace("Sale priceFrom £", "");
if (priceStr === "") {
return 0.0;
}
return parseFloat(priceStr);
}

convertPriceToUsd(priceGb) {
return priceGb * 1.29;
}

createAbsoluteUrl(url) {
if (url === "" || url == null) {
return "missing";
}
return "https://www.chocolate.co.uk" + url;
}
}

class ProductDataPipeline {
constructor(jsonFileName = "", storageQueueLimit = 5) {
this.seenProducts = new Set();
this.storageQueue = [];
this.jsonFileName = jsonFileName;
this.jsonFileOpen = false;
this.storageQueueLimit = storageQueueLimit;
}

saveToJson() {
if (this.storageQueue.length <= 0) {
return;
}

const fileExists = fs.existsSync(this.jsonFileName);
let existingData = [];
if (fileExists) {
const fileContent = fs.readFileSync(this.jsonFileName, "utf8");
existingData = JSON.parse(fileContent);
}

const mergedData = [...existingData, ...this.storageQueue];
fs.writeFileSync(this.jsonFileName, JSON.stringify(mergedData, null, 2));
this.storageQueue = []; // Clear the queue after saving
}

cleanRawProduct(rawProduct) {
return new Product(rawProduct.name, rawProduct.price, rawProduct.url);
}

isDuplicateProduct(product) {
if (!this.seenProducts.has(product.url)) {
this.seenProducts.add(product.url);
return false;
}
return true;
}

addProduct(rawProduct) {
const product = this.cleanRawProduct(rawProduct);
if (!this.isDuplicateProduct(product)) {
this.storageQueue.push(product);
if (this.storageQueue.length >= this.storageQueueLimit) {
this.saveToJson();
}
}
}

async close() {
if (this.storageQueue.length > 0) {
this.saveToJson();
}
}
}

const listOfUrls = ["https://www.chocolate.co.uk/collections/all"];

async function makeRequest(page, url, retries = 3, antiBotCheck = false) {
for (let i = 0; i < retries; i++) {
try {
const response = await page.goto(url);
const status = response.status();
if ([200, 404].includes(status)) {
if (antiBotCheck && status == 200) {
const content = await page.content();
if (content.includes("<title>Robot or human?</title>")) {
return null;
}
}
return response;
}
} catch (e) {
console.log(`Failed to fetch ${url}, retrying...`);
}
}
return null;
}

async function makeScrapeOpsRequest(page, url) {
const payload = {
api_key: "<YOUR_SCRAPE_OPS_KEY>",
url: encodeURIComponent(url),
};

const proxyUrl = `https://proxy.scrapeops.io/v1?${new URLSearchParams(
payload
).toString()}`;

return makeRequest(page, proxyUrl, 3, true);
}

async function scrape(url) {
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();

const response = await makeScrapeOpsRequest(page, url);
if (!response) {
await browser.close();
return { nextUrl: null, products: [] };
}

const productItems = await page.$$eval("product-item", items =>
items.map(item => {
const titleElement = item.querySelector(".product-item-meta__title");
const priceElement = item.querySelector(".price");
return {
name: titleElement ? titleElement.textContent.trim() : null,
price: priceElement ? priceElement.textContent.trim() : null,
url: titleElement ? titleElement.getAttribute("href") : null
};
})
);

const nextUrl = await nextPage(page);
await browser.close();

return {
nextUrl: nextUrl,
products: productItems.filter(item => item.name && item.price && item.url)
};
}

async function nextPage(page) {
let nextUrl = null;
try {
nextUrl = await page.$eval("a.pagination__nav-item:nth-child(4)", item => item.href);
} catch (error) {
console.log('Last Page Reached');
}
return nextUrl;
}

if (isMainThread) {
const pipeline = new ProductDataPipeline("chocolate.json", 5);
const workers = [];

for (const url of listOfUrls) {
workers.push(
new Promise((resolve, reject) => {
const worker = new Worker(__filename, {
workerData: { startUrl: url }
});
console.log("Worker created", worker.threadId, url);

worker.on("message", (product) => {
pipeline.addProduct(product);
});

worker.on("error", reject);
worker.on("exit", (code) => {
if (code !== 0) {
reject(new Error(`Worker stopped with exit code ${code}`));
} else {
console.log("Worker exited");
resolve();
}
});
})
);
}

Promise.all(workers)
.then(() => pipeline.close())
.then(() => console.log("Pipeline closed"));
} else {
const { startUrl } = workerData;
const handleWork = async (workUrl) => {
const { nextUrl, products } = await scrape(workUrl);
for (const product of products) {
parentPort.postMessage(product);
}

if (nextUrl) {
console.log("Worker working on", nextUrl);
await handleWork(nextUrl);
}
};

handleWork(startUrl).then(() => console.log("Worker finished"));
}

Here’s the "chocolate.csv" file that will be generated when you run the code above:

Final CSV


Conclusion

The guide explored using proxies to bypass website restrictions by masking your real IP address and location. We discussed the three most common proxy integration methods in detail. Finally, we successfully integrated the ScrapeOps Proxy Aggregator into our existing scraper code.

This six-part guide has walked you through the complete process of building a production-ready web scraper from scratch, resulting in a powerful and efficient tool capable of tackling real-world challenges.

Along the way, we explored key web scraping concepts, including setting up your scraper, managing data, and handling website restrictions using proxy integration.

I hope you enjoyed following along and feel confident in applying these techniques to your projects.

Happy scraping!

You can visit of our previous articles in the Node.js Playwright 6-Part Beginner Series:

  • Part 1: Basic Node.js Playwright Scraper - We'll learn the fundamentals of web scraping with Node.js and build your first scraper using Cheerio. (Part 1)

  • Part 2: Cleaning Unruly Data & Handling Edge Cases - Web data can be messy and unpredictable. In this part, we'll create a robust scraper using data structures and cleaning techniques to handle these challenges. (Part 2)

  • Part 3: Storing Scraped Data in AWS S3, MySQL & Postgres DBs - Explore various options for storing your scraped data, including databases like MySQL or Postgres, cloud storage like AWS S3, and file formats like CSV and JSON. We'll discuss their pros, cons, and suitable use cases. (Part 3)

  • Part 4: Managing Retries & Concurrency - Enhance your scraper's reliability and scalability by handling failed requests and utilizing concurrency. (Part 4)

  • Part 5: Faking User-Agents & Browser Headers - Learn how to create a production-ready scraper by simulating real users through user-agent and browser header manipulation. (Part 5)

  • Part 6: Using Proxies To Avoid Getting Blocked - Discover how to use proxies to bypass anti-bot systems by disguising your real IP address and location. (This Article)