Part 6 - Avoiding Detection with Proxies

Node.js Cheerio Beginners Series Part 6: Avoiding Detection with Proxies

So far in this Node.js Cheerio Beginners Series, we have learned how to build a basic web scraper in Part 1, scrape data from a website in Part 2, clean it up, save it to a file or database in Part 3, and make our scraper more robust and scalable by handling failed requests and using concurrency in Part 4. We also learned how to use fake user-agents and browser headers to bypass restrictions on sites trying to prevent scraping in Part 5.

In Part 6, we'll explore how to use proxies to bypass various website restrictions by hiding your real IP address and location without needing to worry about user agents and headers.

Why Use Proxies?
The 3 Most Common Proxy Integration
Integrate Proxy Aggregator into the Existing Scraper
Complete Code
Conclusion

Node.js Axios/CheerioJS 6-Part Beginner Series

This 6-part Node.js Axios/CheerioJS Beginner Series will walk you through building a web scraping project from scratch, covering everything from creating the scraper to deployment and scheduling.

Part 1: Basic Node.js Cheerio Scraper - We'll learn the fundamentals of web scraping with Node.js and build your first scraper using Cheerio. (Part 1)
Part 2: Cleaning Unruly Data & Handling Edge Cases - Web data can be messy and unpredictable. In this part, we'll create a robust scraper using data structures and cleaning techniques to handle these challenges. (Part 2)
Part 3: Storing Scraped Data - Explore various options for storing your scraped data, including databases like MySQL or Postgres, cloud storage like AWS S3, and file formats like CSV and JSON. We'll discuss their pros, cons, and suitable use cases. (This article)
Part 4: Managing Retries & Concurrency - Enhance your scraper's reliability and scalability by handling failed requests and utilizing concurrency. (Part 4)
Part 5: Mimicking User Behavior - Learn how to create a production-ready scraper by simulating real users through user-agent and browser header manipulation. (Part 5)
Part 6: Avoiding Detection with Proxies - Discover how to use proxies to bypass anti-bot systems by disguising your real IP address and location. (This article)

GitHub Code

The code for this project is available on Github.

Why Use Proxies?

Scraping data from websites can be tricky sometimes. Websites might restrict you based on your location or block your IP address. This is where proxies come in handy.

Proxies help you bypass these restrictions by hiding your real IP address and location. When you use a proxy, your request gets routed through a proxy server first, acting as an intermediary. This way, the website only sees the proxy's IP address, not yours.

Websites often serve different information based on specific locations. Without a proxy, you might not be able to access information that's relevant to your needs if you're not located in a particular location.

Furthermore, proxies offer an extra layer of security by encrypting your data as it travels between your device and the server. This protects your data from being thwarted by third parties.

Additionally, you can use multiple proxies at the same time to distribute your scraping requests across different IP addresses, avoiding website rate limits.

The 3 Most Common Proxy Integration

Let's dive into integrating Node.js Axios with the 3 most common proxy formats:

Rotating Through a List of Proxy IPs
Using Proxy Gateways
Using Proxy API Endpoints

Previously, proxy providers offered lists of IP addresses, and you'd configure your scraper to cycle through them, using a new IP for each request. However, this method is less common now.

Many providers now offer access through proxy gateways or proxy API endpoints instead of raw lists. These gateways act as intermediaries, routing your requests through their pool of IPs.

Proxy Integration #1: Rotating Through Proxy IP List

Using rotating proxies is crucial because websites can restrict access to scrapers that send many requests from the same IP address. This technique makes it harder for websites to track and block your scraping activity by constantly changing the IP address used.

The code snippet fetches a list of free proxies from the Free Proxy List website. It extracts proxy information (IP address and port). Next, it filters out proxies that do not support HTTPS and returns a set of unique proxy entries.

const axios = require("axios");
const cheerio = require("cheerio");

async function getProxies() {
  const url = "https://free-proxy-list.net/";
  const response = await axios.get(url);
  const html = response.data;
  const $ = cheerio.load(html);

  const rows = $("tbody tr");
  const proxies = [];
  for (const row of rows) {
    const columns = $(row).find("td");
    const ip = $(columns[0]).text();
    const port = $(columns[1]).text();
    if ($(columns[6]).text() === "yes") {
      proxies.push({ ip, port });
    }
  }

  return proxies;
}

(async () => {
  const proxies = await getProxies();
  const proxyCount = proxies.length;
  let proxyIndex = 0;

  for (let i = 0; i < 100; i++) {
    const proxy = proxies[proxyIndex];

    try {
      const response = await axios.get("https://httpbin.org/ip", {
        proxy: {
          protocol: "http",
          host: proxy.ip,
          port: proxy.port,
        },
      });

      console.log(proxyIndex, response.data);
    } catch (e) {
      console.log(proxyIndex, `Failed to use ${proxy.ip}:${proxy.port}`);
    }

    proxyIndex = (proxyIndex + 1) % proxyCount;
  }
})();

The proxyIndex variable and the % (modulo) operation is used to cycle through the list of proxies we fetched. Keep in mind that using free proxies can have limitations. Not all proxies may work perfectly (or at all), and that’s why we have the try/catch.

The following output shows that 5 out of 11 requests were successful. Note, the returned address is different each time because it is not our own, we are using the proxies. Free Proxy Output

This is a simplistic example. For larger-scale scraping, you would require monitoring individual IP performance and removing banned/blocked ones from the proxy pool.

Proxy Integration #2: Using Proxy Gateway

Many proxy providers are moving away from selling static IP lists and instead offer access to their proxy pools through a gateway. This eliminates the need to manage and rotate individual IP addresses, as the provider handles that on your behalf. This has become the preferred method for using residential and mobile proxies, and increasingly for datacenter proxies as well.

Here is an example of how to integrate BrightData's residential proxy gateway into your Node.js Axios scraper:

const axios = require("axios");

(async () => {
  for (let i = 0; i < 100; i++) {
    try {
      const response = await axios.get("https://httpbin.org/ip", {
        proxy: {
          protocol: "http",
          host: "http://brd.superproxy.io",
          port: 22225,
          auth: {
            username: "username",
            password: "password",
          },
        },
      });

      console.log(response.data);
    } catch (e) {
      console.log("Failed to use proxy");
    }
  }
})();

Integrating via a gateway is significantly easier compared to a proxy list as you don't have to worry about implementing all the proxy rotation logic.

Proxy Integration #3: Using Proxy API Endpoint

Recently, many proxy providers have begun offering smart proxy APIs. These APIs manage your proxy infrastructure by automatically rotating proxies and headers allowing you to focus on extracting the data you need.

Typically, you send the URL you want to scrape to an API endpoint, and the API returns the HTML response. While each provider's API integration differs slightly, most are very similar and easy to integrate with.

Here's an example of integrating with the ScrapeOps Proxy Manager:

const axios = require("axios");

(async () => {
  const response = await axios.get("https://proxy.scrapeops.io/v1/", {
    params: {
      api_key: "<YOUR_SCRAPE_OPS_KEY>",
      url: encodeURIComponent("https://httpbin.org/ip"),
    },
  });

  console.log(response.content);
})();

Here you simply send the URL you want to scrape to the ScrapeOps API endpoint in the URL query parameter, along with your API key in the api_key query parameter. ScrapeOps will then locate the optimal proxy for the target domain and deliver the HTML response directly to you.

You can get your free API key with 1,000 free requests by signing up here.

Note that, when using proxy API endpoints it is very important to encode the URL you want to scrape before sending it to the Proxy API endpoint. If the URL contains query parameters then the Proxy API might think that those query parameters are for the Proxy API and not the target website.

To encode your URL you just need to use the encodeURIComponent function as we've done above in the example.

Integrate Proxy Aggregator into the Existing Scraper

After integrating the ScrapeOps Proxy Aggregator, you won't need to worry about user agents and headers. ScrapeOps Proxy Aggregator acts as an intermediary between your scraper and the target website. It routes your requests through a pool of high-performing proxies from various providers.

These proxies already have different user-agent strings and other headers pre-configured that help you avoid detection and blocks, even without additional middleware.

So for our scraper, we can get rid of our existing header logic. Then we can add a new method named makeScrapeOpsRequest. This new method will craft the ScrapeOps API url that we can then pass to our existing makeRequest method.

async function makeScrapeOpsRequest(url) {
  const payload = {
    api_key: "<YOUR_SCRAPE_OPS_KEY>",
    url: encodeURIComponent(url),
  };

  const proxyUrl = `https://proxy.scrapeops.io/v1?${new URLSearchParams(
    payload
  ).toString()}`;

  return makeRequest(proxyUrl, 3, true);
}

The above method starts by making an object with the URL Query Param values we need. The API Key and URL. Note that we encode the URL. Then we use the URLSearchParams class to convert the object to a string to append to the ScrapeOps API URL.

Then we pass the appended URL to the existing makeRequest method. The integration is incredibly simple. You no longer need to worry about the user agents and browser headers we used before. Just send the URL you want to scrape to the ScrapeOps API endpoint, and it will return the HTML response.

Complete Code

We did it! We have a fully functional scraper that creates a final CSV file containing all the desired data.

const axios = require("axios");
const cheerio = require("cheerio");
const fs = require("fs");
const {
  Worker,
  isMainThread,
  parentPort,
  workerData,
} = require("worker_threads");

class Product {
  constructor(name, priceStr, url) {
    this.name = this.cleanName(name);
    this.priceGb = this.cleanPrice(priceStr);
    this.priceUsd = this.convertPriceToUsd(this.priceGb);
    this.url = this.createAbsoluteUrl(url);
  }

  cleanName(name) {
    if (name == " " || name == "" || name == null) {
      return "missing";
    }
    return name.trim();
  }

  cleanPrice(priceStr) {
    priceStr = priceStr.trim();
    priceStr = priceStr.replace("Sale price£", "");
    priceStr = priceStr.replace("Sale priceFrom £", "");
    if (priceStr == "") {
      return 0.0;
    }
    return parseFloat(priceStr);
  }

  convertPriceToUsd(priceGb) {
    return priceGb * 1.29;
  }

  createAbsoluteUrl(url) {
    if (url == "" || url == null) {
      return "missing";
    }
    return "https://www.chocolate.co.uk" + url;
  }
}

class ProductDataPipeline {
  constructor(csvFilename = "", storageQueueLimit = 5) {
    this.seenProducts = new Set();
    this.storageQueue = [];
    this.csvFilename = csvFilename;
    this.csvFileOpen = false;
    this.storageQueueLimit = storageQueueLimit;
  }

  saveToCsv() {
    this.csvFileOpen = true;
    const fileExists = fs.existsSync(this.csvFilename);
    const file = fs.createWriteStream(this.csvFilename, { flags: "a" });
    if (!fileExists) {
      file.write("name,priceGb,priceUsd,url\n");
    }
    for (const product of this.storageQueue) {
      file.write(
        `${product.name},${product.priceGb},${product.priceUsd},${product.url}\n`
      );
    }
    file.end();
    this.storageQueue = [];
    this.csvFileOpen = false;
  }

  cleanRawProduct(rawProduct) {
    return new Product(rawProduct.name, rawProduct.price, rawProduct.url);
  }

  isDuplicateProduct(product) {
    if (!this.seenProducts.has(product.url)) {
      this.seenProducts.add(product.url);
      return false;
    }
    return true;
  }

  addProduct(rawProduct) {
    const product = this.cleanRawProduct(rawProduct);
    if (!this.isDuplicateProduct(product)) {
      this.storageQueue.push(product);
      if (
        this.storageQueue.length >= this.storageQueueLimit &&
        !this.csvFileOpen
      ) {
        this.saveToCsv();
      }
    }
  }

  async close() {
    while (this.csvFileOpen) {
      // Wait for the file to be written
      await new Promise((resolve) => setTimeout(resolve, 100));
    }
    if (this.storageQueue.length > 0) {
      this.saveToCsv();
    }
  }
}

const listOfUrls = ["https://www.chocolate.co.uk/collections/all"];

async function makeScrapeOpsRequest(url) {
  const payload = {
    api_key: "<YOUR_SCRAPE_OPS_KEY>",
    url: encodeURIComponent(url),
  };

  const proxyUrl = `https://proxy.scrapeops.io/v1?${new URLSearchParams(
    payload
  ).toString()}`;

  return makeRequest(proxyUrl, 3, true);
}

async function makeRequest(
  url,
  retries = 3,
  antiBotCheck = false,
  headers = {}
) {
  for (let i = 0; i < retries; i++) {
    try {
      const response = await axios.get(url, {
        headers: headers,
      });
      if ([200, 404].includes(response.status)) {
        if (antiBotCheck && response.status == 200) {
          if (response.data.includes("<title>Robot or human?</title>")) {
            return null;
          }
        }
        return response;
      }
    } catch (e) {
      console.log(`Failed to fetch ${url}, retrying...`);
    }
  }
  return null;
}

async function scrape(url) {
  const response = await makeScrapeOpsRequest(url, 3, false);
  if (!response) {
    throw new Error(`Failed to fetch ${url}`);
  }

  const html = response.data;
  const $ = cheerio.load(html);
  const productItems = $("product-item");

  const products = [];
  for (const productItem of productItems) {
    const title = $(productItem).find(".product-item-meta__title").text();
    const price = $(productItem).find(".price").first().text();
    const url = $(productItem).find(".product-item-meta__title").attr("href");
    products.push({ name: title, price: price, url: url });
  }

  const nextPage = $("a[rel='next']").attr("href");
  return {
    nextUrl: nextPage ? "https://www.chocolate.co.uk" + nextPage : null,
    products: products,
  };
}

if (isMainThread) {
  const pipeline = new ProductDataPipeline("chocolate.csv", 5);
  const workers = [];

  for (const url of listOfUrls) {
    workers.push(
      new Promise((resolve, reject) => {
        const worker = new Worker(__filename, {
          workerData: { startUrl: url },
        });
        console.log("Worker created", worker.threadId, url);

        worker.on("message", (product) => {
          pipeline.addProduct(product);
        });

        worker.on("error", reject);
        worker.on("exit", (code) => {
          if (code !== 0) {
            reject(new Error(`Worker stopped with exit code ${code}`));
          } else {
            console.log("Worker exited");
            resolve();
          }
        });
      })
    );
  }

  Promise.all(workers)
    .then(() => pipeline.close())
    .then(() => console.log("Pipeline closed"));
} else {
  // Perform work
  const { startUrl } = workerData;

  const handleWork = async (workUrl) => {
    const { nextUrl, products } = await scrape(workUrl);
    for (const product of products) {
      parentPort.postMessage(product);
    }

    if (nextUrl) {
      console.log("Worker working on", nextUrl);
      await handleWork(nextUrl);
    }
  };

  handleWork(startUrl).then(() => console.log("Worker finished"));
}

Conclusion

The guide explored using proxies to bypass website restrictions by masking your real IP address and location. We discussed the three most common proxy integration methods in detail. Finally, we successfully integrated the ScrapeOps Proxy Aggregator into our existing scraper code.

Part 1: Basic Node.js Cheerio Scraper - We'll learn the fundamentals of web scraping with Node.js and build your first scraper using Cheerio. (Part 1)
Part 2: Cleaning Unruly Data & Handling Edge Cases - Web data can be messy and unpredictable. In this part, we'll create a robust scraper using data structures and cleaning techniques to handle these challenges. (Part 2)
Part 3: Storing Scraped Data - Explore various options for storing your scraped data, including databases like MySQL or Postgres, cloud storage like AWS S3, and file formats like CSV and JSON. We'll discuss their pros, cons, and suitable use cases. (Part 3)
Part 4: Managing Retries & Concurrency - Enhance your scraper's reliability and scalability by handling failed requests and utilizing concurrency. (Part 4)
Part 5: Mimicking User Behavior - Learn how to create a production-ready scraper by simulating real users through user-agent and browser header manipulation. (Part 5)
Part 6: Avoiding Detection with Proxies - Discover how to use proxies to bypass anti-bot systems by disguising your real IP address and location. (This article)

Why Use Proxies?
The 3 Most Common Proxy Integration
Integrate Proxy Aggregator into the Existing Scraper
Complete Code
Conclusion

Node.js Cheerio Beginners Series Part 6: Avoiding Detection with Proxies

Why Use Proxies?​

The 3 Most Common Proxy Integration​

Proxy Integration #1: Rotating Through Proxy IP List​

Proxy Integration #2: Using Proxy Gateway​

Proxy Integration #3: Using Proxy API Endpoint​

Integrate Proxy Aggregator into the Existing Scraper​

Complete Code​

Conclusion​