Skip to main content

NodeJs SuperAgent: Make Concurent Requests

NodeJs SuperAgent: Make Concurent Requests

In this guide for The NodeJs Web Scraping Playbook, we will look at how to configure NodeJS SuperAgent library to make concurrent requests so that you can increase the speed of your scrapers.

The more concurrent threads you have, the more requests you can have active in parallel, and the faster you can scrape.

So in this guide we will walk you through the best way to send concurrent requests with SuperAgent:

Let's begin...

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.


Make Concurrent Requests Using Promise.all() & Bottleneck

The first approach to making concurrent requests with SuperAgent is to use Javascript's Promise.all() functionality and the bottleneck package to control the concurrency.

Here is an example:


const request = require("superagent");
const cheerio = require('cheerio');
const Bottleneck = require('bottleneck');

const NUM_THREADS = 5;

// Example list of URLs to scrape
const list_of_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
'http://quotes.toscrape.com/page/3/',
'http://quotes.toscrape.com/page/4/',
'http://quotes.toscrape.com/page/5/'
];

const output_data_list = [];
const limiter = new Bottleneck({ maxConcurrent: NUM_THREADS });

async function scrape_page(url) {
try {
const response = await request.get(url);
const html = response.text
if (response.status === 200) {
const $ = cheerio.load(html);

const title = $('h1').text();

// Add scraped data to "output_data_list" array
output_data_list.push({
'title': title
});
};
} catch (error) {
console.log('Error', error);
}
};

(async () => {
await Promise.all(
list_of_urls.map(url =>
limiter.schedule(() => scrape_page(url))
)
);

console.log(output_data_list);
})();

Here we:

  1. We import the necessary libraries: superagent, cheerio, and Bottleneck. request.get method from superagent is used for making HTTP requests, cheerio is used for parsing the HTML response, and Bottleneck is used for limiting the number of concurrent threads.

  2. We define the NUM_THREADS constant, which represents the maximum number of concurrent threads we want to allow for scraping.

  3. We create an array list_of_urls containing the URLs we want to scrape.

  4. We define an empty array output_data_list to store the scraped data. In your code, you could have this data being saved to a queue.

  5. We create an instance of Bottleneck called limiter with maxConcurrent option set to NUM_THREADS. This ensures that at most NUM_THREADS requests are executed concurrently.

  6. We define the scrape_page function, which takes a url as an argument and performs the scraping for that url. Inside the function:

  • We call request.get method from superagent to send an HTTP GET request to the URL and obtain the response. Then we store response.text in html variable.
  • If the response's status code is 200 (indicating a successful response), we load the response into cheerio using cheerio.load(html).
  • We extract the title from the HTML using $('h1').text() and add it to the output_data_list array as an object with a 'title' property.
  1. We use Promise.all() to map each URL in list_of_urls to a call to limiter.schedule(). This schedules the scrape_page function for each URL, enforcing the concurrency limit set by limiter.
  2. limiter.schedule() returns a promise for each URL, and Promise.all() waits for all the promises to resolve. Once all the promises are resolved, we reach the await line, and execution continues to the next line.
  3. We log the output_data_list to the console, which contains the scraped data from all the URLs.

Overall, the code sets a concurrency limit using Bottleneck and utilizes superagent and cheerio for making HTTP requests and parsing the HTML response, respectively. The scraping is done concurrently for multiple URLs, and the scraped data is collected in an array.

Using this approach we can significantly increase the speed at which we can make requests with SuperAgent library.


Adding Concurrency To ScrapeOps Scrapers

The following is an example sending requests to the ScrapeOps Proxy API Aggregator, which enables you to use all the available threads your proxy plan allows you to make.

Just change the NUM_THREADS value to the number of concurrent threads your proxy plan allows.


const request = require("superagent");
const cheerio = require('cheerio');
const Bottleneck = require('bottleneck');
const querystring = require('querystring');

const NUM_THREADS = 5;

const SCRAPEOPS_API_KEY = 'YOUR_API_KEY'; // Replace with your actual API key

function get_scrapeops_url(url) {

const payload = {
api_key: SCRAPEOPS_API_KEY,
url: url
};

const proxy_url = `https://proxy.scrapeops.io/v1/?${querystring.stringify(payload)}`;
return proxy_url;
}

// Example list of URLs to scrape
const list_of_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
'http://quotes.toscrape.com/page/3/',
'http://quotes.toscrape.com/page/4/',
'http://quotes.toscrape.com/page/5/'
];

const output_data_list = [];
const limiter = new Bottleneck({ maxConcurrent: NUM_THREADS });

async function scrape_page(url) {
try {
const response = await request.get(get_scrapeops_url(url));
const html = response.text
if (response.status === 200) {
const $ = cheerio.load(html);

const title = $('h1').text();

// Add scraped data to "output_data_list" array
output_data_list.push({
'title': title
});
};
} catch (error) {
console.log('Error', error);
}
};

(async () => {
await Promise.all(
list_of_urls.map(url =>
limiter.schedule(() => scrape_page(url))
)
);

console.log(output_data_list);
})();

You can get your own free API key with 1,000 free requests by signing up here.


More Web Scraping Tutorials

So that's how you can configure SuperAgent to send requests concurrently.

If you would like to learn more about Web Scraping, then be sure to check out The Web Scraping Playbook.

Or check out one of our more in-depth guides: