NodeJs SuperAgent: Make Concurent Requests
In this guide for The NodeJs Web Scraping Playbook, we will look at how to configure NodeJS SuperAgent library to make concurrent requests so that you can increase the speed of your scrapers.
The more concurrent threads you have, the more requests you can have active in parallel, and the faster you can scrape.
So in this guide we will walk you through the best way to send concurrent requests with SuperAgent:
Let's begin...
Make Concurrent Requests Using Promise.all() & Bottleneck
The first approach to making concurrent requests with SuperAgent is to use Javascript's Promise.all() functionality and the bottleneck package to control the concurrency.
Here is an example:
const request = require("superagent");
const cheerio = require('cheerio');
const Bottleneck = require('bottleneck');
const NUM_THREADS = 5;
// Example list of URLs to scrape
const list_of_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
'http://quotes.toscrape.com/page/3/',
'http://quotes.toscrape.com/page/4/',
'http://quotes.toscrape.com/page/5/'
];
const output_data_list = [];
const limiter = new Bottleneck({ maxConcurrent: NUM_THREADS });
async function scrape_page(url) {
try {
const response = await request.get(url);
const html = response.text
if (response.status === 200) {
const $ = cheerio.load(html);
const title = $('h1').text();
// Add scraped data to "output_data_list" array
output_data_list.push({
'title': title
});
};
} catch (error) {
console.log('Error', error);
}
};
(async () => {
await Promise.all(
list_of_urls.map(url =>
limiter.schedule(() => scrape_page(url))
)
);
console.log(output_data_list);
})();
Here we:
-
We import the necessary libraries:
superagent
,cheerio
, andBottleneck
.request.get
method fromsuperagent
is used for making HTTP requests,cheerio
is used for parsing the HTML response, andBottleneck
is used for limiting the number of concurrent threads. -
We define the
NUM_THREADS
constant, which represents the maximum number of concurrent threads we want to allow for scraping. -
We create an array
list_of_urls
containing the URLs we want to scrape. -
We define an empty array
output_data_list
to store the scraped data. In your code, you could have this data being saved to a queue. -
We create an instance of Bottleneck called limiter with maxConcurrent option set to
NUM_THREADS
. This ensures that at mostNUM_THREADS
requests are executed concurrently. -
We define the
scrape_page
function, which takes aurl
as an argument and performs the scraping for that url. Inside the function:
- We call
request.get
method fromsuperagent
to send an HTTPGET
request to the URL and obtain theresponse
. Then we storeresponse.text
inhtml
variable. - If the response's status code is
200
(indicating a successful response), we load the response into cheerio usingcheerio.load(html)
. - We extract the
title
from the HTML using$('h1').text()
and add it to theoutput_data_list
array as an object with a 'title' property.
- We use
Promise.all()
to map each URL inlist_of_urls
to a call tolimiter.schedule()
. This schedules thescrape_page
function for each URL, enforcing the concurrency limit set by limiter. limiter.schedule()
returns a promise for each URL, andPromise.all()
waits for all the promises to resolve. Once all the promises are resolved, we reach the await line, and execution continues to the next line.- We log the
output_data_list
to the console, which contains the scraped data from all the URLs.
Overall, the code sets a concurrency limit using Bottleneck
and utilizes superagent
and cheerio
for making HTTP requests and parsing the HTML response, respectively. The scraping is done concurrently for multiple URLs, and the scraped data is collected in an array.
Using this approach we can significantly increase the speed at which we can make requests with SuperAgent library.
Adding Concurrency To ScrapeOps Scrapers
The following is an example sending requests to the ScrapeOps Proxy API Aggregator, which enables you to use all the available threads your proxy plan allows you to make.
Just change the NUM_THREADS
value to the number of concurrent threads your proxy plan allows.
const request = require("superagent");
const cheerio = require('cheerio');
const Bottleneck = require('bottleneck');
const querystring = require('querystring');
const NUM_THREADS = 5;
const SCRAPEOPS_API_KEY = 'YOUR_API_KEY'; // Replace with your actual API key
function get_scrapeops_url(url) {
const payload = {
api_key: SCRAPEOPS_API_KEY,
url: url
};
const proxy_url = `https://proxy.scrapeops.io/v1/?${querystring.stringify(payload)}`;
return proxy_url;
}
// Example list of URLs to scrape
const list_of_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
'http://quotes.toscrape.com/page/3/',
'http://quotes.toscrape.com/page/4/',
'http://quotes.toscrape.com/page/5/'
];
const output_data_list = [];
const limiter = new Bottleneck({ maxConcurrent: NUM_THREADS });
async function scrape_page(url) {
try {
const response = await request.get(get_scrapeops_url(url));
const html = response.text
if (response.status === 200) {
const $ = cheerio.load(html);
const title = $('h1').text();
// Add scraped data to "output_data_list" array
output_data_list.push({
'title': title
});
};
} catch (error) {
console.log('Error', error);
}
};
(async () => {
await Promise.all(
list_of_urls.map(url =>
limiter.schedule(() => scrape_page(url))
)
);
console.log(output_data_list);
})();
You can get your own free API key with 1,000 free requests by signing up here.
More Web Scraping Tutorials
So that's how you can configure SuperAgent to send requests concurrently.
If you would like to learn more about Web Scraping, then be sure to check out The Web Scraping Playbook.
Or check out one of our more in-depth guides: