Skip to main content

NodeJs Got: Make Concurrent Requests

In this guide for The NodeJs Web Scraping Playbook, we will look at how to configure NodeJS Got library to make concurrent requests so that you can increase the speed of your scrapers.

The more concurrent threads you have, the more requests you can have active in parallel, and the faster you can scrape.

So in this guide we will walk you through the best way to send concurrent requests with Got:

Let's begin...


Make Concurrent Requests Using Promise.all() & Bottleneck

The first approach to making concurrent requests with Got is to use Javascript's Promise.all() functionality and the bottleneck package to control the concurrency.

Here is an example:


import got from 'got';
import cheerio from 'cheerio';
import Bottleneck from 'bottleneck';

const NUM_THREADS = 5;

// Example list of URLs to scrape
const list_of_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
'http://quotes.toscrape.com/page/3/',
'http://quotes.toscrape.com/page/4/',
'http://quotes.toscrape.com/page/5/'
];

const output_data_list = [];
const limiter = new Bottleneck({ maxConcurrent: NUM_THREADS });

async function scrape_page(url) {
try {
const response = await got.get(url);
const html = response.body
if (response.statusCode === 200) {
const $ = cheerio.load(html);

const title = $('h1').text();

// Add scraped data to "output_data_list" array
output_data_list.push({
'title': title
});
};
} catch (error) {
console.log('Error', error);
}
};

(async () => {
await Promise.all(
list_of_urls.map(url =>
limiter.schedule(() => scrape_page(url))
)
);

console.log(output_data_list);
})();

Here:

  1. We import the necessary libraries: got, cheerio, and Bottleneck. got.get method is used for making HTTP requests, cheerio is used for parsing the HTML response, and Bottleneck is used for limiting the number of concurrent threads.

  2. We define the NUM_THREADS constant, which represents the maximum number of concurrent threads we want to allow for scraping.

  3. We create an array list_of_urls containing the URLs we want to scrape.

  4. We define an empty array output_data_list to store the scraped data. In your code, you could have this data being saved to a queue.

  5. We create an instance of Bottleneck called limiter with maxConcurrent option set to NUM_THREADS. This ensures that at most NUM_THREADS requests are executed concurrently.

  6. We define the scrape_page function, which takes a url as an argument and performs the scraping for that url. Inside the function:

  • We call got.get method from got to send an HTTP GET request to the URL and obtain the response. Then we store response.body in html variable.
  • If the response's status code is 200 (indicating a successful response), we load the response into cheerio using cheerio.load(html).
  • We extract the title from the HTML using $('h1').text() and add it to the output_data_list array as an object with a 'title' property.
  1. We use Promise.all() to map each URL in list_of_urls to a call to limiter.schedule(). This schedules the scrape_page function for each URL, enforcing the concurrency limit set by limiter.
  2. limiter.schedule() returns a promise for each URL, and Promise.all() waits for all the promises to resolve. Once all the promises are resolved, we reach the await line, and execution continues to the next line.
  3. We log the output_data_list to the console, which contains the scraped data from all the URLs.

Overall, the code sets a concurrency limit using Bottleneck and utilizes got and cheerio for making HTTP requests and parsing the HTML response, respectively. The scraping is done concurrently for multiple URLs, and the scraped data is collected in an array.

Using this approach we can significantly increase the speed at which we can make requests with Got library.


Adding Concurrency To ScrapeOps Scrapers

The following is an example sending requests to the ScrapeOps Proxy API Aggregator, which enables you to use all the available threads your proxy plan allows you to make.

Just change the NUM_THREADS value to the number of concurrent threads your proxy plan allows.


import got from 'got';
import cheerio from 'cheerio';
import Bottleneck from 'bottleneck';
import querystring from 'querystring';

const NUM_THREADS = 5;

const SCRAPEOPS_API_KEY = 'YOUR_API_KEY'; // Replace with your actual API key

function get_scrapeops_url(url) {

const payload = {
api_key: SCRAPEOPS_API_KEY,
url: url
};

const proxy_url = `https://proxy.scrapeops.io/v1/?${querystring.stringify(payload)}`;
return proxy_url;
}

// Example list of URLs to scrape
const list_of_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
'http://quotes.toscrape.com/page/3/',
'http://quotes.toscrape.com/page/4/',
'http://quotes.toscrape.com/page/5/'
];

const output_data_list = [];
const limiter = new Bottleneck({ maxConcurrent: NUM_THREADS });

async function scrape_page(url) {
try {
const response = await got.get(get_scrapeops_url(url));
const html = response.body
if (response.statusCode === 200) {
const $ = cheerio.load(html);

const title = $('h1').text();

// Add scraped data to "output_data_list" array
output_data_list.push({
'title': title
});
};
} catch (error) {
console.log('Error', error);
}
};

(async () => {
await Promise.all(
list_of_urls.map(url =>
limiter.schedule(() => scrape_page(url))
)
);

console.log(output_data_list);
})();

You can get your own free API key with 1,000 free requests by signing up here.


More Web Scraping Tutorials

So that's how you can configure Got to send requests concurrently.

If you would like to learn more about Web Scraping, then be sure to check out The Web Scraping Playbook.

Or check out one of our more in-depth guides: