Skip to main content

Apache Apache HttpClient: Make Concurent Requests

In this guide for The Java Web Scraping Playbook, we will look at how to configure Java Apache HttpClient library to make concurrent requests so that you can increase the speed of your scrapers.

The more concurrent threads you have, the more requests you can have active in parallel, and the faster you can scrape.

So in this guide we will walk you through the best way to send concurrent requests with Apache HttpClient:

Let's begin...

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.


Make Concurrent Requests Using Java util.concurrent Package

The first approach to making concurrent requests with Apache HttpClient is to use Executor class from java.util.concurrent package to execute our requests concurrently.

Here is an example:


import org.apache.hc.client5.http.async.methods.SimpleHttpRequest;
import org.apache.hc.client5.http.async.methods.SimpleHttpResponse;
import org.apache.hc.client5.http.async.methods.SimpleRequestBuilder;
import org.apache.hc.client5.http.impl.async.CloseableHttpAsyncClient;
import org.apache.hc.client5.http.impl.async.HttpAsyncClients;
import org.jsoup.Jsoup;

import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.*;

public class ConcurrentThreads {

public static void main(String[] args) throws Exception {
CloseableHttpAsyncClient client = HttpAsyncClients.custom().build();
client.start();

String[] requestUris = new String[]{
"http://quotes.toscrape.com/page/1/",
"http://quotes.toscrape.com/page/2/",
"http://quotes.toscrape.com/page/3/",
"http://quotes.toscrape.com/page/4/",
"http://quotes.toscrape.com/page/5/",
};
List<String> outputData = new ArrayList<String>();

int numberOfThreads = 5; // Number of threads to use for making requests

ExecutorService executor = Executors.newFixedThreadPool(numberOfThreads);
List<Callable<Void>> tasks = new ArrayList<>();

for (String requestUri : requestUris) {
SimpleHttpRequest request = SimpleRequestBuilder.get(requestUri)
.build();
Callable<Void> task = () -> {
SimpleHttpResponse response = client.execute(request, null).get();
String html = response.getBodyText();
String title = Jsoup.parse(html).title();
outputData.add(title);
return null;
};
tasks.add(task);
}

executor.invokeAll(tasks);

outputData.forEach(System.out::println);

executor.shutdown();
client.close();
}
}

Here:

  1. We import the necessary classes from these packages: jsoup, Apache HttpClient and java.util.concurrent. Apache HttpClient is used for making HTTP requests. Jsoup is used for parsing the HTML response, while Executors from java.util.concurrent is used for executing requests concurrently and configuring the number of concurrent threads.

  2. We define the numberOfThreads variable, which represents the maximum number of concurrent threads we want to allow for scraping.

  3. We create an array requestUris containing the URIs we want to scrape.

  4. We define an empty list called tasks to store asynchronous functions used to make our requests.

  5. Then we loop through requestUris array. And for each requestUri, we create a task and add it to tasks list. A task is an instance of Callable and is just a lambda function for encapsulating the logic for making an asynchronous request. Inside this task function:

  • We create an instance of SimpleHttpRequest with SimpleRequestBuilder and set it to request variable. Then we send this request using client.execute function and keep track of resulting response.
  • We read response body with response.getBodyText and save it to html variable. Next we parse the html and get its title using Jsoup.parse(html).title(). We then add title into outputData list.
  1. We call executor.invokeAll method with tasks list as an argument to run our scraping tasks concurrently.

  2. Finally we print out the outputData, which contains the scraped data from all the URLs.

Overall, the code uses java.util.concurrency package to scrape multiple web pages concurrently and utilizes Apache HttpClient and Jsoup for making HTTP requests and parsing the HTML response, respectively.

Using this approach we can significantly increase the speed at which we can make requests with Apache HttpClient library.


Adding Concurrency To ScrapeOps Scrapers

The following is an example sending requests to the ScrapeOps Proxy API Aggregator, which enables you to use all the available threads your proxy plan allows you to make.

Just set SCRAPEOPS_API_KEY to your ScrapeOps API key, and change the numberOfThreads value to the number of concurrent threads your proxy plan allows.


// same imports as the previous code block

public class ScrapeOpsProxyConcurrentThreads {
final public static String SCRAPEOPS_API_KEY = "your_api_key";
public static void main(String[] args) throws Exception {
CloseableHttpAsyncClient client = HttpAsyncClients.custom().build();
client.start();

String[] requestUris = new String[]{
"http://quotes.toscrape.com/page/1/",
"http://quotes.toscrape.com/page/2/",
"http://quotes.toscrape.com/page/3/",
"http://quotes.toscrape.com/page/4/",
"http://quotes.toscrape.com/page/5/",
};
List<String> outputData = new ArrayList<String>();

int numberOfThreads = 5; // Number of threads to use for making requests

ExecutorService executor = Executors.newFixedThreadPool(numberOfThreads);
List<Callable<Void>> tasks = new ArrayList<>();

for (String requestUri : requestUris) {
// construct ScrapeOps proxy URL out of SCRAPEOPS_API_KEY and requestUri
String proxyUrl = String.format("https://proxy.scrapeops.io/v1?api_key=%s&url=%s", SCRAPEOPS_API_KEY, requestUri);

SimpleHttpRequest request = SimpleRequestBuilder.get(proxyUrl)
.build();
Callable<Void> callable = () -> {
SimpleHttpResponse response = client.execute(request, null).get();
String html = response.getBodyText();
String title = Jsoup.parse(html).title();
outputData.add(title);
return null;
};
tasks.add(callable);
}
executor.invokeAll(tasks);
outputData.forEach(System.out::println);

executor.shutdown();
client.close();
}
}

You can get your own free API key with 1,000 free requests by signing up here.


More Web Scraping Tutorials

So that's how you can configure Apache HttpClient to send requests concurrently.

If you would like to learn more about Web Scraping, then be sure to check out The Web Scraping Playbook.

Or check out one of our more in-depth guides: