Apache Apache HttpClient: Make Concurent Requests
In this guide for The Java Web Scraping Playbook, we will look at how to configure Java Apache HttpClient library to make concurrent requests so that you can increase the speed of your scrapers.
The more concurrent threads you have, the more requests you can have active in parallel, and the faster you can scrape.
So in this guide we will walk you through the best way to send concurrent requests with Apache HttpClient:
Let's begin...
Make Concurrent Requests Using Java util.concurrent Package
The first approach to making concurrent requests with Apache HttpClient is to use Executor class from java.util.concurrent package to execute our requests concurrently.
Here is an example:
import org.apache.hc.client5.http.async.methods.SimpleHttpRequest;
import org.apache.hc.client5.http.async.methods.SimpleHttpResponse;
import org.apache.hc.client5.http.async.methods.SimpleRequestBuilder;
import org.apache.hc.client5.http.impl.async.CloseableHttpAsyncClient;
import org.apache.hc.client5.http.impl.async.HttpAsyncClients;
import org.jsoup.Jsoup;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.*;
public class ConcurrentThreads {
public static void main(String[] args) throws Exception {
CloseableHttpAsyncClient client = HttpAsyncClients.custom().build();
client.start();
String[] requestUris = new String[]{
"http://quotes.toscrape.com/page/1/",
"http://quotes.toscrape.com/page/2/",
"http://quotes.toscrape.com/page/3/",
"http://quotes.toscrape.com/page/4/",
"http://quotes.toscrape.com/page/5/",
};
List<String> outputData = new ArrayList<String>();
int numberOfThreads = 5; // Number of threads to use for making requests
ExecutorService executor = Executors.newFixedThreadPool(numberOfThreads);
List<Callable<Void>> tasks = new ArrayList<>();
for (String requestUri : requestUris) {
SimpleHttpRequest request = SimpleRequestBuilder.get(requestUri)
.build();
Callable<Void> task = () -> {
SimpleHttpResponse response = client.execute(request, null).get();
String html = response.getBodyText();
String title = Jsoup.parse(html).title();
outputData.add(title);
return null;
};
tasks.add(task);
}
executor.invokeAll(tasks);
outputData.forEach(System.out::println);
executor.shutdown();
client.close();
}
}
Here:
-
We import the necessary classes from these packages:
jsoup
,Apache HttpClient
andjava.util.concurrent
.Apache HttpClient
is used for making HTTP requests.Jsoup
is used for parsing the HTML response, whileExecutors
fromjava.util.concurrent
is used for executing requests concurrently and configuring the number of concurrent threads. -
We define the
numberOfThreads
variable, which represents the maximum number of concurrent threads we want to allow for scraping. -
We create an array
requestUris
containing the URIs we want to scrape. -
We define an empty list called
tasks
to store asynchronous functions used to make our requests. -
Then we loop through
requestUris
array. And for eachrequestUri
, we create atask
and add it totasks
list. Atask
is an instance ofCallable
and is just a lambda function for encapsulating the logic for making an asynchronous request. Inside thistask
function:
- We create an instance of
SimpleHttpRequest
withSimpleRequestBuilder
and set it torequest
variable. Then we send this request usingclient.execute
function and keep track of resultingresponse
. - We read response body with
response.getBodyText
and save it tohtml
variable. Next we parse thehtml
and get itstitle
usingJsoup.parse(html).title()
. We then addtitle
intooutputData
list.
-
We call
executor.invokeAll
method withtasks
list as an argument to run our scrapingtasks
concurrently. -
Finally we print out the
outputData
, which contains the scraped data from all the URLs.
Overall, the code uses java.util.concurrency
package to scrape multiple web pages concurrently and utilizes Apache HttpClient
and Jsoup
for making HTTP requests and parsing the HTML response, respectively.
Using this approach we can significantly increase the speed at which we can make requests with Apache HttpClient library.
Adding Concurrency To ScrapeOps Scrapers
The following is an example sending requests to the ScrapeOps Proxy API Aggregator, which enables you to use all the available threads your proxy plan allows you to make.
Just set SCRAPEOPS_API_KEY
to your ScrapeOps API key, and change the numberOfThreads
value to the number of concurrent threads your proxy plan allows.
// same imports as the previous code block
public class ScrapeOpsProxyConcurrentThreads {
final public static String SCRAPEOPS_API_KEY = "your_api_key";
public static void main(String[] args) throws Exception {
CloseableHttpAsyncClient client = HttpAsyncClients.custom().build();
client.start();
String[] requestUris = new String[]{
"http://quotes.toscrape.com/page/1/",
"http://quotes.toscrape.com/page/2/",
"http://quotes.toscrape.com/page/3/",
"http://quotes.toscrape.com/page/4/",
"http://quotes.toscrape.com/page/5/",
};
List<String> outputData = new ArrayList<String>();
int numberOfThreads = 5; // Number of threads to use for making requests
ExecutorService executor = Executors.newFixedThreadPool(numberOfThreads);
List<Callable<Void>> tasks = new ArrayList<>();
for (String requestUri : requestUris) {
// construct ScrapeOps proxy URL out of SCRAPEOPS_API_KEY and requestUri
String proxyUrl = String.format("https://proxy.scrapeops.io/v1?api_key=%s&url=%s", SCRAPEOPS_API_KEY, requestUri);
SimpleHttpRequest request = SimpleRequestBuilder.get(proxyUrl)
.build();
Callable<Void> callable = () -> {
SimpleHttpResponse response = client.execute(request, null).get();
String html = response.getBodyText();
String title = Jsoup.parse(html).title();
outputData.add(title);
return null;
};
tasks.add(callable);
}
executor.invokeAll(tasks);
outputData.forEach(System.out::println);
executor.shutdown();
client.close();
}
}
You can get your own free API key with 1,000 free requests by signing up here.
More Web Scraping Tutorials
So that's how you can configure Apache HttpClient to send requests concurrently.
If you would like to learn more about Web Scraping, then be sure to check out The Web Scraping Playbook.
Or check out one of our more in-depth guides: