Java OkHttp & Apache HttpClient: Make Concurrent Requests
In this guide for The Java Web Scraping Playbook, we will look at how to configure Java OkHttp & Apache HttpClient library to make concurrent requests so that you can increase the speed of your scrapers.
The more concurrent threads you have, the more requests you can have active in parallel, and the faster you can scrape.
So in this guide we will walk you through the best way to send concurrent requests with OkHttp and Apache HttpClient:
- Make Concurrent Requests Using Java util.concurrent Package
- Adding Concurrency To ScrapeOps Scrapers
Let's begin...
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
Make Concurrent Requests Using Java util.concurrent Package
The first approach to making concurrent requests with OkHttp or Apache HttpClient is to use Executor class from java.util.concurrent package to execute our requests concurrently.
Here is an example:
- OkHttp
- Apache HttpClient
import okhttp3.OkHttpClient;
import okhttp3.Request;
import okhttp3.Response;
import org.jsoup.Jsoup;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.*;
public class ConcurrentThreads {
public static void main(String[] args) throws Exception {
OkHttpClient client = new OkHttpClient();
String[] requestUris = new String[] {
"http://quotes.toscrape.com/page/1/",
"http://quotes.toscrape.com/page/2/",
"http://quotes.toscrape.com/page/3/",
"http://quotes.toscrape.com/page/4/",
"http://quotes.toscrape.com/page/5/"
};
List<String> outputData = new ArrayList<String>();
int numberOfThreads = 5; // Number of threads to use for making requests
ExecutorService executor = Executors.newFixedThreadPool(numberOfThreads);
List<Callable<Void>> tasks = new ArrayList<>();
for (String requestUri : requestUris) {
Callable<Void> task = () -> {
Request request = new Request.Builder()
.url(requestUri)
.build();
Response response = client.newCall(request).execute();
String html = response.body().string();
String title = Jsoup.parse(html).title();
outputData.add(title);
return null;
};
tasks.add(task);
}
executor.invokeAll(tasks);
outputData.forEach(System.out::println);
executor.shutdown();
}
}
Here:
-
We import the necessary classes from these packages:
jsoup,okhttp3andjava.util.concurrent.okhttp3is used for making HTTP requests.jsoupis used for parsing the HTML response, whileExecutorsfromjava.util.concurrentis used for executing requests concurrently and configuring the number of concurrent threads. -
We define the
numberOfThreadsvariable, which represents the maximum number of concurrent threads we want to allow for scraping. -
We create an array
requestUriscontaining the URIs we want to scrape. -
We define an empty list called
tasksto store the functions used to make our requests. -
Then we loop through
requestUrisarray. And for eachrequestUri, we create ataskand add it totaskslist. Ataskis an instance ofCallableand is just a lambda function for encapsulating the logic for making a request and handling the response. Inside thistaskfunction:
- We create an instance of
RequestusingRequest.Builderand set it torequestvariable. Then we send this request usingclient.newCall(request).execute()call and keep track of resultingresponse. - We read response body with
response.body().string(), which store ashtmlvariable. Next we parse thehtmland get itstitleusingJsoup.parse(html).title(). We then addtitleintooutputDatalist.
-
We call
executor.invokeAllmethod withtaskslist as an argument to run our scrapingtasksconcurrently. -
Finally we print out the
outputData, which contains the scraped data from all the URLs.
Overall, the code uses java.util.concurrency package to scrape multiple web pages concurrently and utilizes OkHttp and Jsoup for making HTTP requests and parsing the HTML response, respectively.
Using this approach we can significantly increase the speed at which we can make requests with OkHttp library.
import org.apache.hc.client5.http.async.methods.SimpleHttpRequest;
import org.apache.hc.client5.http.async.methods.SimpleHttpResponse;
import org.apache.hc.client5.http.async.methods.SimpleRequestBuilder;
import org.apache.hc.client5.http.impl.async.CloseableHttpAsyncClient;
import org.apache.hc.client5.http.impl.async.HttpAsyncClients;
import org.jsoup.Jsoup;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.*;
public class ConcurrentThreads {
public static void main(String[] args) throws Exception {
CloseableHttpAsyncClient client = HttpAsyncClients.custom().build();
client.start();
String[] requestUris = new String[]{
"http://quotes.toscrape.com/page/1/",
"http://quotes.toscrape.com/page/2/",
"http://quotes.toscrape.com/page/3/",
"http://quotes.toscrape.com/page/4/",
"http://quotes.toscrape.com/page/5/",
};
List<String> outputData = new ArrayList<String>();
int numberOfThreads = 5; // Number of threads to use for making requests
ExecutorService executor = Executors.newFixedThreadPool(numberOfThreads);
List<Callable<Void>> tasks = new ArrayList<>();
for (String requestUri : requestUris) {
SimpleHttpRequest request = SimpleRequestBuilder.get(requestUri)
.build();
Callable<Void> task = () -> {
SimpleHttpResponse response = client.execute(request, null).get();
String html = response.getBodyText();
String title = Jsoup.parse(html).title();
outputData.add(title);
return null;
};
tasks.add(task);
}
executor.invokeAll(tasks);
outputData.forEach(System.out::println);
executor.shutdown();
client.close();
}
}
Here:
-
We import the necessary classes from these packages:
jsoup,Apache HttpClientandjava.util.concurrent.Apache HttpClientis used for making HTTP requests.Jsoupis used for parsing the HTML response, whileExecutorsfromjava.util.concurrentis used for executing requests concurrently and configuring the number of concurrent threads. -
We define the
numberOfThreadsvariable, which represents the maximum number of concurrent threads we want to allow for scraping. -
We create an array
requestUriscontaining the URIs we want to scrape. -
We define an empty list called
tasksto store asynchronous functions used to make our requests. -
Then we loop through
requestUrisarray. And for eachrequestUri, we create ataskand add it totaskslist. Ataskis an instance ofCallableand is just a lambda function for encapsulating the logic for making an asynchronous request. Inside thistaskfunction:
- We create an instance of
SimpleHttpRequestwithSimpleRequestBuilderand set it torequestvariable. Then we send this request usingclient.executefunction and keep track of resultingresponse. - We read response body with
response.getBodyTextand save it tohtmlvariable. Next we parse thehtmland get itstitleusingJsoup.parse(html).title(). We then addtitleintooutputDatalist.
-
We call
executor.invokeAllmethod withtaskslist as an argument to run our scrapingtasksconcurrently. -
Finally we print out the
outputData, which contains the scraped data from all the URLs.
Overall, the code uses java.util.concurrency package to scrape multiple web pages concurrently and utilizes Apache HttpClient and Jsoup for making HTTP requests and parsing the HTML response, respectively.
Using this approach we can significantly increase the speed at which we can make requests with Apache HttpClient library.
Adding Concurrency To ScrapeOps Scrapers
The following is an example sending requests to the ScrapeOps Proxy API Aggregator, which enables you to use all the available threads your proxy plan allows you to make.
Just set SCRAPEOPS_API_KEY to your ScrapeOps API key, and change the numberOfThreads value to the number of concurrent threads your proxy plan allows.
- OkHttp
- Apache HttpClient
// same imports as the previous code block
public class ConcurrentThreads {
final public static String SCRAPEOPS_API_KEY = "your_api_key";
public static void main(String[] args) throws Exception {
OkHttpClient client = new OkHttpClient();
String[] requestUris = new String[] {
"http://quotes.toscrape.com/page/1/",
"http://quotes.toscrape.com/page/2/",
"http://quotes.toscrape.com/page/3/",
"http://quotes.toscrape.com/page/4/",
"http://quotes.toscrape.com/page/5/"
};
List<String> outputData = new ArrayList<String>();
int numberOfThreads = 5; // Number of threads to use for making requests
ExecutorService executor = Executors.newFixedThreadPool(numberOfThreads);
List<Callable<Void>> tasks = new ArrayList<>();
for (String requestUri : requestUris) {
// construct ScrapeOps proxy URL out of SCRAPEOPS_API_KEY and requestUri
String proxyUrl = String.format("https://proxy.scrapeops.io/v1?api_key=%s&url=%s", SCRAPEOPS_API_KEY, requestUri);
Callable<Void> task = () -> {
Request request = new Request.Builder()
.url(proxyUrl)
.build();
Response response = client.newCall(request).execute();
String html = response.body().string();
String title = Jsoup.parse(html).title();
outputData.add(title);
return null;
};
tasks.add(task);
}
executor.invokeAll(tasks);
outputData.forEach(System.out::println);
executor.shutdown();
}
}
// same imports as the previous code block
public class ScrapeOpsProxyConcurrentThreads {
final public static String SCRAPEOPS_API_KEY = "your_api_key";
public static void main(String[] args) throws Exception {
CloseableHttpAsyncClient client = HttpAsyncClients.custom().build();
client.start();
String[] requestUris = new String[]{
"http://quotes.toscrape.com/page/1/",
"http://quotes.toscrape.com/page/2/",
"http://quotes.toscrape.com/page/3/",
"http://quotes.toscrape.com/page/4/",
"http://quotes.toscrape.com/page/5/",
};
List<String> outputData = new ArrayList<String>();
int numberOfThreads = 5; // Number of threads to use for making requests
ExecutorService executor = Executors.newFixedThreadPool(numberOfThreads);
List<Callable<Void>> tasks = new ArrayList<>();
for (String requestUri : requestUris) {
// construct ScrapeOps proxy URL out of SCRAPEOPS_API_KEY and requestUri
String proxyUrl = String.format("https://proxy.scrapeops.io/v1?api_key=%s&url=%s", SCRAPEOPS_API_KEY, requestUri);
SimpleHttpRequest request = SimpleRequestBuilder.get(proxyUrl)
.build();
Callable<Void> callable = () -> {
SimpleHttpResponse response = client.execute(request, null).get();
String html = response.getBodyText();
String title = Jsoup.parse(html).title();
outputData.add(title);
return null;
};
tasks.add(callable);
}
executor.invokeAll(tasks);
outputData.forEach(System.out::println);
executor.shutdown();
client.close();
}
}
You can get your own free API key with 1,000 free requests by signing up here.
More Web Scraping Tutorials
So that's how you can configure OkHttp and Apache HttpClient to send requests concurrently.
If you would like to learn more about Web Scraping, then be sure to check out The Web Scraping Playbook.
Or check out one of our more in-depth guides: