Java Apache HttpClient Library - Setting Fake User-Agents

Java Apache HttpClient Library: Setting Fake User-Agents

Using Fake User-Agents With Java Apache HttpClient Library

To use fake user-agents with Java Apache HttpClient, call setHeader method on your SimpleRequestBuilder instance and provide "User-Agent" as the first argument and user-agent value as the second argument.

import org.apache.hc.client5.http.async.methods.SimpleHttpRequest;
import org.apache.hc.client5.http.async.methods.SimpleHttpResponse;
import org.apache.hc.client5.http.async.methods.SimpleRequestBuilder;
import org.apache.hc.client5.http.impl.async.CloseableHttpAsyncClient;
import org.apache.hc.client5.http.impl.async.HttpAsyncClients;

import java.util.concurrent.Future;

public class FakeUserAgents {
    public static void main(String[] args) throws Exception {
        CloseableHttpAsyncClient client = HttpAsyncClients.createDefault();
        client.start();
        SimpleHttpRequest request = SimpleRequestBuilder.get("http://httpbin.org/headers")
                .setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36")
                .build();

        Future<SimpleHttpResponse> future = client.execute(request, null);
        SimpleHttpResponse response = future.get();
        System.out.println("Response body: " + response.getBodyText());
        client.close();
    }
}

One of the most common reasons for getting blocked whilst web scraping is using bad user-agents.

However, integrating fake user-agents into your Java web scrapers that use Apache HttpClient is very easy.

So in this guide, we will go through:

What Are Fake User-Agents?
How To Set A User Agent In Java Apache HttpClient
How To Rotate User-Agents
How To Manage Thousands of Fake User-Agents
Why Use Fake Browser Headers
ScrapeOps Fake Browser Headers API

First, let's quickly go over the very basics.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

What Are Fake User-Agents?

User Agents are strings that let the website you are scraping identify the application, operating system (OSX/Windows/Linux), browser (Chrome/Firefox/Internet Explorer), etc. of the user sending a request to their website. They are sent to the server as part of the request headers.

Here is an example User agent sent when you visit a website with a Chrome browser:

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.36'

When scraping a website, you also need to set user-agents on every request as otherwise the website may block your requests because it knows you aren't a real user.

In the case of Apache HttpClient, when you send a request by default it doesn't include a User-Agent. This will clearly identify that the request isn't being made with a real browser.

'User-Agent': '',

This user-agent will clearly identify your requests as suspicious, so the website can easily block you from scraping the site.

That is why we need to manage the user-agents we use with Apache HttpClient when we send requests.

How To Set A Fake User-Agent In Java Apache HttpClient

Configuring Apache HttpClient to use a fake user-agent is very easy. While building your request, you simply call setHeaders method providing "User-Agent" and your user-agent value as arguments.

import org.apache.hc.client5.http.async.methods.SimpleHttpRequest;
import org.apache.hc.client5.http.async.methods.SimpleHttpResponse;
import org.apache.hc.client5.http.async.methods.SimpleRequestBuilder;
import org.apache.hc.client5.http.impl.async.CloseableHttpAsyncClient;
import org.apache.hc.client5.http.impl.async.HttpAsyncClients;

import java.util.concurrent.Future;

public class FakeUserAgent {
    public static void main(String[] args) throws Exception {
        CloseableHttpAsyncClient client = HttpAsyncClients.createDefault();
        client.start();
        SimpleHttpRequest request = SimpleRequestBuilder.get("http://httpbin.org/headers")
                .setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36")
                .build();

        Future<SimpleHttpResponse> future = client.execute(request, null);
        SimpleHttpResponse response = future.get();
        System.out.println("Response body: " + response.getBodyText());
        client.close();
    }
}

From here Apache HttpClient will use the above user-agent to make the request. Here, all request headers, including User-Agent, are reflected in response JSON, which you can accessed with response.getBodyText().

How To Rotate User-Agents

Rotating through user-agents is also pretty straightforward when using Java Apache HttpClient library. We just need a list of user-agents in our scraper and then we can simply use a random one with every request.

import org.apache.hc.client5.http.async.methods.SimpleHttpRequest;
import org.apache.hc.client5.http.async.methods.SimpleHttpResponse;
import org.apache.hc.client5.http.async.methods.SimpleRequestBuilder;
import org.apache.hc.client5.http.impl.async.CloseableHttpAsyncClient;
import org.apache.hc.client5.http.impl.async.HttpAsyncClients;

import java.util.Random;
import java.util.concurrent.Future;

public class FakeUserAgents {
    public static String[] userAgents = new String[]{
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36",
        "Mozilla/5.0 (iPhone; CPU iPhone OS 14_4_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Mobile/15E148 Safari/604.1",
        "Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)",
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/87.0.664.75",
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18363"
    };
    public static String getRandomUserAgent(String[] userAgents) {
        int rnd = new Random().nextInt(userAgents.length);
        return userAgents[rnd];
    }
    public static void main(String[] args) throws Exception {
        CloseableHttpAsyncClient client = HttpAsyncClients.createDefault();
        client.start();
        SimpleHttpRequest request = SimpleRequestBuilder.get("http://httpbin.org/headers")
                .setHeader("User-Agent", getRandomUserAgent(userAgents))
                .build();

        Future<SimpleHttpResponse> future = client.execute(request, null);
        SimpleHttpResponse response = future.get();
        System.out.println("Response body: " + response.getBodyText());
        client.close();
    }
}

This works but it has drawbacks as we would need to build & keep an up-to-date list of user-agents ourselves.

How To Manage Thousands of Fake User-Agents

A better approach would be to use a free user-agent API like ScrapeOps Fake User-Agent API to download an up-to-date user-agent list when your scraper starts up and then pick a random user-agent for each request.

To use the ScrapeOps Fake User-Agents API you just need to send a request to the API endpoint to retrieve a list of user-agents.

http://headers.scrapeops.io/v1/user-agents?api_key=YOUR_API_KEY

To use the ScrapeOps Fake User-Agent API, you first need an API key which you can get by signing up for a free account here.

Example response from the API:

{
  "result": [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Safari/605.1.15",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/603.3.8 (KHTML, like Gecko) Version/10.1.2 Safari/603.3.8",
    "Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Safari/605.1.15",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Safari/605.1.15",
    "Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36"
  ]
}

To integrate the Fake User-Agent API you should configure your scraper to retrieve a batch of the most up-to-date user-agents when the scraper starts and then configure your scraper to pick a random user-agent from this list for each request.

Here is an example Java Apache HttpClient scraper integration:

import org.apache.hc.client5.http.async.methods.SimpleHttpRequest;
import org.apache.hc.client5.http.async.methods.SimpleHttpResponse;
import org.apache.hc.client5.http.async.methods.SimpleRequestBuilder;
import org.apache.hc.client5.http.impl.async.CloseableHttpAsyncClient;
import org.apache.hc.client5.http.impl.async.HttpAsyncClients;
import org.json.JSONArray;
import org.json.JSONObject;

import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.concurrent.Future;

public class ScrapeOpsFakeUserAgents {
    public static String SCRAPEOPS_API_KEY = "your_api_key";
     // fetch and return a list of user-agents from the ScrapeOps API
    public static List<String> getUserAgentList(CloseableHttpAsyncClient client) {
        try {
            String url = String.format("http://headers.scrapeops.io/v1/user-agents?api_key=%s", SCRAPEOPS_API_KEY);
            SimpleHttpRequest request = SimpleRequestBuilder.get(url).build();
            SimpleHttpResponse response = client.execute(request, null).get();
            String jsonString = response.getBodyText();
            JSONObject results = new JSONObject(jsonString);
            JSONArray userAgentListJson = results.getJSONArray("result");
            List<String> userAgentList = new ArrayList<String>();

            for (Object userAgent: userAgentListJson) {
                userAgentList.add(userAgent.toString());
            }
            return userAgentList;
        } catch (Exception e) {
            return new ArrayList<String>();
        }
    }
    public static String getRandomUserAgent(List<String> userAgents) {
        int rnd = new Random().nextInt(userAgents.size());
        return userAgents.get(rnd);
    }

    public static void main(String[] args) throws Exception {
        CloseableHttpAsyncClient client = HttpAsyncClients.createDefault();
        client.start();

        List<String> userAgents = getUserAgentList(client);
        SimpleHttpRequest request = SimpleRequestBuilder.get("http://httpbin.org/headers")
                .setHeader("User-Agent", getRandomUserAgent(userAgents))
                .build();

        Future<SimpleHttpResponse> future = client.execute(request, null);
        SimpleHttpResponse response = future.get();
        System.out.println("Response body: " + response.getBodyText());
        client.close();
    }
}

Here the scraper will use a random user-agent for each request.

Why Use Fake Browser Headers

For simple websites, simply setting an up-to-date user-agent should allow you to scrape a website pretty reliably.

However, a lot of popular websites are increasingly using sophisticated anti-bot technologies to try and prevent developer from scraping data from their websites.

These anti-bot solutions not only look at your requests user-agent when analysing the request, but also the other headers a real browser normally sends.

By using a full set of browser headers you make your requests look more like real user requests, and as a result harder to detect.

Here are example headers when using a Chrome browser on a MacOS machine:

sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="99", "Google Chrome";v="99"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "macOS"
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.83 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: en-GB,en-US;q=0.9,en;q=0.8

As we can see, real browsers don't just send User-Agent strings but also a number of other headers that are used to identify and customize the request.

So to improve the reliability of our scrapers we should also include these headers when making requests.

You could build a list of fake browser headers yourself, or you could use the ScrapeOps Fake Browser Headers API to get an up-to-date list every time your scraper starts up.

ScrapeOps Fake Browser Headers API

The ScrapeOps Fake Browser Headers API is a free API that returns a list of optimized fake browser headers that you can use in your web scrapers to avoid blocks/bans and improve the reliability of your scrapers.

API Endpoint:

http://headers.scrapeops.io/v1/browser-headers?api_key=YOUR_API_KEY

Response:

{
  "result": [
    {
        "upgrade-insecure-requests": "1",
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "sec-ch-ua": "\".Not/A)Brand\";v=\"99\", \"Google Chrome\";v=\"103\", \"Chromium\";v=\"103\"",
        "sec-ch-ua-mobile": "?0",
        "sec-ch-ua-platform": "\"Windows\"",
        "sec-fetch-site": "none",
        "sec-fetch-mod": "",
        "sec-fetch-user": "?1",
        "accept-encoding": "gzip, deflate, br",
        "accept-language": "bg-BG,bg;q=0.9,en-US;q=0.8,en;q=0.7"
    },
    {
        "upgrade-insecure-requests": "1",
        "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36",
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "sec-ch-ua": "\".Not/A)Brand\";v=\"99\", \"Google Chrome\";v=\"103\", \"Chromium\";v=\"103\"",
        "sec-ch-ua-mobile": "?0",
        "sec-ch-ua-platform": "\"Linux\"",
        "sec-fetch-site": "none",
        "sec-fetch-mod": "",
        "sec-fetch-user": "?1",
        "accept-encoding": "gzip, deflate, br",
        "accept-language": "fr-CH,fr;q=0.9,en-US;q=0.8,en;q=0.7"
    }
  ]
}

To use the ScrapeOps Fake Browser Headers API, you first need an API key which you can get by signing up for a free account here.

To integrate the Fake Browser Headers API you should configure your scraper to retrieve a batch of the most up-to-date headers when the scraper starts and then configure your scraper to pick a random header from this list for each request.

Here is an example Java Apache HttpClient scraper integration:

import java.util.ArrayList;
import org.apache.hc.client5.http.async.methods.SimpleHttpRequest;
import org.apache.hc.client5.http.async.methods.SimpleHttpResponse;
import org.apache.hc.client5.http.async.methods.SimpleRequestBuilder;
import org.apache.hc.client5.http.impl.async.CloseableHttpAsyncClient;
import org.apache.hc.client5.http.impl.async.HttpAsyncClients;
import org.json.JSONArray;
import org.json.JSONObject;

import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Random;

public class FakeHeaders {
    public static String SCRAPEOPS_API_KEY = "your_api_key";

    // parse the JSON response json body containing header information
    public static List<Map<String, String>> parseHeaderList(String jsonString) throws Exception {
        JSONObject results = new JSONObject(jsonString);
        JSONArray headersListJson = results.getJSONArray("result");
        List<Map<String, String>> headersList = new ArrayList<Map<String, String>>();
        for (int i = 0; i < headersListJson.length(); i++) {
            Map<String, Object> headersJson = headersListJson.getJSONObject(i).toMap();
            Map<String, String> headers = new HashMap<>();
            headersJson.forEach((header, value) -> {
                headers.put(header, value.toString());
            });
            headersList.add(headers);
        }
        return headersList;
    }

    // fetch and return a list of headers from the ScrapeOps API
    public static List<Map<String, String>> getHeadersList(CloseableHttpAsyncClient client) {
        try {
            String url = String.format("http://headers.scrapeops.io/v1/browser-headers?api_key=%s", SCRAPEOPS_API_KEY);
            SimpleHttpRequest request = SimpleRequestBuilder.get(url).build();
            SimpleHttpResponse response = client.execute(request, null).get();
            String jsonString = response.getBodyText();
            return parseHeaderList(jsonString);
        } catch (Exception e) {
            e.printStackTrace();
            return new ArrayList<Map<String, String>>();
        }
    }

    public static Map<String, String> getRandomHeaders(List<Map<String, String>> headers) {
        int rnd = new Random().nextInt(headers.size());
        return headers.get(rnd);
    }

    public static void main(String[] args) throws Exception {
        CloseableHttpAsyncClient client = HttpAsyncClients.createDefault();
        client.start();

        List<Map<String, String>> headersList = getHeadersList(client);

        Map<String, String> headers = getRandomHeaders(headersList);

        SimpleRequestBuilder requestBuilder = SimpleRequestBuilder.get("http://httpbin.org/headers");

        // add the fake headers to the request
        headers.forEach((header, value) -> {
            requestBuilder.setHeader(header, value);
        });

        SimpleHttpRequest request = requestBuilder.build();

        SimpleHttpResponse response = client.execute(request, null).get();
        System.out.println("Response body: " + response.getBodyText());
        client.close();
    }
}

For more information on check out the Fake Browser Headers API documenation.

Java Apache HttpClient Library: Setting Fake User-Agents

Need help scraping the web?

What Are Fake User-Agents?​

How To Set A Fake User-Agent In Java Apache HttpClient​

How To Rotate User-Agents​

How To Manage Thousands of Fake User-Agents​

Why Use Fake Browser Headers​

ScrapeOps Fake Browser Headers API​

More Web Scraping Tutorials​