Puppeteer Guide: Using Fake User Agents
User agents play a pivotal role in shaping the interaction between browsers and websites and allows developers to emulate different browsers and devices seamlessly.
This guide will explore various methods to change the user-agent with Puppeteer.
- TLDR: How To Use Fake User-Agents In Puppeteer
- What Are Fake User-Agents?
- Use Random User-Agent for Each Session With
random-useragent
Package - Use Puppeteer
puppeteer-extra-plugin-anonymize-ua
Package - Use Puppeteer Stealth Plugin
- Obtaining User-Agent Strings
- Troubleshooting and Best Practices
- Conclusion
- More Puppeteer Web Scraping Guides
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
TLDR - How To Use Fake User-Agents In Puppeteer
There are several ways to employ a fake user-agent to evade detection of Puppeteer bots.
The easiest way to apply a custom fake user-agent, is to use the page.setUserAgent() method. This method accepts a UA string as an argument and should be invoked before navigating to a page.
This ensures that the browser requests the new page with the specified user-agent.
Firstly, you'll need a custom user-agent string. Depending on your specific needs, you can obtain it from various resources that we'll discuss later. For now, let's use an example user-agent string:
const puppeteer = require('puppeteer');
puppeteer.launch({ headless: "new" })
.then(async browser => {
const page = await browser.newPage();
const customUserAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.36';
await page.setUserAgent(customUserAgent);
// You Code
await browser.close();
})
This script will effectively set our customUserAgent
on the current page. Keep in mind that the setUserAgent()
method takes a UA string as an argument, not a JavaScript object.
What Are Fake User-Agents?
User Agents serve as strings that enable websites to identify the application, operating system (OSX/Windows/Linux), browser (Chrome/Firefox/Internet Explorer), etc., of the user requesting to their site. These agents are transmitted to the server as part of the request headers.
Here's an example of a User-agent sent when visiting a website with a Chrome browser:
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.36'
While scraping a website, it's essential to set user agents on every request; otherwise, the website may block requests, recognizing them as abnormal user activity.
In the context of Puppeteer, when a request is sent, the default settings distinctly reveal that the request is originating from headless Chromium in the user-agent string:
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/119.0.0.0 Safari/537.36
This user-agent explicitly signals that the requests are generated by a Puppeteer script, making it susceptible to website blocking.
In order to overcome this issue, it becomes crucial to manage the user-agents used with Puppeteer for requests. By mimicking user agents of real browsers, you become one step closer to bypassing bot detection and successfully scraping the site.
User-Agent String Components
Web-designed user-agent strings typically adhere to a format as outlined below:
User-Agent: Mozilla/5.0 (<system-information>) <platform> (<platform-details>) <extensions>
Parsing a user-agent string reveals distinct components, each providing specific details about the applications used by the requesting user. Let's dissect the different components of the following user-agent:
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/119.0.0.0 Safari/537.36
-
Mozilla/5.0: This part represents the product token and version. In this case, it indicates compatibility with Mozilla, and the number (5.0) is a reference to the version.
-
(X11; Linux x86_64): These are the comments or comments within parentheses. They provide additional information about the user's operating system and environment. In this example, it specifies that the browser is running on the X Window System on a 64-bit Linux system.
-
AppleWebKit/537.36 (KHTML, like Gecko): This part identifies the browser engine. In this case, it's the WebKit engine, which is used by browsers like Chrome and Safari. The "KHTML, like Gecko" is historical and indicates compatibility with KHTML (used by Konqueror) and Gecko (used by Firefox).
-
HeadlessChrome/119.0.0.0: This part specifies the browser and its version. In this example, it indicates that the browser is headless Chromium, and the version is 119.0.0.0.
-
Safari/537.36: This part further mentions compatibility, indicating that the browser is like Safari (as both Chrome and Safari use the WebKit engine). The version number is also provided.
You can see your user-agent, divided into components, by visiting UserAgentString.com. Here is what a user-agent looks like in headless puppeteer mode:
User-Agents & Anti-Bots
User agents serve as a means for servers to identify users and deliver tailored content. For instance, web developers can discern requests from mobile devices by examining the user-agent string, as demonstrated below:
function isMobile() {
const regex = /Mobi|Android|webOS|iPhone|iPad|iPod|BlackBerry|IEMobile|Opera Mini/i;
return regex.test(navigator.userAgent);
}
if (isMobile()) {
console.log("Mobile device detected");
} else {
console.log("Desktop device detected");
}
The navigator.userAgent
getter provides the user-agent information. While Puppeteer offers a default user-agent resembling:
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/119.0.0.0 Safari/537.36
The presence of the substring "HeadlessChrome" in this user-agent exposes the identity of the Puppeteer bot.
Therefore, it becomes crucial to dynamically set and rotate user agents for testing and web scraping purposes.
How Puppeteer Manages User-Agents
Puppeteer defaults to running in headless mode, but it can be configured to operate in full headful mode. In this section, we'll explore the default user agents used by Puppeteer in both headless and headful modes, examining the differences.
Let's create a script to retrieve the user-agent string in headless mode using navigator.userAgent
:
const puppeteer = require('puppeteer');
puppeteer.launch({ headless: "new" })
.then(async browser => {
const page = await browser.newPage();
const ua = await page.evaluate(() => {
return navigator.userAgent;
});
console.log(ua);
await browser.close();
});
// Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/119.0.0.0 Safari/537.36
The presence of the substring HeadlessChrome/119.0.0.0
indicates that it's a headless browser. To avoid potential blocks by servers, we may want to replace this with a custom user-agent.
Now, let's observe the user-agent Puppeteer uses in headful mode. To achieve this, we'll simply set the headless
flag to false
in the puppeteer.launch()
method:
const puppeteer = require('puppeteer');
puppeteer.launch({ headless: false })
.then(async browser => {
const page = await browser.newPage();
const ua = await page.evaluate(() => {
return navigator.userAgent;
});
console.log(ua);
await browser.close();
});
// Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36
In headful mode, the user-agent is less likely to face blocking as it mimics a real browser rather than a headless one. However, using Puppeteer in headful mode is less resource-efficient compared to scripts that run in headless mode.
How To Use Fake User-Agents In Puppeteer
In Puppeteer, you can use various methods and techniques to manipulate user agents. Some of these methods are outlined below:
-
setUserAgent() Method:
- The easiest way to apply a custom fake user-agent, is to use the page.setUserAgent() method.
- This method accepts a UA string as an argument and should be invoked before navigating to a page.
- This ensures that the browser requests the new page with the specified user-agent.
-
Random-UserAgent NPM Package:
- random-useragent is a NodeJs library that pulls an extensive collection of fake user agents from a large XML file, provided by the author of User-agent Switcher extension.
- You can get a random user-agent or select from an array of user agents by passing an optional flag to
getRandom()
,getRandomData()
,getAll()
,getAllData()
methods.
- Anonymize-UA Plugin:
- This plugin operates in conjunction with puppeteer-extra, which is a wrapper around puppeteer.
- Puppeteer-Extra-Anonymize-UA anonymizes the user-agent on all pages and provides support for dynamic replacement, ensuring that the Chrome version remains intact and up-to-date.
- Stealth Plugin:
- Puppeteer-Extra-Stealth-Plugin incorporates multiple evasion modules, each intercepting Puppeteer sessions with Chromium and modifying the state to minimize fingerprint differences between bots and real browsers.
- One of these evasion modules, user-agent-override, addresses Puppeteer's default UserAgent information, comprising UA string, Accept-Language, Platform, and UA hints.
- This module establishes default languages as "en-US, en", and if the operating system is "Linux", it masks the settings to resemble "Windows".
Since, we have covered the setUserAgent()
Method in the beginning of the artice, let's jump into the next option.
Use Random User-Agent for Each Session With random-useragent
Package
Recall our earlier discussion about the node package random-useragent
. Now, let's explore how it operates in practice. Begin by installing this package using the following npm command:
npm install random-useragent
The package offers four methods, each of which accepts an optional callback filter
method:
- getRandom():
To obtain a random user-agent, simply execute this method. Here's an example demonstrating how to change the user-agent from the default to one generated with the getRandom()
method:
const puppeteer = require('puppeteer');
const randomUserAgent = require('random-useragent');
puppeteer.launch({ headless: "new" })
.then(async browser => {
const page = await browser.newPage();
await page.setUserAgent(randomUserAgent.getRandom());
const ua = await page.evaluate(() => {
return navigator.userAgent;
});
console.log(ua);
await browser.close();
});
// Mozilla/4.0 (compatible; MSIE 6.0; Windows CE; IEMobile 7.11) Sprint:PPC6800
- getRandomData():
This method also retrieves the random user-agent string but in a parsed JSON format. Here's an example:
console.log(randomUserAgent.getRandomData());
// {
// folder: '/Mobile Devices/Devices/HTC',
// description: 'Sensation - Android 4.0.3 - Mobile Safari 534.30',
// userAgent: 'Mozilla/5.0 (Linux; U; Android 4.0.3; de-ch; HTC Sensation Build/IML74K) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30',
// appCodename: '',
// ... More
// }
- getAll():
This method retrieves an array of user agents instead of a single random user-agent:
console.log(randomUseragent.getAll().length);
// 822
- getAllData():
This method retrieves an array of parsed user-agent string data in JSON format:
console.log(randomUseragent.getAllData().length);
// 822
Use Puppeteer puppeteer-extra-plugin-anonymize-ua
Package
The puppeteer-extra-plugin-anonymize-ua
is specifically crafted to anonymize user agents, making it particularly suitable for scenarios where privacy and anonymity are paramount.
The main objective of anonymize-ua
is to offer a pool of privacy-focused user agents that are not associated with specific devices or real user behaviors. It aims is to enhance request anonymization by providing a collection of generic user agents that disclose minimal information about the client.
To utilize this package, puppeteer-extra
is required. You can install puppeteer-extra
along with puppeteer-extra-plugin-anonymize-ua
using the following command:
npm install puppeteer-extra puppeteer-extra-plugin-anonymize-ua
Then you can proceed to anonymize your user-agent so that the server might not be able to distinguish bot traffic from real traffic:
const puppeteer = require('puppeteer-extra');
const AnonymizeUA = require('puppeteer-extra-plugin-anonymize-ua');
puppeteer.use(AnonymizeUA());
puppeteer.launch({ headless: "new" })
.then(async browser => {
const page = await browser.newPage();
// Your code here
await browser.close();
});
Use Puppeteer Stealth Plugin
If you are using a fake user-agent, then you are likely doing it to avoid bot detection. However, setting up fake user-agent doesn't guarantee that your bot will not be detected because it is not the only thing that the bot detectors investigate.
There are other headers and browser fingerprints that might give away the presence of puppeteer in bots.
Fortunately, there's an additional plugin for puppeteer-extra
that, through its various evasion modules, aims to minimize the fingerprint differences between Puppeteer and a real browser.
The puppeteer-extra-stealth-plugin
includes a module called override-user-agent
, which not only sets a random user-agent but also aligns it with other browser headers.
This plugin automatically rotates the user-agent, eliminating the need for manual optimization.
Start by installing it using the following command:
npm install puppeteer-extra-plugin-stealth
Using the stealth plugin in your script is a straightforward process. Simply require
it and integrate it using the use()
method, like this:
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());
puppeteer.launch({ headless: "new" })
.then(async browser => {
const page = await browser.newPage();
// Your code here
await browser.close();
});
It's important to note that the regular page.setUserAgent()
method cannot be used in your code, as it will reset the language and platform values set by this plugin.
By default, the stealth plugin incorporates all other evasion modules, alongside override-user-agent
, to enhance the invisibility of your bot.
If you wish to learn more about all the evasion modules integrated by this plugin, visit this link.
Obtaining User-Agent Strings
- UserAgentString.com:
In addition to the packages and tools we covered for obtaining a random user-agent, there are online resources that provide more extensive lists, allowing you to select the specific desired user-agent you need.
One such resource is UserAgentString.com, which compiles a large number of user agents for various browsers and devices.
You can use this website to search for a user-agent string that best suits your requirements.
- WhatIsMyBrowser.com:
If you wish to view detailed information about your browser settings, including user-agent info and other headers, you can visit whatismybrowser.com.
This tool is valuable as it provides comprehensive details about your browser, such as IP address, location, ISP, user-agent, and more.
- Fake User-Agent API:
You also have the option to utilize ScrapeOps Fake User-Agent API, which returns a list of fake user-agents, that you can use in your puppeteer session to bypass some simple anti-bot defenses.
To use the ScrapeOps Fake User-Agent API, you first need an API key which you can get by signing up for a free account here. The API is free to use, however, this API key is used to prevent users from misusing the API endpoint.
To use the ScrapeOps Fake User-Agents API you just need to send a request to the API endpoint to retrieve a list of user-agents:
const axios = require('axios');
const apiKey = 'YOUR_API_KEY';
const apiUrl = `http://headers.scrapeops.io/v1/user-agents?api_key=${apiKey}`;
axios.get(apiUrl)
.then(response => {
console.log(response.data);
})
.catch(error => {
console.error('Error fetching User Agents:', error.message);
});
Response from the API will look like this:
{
"result": [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Safari/605.1.15",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/603.3.8 (KHTML, like Gecko) Version/10.1.2 Safari/603.3.8",
"Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Safari/605.1.15",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Safari/605.1.15",
"Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36"
]
}
Troubleshooting and Best Practices
While working with Puppeteer and configuring user agents, you may encounter a few common issues. It's essential to understand these issues to troubleshoot effectively and ensure that your Puppeteer scripts function as intended.
Here are some of the typical problems that may arise:
- Detection as a Bot:
Despite changing the user-agent, some websites might still detect headless browsers and block them. This is often because detection scripts on these websites look for other fingerprints beyond the user-agent, such as certain properties in the window.navigator object or the absence of specific plugins like a PDF viewer that are usually present in regular browsers.
To enhance the stealth of your bot and minimize the risk of detection, you might want to utilize the stealth plugin that we discussed earlier.
- Mismatch Between User-agent and Browser Features:
If you configure a user-agent in a way that conflicts with the attributes of the browser controlled by Puppeteer (for example, applying a mobile user-agent while utilizing desktop browser dimensions and features), it can result in detection.
Sophisticated websites may identify this inconsistency and mark the session as suspicious. To prevent this, ensure that the user-agent being employed aligns with the browser properties.
- Issues with Specific Sites:
Certain websites may necessitate particular user agents to ensure proper rendering and functionality. Incorrect user agents could result in the site not loading as anticipated, and certain features may fail to operate.
Online banking portals serve as an illustrative example. Several banking websites incorporate security measures, employing specific user-agent checks to verify that users access their services through browsers that are both supported and secure.
- Inconsistent User Agents Across Requests:
Using multiple different user agents for the same Puppeteer session can raise red flags for websites that are attempting to identify automated or suspicious traffic.
To avoid this use consistent and appropriate user-agent strings to mimic authentic browser behavior.
- Overlooking Headers and Other Browser Signatures:
Besides the user-agent, browsers send various other HTTP headers with distinct signatures. Concentrating solely on setting a fake user-agent and disregarding other headers (such as Accept-Language, Accept-Encoding, etc.) may increase the detectability of your bot.
To address these parameters, you can leverage the override-user-agent
module in the stealth plugin.
- Failure to Persist User-agent on New Pages/Tabs:
User Agents set for one page in Puppeteer do not automatically extend to newly opened pages or tabs.
To establish a consistent user-agent across multiple pages or tabs in Puppeteer, it is essential to explicitly configure the user-agent for each new page or tab.
- Performance Issues:
Setting up and handling a customized user-agent requires additional processing steps. If performed frequently or in a complicated manner, it could extend the overall execution time of the script.
The act of sending requests with custom user agents may affect network latency, particularly if the customizations lead to larger or more complex HTTP headers. Hence, it is crucial to be aware of the potential impact on performance, particularly in situations where efficiency is crucial.
- Problems with Browser Extensions or Plugins:
Altering the user-agent in your script may lead to the malfunctioning of browser extensions or plugins, particularly if they rely on specific browser attributes for proper functioning.
- Cache and Session Inconsistencies:
Altering user agents within a single browsing session can introduce discrepancies in cached data or session information, potentially resulting in unforeseen issues or errors on websites. To mitigate this, one potential solution involves clearing the browser's cache and cookies before implementing any user-agent changes.
This approach establishes a clean slate, minimizing the chances of conflicts with cached data or session information. Alternatively, if certain features necessitate a distinct user-agent, consider isolating them into separate browsing sessions or instances.
- Challenges in Automation Testing:
User Agents not only indicate the browser but also specify the rendering engine and devices employed. Consequently, when employing Puppeteer for automated testing, variations in user agents can influence the rendering and interaction of the page.
This, in turn, may result in potential inaccuracies, either in the form of false positives or false negatives during testing.
To tackle these challenges, it is crucial to carefuly handle user-agent strings. Consider other browser fingerprints and HTTP headers, and maintain consistency across your Puppeteer sessions.
Furthermore, conduct thorough testing of your Puppeteer scripts to verify their expected behavior under various scenarios.
Conclusion
User Agents play a crucial role in web testing and scraping. When scraping, utilizing an appropriate user agent that closely resembles a real browser's user agent is essential to minimize the risk of being blocked by servers.
In this article, we explore various resources to obtain a user agent tailored to your specific requirements and discuss how to set them up using Puppeteer.
However, effectively managing user-agents is only half the battle when it comes to avoiding blocks during web scraping. The more critical aspect involves the use of proxies. To learn how to integrate proxies into your Puppeteer script, check out our Puppeteer Proxy Guide.
More Puppeteer Web Scraping Guides
For those interested in a comprehensive understanding of Puppeteer itself, the following well-written and beginner-friendly articles will be helpful: