Puppeteer Guide: Managing Cookies
Web applications instruct web browsers to store user state, including session data and preferences, using small text files known as cookies.
These cookies are added to the header of each subsequent request to the website, ensuring the continuation of the same session and avoiding the necessity for repeated logins, re-entering the shopping cart, or resetting other preferences.
In this guide, we'll delve into the intricacies of cookie management in Puppeteer, emphasizing its effectiveness in the domains of automation and web scraping.
- Understanding Cookies
- Why Managing Cookies is Important in Web Scraping
- Why Handle Cookies In Puppeteer
- How to Get Cookies with Puppeteer
- How to Accept Cookie Consent Prompts
- How to Save Cookies Locally with Puppeteer
- How to Load Cookies with Puppeteer
- How to Delete Cookies with Puppeteer
- Working with Session Cookies in Puppeteer
- Handling Cookie Changes with Puppeteer
- Working with Multiple Cookies In Puppeteer
- How to Clear the Browser Cache with Puppeteer
- Puppeteer Cookies Best Practices
- Troubleshooting Cookie-Related Issues
- Working with Cookies in E-commerce Websites
- Limitations of Puppeteer in Managing Cookies
- Alternatives to Puppeteer for Managing Cookies
- Conclusion
Understanding Cookies
HTTP operates without a built-in connection between successive requests on the same link, posing challenges for users trying to interact smoothly, such as navigating e-commerce shopping carts.
Despite HTTP's inherent statelessness, HTTP cookies come into play, introducing stateful sessions. Through header extensibility, HTTP cookies seamlessly integrate into the workflow, facilitating session creation in every HTTP request and ensuring a consistent context or state for a more user-friendly experience.
What are Cookies and What do They Do
An HTTP cookie, also known as a web cookie or browser cookie, is a small piece of data transmitted by a server to a user's web browser.
The browser then stores this cookie and subsequently sends it back to the same server with future requests. The primary purpose of an HTTP cookie is to identify whether two requests originate from the same browser, facilitating tasks such as maintaining user login status.
It is important to note that the size of a single cookie should not exceed 4 KB, although a domain can to store multiple cookies.
Cookies serve three main purposes:
- 1. Session management: Handling logins, managing shopping carts, tracking game scores, or any other information the server needs to retain.
- 2. Personalization: Customizing user experiences based on preferences, themes, and other individual settings.
- 3. Tracking: Recording and analysis of user behavior for various insights and purposes.
The Set-Cookie
HTTP response header is utilized to transmit a cookie from the server to the user agent, enabling the latter to send it back to the server in subsequent interactions. Multiple Set-Cookie
headers may be included in the same response to send multiple cookies simultaneously.
Set-Cookie: <cookie-name>=<cookie-value>
Set-Cookie: <cookie-name>=<cookie-value>; Domain=<domain-value>
Set-Cookie: <cookie-name>=<cookie-value>; Expires=<date>
Set-Cookie: <cookie-name>=<cookie-value>; HttpOnly
Set-Cookie: <cookie-name>=<cookie-value>; Max-Age=<number>
Set-Cookie: <cookie-name>=<cookie-value>; Path=<path-value>
Set-Cookie: <cookie-name>=<cookie-value>; Secure
The various attributes that can be associated with a cookie include:
-
Name: The unique identifier for the cookie. It is used to retrieve the stored value.
-
Value: The data associated with the name. This can be a simple string or more complex data.
-
Domain: The domain for which the cookie is valid. The cookie will be sent to this domain and its subdomains.
-
Path: The specific path or directory within the domain for which the cookie is valid. The cookie will be sent only to requests that match this path.
-
Expires: The date and time when the cookie will expire. Once expired, the cookie will be automatically deleted. If no expires attribute is set, the cookie is considered a session cookie and will be deleted when the browser is closed.
-
Secure: If set, the cookie will only be sent over secure, encrypted connections (HTTPS).
-
HttpOnly: If set, the cookie cannot be accessed through client-side scripts. This helps mitigate certain types of cross-site scripting (XSS) attacks.
-
SameSite: This attribute controls when the cookie is sent to the server. It can have three values: "Strict," "Lax," or "None." "Strict" allows the cookie to be sent only in a first-party context, "Lax" is more permissive, and "None" allows the cookie to be sent in cross-site requests.
Types of Cookies
Cookies can be classified into two types based on their expiration data:
-
Persistent Cookies: These cookies are eliminated on a date specified by the
expires
attribute or after a period prescribed by themaxAge
attribute. -
Session Cookies: Due to the lack of a
maxAge
orexpires
attribute, session cookies are erased when the current session concludes.
How are Cookies Stored and Accessed by the Browser
The browser locally stores the received cookies on the user's device, with the storage location contingent upon the browser and operating system—typically manifesting as a text file or a local database.
To view all stored cookies, you can utilize browser developer tools and navigate to the "Storage" tab. Within this tab, under cookies, you will find a list of resource URLs, including those from trackers, extensions, and the original webpage's URL.
For example, the inspection of guardians.com appears as follows:
To manage these cookies, you can delete either all or specific session cookies by right-clicking on them and selecting the desired option. It's important to note that the user interface may vary across different browsers.
Here's a screenshot illustrating the process on Firefox:
Why Managing Cookies is Important in Web Scraping
Cookies play a pivotal role in enhancing the efficiency of web scraping processes, offering numerous advantages that optimize the execution time of scripts and contribute to a more effective scraping experience:
- Session Persistence: Cookies serve to preserve website states, encompassing features like Shopping Carts, Game Scores, and User Settings or Preferences, ensuring seamless recall for subsequent use.
- Authentication: Cookies facilitate user persistence, eliminating the need for repeated logins during subsequent web scraping actions.
- Personalized Content: Websites often personalize content based on user preferences stored in cookies. Managing cookies allows you to access and scrape personalized content, providing a more comprehensive dataset.
- Bot Detection: Cookies can be used to track the client's behavior to detect the presence of Puppeteer bots. Disabling cookie tracking and sanitizing cookies can be helpful against bot detection.
Why Handle Cookies In Puppeteer
Cookies are compact files, limited to 4KB in size, yet their applications span over a wide range.
Let's delve into these applications one by one and examine how each contributes to improving the performance of Puppeteer bots:
- Session Persistence:
When you navigate to a website, the server generates a unique session identifier and sends a corresponding session cookie to your browser.
These cookies track your diverse interactions with the website, including actions like logging in, adding items to your shopping cart, and configuring preferences.
Session cookies are intentionally designed to expire after the browser session or upon user logout.
app.get('/logout', (req, res) => {
res.cookie(sessionId, '', { expires: new Date(0), httpOnly: true });
res.redirect('/');
});
That can be a challenge, when you want to employ multiple browser contexts or scripts, necessitating the repetitive execution of the entire process, including login and other interactions, which can be time-consuming.
Puppeteer comes to the rescue with its methods to preserve session cookies before they get deleted. This allows you to subsequently load these cookies into different browser contexts, seamlessly resuming the exact session where you left off.
- Authentication:
Various authentication methods, such as JWT, OAuth 2.0, or OpenID, are employed by websites to verify user credentials, including login details.
Typically, authentication tokens are stored in cookies, with an assigned expiration date for security purposes.
Puppeteer enables us to save these tokens as cookies and subsequently load them into different browser contexts to persist authentications, like login sessions.
This proves invaluable in scenarios such as automated testing or the development of web scraping applications, where maintaining authentication states is crucial.
- Testing Cookie-Based Features:
Various website functionalities leverage cookies, serving as storage for user preferences like language settings, theme choices, or layout preferences.
Additionally, content caching is facilitated through cookies, allowing websites to optimize content delivery by storing static resources on the user's device. Cookies are also employed by advertisers to track user interests and deliver targeted ads.
Puppeteer offers automation capabilities for testing these cookie-based features.
- Web Scraping:
Leveraging cookies in web scraping can streamline the process by maintaining sessions and authentications, enabling seamless data extraction.
Additionally, cookies facilitate the ability to pause and resume scraping sessions, which is particularly advantageous when dealing with extensive data sets.
Incorporating personalization cookies into scraping scripts allows for the extraction of user-specific data or helps in avoiding bot detection by mimicking authentic user behavior.
- Avoiding Repeated Logins:
In Puppeteer, automating logins involves populating form fields, submitting the form to the server, and waiting for the subsequent page navigation.
When scraping a site with a login requirement, these steps must be executed each time to access the desired page. However, by utilizing cookies, login auth tokens can be preserved.
By loading these tokens, you can bypass logins, significantly speeding the process and reaching the target page for scraping.
- Bypassing Rate Limits or IP Blocks:
When a website imposes rate limits or blocks access based on IP addresses, it means that users or automated bots are restricted in the frequency or volume of requests they can make within a specific timeframe.
The goal is often to prevent abuse, ensure fair usage, or enhance security. Through careful cookie management with Puppeteer, specifically by preserving session persistence and mimicking user behavior, you can adeptly address the challenges posed by rate limits or IP blocks on a website.
- Bypassing Anti-Bot Systems:
To successfully avoid bot detection, the key is to emulate authentic user behavior. By incorporating cookies into your request, you create the appearance of a returning user.
This aligns with the expectations of web servers, allowing you to navigate undetected and retrieve the necessary data when executed correctly.
- Testing Different User Roles:
Evaluating a web application from diverse user perspectives is essential to guarantee its correct functionality and positive user experience across all user roles. Each user role may entail unique permissions, access levels, and functionalities within the application.
Through comprehensive testing from different viewpoints, developers can pinpoint and resolve issues specific to each user role, ultimately improving the overall resilience and usability of the application.
Efficient cookie management further facilitates smooth transitions between various user accounts in this testing process.
- Compliance Testing:
For assurance that a website adheres to regulations such as GDPR or CCPA, which typically outline specific guidelines on cookie usage,
Puppeteer can be employed to automate the evaluation of how cookies are utilized and managed, ensuring compliance with regulatory standards.
How to Get Cookies with Puppeteer
Puppeteer manages cookies on a page level through two straightforward functions:
- The
page.cookies()
retrieves cookies from the current page, and it is important to execute this function after the page has been visited. - The
page.setCookies()
inserts specified cookies into the request headers, and it is crucial to configure cookies before navigating to the page.
How To Get All Cookies With Puppeteer
Waiting for the load
or DOMContentLoaded
events is crucial to guarantee that the page is completely loaded before attempting to retrieve cookies.
This precaution is necessary because cookies might be dynamically set or modified during the page loading process. Subsequently, you can employ the page.cookies()
method to fetch the cookies and store them in a variable.
The method returns a Promise that resolves to an array of cookies. Here is a practical example:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({ headless: "new" });
const page = await browser.newPage();
await page.goto('https://www.goodreads.com/', { waitUntil: 'load' });
const cookies = await page.cookies()
console.log(cookies);
browser.close();
})();
// [
// {
// name: '_session_id2',
// value: '45fd2bec86f07fcb2fe739358a080299',
// domain: 'www.goodreads.com',
// path: '/',
// expires: 1702219589.329311,
// size: 44,
// httpOnly: true,
// secure: false,
// session: false,
// sameParty: false,
// sourceScheme: 'Secure',
// sourcePort: 443
// },
// {
// name: 'locale',
// value: 'en',
// domain: 'www.goodreads.com',
// path: '/',
// expires: -1,
// size: 8,
// httpOnly: false,
// secure: false,
// session: true,
// sameParty: false,
// sourceScheme: 'Secure',
// sourcePort: 443
// },
// ... More
// ]
In the above example, we waited until the home page of GoodReads.com was fully loaded before using the page.cookies()
method to retrieve cookies.
Several attributes, such as expires
, httpOnly
and secure
, etc., exist in cookies, but there's no need to worry about them at this moment; we will cover these details later on.
How To Get Specific Cookies From Url With Puppeteer
The page.cookies()
method retrieves cookies of all types. In scenarios where the number of cookies is extensive or your specific interest lies in cookies meeting particular criteria, such as session cookies, you can filter the cookies based on specific cookie attributes.
Here's a thin wrapper around the page.cookies()
method that takes a URL and a predicate function, returning the filtered cookies:
async function getCookies(url, predicate) {
await page.goto(url, { waitUntil: 'load' });
const allCookies = await page.cookies();
const filteredCookies = allCookies.filter(predicate);
return filteredCookies;
}
Now, let's utilize our getCookies()
function to retrieve session cookies. Session cookies can be identified by their session: true
flag or expires: -1
, which sets the expiration of the cookie to a negative time, prompting the browser to delete it when the session ends:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({ headless: "new" });
const page = await browser.newPage();
const url = 'https://www.goodreads.com/';
const sessionCookies = await getCookies(url, cookie => {
return cookie.expires === -1 || cookie.session === true;
});
console.log(sessionCookies);
await browser.close();
})();
// [
// {
// name: 'locale',
// value: 'en',
// domain: 'www.goodreads.com',
// path: '/',
// expires: -1,
// size: 8,
// httpOnly: false,
// secure: false,
// session: true,
// sameParty: false,
// sourceScheme: 'Secure',
// sourcePort: 443
// }
// ]
How to Accept Cookie Consent Prompts
Certain websites request user consent for specific cookies due to privacy regulations such as the General Data Protection Regulation (GDPR) or The ePrivacy Directive in the EU.
In these situations, the cookies are only sent to the browser if the user agrees to the cookie consent form, respecting their privacy preferences.
Various websites display diverse types of pop-ups featuring accept and reject buttons to request cookie consent. Hence, it is essential to examine the appearance of the pop-up and inspect its HTML before automating and accepting it.
For instance, on TheGuardians.com, a separate iframe
contains the cookie consent form:
To automatically accept the mentioned pop-up, it's necessary to wait for the iframe
to become visible before clicking the accept button within it. Here is the script to achieve this:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({ headless: "new" });
const page = await browser.newPage();
await page.goto("https://www.theguardian.com/international");
const iframeHandle = await page.waitForSelector('#sp_message_iframe_882219');
const iframe = await iframeHandle.contentFrame();
await iframe.waitForSelector('#notice', { visible: true });
const elementHandle = await iframe.$(`button.message-component:nth-child(1)`);
await elementHandle.click();
const cookies = await page.cookies()
console.log(cookies);
browser.close();
})();
// [
// {
// name: 'consentUUID',
// value: '579cb617-a9d1-43a8-84c0-f237603874c1',
// domain: '.theguardian.com',
// path: '/',
// expires: 1733789490.935974,
// size: 47,
// httpOnly: false,
// secure: true,
// session: false,
// sameSite: 'None',
// sameParty: false,
// sourceScheme: 'Secure',
// sourcePort: 443
// },
// ... More
- The script launches a headless browser, opens a new page, and navigates to "https://www.theguardian.com/international".
- It waits for the appearance of an iframe with the ID 'sp_message_iframe_882219'.
- The content frame of the iframe is obtained, and then the script waits for the visibility of an element with the ID 'notice' within that iframe.
- It selects a button within the iframe using a CSS selector and clicks on it.
- The script then retrieves the cookies from the current page and logs them to the console.
- Finally, it closes the browser.
How to Save Cookies Locally with Puppeteer
Until now, we have been retrieving cookies without storing them for future use. It is essential to save cookies in a text or JSON file to load them later, for persisting sessions or avoiding repeated logins.
For file operations in NodeJS, the fs
module is utilized. The writeFileSync()
function within this module requires a path
to store the file and data
in string format. The data obtained with the page.cookies()
method is in array format, so it needs to be converted into a JSON-like string using the JSON.stringify(value, replacer, space)
method.
Additionally, we will pass a space
argument of 2 to format the cookies neatly in the JSON file, preventing them from extending over a single line.
Let's log in to the Goodreads website and attempt to save all cookies, including authentication cookies, for future use.
const puppeteer = require('puppeteer');
const {writeFileSync} = require('fs');
(async () => {
const browser = await puppeteer.launch({ headless: "new" });
const page = await browser.newPage();
await page.goto("https://www.goodreads.com/user/sign_in", { waitUntil: 'load' });
await Promise.all([
page.waitForNavigation({waitUntil: 'load'}),
page.click(".authPortalConnectButton")
]);
await page.type('#ap_email', 'YOUR-USERNAME');
await page.type('#ap_password', 'YOUR-PASSWORD');
await Promise.all([
page.waitForNavigation({waitUntil: 'load'}),
page.click("#signInSubmit")
]);
const cookies = await page.cookies();
writeFileSync('cookies.json', JSON.stringify(cookies, null, 2));
browser.close();
})();
The provided script logs into "https://www.goodreads.com/", waits for the navigation to finish, and subsequently saves the retrieved cookies in a file named cookies.json
.
The structure of the saved file will resemble the following:
[
{
"name": "_session_id2",
"value": "190831d6e65882e69e8eb47f4a21a4",
"domain": "www.goodreads.com",
"path": "/",
"expires": 1702279118.765509,
"size": 44,
"httpOnly": true,
"secure": false,
"session": false,
"sameParty": false,
"sourceScheme": "Secure",
"sourcePort": 443
},
{
"name": "likely_has_account",
"value": "true",
"domain": "www.goodreads.com",
"path": "/",
"expires": 1710033515.043604,
"size": 22,
"httpOnly": false,
"secure": false,
"session": false,
"sameParty": false,
"sourceScheme": "Secure",
"sourcePort": 443
},
... More
How to Load Cookies with Puppeteer
Setting cookies in request headers, before visiting a page, is accomplished with page.setCookies(...cookies)
. You have the flexibility to manually set as many cookies as needed or load them from a file.
In the previous section, we saved cookies after logging into goodreads.com to a file named cookies.json
. Let's observe whether loading these cookies into a new browser instance bypasses the login process or not.
const puppeteer = require('puppeteer');
const {readFileSync} = require('fs');
(async () => {
const browser = await puppeteer.launch({
defaultViewport: {
width: 1980,
height: 1080
},
headless: "new"
});
const page = await browser.newPage();
const cookies = JSON.parse(readFileSync("./cookies.json", "utf-8"));
await page.setCookie(...cookies);
await page.goto("https://www.goodreads.com", {
waitUntil: "load"
});
await page.screenshot({path: "goodreads-login.png"});
browser.close();
})();
As you can see, by visiting the home page of GoodReads.com, loaded with cookies from cookies.json
, we were seamlessly directed to our dashboard without the need for a login.
How to Delete Cookies with Puppeteer
You can remove specific cookies using the page.deleteCookie(...cookies)
method. While there might not be numerous reasons to do this, it proves beneficial in certain scenarios:
-
Different Login Credentials: Different users possess distinct privileges and authorizations on a website. Resetting specific user data, such as login credentials, facilitates the testing of performance across various user roles.
-
New Session: Clearing cookies allows you to start your automation with a clean slate, simulating a fresh session. This is valuable for testing and automating scenarios where a user begins a new session without any prior interactions or stored information.
Continuing from the previous example where we bypassed logging in to goodreads.com with cookies, we will now delete the cookies and reload the page, reverting the login and bringing us back to the home page in a logged-out state:
await page.deleteCookie();
await page.reload();
An alternative method for deleting cookies involves setting their expires
attribute to -1, causing them to be deleted immediately:
for (let cookie of cookies) {
cookie.expires = -1;
}
await page.setCookies(...cookies);
Working with Session Cookies in Puppeteer
Session cookies persist until the browser is closed. You can establish a session cookie by either omitting the expires
and maxAge
properties or by explicitly setting its expires
attribute to -1.
Session cookies are commonly employed for tasks such as retaining logins, shopping cart details, game scores, or any information the server should remember.
When a website authenticates users, it typically generates and re-issues session cookies, even for existing ones, upon user authentication. This practice serves as a security measure against session fixation attacks, where a malicious third party attempts to reuse a user's session.
You can save all the session cookies to a file, allowing you to pause the current session and resume it later in another script.
const allCookies = await page.cookies();
const sessionCookies = allCookies.filter(cookie => {
if ((!cookie.expires || cookie.expires === -1) && !cookie.maxAge) {
return true;
}
})
writeFileSync("./sessionCookies.json", JSON.stringify(sessionCookies, null, 2));
This code snippet above fetches all cookies from the current page using Puppeteer, filters out session cookies based on their expiration properties, and then writes the information of these session cookies to a JSON file named "sessionCookies.json".
Handling Cookie Changes with Puppeteer
Cookies are often used to remember user login status, language preferences, shopping cart contents, and other session-specific information. They facilitate tracking and analytics, helping website owners understand user behavior, improve user experience, and analyze website performance.
Cookies can change dynamically as users interact with the website. New cookies may be set, existing ones may be updated, and some may be deleted based on user actions or predefined criteria. Cookies can have expiration dates, determining how long they persist on a user's device. Session cookies, for example, expire when the browser session ends, while persistent cookies may have a set expiration date.
Monitoring changes in cookies can be valuable for debugging, troubleshooting, understanding user interactions, and various other purposes:
- Debugging: Tracking cookie changes can be valuable for debugging purposes. This way, you can see how your application or third-party service is modifying cookie values, which can help identify issues related to session management, authentication, or state persistence.
- Automated Testing: In automated testing, you might want to verify that certain actions (like logging in or updating settings) trigger the creation or alteration of cookies as expected. Monitoring cookie changes allows you to assert these behaviors programmatically.
- Handling Authentication and Session Management: For applications that rely on cookies for user sessions or authentication tokens, monitoring cookies changes can help manage user states. For example, you can automate the process of re-authentication once a session cookie is removed or expired.
- Compliance and Security Audits: When auditing a website for security or compliance with privacy laws (like GDPR or CCPA), you can detect cookies changes to monitor and log how cookies are being used and managed, ensuring that they comply with the necessary regulations.
- Synchronizing State Across Multiple Tabs or Windows: If your Puppeteer script is managing multiple pages or browser contexts, monitoring cookie changes can help synchronize state across these contexts by reacting to changes in cookies.
- Monitoring Cookie Changes: To track when cookies are added, updated, or removed during a browsing session. This is especially useful in testing scenarios where you need to ensure that your web application is handling cookies correctly.
While there are no built-in events to track changes in cookies, we can create a custom function that executes periodically to detect when cookies have been added or removed. It's important to note that this simplified function focuses on detecting additions or removals of cookies by their name
attribute, excluding monitoring updates to cookies, as that is beyond the scope of this article.
let initialCookies = await page.cookies();
const checkCookieChanges = async () => {
const currentCookies = await page.cookies();
const addedCookies = currentCookies.filter(cookie => !initialCookies.some(prevCookie => prevCookie.name === cookie.name));
const removedCookies = initialCookies.filter(prevCookie => !currentCookies.some(cookie => cookie.name === prevCookie.name));
if (addedCookies.length > 0 || removedCookies.length > 0) {
console.log('Cookies changed:');
console.log('Added Cookies:', addedCookies);
console.log('Removed Cookies:', removedCookies);
}
initialCookies = currentCookies;
};
const intervalId = setInterval(checkCookieChanges, 5000);
Working with Multiple Cookies In Puppeteer
In scenarios where managing a multitude of cookies is a routine task, Node.js higher-order functions prove invaluable, simplifying our workflow. These functions offer the flexibility to filter cookies based on specific criteria or reshape existing cookies by adjusting their attributes.
- You can employ the
filter()
method to selectively choose cookies based on a predicate function. Here's an example demonstrating how to filter cookies with thehttpOnly
attribute:
const httpOnlyCookies = allCookies.filter(cookie => cookie.httpOnly);
- You can utilize the
forEach()
method to iterate through all cookies and modify their attributes. Here's an example to set theexpires
attribute to -1, effectively expiring all cookies immediately:
allCookies.forEach(cookie => {
cookie.expires = -1;
});
Other higher-order functions such as map()
and reduce()
can also be used to apply the transformation or aggregation logic for efficient cookie manipulation.
How to Clear the Browser Cache with Puppeteer
Caching is used to store copies of resources (like images, stylesheets, and scripts) locally on the user's device. By caching static resources, subsequent page loads can be faster since the browser can retrieve resources locally rather than from the server.
In Puppeteer, you can utilize the ability to clear the browser cache to establish a clean state for a new session.
const client = await page.target().createCDPSession();
await client.send('Network.clearBrowserCache');
Puppeteer Cookies Best Practices
When working with cookies in Puppeteer, there are several best practices to consider.
- Update Cookies Regularly:
It's essential to stay mindful of the expiration dates (expires
key) when obtaining cookies. Consider updating them periodically, and acquiring fresh ones before initiating tasks on a target website.
- Consider Using Separate Browser Contexts:
Opt for new pages within separate browser contexts for different tasks to enhance isolation.
Each context maintains its distinct set of cookies, ensuring a cleaner separation of state.
const context = await browser.createIncognitoBrowserContext();
const page = await context.newPage();
- Test in Headless Mode:
Test your Puppeteer scripts in headless: "new"
mode, as this is closer to how your automation will run in production. This helps uncover potential issues early in the development process.
- Handle Errors:
Implement error handling with try...catch
to manage potential issues when interacting with cookies. This helps prevent script failures and improves the robustness of your automation.
Troubleshooting Cookie-Related Issues
Here are some common issues that may arise during cookie handling and ways to troubleshoot them:
- Expired Cookies:
Expired cookies can lead to authentication or session issues. To prevent this, regularly clear expired and outdated cookies using page.deleteCookies()
and fetch fresh cookies with page.cookies()
.
- Domain Mismatch:
Cookies are domain-specific, and using them in a mismatched domain can result in errors. Pay attention to the domain attribute of cookies to avoid such issues.
- Cookie Backups:
Before making any changes to cookies, create a backup of the original cookies to a file. This ensures that you can restore them if any undesired changes occur during manipulation.
- Headless Mode:
Test your Puppeteer script in both headless: false mode and headless: true mode before deploying it to production. This helps identify and address any issues related to the script's behavior in different modes.
- User-Agent Consideration:
Prevent bot detection and other potential impacts on cookies by using a realistic user agent. Set the user-agent with page.setUserAgent()
to enhance the script's authenticity.
- Concurrency Issues:
When running multiple instances of browsers or pages in Puppeteer, be cautious of race conditions. Mitigate concurrency issues by using Promise.all() to avoid conflicts and ensure proper cookie management.
- Proper Timing:
Ensure that you acquire and set cookies at appropriate times. Utilize functions such as waitForSelector()
, waitForNavigation()
, and page.evaluate()
to wait for the right moments before setting or retrieving cookies.
This helps in synchronizing cookie operations with the overall script flow.
Working with Cookies in E-commerce Websites
Let's dive into a practical example that mirrors the kind of tasks encountered in real-world scenarios, especially when writing production-ready puppeteer scripts.
We'll use the Bruvi.com as a case study — a platform focused on selling coffee pods. Our first step is writing a Puppeteer script named save-bruvi-session.js
.
This script will navigate through three distinct product pages, seamlessly add these items to the shopping cart, await the loading of cookies on each page, and subsequently store this data in a file named bruvi-cookies.json
.
This file will not just store session cookies; it will also retain information about the items successfully added to the cart.
Create a file named save-bruvi-session.js
and write the following code:
const puppeteer = require('puppeteer');
const { writeFileSync } = require('fs');
(async () => {
const browser = await puppeteer.launch({ headless: "new" });
const page = await browser.newPage();
const productPages = [
'https://bruvi.com/collections/espresso/products/euphoria-espresso',
'https://bruvi.com/collections/espresso/products/causu-espresso',
'https://bruvi.com/collections/espresso/products/espresso-forte'
];
for (const pageUrl of productPages) {
await page.goto(pageUrl, { waitUntil: 'domcontentloaded' });
await page.click("button.btn-secondary")
await page.waitForTimeout(3000);
}
const cookies = await page.cookies();
await writeFileSync('bruvi-cookies.json', JSON.stringify(cookies, null, 2));
await browser.close();
})();
Code Explanation:
- Launched the browser in headless mode, which is typically preferred for production-ready code.
- Stored all the product URLs in an array.
- Iterated through the array, visited each page, clicked the "Add to Cart" button, and waited for 3 seconds for cookies to load with
page.waitForTimeout()
. - Retrieved all cookies and saved them in a file named
bruvi-cookies.json
. - Closed the browser after all the tasks finished.
After executing the script with npm save-session
, assuming successful execution, you'll obtain a bruvi-cookies.json
file, that will resemble this:
[
{
"name": "_clsk",
"value": "1oyppj3%6C1701390862398%7C5%7C1%7Co.clarity.ms%2Fcollect",
"domain": ".bruvi.com",
... More
},
{
"name": "_ga",
"value": "GA1.1.1804130381.1702390648",
"domain": ".bruvi.com",
... More
},
... More
The next step involves loading cookies from bruvi-cookies.json
into another script named load-bruvi-session.js
. This script aims to visit the home page of Bruvi.com and verify if the cookies successfully load the items added to the cart in the previous script.
Create another file named load-bruvi-session.js
and write this code:
const puppeteer = require('puppeteer');
const { readFileSync } = require('fs');
(async () => {
const browser = await puppeteer.launch({ headless: "new" });
const page = await browser.newPage();
const cookies = JSON.parse(readFileSync("bruvi-cookies.json", "utf-8"));
await page.setCookie(...cookies);
await page.goto("https://www.bruvi.com", { waitUntil: 'load' });
await page.screenshot({path: "bruvi.png"});
browser.close();
})();
Code Explanation:
- Started a new browser instance in headless mode and created a new blank page with
page.newPage()
. - Read the cookies from the bruvi-cookies.json file, which returned a string, and then parsed it into JSON using
JSON.parse()
method. - Loaded the cookies using
page.setCookie()
method before visiting the home page of Bruvi.com. - Captured a screenshot of the page with
page.screenshot()
, after it has loaded. - Closed the browser to complete the script.
The above screenshot shows the Bruvi homepage, and in the top right corner, you can observe that three items have already been added to the cart. These items were not added by the load-bruvi-session.js
.
Instead, the cookies loaded included information from the previous session. This illustrates the process of pausing and resuming sessions, retaining state among different Puppeteer scripts through the use of cookies.
Limitations of Puppeteer in Managing Cookies
While Puppeteer is a powerful tool for automating browser interactions and managing cookies, it has some limitations and considerations when it comes to handling cookies.
Here are some limitations of Puppeteer in managing cookies:
- Absence of Cookie Events:
Puppeteer does not provide built-in events to listen for changes in cookies. Developers need to manually handle scenarios where cookies might be modified during script execution.
- Cross-Browser Context Limitations:
Cookies set in one browser context, like incognito mode, aren't automatically accessible in another context, leading to potential inconsistencies. This arises because each browser context maintains its distinct set of cookies, lacking automatic synchronization between them.
- Limited Support for Asynchronous Cookie Operations:
Handling cookies asynchronously might require careful consideration and synchronization to avoid race conditions and ensure proper sequencing.
- No Built-in Cookie Backup and Restore:
Puppeteer does not inherently provide functionality for easily backing up and restoring cookies. Implementing a reliable backup and restore mechanism may involve additional scripting.
Alternatives to Puppeteer for Managing Cookies
If you're seeking a seamless approach to handle cookies alongside localStorage, sessionStorage, and IndexedDB, consider using the puppeteer-extra-plugin-session from Puppeteer-Extra.
This plugin serves as an extension to Puppeteer, enabling the export and import of session data, including cookies, localStorage, sessionStorage, and IndexedDB.
Install puppeteer-extra and puppeteer-extra-plugin-session with these commands:
npm install puppeteer-extra
npm install puppeteer-extra-plugin-session
To integrate the plugin with Puppeteer-Extra, you need to register it as follows:
const puppeteer = require('puppeteer-extra');
puppeteer.use(require('puppeteer-extra-plugin-session').default());
Subsequently, you can perform the dumping and restoration of all cookies, localStorage, sessionStorage, and IndexedDB storage backends using the provided functions:
const sessionData = await page.session.dump();
// Other Code
await page.session.restore(sessionData);
For more detailed information, refer to the plugin's documentation.
Conclusion
Cookies play a crucial role in preserving a website's state based on user interactions. They contribute to session persistence, eliminate the need for re-authentication on the same site, and prevent repeated logins.
Puppeteer offers methods for managing cookies on a page, allowing for easy saving, loading, and deletion.
It's important to note that the user state isn't solely recorded in cookies; other storage mechanisms such as localStorage, sessionStorage, indexedDB, and cache must also be considered when efficiently conducting web scraping tasks.
For more information, visit the official documentation of Puppeteer
More Web Scraping Guides
If you would like to learn more about Web Scraping with Puppeteer, then be sure to check out The Puppeteer Web Scraping Playbook.
Or check out one of our more in-depth guides: