Playwright Guide: How To Find Elements by XPath
The ability to locate elements on a web page is essential when using web automation libraries. While CSS selectors are often the go-to method for targeting elements, there are situations where they may fall short. This is where XPath shines, it offers a different way of navigating the complex structure of a web page.
In this guide, we will explore how to find elements by XPath in Playwright. We will go through step-by-step instructions and practical examples to help you master this valuable skill.
- TLDR: How To Find Elements by XPath
- What is XPath
- Understanding XPath Syntax
- Choosing Between XPath and CSS: What You Need to Know
- How to Find Elements with XPath
- Selecting Elements with XPath (page.locator() method)
- Waiting for XPath to be Available
- Performing Actions with XPath
- Real-world Applications: XPath in Practice
- XPath Best Practices: Maximizing Efficiency
- Conclusion
- More Web Scraping Guides
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
TLDR: How To Find Elements by XPath
When using Playwright, scripts adhering to a uniform design pattern for selecting elements through XPath typically involve the page.locator()
method. The method returns an element locator that can be used to perform actions on this page/frame.
Below is a Playwright XPath query for finding an element on scrapethissite.com in our console.
Let's look at the following script:
const playwright = require("playwright");
async function getInnerText(params) {
const browser = await playwright.chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();
// We insert the desired URL of the page
await page.goto("https://www.scrapethissite.com/pages/simple/");
/**
We will insert the XPath expression that we got from our browser
inside the locator method.
**/
// The XPath expression selects the first country on the website
const element = await page.locator(
`//*[@id="countries"]/div/div[4]/div[1]/h3`
);
/**
Now that we have located the element we want, we can use the inner.text method to get the text inside.
**/
const innerText = await element.innerText();
console.log(innerText);
await browser.close();
}
getInnerText();
Andorra
In this script:
- We launched a Chromium browser instance using Playwright.
- A new browsing context and page were created within this browser instance.
- The
page.goto()
method was used to navigate to the desired URL. - Inside the
<code>
page.locator()</code>
method, we inserted the XPath expression retrieved from the browser. This XPath expression selects the first country on the website. - After locating the desired element, its inner text was extracted using the
<code>
innerText()</code>
method. - The extracted inner text was then logged on the console.
- Finally, the browser instance was closed.
What is XPath
XPath (XML Path Language) is a query language used to navigate XML documents and select nodes based on their properties. It provides a way to locate and retrieve information from XML documents by specifying the paths to elements or attributes within the document's hierarchical structure.
XPath is like a map that helps you find specific information in a document, like an XML file. Imagine you have a big book with chapters, paragraphs, and sentences. XPath is like a guide that tells you where to look in the book to find what you need.
For example, if you're looking for a specific paragraph in a chapter, XPath would help you locate it by giving you directions like "Go to chapter 3, then find the second paragraph."
Understanding XPath Syntax
In XML documents, everything is considered a node. This includes elements, attributes, text within elements, and even the document itself. XPath uses expressions to navigate through the nodes in an XML document. These expressions are like paths that guide you to specific nodes.
For instance, let's examine the HTML snippet:
<h3 class="country-name">Andorra</h3>
Accompanied by its corresponding XPath expression:
//h3[@class='country-name']
Now, let's decipher each component of this XPath expression:
//
: This double forward slash signifies a search that starts from the document's root, traversing through all hierarchy levels to find the desired node.h3
: This denotes the target element we're searching for, specifically an h3 element.[@class='country-name']
: Within square brackets, this signifies a condition that filters the selection based on the value of the class attribute. Here, we're looking for an h3 element with a class attribute equal to<code>
country-name</code>
In summary, the XPath expression //h3[@class='country-name']
instructs XPath to locate any h3 element with a class attribute set to country-name
starting from the root of the XML document, effectively guiding us to the specific node that represents the country name in the HTML snippet.
Types of XPath
XPath can be classified into different types based on how they navigate the XML document structure:
Absolute XPath
Absolute XPath expressions start with the root node of the document and traverse down the hierarchy to reach the desired element. They specify the exact location of an element in the document tree, irrespective of its context.
Here's an example:
/html/body/div[1]/form/input[2]
In this expression,/html
represents the root node, followed by the path to the desired element input
.
Relative XPath
Relative XPath expressions, on the other hand, rely on the context of the current node to locate elements. They offer more flexibility as they don't depend on the entire document structure but rather on the element's position relative to another element.
Here's an example:
//input[@name='username']
This expression starts anywhere in the document (//
) and searches for an input
element with the attribute name
set to username
.
Absolute XPath expressions are advantageous for tasks requiring precise targeting in stable document structures. Their direct path specification ensures accuracy, making them ideal for automated testing or scraping tasks where the document is never changed.
Relative XPath expressions offer flexibility and adaptability, making them suitable for dynamic environments where document structures may change frequently. By relying on element relationships rather than specific paths, they simplify maintenance efforts and enhance reusability across different parts of the document.
Here are some frequently used Relative XPath expressions:
Action | Expression | Description |
Selecting by Tag Name | //div | Selects all 'div' elements in the document. |
Selecting by Class Name | //div[@class='container'] | Selects all 'div' elements with the class attribute equal to 'container'. |
Selecting by ID | //*[@id='header'] | Selects the element with the ID attribute equal to 'header'. |
Selecting by Attribute | //input[@text] | Selects all 'input' elements with the type attribute equal to 'text'. |
Selecting by Text Content | //p[text()='Welcome'] | Selects all 'p' elements with the exact text content 'Welcome'. |
Selecting by Partial Text Content | //a[contains(text(),'Click')] | Selects all 'a' elements whose text content contains the substring 'Click'. |
Selecting by Position | (//ul/li)[1] | Selects the first 'li' element within any 'ul' element. |
Selecting by Parent/Child Relationships | //div[@class='parent']/child::span | Selects all 'span' elements that are direct children of 'div' elements with the class attribute equal to 'parent'. |
Selecting by Ancestor | //span//ancestor::div | Selects all 'div' elements that are ancestors of 'span' elements. |
Selecting by Following-sibling | //h2/following-sibling::p | Selects all 'p' elements that are siblings following 'h2' elements. |
Selecting by Preceding-sibling | //p[@class='info']/preceding-sibling::h2 | Selects all 'h2' elements that are siblings preceding 'p' elements with the class attribute equal to 'info'. |
Choosing Between XPath and CSS: What You Need to Know
When it comes to web scraping or automating interactions with web pages, selecting the right tool for targeting elements is crucial. Two primary methods for this purpose are XPath and CSS selectors.
Let's dive into the nuances of each and explore when to use them.
Understanding CSS Selectors
CSS selectors serve as patterns to pinpoint and style elements within HTML documents. They offer a concise syntax for identifying elements based on various criteria such as element type, class, ID, attributes, and hierarchical relationships.
Here's how you can leverage CSS selectors in your scripts:
const playwright = require("playwright");
async function getInnerText(params) {
const browser = await playwright.chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();
// We insert the desired URL of the page
await page.goto("https://www.scrapethissite.com/pages/simple/");
/**
After we get the class name associated with the <h3> we want to
scrape. We will use $ function to locate the first element on the
page that matches the CSS selector
**/
// Use CSS selector to locate the desired element on the website
const element = await page.$(".country-name");
/**
Now that we have located the element that we want, we can then
display the inner text using innerText method.
**/
const innerText = await element.innerText();
console.log(innerText);
await browser.close();
}
getInnerText();
In this snippet, .country-name
represents the class selector used to locate the desired element on the webpage.
Advantages of CSS Selectors:
- Concise Syntax: CSS selectors provide a readable and succinct way to target elements.
- Widespread Support: They are universally supported by modern browsers.
- Ease of Use: Developers familiar with CSS find it intuitive to work with selectors.
Understanding XPath Selectors
XPath offers a broader range of capabilities for element selection, particularly in intricate HTML structures or when elements lack convenient identifiers.
XPath expressions allow for precise targeting of elements, making them invaluable for complex scraping tasks or automated testing scenarios.
Considerations with XPath:
- Reliability: While XPath can precisely target elements, expressions relying heavily on specific paths within the HTML structure may become less reliable if the page undergoes updates or redesigns.
- Complexity: XPath expressions can be intricate and challenging to comprehend, particularly for those new to XPath.
Choosing Between XPath and CSS Selectors
Both XPath and CSS selectors have their strengths and weaknesses, making them suitable for distinct scenarios:
- CSS Selectors excel in simplicity, readability, and widespread browser support, making them ideal for straightforward scraping tasks or when targeting elements with identifiable classes or IDs.
- XPath, on the other hand, shines in its precision and ability to navigate complex HTML structures. It's particularly useful when dealing with dynamic content or elements without clear identifiers.
When deciding between XPath and CSS selectors for web scraping or automation consider:
- the complexity of the task,
- the stability of the web page's structure, and
- your familiarity with each method.
By understanding the strengths and limitations of XPath and CSS selectors, you can make informed choices to streamline your development process and enhance the robustness of your scripts.
How to Find Elements with XPath
When it comes to finding elements using XPath, you have two primary approaches:
- Copying them from Dev Tools, which often yields the Absolute XPath, or
- Manually inspecting the webpage for a more adaptable Relative XPath.
Finding XPath Using Dev Tools
To determine the XPath expression for a particular element on a webpage, you can use browser developer tools.
Most modern browsers provide a feature to inspect elements and generate XPath expressions automatically.
We'll explore a step-by-step method for locating elements using XPath on the website scrapethissite.
Here's our process:
- Right-click on the element you want to locate.
- Select "Inspect" from the context menu to open the developer tools.
- Right-click on the highlighted element in the developer tools.
- Choose Copy - We can either choose
Copy XPath
orCopy full XPath
- If we select
Copy XPath
, we get a relative XPath for the element:
//\*[@id="countries"]/div/div[4]/div[1]/h3
- If we select
Copy full XPath
, we get an absolute XPath for the element:
/html/body/div/section/div/div[4]/div[1]/h3
Following these steps, you can quickly obtain the XPath expression for any element on a webpage and use it in your Playwright scripts to interact with that element.
Finding XPath via Manual Inspection
Manually generating XPath expressions without relying on developer tools involves understanding the HTML/XML document's structure, utilizing various XPath syntax rules, and relying on developer tools.
Here's a general guide on manually generating XPath on the same website scrapethissite.
- Right-click on the element you want to locate.
- Select "Inspect" from the context menu to open the developer tools.
- Start from a common ancestor or the root if needed.
- Traverse to the target element, noting the structure.
- Construct a relative XPath from the closest common ancestor.
- Based on the red marks highlighted above, you will create a relative XPath like:
"div[@id='countries']/div/div[4]/div[1]/h3"
Selecting Elements with XPath (page.locator() method)
In Playwright, the page.locator()
method is a powerful tool for locating elements within a web page's DOM (Document Object Model). It allows you to find elements based on various criteria, such as XPath expressions, CSS selectors, text content, and more.
The method returns an element locator that can be used to perform actions on this page/frame. Locator is resolved to the element immediately before performing an action, so a series of actions on the same locator can in fact be performed on different DOM elements.
const playwright = require("playwright");
async function getInnerText(params) {
const browser = await playwright.chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();
// We insert the desired URL of the page
await page.goto("https://www.scrapethissite.com/pages/simple/");
const countryNamesLocator = page.locator("//h3"); // Relative XPath
// Extract country names
const countryNames = await countryNamesLocator.evaluateAll((elements) =>
elements.map((el) => el.textContent.trim())
);
console.log(countryNames);
//console.log(innerText);
await browser.close();
}
getInnerText();
Let’s break down the provided code step by step: 1. Import Playwright and Launch Chromium:
We start by importing the Playwright library and launching a Chromium browser instance using playwright.chromium.launch()
2. Create a Browser Context:
A new browser context is created using browser.newContext()
3. Create a New Page:
Inside the context, a new page is created using context.newPage()
4. Navigate to a URL:
The page navigates to the specified URL (“https://www.scrapethissite.com/pages/simple/”) using page.goto(url)
5. Select Country Names Using XPath:
The code selects country names from the page using a relative XPath selector //h3
.
6. Evaluate and Extract Country Names:
The evaluateAll
method is used to extract the text content of all matching elements.
const countryNames = await countryNamesLocator.evaluateAll((elements) => elements.map((el) => el.textContent.trim()));
7. Print Country Names to Console:
The country names are logged to the console console.log(countryNames);
8. Close the Browser: Finally, the browser instance is closed.
Waiting for XPath to be Available
Playwright automatically waits for elements to be ready before performing actions. However, there are scenarios where explicit waiting is necessary. The waitForSelector()
method ensures that an element is present and ready for interaction before proceeding with further actions.
Locators combine waits and actions into a single atomic step, avoiding issues like stale handles and race conditions. Most of the time, you’ll use specific actions (e.g., click()
, .type()
) that inherently include waiting, making .waitFor()
less common.
We will see this in action by scraping the books from the website Books to Scrape.
const playwright = require("playwright");
async function getInnerText(params) {
const browser = await playwright.chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();
// We insert the desired URL of the page
await page.goto("https://books.toscrape.com/");
// Wait for the network to be idle so that everything is loaded
await page.waitForLoadState("networkidle");
// Wait for the XPath selector to be available
await page.waitForSelector(`//*[@class="product_pod"]/h3`);
const countryNamesLocator = page.locator(`//*[@class="product_pod"]/h3`); // Relative XPath
// Extract country names
const countryNames = await countryNamesLocator.evaluateAll((elements) =>
elements.map((el) => el.textContent.trim())
);
console.log(countryNames);
//console.log(innerText);
await browser.close();
}
getInnerText();
Above, we can see that we are both waiting for the network to idle and for XPath to be available.
Performing Actions with XPath
XPath itself is primarily a query language for selecting nodes in an XML document, rather than directly performing actions. However, XPath is commonly used in conjunction with other functions in Playwright to perform actions such as:
Clicking Elements
Clicking is a fundamental action in web automation. We can simulate user interaction, trigger events, navigate through web applications, select elements, and many more by using playwright's page.click
method.
Clicking simulates user interaction with web elements such as buttons, links, or dropdown menus. Many web applications rely heavily on user input to trigger actions, navigate through different sections of the site, to submit forms, or display dynamic content.
Let's dive into a practical example. Suppose we have a web page with a list of product cards, and we want to click on a specific product, say "Samsung Galaxy S6."
Here's how we can achieve this using Playwright and XPath:
const { chromium } = require("playwright");
async function click() {
// Launch a new browser instance
const browser = await chromium.launch();
// Create a new page
const page = await browser.newPage();
// Navigate to the DemoBlaze website
await page.goto("https://www.demoblaze.com/");
// Define an XPath selector for the link you want to click
const linkXPath = '//a[contains(text(), "Samsung galaxy s6")]';
// Wait for the link to be visible
await page.waitForSelector(linkXPath);
// Click the link
await page.click(linkXPath);
// Wait for the network to be idle so that everything is loaded
await page.waitForLoadState("networkidle");
await page.screenshot({ path: "screenshot-S6.png", fullPage: true });
// Close the browser
await browser.close();
}
click();
- We use Playwright’s Chromium module to launch a Chromium browser.
- The script creates a new page and navigates to the DemoBlaze website.
- We define an XPath selector for the link with the text “Samsung galaxy s6”.
- The script waits for the link to be visible on the page.
- We click the link.
- Wait for the page to load.
- Take a screenshot.
Capturing Screenshots
Screenshots provide visual confirmation of scraped data, aid in debugging and error detection, handle JavaScript-rendered content, verify navigation and pagination, and support documentation and reporting efforts.
We can use the built-in page.screenshot()
method to take screenshots of websites. Screenshots API accepts many parameters for image format, clip area, quality, etc. Make sure to check them out.
const playwright = require("playwright");
(async () => {
// Launch a browser (headless by default)
const browser = await playwright.chromium.launch();
// Create a new page
const page = await browser.newPage();
// Navigate to the target website
await page.goto("https://webscraper.io/test-sites/e-commerce/allinone");
// Take a screenshot (full page)
await page.screenshot({ path: "screenshot.png", fullPage: true });
// Close the browser
await browser.close();
})();
If you want to learn more about how to take screenshots using Playwright, you can visit our comprehensive The Node JS Playwright Guide.
Automated Scrolling
Scrolling is a crucial aspect of web automation, enabling you to capture dynamic and complete page content, perform efficient automation, and accurately select elements for further processing. Playwright's ability to simulate user interactions, including scrolling, makes it a powerful tool for automating browser-based tasks.
We often need to capture the entire content of a page, not just what's initially visible. By scrolling through the page programmatically, you ensure that all content, even the parts that are initially hidden or loaded dynamically, are accessible for scraping.
The code leverages Playwright's Chromium driver to launch a browser window and navigate to a specific URL. It then employs a scrolling technique to progressively load content on the page until it reaches the end.
Finally, it extracts product titles using XPath selectors and logs them to the console.
const { chromium } = require("playwright");
(async () => {
const browser = await chromium.launch({ headless: false });
const context = await browser.newContext();
const page = await context.newPage();
// Navigate to a webpage with an embedded iframe
await page.goto("https://www.small-shops.com/products");
// Scroll down in a loop until no more content is loaded
let previousHeight = 0;
while (true) {
await page.evaluate(() => {
window.scrollBy(0, window.innerHeight);
});
// Wait for some time to allow content to load (adjust as needed)
await page.waitForTimeout(2000); // 2 seconds
// Get the current height of the page
const currentHeight = await page.evaluate(() => {
return document.documentElement.scrollHeight;
});
// If the height hasn't changed, we've reached the end of the page
if (currentHeight === previousHeight) {
break;
}
// Otherwise, update the previous height and continue scrolling
previousHeight = currentHeight;
}
console.log("Scrolled to the end of the page!");
// Locate the element using XPath
const productTitleLocator = page.locator('//*[@id="product-title"]');
// Evaluate the element to extract its text content
const producTitle = await productTitleLocator.evaluateAll((elements) =>
elements.map((el) => el.textContent.trim())
);
console.log("Product Titles:", producTitle);
await browser.close();
})();
Real-world Applications: XPath in Practice
Automated Authentication
Automated authentication is essential for accessing restricted content during web scraping. Implementing robust automated authentication mechanisms enhances the extraction of valuable data from authenticated sources.
Most of the time social media platforms often require users to log in to access personalized content, including user profiles, timelines, messages, and interactions. Automated authentication allows scraping scripts to authenticate with the platform using valid credentials, enabling access to personalized data that is tailored to specific user accounts.
Below, we will Utilize Playwright to orchestrate a login process on a web page using XPath. Once we log in, we will screenshot the website and close the browser.
const { chromium } = require("playwright");
async function login() {
const browser = await chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();
// Navigate to the login page
await page.goto("https://demo.applitools.com/");
// Use XPath to locate the username and password fields
const usernameField = page.locator('//input[@id="username"]');
const passwordField = page.locator('//input[@id="password"]');
// Fill in the username and password fields
await usernameField.fill("username");
await passwordField.fill("password");
// Click the login button
const loginButton = page.locator('//a[@id="log-in"]');
await loginButton.click(); // Replace with the actual login button XPath
// Wait for any necessary page load or AJAX requests
await page.waitForLoadState("networkidle");
// Take a screenshot
await page.screenshot({ path: "login-screenshot.png" });
// Close the browser
await browser.close();
}
login();
- Launch a Chromium browser using Playwright.
- Create a new browsing context and a new page.
- Navigate to the login page using the
page.goto()
method. - Locate the username and password fields using XPath expressions.
- Fill in the username and password fields using the
fill()
method. - Locate and click the login button using another XPath expression.
- Wait for the page to reach a network idle state using
page.waitForLoadState("networkidle")
. - Capture a screenshot of the login page using
page.screenshot()
. - Close the browser using
browser.close()
Extracting Data
Extracting data through web scraping provides access to structured information. We can Implement robust web scraping techniques using Playwright and XPath to harness the wealth of information available on the web and leverage it for various applications.
We can retrieve a lot of different information such as product details, news articles, contact information, user reviews and many more. This data can be used for various purposes, including market research, competitive analysis, lead generation, or content aggregation.
Below we will see how we can extract data from a fictitious book store and collect the rating, price and name of each books.
const playwright = require("playwright");
(async () => {
const browser = await playwright.chromium.launch({ headless: false });
const context = await browser.newContext();
const page = await context.newPage({});
// Navigate to the page with the table
await page.goto("https://books.toscrape.com/catalogue/page-1.html");
// Extract table headers using page.evaluateAll
// Define XPath to select all product list items
const productContainers = await page.locator(
"li.col-xs-6.col-sm-4.col-md-3.col-lg-3"
);
const products = await page.evaluate(() => {
const containers = document.querySelectorAll(
"li.col-xs-6.col-sm-4.col-md-3.col-lg-3"
);
const productData = [];
containers.forEach((container) => {
const name = container.querySelector("h3 a").textContent.trim();
const rating = container
.querySelector("p.star-rating")
.className.split(" ")[1];
const price = container.querySelector("p.price_color").textContent.trim();
productData.push({ Name: name, Rating: rating, Price: price });
});
return productData;
});
console.log(products);
await browser.close();
})();
- Import Playwright library for browser automation.
- Then we Launch a Chromium browser window
- Then we create a new browsing session within the browser.
- The next step is opening a new webpage showing books on Books To Scrape.
- We Define a way to find each book listing on the webpage.
- Print the scraped information (book details) to the console.
- Closes the browser window after finishing.
XPath Best Practices: Maximizing Efficiency
By adhering to best practices, you can maximize efficiency and optimize performance in your XPath implementations. Here are some key strategies to consider:
- Use Efficient XPath Expressions:
Use XPath expressions that are concise and target specific nodes. Avoid overly complex expressions that traverse the entire document unnecessarily. By focusing on the nodes you need, you reduce processing overhead and improve performance.
//div[@class='content']/p
- Utilize Predicates Judiciously:
Predicates in XPath allow you to filter nodes based on conditions. While they are powerful, excessive use can impact performance. Evaluate if predicates are truly necessary and try to keep them simple and efficient.
//book[price > 50]
- Leverage XPath Axes Wisely:
XPath axes enable navigation through the document tree in various directions. Understand the different axes available and choose the most appropriate one for your requirements. Using the correct axis can significantly enhance the efficiency of your XPath queries.
//div[@class='parent']/child::p
- Avoid Absolute XPath Paths: While absolute XPath paths (starting from the root node) may seem convenient, they are fragile and less flexible, especially in dynamic environments where the document structure can change.
Prefer relative XPath paths whenever possible, as they are more resilient to structural modifications and promote better maintainability.
/library/section[1]/books/book[1]/title
Conclusion
In conclusion, using XPath in Playwright opens up a lot of possibilities for precise web automation. XPath's capability to navigate complex HTML structures and select elements with precision is invaluable, especially in scenarios where CSS selectors fall short.
While XPath offers advantages such as precise targeting and broad browser support, it's essential to consider its limitations, such as potential fragility with changes in HTML structure and complexity for newcomers.
Understanding when to leverage XPath's strengths and when to resort to alternative methods like CSS selectors ensures effective and robust web automation solutions.
Check out the official Playwright documentation to deep dive into XPath.
More Playwright Web Scraping Guides
If you want more insights into the world of web scraping, we've got you covered! Check out our extensive The NodeJs Web Scraping Playbook or dive deeper into the different techniques of web scraping by exploring the following links: