Python Scrapy vs Python Pyppeteer Compared
Web scraping is an essential technique for extracting data from websites. In the Python ecosystem, two prominent libraries stand out for this purpose: Scrapy and Python Pyppeteer.
While both are powerful in their own right, they cater to different needs and use cases. This article delves deep into the intricacies of both libraries, comparing their strengths, weaknesses, and ideal scenarios of application. In this article, we will delve into:
- TLDR Python Scrapy vs Python Pyppeteer
- What is Scrapy
- What is Pyppeteer
- Detailed Comparison of Python Pyppeteer vs Scrapy Features
- Case Study: Scraping Amazon with Pyppeteer and Scrapy
- Additional Selenium and Scrapy Resources
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
Scrapy vs Python Pyppeteer for Web Scraping Compared Summary
While both Scrapy and Python Pyppeteer serve the purpose of web scraping, they come with distinctive features and functionalities that cater to different scraping needs and scenarios. The choice between the two hinges on the specific needs of your scraping endeavor.
- Scrapy: A high-performance, open-source framework for extracting data from websites.
- Python Pyppeteer: A Python port of the node library Puppeteer which provides a high-level API over the Chrome DevTools Protocol.
Here are the situations when you should consider using each library:
Feature/Aspect | Pyppeteer | Scrapy |
---|---|---|
Library Type | Python port of the node library Puppeteer | Web scraping framework |
JavaScript Execution | Yes (Full browser environment) | Limited (Requires middleware like Splash) |
Suitable for | Simpler projects | Large-scale projects |
Community & Support | Smaller community compared to Scrapy | Large community, extensive documentation |
Learning Curve | Moderate (knowledge of async programming is beneficial) | Steeper for beginners, but powerful for advanced users |
What is Python Pyppeteer?
Python Pyppeteer is a Python port of the Puppeteer library, which is a Node library that provides a high-level API to control headless Chrome or Chromium browsers over the DevTools Protocol.
In simpler terms, Pyppeteer allows you to automate browser tasks in Python, making it an invaluable tool for web scraping, especially when dealing with JavaScript-heavy websites.
Advantages of Using Python Pyppeteer For Web Scraping
Let's delve into the unique benefits and capabilities that Python's Pyppeteer library offers to modern data enthusiasts and developers
-
Dynamic Content Handling: Pyppeteer is particularly adept at handling websites that use JavaScript to dynamically render content. Traditional scraping tools often struggle with such sites because they only fetch the initial HTML content. In contrast, Pyppeteer controls a headless version of the Chrome browser, allowing it to interact with and retrieve content as it's dynamically loaded and rendered. This ensures that even content loaded asynchronously via AJAX or other JavaScript methods can be captured accurately.
-
Intuitive Browser Automation: Pyppeteer provides an intuitive API for browser automation, closely mirroring the actions a real user would take. Whether it's filling out forms, clicking buttons, or navigating through multiple pages, Pyppeteer makes these tasks straightforward. Its API is designed with clarity in mind, making it easier for developers to write scripts that emulate real user interactions, which is especially beneficial for testing and scraping complex web applications.
-
Comprehensive Page Interactions: Beyond simple page navigation and content extraction, Pyppeteer offers a comprehensive set of functions for interacting with web pages. This includes capabilities like taking screenshots, generating PDFs, and even recording videos of browser sessions. Additionally, it can emulate different devices, allowing developers to see how a page looks and behaves on mobile versus desktop, for instance. Such extensive interaction capabilities make it a versatile tool not just for web scraping, but for a range of browser automation tasks.
-
Stealth Mode: Web scraping often faces challenges from anti-bot measures implemented by websites. Pyppeteer's stealth mode helps mitigate this by making the scraping activity less detectable. When enabled, stealth mode applies various techniques to make the browser controlled by Pyppeteer appear more like a regular user's browser. This includes evading certain fingerprinting techniques that websites use to detect and block automated browsers, thereby increasing the chances of successful scraping without being blocked or banned.
Disadvantages of Using Python Pyppeteer For Web Scraping
While Python's Pyppeteer library offers a plethora of advantages for web scraping, there are also some limitations and challenges that developers might encounter in real-world scenarios.
-
Resource Intensive: Pyppeteer, being a browser automation tool, controls an actual browser instance (headless Chrome) to perform its tasks. This means that for every scraping session or task, it needs to spawn and manage a browser process. As a result, it can be more resource-intensive compared to lightweight scraping libraries that only make HTTP requests and parse HTML. This can be a concern when running multiple instances or when operating in environments with limited resources.
-
Slower Execution: Due to the nature of browser automation, Pyppeteer can be slower than traditional scraping tools. The time taken to launch the browser, render pages, execute JavaScript, and interact with page elements can add up, especially when scraping multiple pages or websites. While it offers the advantage of capturing dynamically loaded content, this comes at the cost of execution speed.
-
Complexity: While Pyppeteer provides a comprehensive API for browser interactions, this can also introduce complexity, especially for beginners or for simple scraping tasks. Setting up and managing browser sessions, handling page interactions, and dealing with asynchronous operations can be more complex compared to using straightforward scraping libraries.
-
Limited Parallelism: Given its resource-intensive nature, running multiple concurrent instances of Pyppeteer can strain system resources. This can limit the degree of parallelism you can achieve, especially when compared to lightweight, request-based scrapers that can easily run hundreds of concurrent tasks.
-
Is no longer maintained: One significant concern with Pyppeteer is that, as of the last update, it appears to be no longer actively maintained. This means potential bugs, compatibility issues with newer versions of Chrome, or emerging web technologies might not be addressed promptly. It's essential to consider the long-term viability and support when choosing a tool for critical projects.
When Should You Use Python Pyppeteer Over Scrapy?
When considering the choice between Python Pyppeteer and Scrapy for web scraping, certain scenarios favor the utilization of Python Pyppeteer over Scrapy.
Let's explore in a bit more detail specific scenarios and projects where leveraging capabilities of Python Pyppeteer can be most beneficial for web scraping endeavors.
-
Dynamic Content Scraping: Websites that render content using JavaScript often pose challenges for traditional scraping tools. Since Pyppeteer controls a real browser, it can easily wait for JavaScript to execute and fetch the rendered content.
-
Interacting with Web Elements: When scraping tasks involve intricate user interactions such as form submissions, button clicks, and scrolling, Python Pyppeteer's ability to simulate user actions within a headless browser environment can simplify the scraping process.
-
Capturing Screenshots or PDFs: Pyppeteer can capture screenshots or generate PDFs of web pages, which can be useful for documentation or archiving.
-
Smaller Scale Projects: For simpler scraping projects that require quick setup and minimal configuration, Python Pyppeteer can offer a more straightforward and lightweight solution compared to the more comprehensive Scrapy framework.
-
Handling Single Page Applications (SPAs): Pyppeteer excels at scraping Single Page Applications (SPAs) because it is built on top of the Puppeteer library, which controls a headless Chrome browser. As SPAs rely heavily on client-side rendering, traditional scraping tools might struggle to fetch the content. However, Pyppeteer can seamlessly navigate, wait for, and capture the dynamically loaded content, making it a preferred choice for SPAs.
Setting Up Python Pyppeteer
Before installing Pyppeteer, ensure you have:
- Python 3.6 or higher installed. You can verify your Python version by running python
--version
in your terminal. pip
, the Python package installer. If you don't have it, here's a guide to get you started.
Installing Pyppeteer is straightforward with pip. Run the following command in your terminal:
pip install pyppeteer
One of the perks of Pyppeteer is that it automatically downloads a compatible version of Chromium (a lightweight version of Chrome) the first time you launch it. However, if you wish to manually download it, you can use:
import pyppeteer
pyppeteer.chromium_downloader.download_chromium()
Basic Python Pyppeteer Example
Once installed, you can start using Pyppeteer right away. Here's a simple example to launch a browser:
import asyncio
from pyppeteer import launch
async def main():
browser = await launch()
page = await browser.newPage()
await page.goto('https://example.com')
await browser.close()
asyncio.get_event_loop().run_until_complete(main())
This script launches a browser, navigates to 'https://example.com', and then closes the browser.
What is Python Scrapy?
Scrapy is an open-source and collaborative web crawling framework for Python. It's designed to extract data from websites and save it in various formats such as CSV, JSON, or XML.
Unlike simple scripts that might use Python's built-in libraries to scrape content, Scrapy provides a full-fledged solution for large-scale and complex web scraping tasks.
Advantages of Using Scrapy For Web Scraping
Let's delve into the distinct advantages that make Scrapy a top choice for web scraping enthusiasts and professionals alike.
-
High Performance: Scrapy offers a perfect blend of speed and efficiency, making it a top choice for developers aiming for high-performance web scraping.
- At its core, Scrapy is an asynchronous framework, leveraging the power of Twisted, a popular event-driven networking engine. This means Scrapy can handle multiple requests concurrently without waiting for one to finish before starting another, resulting in faster data extraction.
- Additionally, Scrapy's built-in caching, throttling, and delay mechanisms ensure efficient bandwidth usage and respectful crawling, minimizing the chances of getting banned or throttled by websites.
- Furthermore, its architecture is designed for extensibility, allowing developers to plug in custom middlewares or pipelines to further optimize the scraping process.
-
Scalability: Scrapy stands out as a highly scalable web scraping framework, designed to handle large-scale data extraction with ease.
- Its modular architecture allows for distributed crawling, meaning you can run multiple spiders on different machines to scrape data concurrently. This is facilitated by tools like Scrapy Cluster which manages and distributes requests across several machines.
- Additionally, Scrapy's settings can be fine-tuned to control the rate of requests, ensuring optimal performance without overloading servers. Its built-in support for middleware and pipelines also means that as your data processing needs grow, Scrapy can grow with you, handling everything from data cleaning to storage seamlessly.
-
Structured Approach: Scrapy offers a structured approach to web scraping, ensuring that developers can maintain and scale their projects with ease. Unlike ad-hoc scripts, Scrapy projects are organized into distinct components: spiders define how to navigate and extract data, items structure the scraped data, pipelines process and store this data, and middlewares handle request/response processing.
-
Built-in Features: Scrapy comes packed with a plethora of built-in features that make web scraping more efficient and user-friendly. Right out of the box, Scrapy provides functionalities like handling retries, setting download delays, respecting robots.txt rules, and rotating user agents, ensuring that your scraping activities are both efficient and respectful to target websites.
-
Customizability: Scrapy stands out for its remarkable customizability, allowing developers to tailor the framework to their specific needs. At its core, Scrapy is designed to be modular, meaning each component, be it the downloader, spider, or pipeline, can be easily replaced or extended.
-
Community and Resources: One of Scrapy's most significant advantages is its vibrant and active community. Being an open-source framework, Scrapy has garnered a large following of developers and enthusiasts who actively contribute to its growth.
Disadvantages of Using Scrapy for Web Scraping:
While Scrapy is a powerful tool with numerous benefits, it's essential to understand its limitations. We'll explore some of the challenges and drawbacks that users might encounter when using Scrapy for web scraping.
-
Limited JavaScript Handling: By default, Scrapy fetches web pages in their raw HTML form, which means it doesn't execute JavaScript. Many modern websites rely heavily on JavaScript to load content dynamically, so Scrapy might miss out on this content. However, this can be overcome with Scrapy Headless browser integrations.
-
Authentication Challenges: Websites that require login often use sessions or cookies to track authenticated users. Scrapy, by default, doesn't maintain sessions, so developers need to manually handle cookies and session data to stay authenticated. This can be cumbersome, especially when dealing with sites that have complex session management.
-
Learning Curve: While Scrapy is a powerful web scraping framework, its extensive features and flexibility can present a steep learning curve for beginners. Unlike simpler libraries, mastering Scrapy requires understanding its unique architecture, including middlewares, pipelines, and settings. This complexity can be daunting for those new to web scraping, potentially leading to longer project initiation times.
-
Resource-Intensive: Scrapy, while efficient, can be resource-intensive, especially when running multiple spiders or handling large-scale scraping tasks. Its concurrent requests and handling of vast amounts of data can strain memory and CPU resources. As a result, developers might need to optimize their Scrapy projects or invest in more robust hardware for smooth operation.
When Should You Use Scrapy Over Python Pyppeteer?
When deciding between Scrapy and Python Pyppeteer for web scraping, several scenarios call for the use of Scrapy over Python Pyppeteer.
Let's delve into the ideal use cases where this framework truly demonstrates its prowess in web scraping.
-
Large-scale Web Crawling: Scrapy shines brightest when tasked with large-scale web crawling projects, such as aggregating data from e-commerce sites, news portals, or directories. Its built-in capabilities for handling concurrency, rate limiting, and request delays make it adept at navigating vast websites or multiple domains efficiently. For businesses or researchers aiming to gather extensive datasets from the web, Scrapy provides a robust and scalable solution.
-
Data Mining and Aggregation: Scrapy is exceptionally suited for data mining and aggregation tasks, such as extracting structured data from forums, review sites, or social media platforms for sentiment analysis or market research. Its ability to parse and filter vast amounts of data, combined with its extensible item pipelines, allows for efficient extraction, transformation, and storage of relevant data points. For organizations looking to derive insights from diverse web sources, Scrapy offers a comprehensive toolkit for streamlined data collection and processing.
-
Regular Data Extraction: Scrapy excels in scenarios requiring regular data extraction, such as monitoring product prices, stock availability, or real estate listings from e-commerce or classified websites. Its built-in scheduling and caching capabilities ensure timely and efficient data retrieval without redundant requests. For businesses aiming to stay updated with dynamic web content and make informed decisions, Scrapy provides a robust and automated solution for consistent data harvesting.
-
Custom Data Pipelines: Scrapy is particularly advantageous for projects that require tailored data processing and storage workflows. For instance, a media analysis company might use Scrapy's custom data pipelines to extract news articles, process the content for sentiment analysis, and then store the results in a specialized database or analytics platform. The flexibility of Scrapy's data pipeline system allows developers to seamlessly integrate various data processing tools and storage solutions, ensuring a streamlined and customized data flow tailored to specific project needs.
Setting Up Scrapy
Before installing Scrapy, ensure you have Python installed on your system.
Scrapy is compatible with both Python 2 and Python 3, but it's recommended to use Python 3 for the latest features and updates.
The easiest way to install Scrapy is using pip, the Python package manager. Open your terminal or command prompt and run the following command:
pip install scrapy
After the installation is complete, you can verify it by running:
scrapy version
Basic Scrapy Example
- Once Scrapy is installed, you can start a new Scrapy project by using the following command:
scrapy startproject projectname
Replace projectname
with the desired name for your Scrapy project. This will create a new directory with the specified name, containing the basic directory structure for a Scrapy project
A typical Scrapy project will have the following directory structure:
projectname/
|-- scrapy.cfg
|-- projectname/
|-- __init__.py
|-- items.py
|-- middlewares.py
|-- pipelines.py
|-- settings.py
|-- spiders/
|-- __init__.py
scrapy.cfg
: Contains configuration details for the Scrapy project.
items.py
: Defines the data structure for scraped items.
middlewares.py
: Contains middleware classes for processing requests and responses.
pipelines.py
: Defines item pipelines for post-processing scraped items.
settings.py
: Contains settings for the Scrapy project.
spiders/
: This directory will contain all your spiders (web crawlers).
- Define Items (
items.py
):
import scrapy
class QuoteItem(scrapy.Item):
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
- Set up a spider (
quotes_spider.py
insidequotes_scraper/spiders
directory):
import scrapy
from quotes_scraper.items import QuoteItem
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ['http://quotes.toscrape.com/page/1/']
def parse(self, response):
for quote in response.css('div.quote'):
item = QuoteItem()
item['text'] = quote.css('span.text::text').get()
item['author'] = quote.css('span small::text').get()
item['tags'] = quote.css('div.tags a.tag::text').getall()
yield item
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
- Set up item pipeline (
pipelines.py
):
class QuotesPipeline:
def process_item(self, item, spider):
# For demonstration, we'll just print the item
print(item)
return item
- Enable Item Pipeline (in
settings.py
):
ITEM_PIPELINES = { 'quotes_scraper.pipelines.QuotesPipeline': 300, }
- Run the Spider:
scrapy crawl quotes
This Scrapy project will scrape quotes, authors, and tags from the website and print them using the defined item pipeline. The spider follows pagination links to scrape quotes from subsequent pages.
Detailed Comparison of Python Pyppeteer vs Scrapy Features
Navigating the intricate landscape of web scraping tools, it's essential to have a clear side-by-side comparison.
Here, we present a detailed table contrasting the features of Python Pyppeteer and Scrapy to aid in your decision-making process.
Feature/Aspect | Pyppeteer | Scrapy |
---|---|---|
Library Type | Headless browser automation | Web scraping framework |
JavaScript Execution | Yes (Full browser environment) | Limited (Requires middleware like Splash) |
Concurrency Model | Asyncio | Twisted |
Middleware Support | Limited | Extensive (Custom middlewares supported) |
Built-in Data Storage | No | Yes (JSON, XML, CSV, and more) |
Request Handling | Browser-based (can handle AJAX, WebSockets) | HTTP requests (can handle AJAX with additional setup) |
Extensions | Limited | Rich (Many built-in extensions and support for custom ones) |
Headless Browsing | Yes (uses Chrome in headless mode) | No (requires middleware like Splash) |
Cookies & Session | Full browser session support | Manual cookie handling |
Captcha Handling | Can be handled with manual intervention in browser | Requires third-party services or manual handling |
Community & Support | Smaller community compared to Scrapy | Large community, extensive documentation |
Integration with Cloud | Possible but requires additional setup | Built-in support for cloud storage like S3 |
Learning Curve | Moderate (knowledge of async programming is beneficial) | Steeper for beginners, but powerful for advanced users |
To sum up, both Scrapy and Python Pyppeteer are prominent Python libraries for web scraping.
- Scrapy is optimized for large-scale, high-speed projects and boasts a robust community support.
- In contrast, Python Pyppeteer shines when dealing with dynamic, JavaScript-heavy websites by controlling a headless Chrome browser.
The choice between them hinges on the specific needs of the scraping task at hand.
Case Study: A Side-by-Side Python Pyppeteer vs Scrapy Comparison
To illustrate the differences between the two technologies let’s scrape Amazon products with Scrapy and Pyppeteer and discuss the key points.
Overall, we'll show you how to create a web scraper using Scrapy and Pyppeteer to extract specific data from the Amazon website, handling pagination, and accessing individual product pages to collect relevant information.
Please note that the specific HTML structure, element locators, and class or ID attributes used in the code samples are based on the current state of the web page as of November 2023.
Due to possible updates or modifications to the webpage's design and structure, the element locators and CSS selectors mentioned in the examples may no longer be valid or effective.
Please leave a comment on the article if you encounter any discrepancies, changes, or issues with the provided code samples.
Scrapy
import scrapy
from scrapy.crawler import CrawlerProcess
from urllib.parse import urlencode, urljoin
SCRAPEOPS_API_KEY = '' # fill out with your API key
SEARCH_WORD = 'jeans'
def scrapeops_url(url):
payload = {'api_key': SCRAPEOPS_API_KEY, 'url': url, 'country': 'us'}
proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)
return proxy_url
class AmazonEbikeSpider(scrapy.Spider):
name = 'amazon_spider'
'''
Here we are setting up:
- output file name and format
- headers - in order to be not blocked away by the server
- minimize concurrency - to avoid overloading the server
'''
custom_settings = {
'FEEDS': {
'output_data_scrapy.json': {
'format': 'json',
'encoding': 'utf8',
'store_empty': False,
'fields': None,
'indent': 4,
'item_export_kwargs': {
'export_empty_fields': True,
}
}
},
'ROBOTSTXT_OBEY': True,
'CONCURRENT_REQUESTS': 3,
'DOWNLOAD_DELAY': 1,
'DEFAULT_REQUEST_HEADERS': {
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"sec-ch-ua": "\".Not/A)Brand\";v=\"99\", \"Google Chrome\";v=\"103\", \"Chromium\";v=\"103\"",
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "\"Windows\"",
"sec-fetch-site": "none",
"sec-fetch-mod": "",
"sec-fetch-user": "?1",
"accept-encoding": "gzip, deflate, br",
"accept-language": "bg-BG,bg;q=0.9,en-US;q=0.8,en;q=0.7"
}
}
def start_requests(self):
urls = [
'https://www.amazon.com/ebike/s?k=ebike'
]
for url in urls:
yield scrapy.Request(url=scrapeops_url(url), callback=self.parse)
def parse(self, response):
url_list = []
# get a number of available search pages
available_pages = response.xpath(
'//a[contains(@class, "s-pagination-item")][not(contains(@class, "s-pagination-separator"))]/text()'
)
# create a list of search pages urls
for page in available_pages[:2]: # for illustration purposes let's take only first 2 pages
search_url_paginated = f'https://www.amazon.com/s?k={SEARCH_WORD}&page={page}'
url_list.append(search_url_paginated)
for url in url_list:
yield response.follow(scrapeops_url(url), self.parse_products_links)
def parse_products_links(self, response):
search_products = response.css("div.s-result-item[data-component-type=s-search-result]")
for product in search_products:
# extract relative url from the search product
relative_url = product.css("h2>a::attr(href)").get()
# form a product url
product_url = urljoin('https://www.amazon.com/', relative_url).split("?")[0]
yield response.follow(scrapeops_url(product_url), self.parse_product)
def parse_product(self, response):
product_name = response.css('span#productTitle::text').get()
if product_name is None:
product_name = "not available"
product_price = response.css('.a-price-whole::text')[0].get() or response.css('span#priceblock_dealprice::text').get()
if product_price is None:
product_price = "not available"
if product_name and product_price:
yield {
'product_name': product_name.strip(),
'product_price': product_price.strip()
}
# Running the spider
process = CrawlerProcess()
process.crawl(AmazonEbikeSpider)
process.start()
Comment:
- The spider first extracts the links of individual product pages from the search results.
- It then visits each product page concurrently to extract the product name and price.
- The spider continues to follow pagination links until all pages are scraped.
- The data is saved in ebike_products.json.
Note: Adjusting concurrency settings can increase the load on the target server. Always ensure you're respectful of the server's resources and terms of use. Too many requests in a short time can lead to IP bans or other restrictions. It's always a good idea to test with lower concurrency first and then gradually increase it while monitoring the server's response.
Pyppeteer
import asyncio
import os
import shutil
import stat
import tempfile
from pyppeteer import launch
import pyppeteer.errors
import json
# Proxy setup
PROXY_SERVER = 'proxy.scrapeops.io'
PROXY_SERVER_PORT = '5353'
SCRAPEOPS_API_KEY = '' # Replace with your API key
# Search setup
SEARCH_WORD = "jeans"
def remove_readonly(func, path, _):
""" Clear the readonly bit and reattempt the removal
Once the script finishes working, this function will help us to remove temporary user data created by the script
"""
os.chmod(path, stat.S_IWRITE)
func(path)
async def handle_request(req):
custom_headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"Cache-Control": "max-age=0",
"Connection": "keep-alive",
"DNT": "1", # Do Not Track Request Header
"Referer": "https://www.google.com/",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36"
}
# Add custom headers, overriding the existing ones
headers = {**req.headers, **custom_headers}
# Add error handling to the handle_request function
# to catch any exceptions that might occur during request processing
try:
if req.resourceType == 'image':
await req.abort()
else:
await req.continue_({'headers': headers})
except Exception as e:
print(f"Error in request handling: {e}")
async def extract_product_details(browser, product_url, semaphore):
# use semaphore to handle concurrency limits
async with semaphore:
page = None
# we have to wrap up our code in try...except blocks
# to make sure that we close browser pages gracefully in case any exceptions arise
try:
page = await browser.newPage()
# Set up request interception to abort requests with images and to set up custom headers
await page.setRequestInterception(True)
page.on('request', lambda req: asyncio.create_task(handle_request(req)))
try:
await page.goto(product_url, waitUntil='domcontentloaded', timeout=120000)
except pyppeteer.errors.TimeoutError:
print(f'TimeoutError while loading {product_url}')
return {'name': "not available", 'price': "not available"}
# in case product details are not found we set up their values to "not available"
product_name = "not available"
product_price = "not available"
# we should make sure that we don't perform any actions
# on pages that might have been closed due to any exceptions
if not page.isClosed():
try:
# set up a custom wait for a product name and parse it
await page.waitForSelector("span#productTitle", timeout=60000)
product_name = await page.evaluate('() => document.querySelector("span#productTitle").innerText')
except Exception as e:
print(f'Error fetching product name for {product_url}: {e}')
try:
product_price = await page.evaluate('() => document.querySelector(".a-price-whole").innerText').replace('\n', '')
except Exception as e:
print(f'Error fetching product price for {product_url}: {e}')
except Exception as e:
print(f'Error in extract_product_details for {product_url}: {e}')
finally:
# make sure that the page is closed if it was not
if page and not page.isClosed():
await page.close()
return {'name': product_name, 'price': product_price}
async def scrape_amazon_products():
browser = None
# we have to wrap up our code in try...except blocks
# to make sure that we close browser pages gracefully in case any exceptions arise
temp_dir = tempfile.mkdtemp()
try:
browser = await launch(
ignoreHTTPSErrors=True,
headless=True,
defaultViewport={'width': 1900, 'height': 1080},
args=['--start-fullscreen',
'--no-sandbox',
f'--proxy-server=http://scrapeops.headless_browser_mode=true:{SCRAPEOPS_API_KEY}@proxy.scrapeops.io:5353',
],
userDataDir=temp_dir
)
page = await browser.newPage()
# Set up request interception to abort requests with images and to set up custom headers.
await page.setRequestInterception(True)
page.on('request', lambda req: asyncio.create_task(handle_request(req)))
await page.setUserAgent(
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3')
all_product_urls = []
# Sequentially scrape the first two search results pages for illustration purposes
for page_num in range(1, 3):
url = f"https://www.amazon.com/{SEARCH_WORD}/s?k={SEARCH_WORD}&page={page_num}"
await page.goto(url)
product_links = await page.querySelectorAll('h2 a.a-link-normal')
product_urls = [await page.evaluate('(link) => link.href', link) for link in product_links]
all_product_urls.extend(product_urls)
await page.close()
# Set up semaphore to limit concurrency, otherwise we might overwhelm the server and get blocked out
semaphore = asyncio.Semaphore(3)
# Concurrently scrape product details for each URL
tasks = [extract_product_details(browser, product_url, semaphore) for product_url in all_product_urls]
products = await asyncio.gather(*tasks)
# Save the data as JSON
with open('output_data_pyppeteer.json', 'w') as f:
json.dump(products, f)
except Exception as e:
print(f'An error occurred: {e}')
finally:
if browser:
await browser.close()
# Introduce a small delay to ensure that all asynchronous operations have been completed and that
# all resources, especially those related to the browser and its pages, have been properly released.
await asyncio.sleep(2)
try:
shutil.rmtree(temp_dir, onerror=remove_readonly)
except Exception as e:
print(f"Error removing temporary directory: {e}")
asyncio.run(scrape_amazon_products())
Comment:
- The
extract_product_details
function is responsible for extracting product details from each product page. - We use
asyncio.gather
to run multiple tasks concurrently. - This approach will speed up the scraping process by processing multiple product pages at the same time. However, be cautious not to overload the website with too many requests in a short period, as it may lead to temporary IP bans or other restrictions. Adjust the concurrency level as needed.
- Ensure that you have the right to use the proxy and that you're not violating any terms of service.
Now, with the real full code example we can compare the two crawlers by the keypoints:
-
Storing the Data:
- When it comes to data storage, Scrapy has a clear edge. It offers out-of-the-box support for exporting your scraped data into popular formats. With its powerful pipelines and feed exports, you can effortlessly push your data to various databases and storage solutions, including cloud-based and FTP systems.
- Contrast this with Pyppeteer, which leaves data storage entirely up to the developer. There's no built-in mechanism, so you'll need to design your own storage solution.
-
Managing Concurrency:
- Concurrency is where Scrapy shines. It comes with a robust, built-in system to manage multiple requests simultaneously, ensuring efficient data scraping.
- While Pyppeteer can also handle concurrent crawling using asyncio, setting it up and customizing it might demand more elbow grease compared to Scrapy.
-
Performance Metrics:
- Pyppeteer, being browser-based, lags behind Scrapy in terms of speed. The overhead of loading and rendering web pages in a browser environment makes it inherently slower.
-
Navigating and Crawling Pages:
- Pyppeteer's browser-based nature means page navigation is somewhat sluggish compared to Scrapy. If you're thinking of combining Scrapy's concurrency with Pyppeteer, there's a catch. Every new link you follow with Pyppeteer necessitates initiating a new page object. This diminishes the advantage of browser-based navigation where you can click and follow links.
-
Speed:
- Setting up Pyppeteer demands more initial work compared to Scrapy. The boilerplate code required and the inherent slowness of browser-based scraping means Scrapy is the faster option in both setup and execution.
-
Proxy Integration:
- Scrapy has a built-in mechanism for integrating proxies, making it easy to rotate and manage them. Check out Scrapy's documentation on proxy middleware for more details.
- Pyppeteer, on the other hand, does support proxy integration, but if you're looking to implement a proxy rotation system, be prepared for some additional work or using third-party proxy providers. In the example above we used a third party proxy provided for demonstration purposes.
Which one is the winner?
In the face-off between Scrapy and Pyppeteer as web crawling tools, Scrapy emerges as the superior choice:
- Scrapy excels with its integrated data storage and efficient concurrent scraping capabilities.
- Pyppeteer, being browser-based, faces challenges in speed and navigation, requiring more developer effort for setup and data storage.
Additional Python Pyppeteer & Scrapy Resources
When diving into the realm of web scraping with Python's Pyppeteer and Scrapy, having the right resources at your fingertips is crucial.
This article provides a concise list of official documentation for both tools, ensuring you have direct access to authoritative and up-to-date information.
More Web Scraping Guides
Both Scrapy and Python Pyppeteer have their strengths. If you're looking for a mature, extensible, and high-performance framework, Scrapy might be your best bet. However, if you need to interact with modern web applications, emulate user behaviors, or require detailed control over browser sessions, Python Pyppeteer could be the way to go.
Now, you have a good understanding of Scrapy and Python Pyppeteer. If you would like to get more information about web scraping, check out our other tutorials: