Using Proxies With Python Pyppeteer
Pyppeteer is a python port of Puppeteer, a powerful browser automation library maintained by google that allows you to build bots and scrapers that can load and interact with web pages in the browser like a real user. Pyppeteer is a great way to for python developers to utilize the capabilities of puppeteer.
In this guide for The Python Pyppeteer Web Scraping Playbook, we will look at how to integrate proxies into our Python Pyppeteer based web scraper.
There are number of different types of proxies which you need to integrate differently with Pyppeteer, so we will walk through how to integrate each type:
Using Proxies With Pyppeteer
The first and simplest type of proxy to integrate with Python Pyppeteer are simple HTTP proxies (in the form of an IP address) that don't require authentication. For example:
"11.456.448.110:8080"
To integrate this proxy IP into a Pyppeteer scraper simply set the --proxy-server
argument to proxyUrl
and add it to args
list inside launchOptions
dictionary. Then simply launch the browser by calling launch
function providing launchOptions
dictionary you just defined.
import asyncio
from pyppeteer import launch
proxy_url = '11.456.448.110:8080'
launchOptions = {
'args': [
f'--proxy-server={proxy_url}'
]
}
async def main():
browser = await launch(launchOptions)
page = await browser.newPage()
await page.goto('https://httpbin.org/ip')
page_content = await page.evaluate('() => document.body.innerText')
print(page_content)
await browser.close()
asyncio.get_event_loop().run_until_complete(main())
Now when we run the script we can see that Pyppeteer is using the defined proxy IP:
{
"origin": "11.456.448.110:8080"
}
Using Authenticated Proxies With Pyppeteer
It is very common for commercial proxy providers to sell access to their proxy pools by giving you single proxy endpoint that you send your requests to and authenticate your account using a username
and password
.
Using proxies that require username
and password
authentication isn't that much different from how we just used proxies without authentication. First launch a browser like before by providing proxyUrl
in the args
param of launchOptions
. Then simply call page.authenticate
method with proxy username
and password
before making a request to a webpage with page.goto
.
import asyncio
from pyppeteer import launch
proxy_url = '201.88.548.330:8080'
username = 'PROXY_USERNAME'
password = 'PROXY_PASSWORD'
launchOptions = {
'args': [
f'--proxy-server={proxy_url}'
]
}
async def main():
browser = await launch(launchOptions)
page = await browser.newPage()
await page.authenticate({
'username': username,
'password': password
})
await page.goto('https://httpbin.org/ip')
page_content = await page.evaluate('() => document.body.innerText')
print(page_content)
await browser.close()
asyncio.get_event_loop().run_until_complete(main())
Now when we run the script we can see that Pyppeteer is using a proxy IP:
{
"origin": "201.88.548.330:8080"
}
Integrating Proxy APIs
Over the last few years there has been a huge surge in proxy providers that offer smart proxy solutions that handle all the proxy rotation, header selection, ban detection and retries on their end. These smart APIs typically provide their proxy services in a API endpoint format.
However, these proxy API endpoints don't integrate well with headless browsers when the website is using relative links as Pyppeteer will try to attach the relative URL onto the proxy API endpoint not the websites root URL. Resulting, in some pages not loading correctly.
As a result, when integrating your Pyppeteer scrapers it is recommended that you use their proxy port integration over the API endpoint integration when they provide them (not all do have a proxy port integration).
For example, in the case of the ScrapeOps Proxy Aggregator we offer a proxy port integration for situations like this.
The proxy port integration is a light front-end for the API and has all the same functionality and performance as sending requests to the API endpoint but allow you to integrate our proxy aggregator as you would with any normal proxy.
The following is an example of how to integrate the ScrapeOps Proxy Aggregator into your Pyppeteer scraper:
import asyncio
from pyppeteer import launch
proxy_url = 'proxy.scrapeops.io:5353'
SCRAPEOPS_API_KEY = 'YOUR_API_KEY'
launchOptions = {
'args': [
f'--proxy-server={proxy_url}'
]
}
async def main():
browser = await launch(launchOptions)
page = await browser.newPage()
await page.authenticate({
'username': 'scrapeops',
'password': SCRAPEOPS_API_KEY
})
await page.goto('https://httpbin.org/ip')
page_content = await page.evaluate('() => document.body.innerText')
print(page_content)
await browser.close()
asyncio.get_event_loop().run_until_complete(main())
Here we set username
to 'scrapesops'
and password
to SCRAPEOPS_API_KEY
while calling page.authenticate
.
Note: So that we can properly direct your requests through the API, your code must be configured to ignore SSL certificate verification errors ignoreHTTPSErrors: true
.
Full integration docs for Python Pyppeteer and the ScrapeOps Proxy Aggregator can be found here.
To use the ScrapeOps Proxy Aggregator, you first need an API key which you can get by signing up for a free account here which gives you 1,000 free API credits.
More Web Scraping Tutorials
So that's how you can use both authenticated and unauthenticated proxies with Pyppeteer to scrape websites without getting blocked.
If you would like to learn more about Web Scraping with Pyppeteer, then be sure to check out The Pyppeteer Web Scraping Playbook.
Or check out one of our more in-depth guides: