Skip to main content

Python Selenium Examples

The following are code examples on how to integrate the ScrapeOps Proxy Aggregator with your Python Selenium Scrapers.

Authorisation - API Key

To use the ScrapeOps proxy, you first need an API key which you can get by signing up for a free account here.

Your API key must be included with every request using the api_key query parameter otherwise the API will return a 403 Forbidden Access status code.


Proxy Port Integration

When integrating the Proxy Aggregator with your Selenium scrapers it is recommended that you use our proxy port integration over the API endpoint integration.

The proxy port integration is a light front-end for the API and has all the same functionality and performance as sending requests to the API endpoint but allow you to integrate our proxy aggregator as you would with any normal proxy.

The username for the proxy is scrapeops and the password is your API key.


"http://scrapeops.headless_browser_mode=true:YOUR_API_KEY@proxy.scrapeops.io:5353"

This is because using the API endpoint can create issues when loading pages and following links when the website uses relative links over absolute links.

We've also added the key/value headless_browser_mode=true to the username section of the proxy string as this will optimize the proxy port for use with headless browsers like Selenium, Puppeteer, et.

SSL Certificate Verification

Note: So that we can properly direct your requests through the API, your code must be configured to not verify SSL certificates.

To enable extra/advanced functionality, you can pass parameters by adding them to username, separated by periods.

For example, if you want to enable country geotargeting with US based proxies, the username would be scrapeops.country=us.

Also, multiple parameters can be included by separating them with periods, for example:


"http://scrapeops.headless_browser_mode=true.country=us:YOUR_API_KEY@proxy.scrapeops.io:5353"


Integrating With Selenium Scrapers

To integrate our proxy with your Selenium scraper we recommend that you use the Selenium Wire extension which makes it very easy to use proxies with Selenium.

First, you need to install Selenium Wire using pip:


pip install selenium-wire

Then update your scraper to use seleniumwire's webdriver instead of the default selenium webdriver.


from seleniumwire import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup

SCRAPEOPS_API_KEY = 'YOUR_API_KEY'

## Define ScrapeOps Proxy Port Endpoint
proxy_options = {
'proxy': {
'http': f'http://scrapeops.headless_browser_mode=true:{SCRAPEOPS_API_KEY}@proxy.scrapeops.io:5353',
'https': f'http://scrapeops.headless_browser_mode=true:{SCRAPEOPS_API_KEY}@proxy.scrapeops.io:5353',
'no_proxy': 'localhost:127.0.0.1'
}
}

## Set Up Selenium Chrome driver
driver = webdriver.Chrome(ChromeDriverManager().install(),
seleniumwire_options=proxy_options)

## Send Request Using ScrapeOps Proxy
driver.get('http://quotes.toscrape.com/page/1/')

## Retrieve HTML Response
html_response = driver.page_source

## Extract Data From HTML
soup = BeautifulSoup(html_response, "html.parser")
h1_text = soup.find('h1').text

print(h1_text)

ScrapeOps will take care of the proxy selection and rotation for you so you just need to send us the URL you want to scrape.


Response Format

After recieving a response from one of our proxy providers the ScrapeOps Proxy API will then respond with the raw HTML content of the target URL along with a response code:


<html>
<head>
...
</head>
<body>
...
</body>
</html>

The ScrapeOps Proxy API will return a 200 status code when it successfully got a response from the website that also passed response validation, or a 404 status code if the website responds with a 404 status code. Both of these status codes are considered successful requests.

Here is the full list of status codes the Proxy API returns.


Production Selenium Scraper

The above example works but isn't good for production scraping as it suffers from a number of issues.

  • Only scrapes one URL.
  • Assumes every request is successful and doesn't retry failed requests.
  • Opens a browser for every request.
  • Gives away itself as a automated browser by opening in sandbox mode, etc.
  • Loads all images which generate extra requests and consumes more bandwidth.

So in the below example we've given an example of a more production ready Selenium scraper that deals with all the above issues:


from seleniumwire import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup

SCRAPEOPS_API_KEY = 'YOUR_API_KEY'
NUM_RETRIES = 2

proxy_options = {
'proxy': {
'http': f'http://scrapeops.headless_browser_mode=true:{SCRAPEOPS_API_KEY}@proxy.scrapeops.io:5353',
'https': f'http://scrapeops.headless_browser_mode=true:{SCRAPEOPS_API_KEY}@proxy.scrapeops.io:5353',
'no_proxy': 'localhost:127.0.0.1'
}
}


## Store The Scraped Data In This List
scraped_quotes = []

## Urls to Scrape
url_list = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]


def status_code_first_request(performance_log):
"""
Selenium makes it hard to get the status code of each request,
so this function takes the Selenium performance logs as an input
and returns the status code of the first response.
"""
for line in performance_log:
try:
json_log = json.loads(line['message'])
if json_log['message']['method'] == 'Network.responseReceived':
return json_log['message']['params']['response']['status']
except:
pass
return json.loads(response_recieved[0]['message'])['message']['params']['response']['status']



## Optional --> define Selenium options
option = webdriver.ChromeOptions()
option.add_argument('--headless') ## --> comment out to see the browser launch.
option.add_argument('--no-sandbox')
option.add_argument('--disable-dev-sh-usage')
option.add_argument('--blink-settings=imagesEnabled=false')

## Enable Selenium logging
caps = DesiredCapabilities.CHROME
caps['goog:loggingPrefs'] = {'performance': 'ALL'}


## Set up Selenium Chrome driver
driver = webdriver.Chrome(ChromeDriverManager().install(),
options=option,
desired_capabilities=caps,
seleniumwire_options=proxy_options)

for url in url_list:

for _ in range(NUM_RETRIES):
try:
driver.get(url)
performance_log = driver.get_log('performance')
status_code = status_code_first_request(performance_log)
if status_code in [200, 404]:
## escape for loop if the API returns a successful response
break
except requests.exceptions.ConnectionError as e:
print("error", e)
driver.close()


if status_code == 200:
## Feed HTML response into BeautifulSoup
html_response = driver.page_source
soup = BeautifulSoup(html_response, "html.parser")

## Find all quotes sections
quotes_sections = soup.find_all('div', class_="quote")

## loop through each quotes section and extract the quote and author
for quote_block in quotes_sections:
quote = quote_block.find('span', class_='text').text
author = quote_block.find('small', class_='author').text

## Add scraped data to "scraped_quotes" list
scraped_quotes.append({
'quote': quote,
'author': author
})


print(scraped_quotes)