Skip to main content

Python Selenium Examples

The following are code examples on how to integrate the ScrapeOps Proxy Aggregator with your Python Selenium Scrapers.

Authorisation - API Key

To use the ScrapeOps proxy, you first need an API key which you can get by signing up for a free account here.

Your API key must be included with every request using the api_key query parameter otherwise the API will return a 403 Forbidden Access status code.


Proxy Port Integration

When integrating the Proxy Aggregator with your Selenium scrapers it is recommended that you use our proxy port integration over the API endpoint integration.

The proxy port integration is a light front-end for the API and has all the same functionality and performance as sending requests to the API endpoint but allow you to integrate our proxy aggregator as you would with any normal proxy.

The username for the proxy is scrapeops and the password is your API key.


"http://scrapeops.headless_browser_mode=true:YOUR_API_KEY@proxy.scrapeops.io:5353"

Here are the individual connection details:

  • Proxy: proxy.scrapeops.io
  • Port: 5353
  • Username: scrapeops.headless_browser_mode=true
  • Password: YOUR_API_KEY

This is because using the API endpoint can create issues when loading pages and following links when the website uses relative links over absolute links.

We've also added the key/value headless_browser_mode=true to the username section of the proxy string as this will optimize the proxy port for use with headless browsers like Selenium, Puppeteer, et.

SSL Certificate Verification

Note: So that we can properly direct your requests through the API, your code must be configured to not verify SSL certificates.

To enable extra/advanced functionality, you can pass parameters by adding them to username, separated by periods.

For example, if you want to enable country geotargeting with US based proxies, the username would be scrapeops.country=us.

Also, multiple parameters can be included by separating them with periods, for example:


"http://scrapeops.headless_browser_mode=true.country=us:YOUR_API_KEY@proxy.scrapeops.io:5353"


Integrating With Selenium Scrapers

To integrate our proxy with your Selenium scraper we recommend that you use the Selenium Wire extension which makes it very easy to use proxies with Selenium.

First, you need to install Selenium Wire using pip:


pip install selenium-wire

Then update your scraper to use seleniumwire's webdriver instead of the default selenium webdriver.


from seleniumwire import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup

SCRAPEOPS_API_KEY = 'YOUR_API_KEY'

## Define ScrapeOps Proxy Port Endpoint
proxy_options = {
'proxy': {
'http': f'http://scrapeops.headless_browser_mode=true:{SCRAPEOPS_API_KEY}@proxy.scrapeops.io:5353',
'https': f'http://scrapeops.headless_browser_mode=true:{SCRAPEOPS_API_KEY}@proxy.scrapeops.io:5353',
'no_proxy': 'localhost:127.0.0.1'
}
}

## Set Up Selenium Chrome driver
driver = webdriver.Chrome(seleniumwire_options=proxy_options)


## Send Request Using ScrapeOps Proxy
driver.get('http://quotes.toscrape.com/page/1/')

## Retrieve HTML Response
html_response = driver.page_source

## Extract Data From HTML
soup = BeautifulSoup(html_response, "html.parser")
h1_text = soup.find('h1').text

print(h1_text)

ScrapeOps will take care of the proxy selection and rotation for you so you just need to send us the URL you want to scrape.


Response Format

After recieving a response from one of our proxy providers the ScrapeOps Proxy API will then respond with the raw HTML content of the target URL along with a response code:


<html>
<head>
...
</head>
<body>
...
</body>
</html>

The ScrapeOps Proxy API will return a 200 status code when it successfully got a response from the website that also passed response validation, or a 404 status code if the website responds with a 404 status code. Both of these status codes are considered successful requests.

Here is the full list of status codes the Proxy API returns.


Production Selenium Scraper

The above example works but isn't good for production scraping as it suffers from a number of issues.

  • Only scrapes one URL.
  • Assumes every request is successful and doesn't retry failed requests.
  • Opens a browser for every request.
  • Gives away itself as a automated browser by opening in sandbox mode, etc.
  • Loads all images which generate extra requests and consumes more bandwidth.

So in the below example we've given an example of a more production ready Selenium scraper that deals with all the above issues:


# code below tested & working as of Nov 2023

# module versions used in example below
# selenium==4.14.0
# selenium-wire==5.1.0
# webdriver-manager==4.0.1

from seleniumwire import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup

SCRAPEOPS_API_KEY = 'YOUR_API_KEY'
NUM_RETRIES = 2

proxy_options = {
'proxy': {
'http': f'http://scrapeops.headless_browser_mode=true:{SCRAPEOPS_API_KEY}@proxy.scrapeops.io:5353',
'https': f'http://scrapeops.headless_browser_mode=true:{SCRAPEOPS_API_KEY}@proxy.scrapeops.io:5353',
'no_proxy': 'localhost:127.0.0.1'
}
}

## Store The Scraped Data In This List
scraped_quotes = []

## Urls to Scrape
url_list = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
'http://quotes.toscrape.com/page/3/',
]

## Optional --> define Selenium options
option = webdriver.ChromeOptions()
option.add_argument('--headless') ## --> comment out to see the browser launch.
option.add_argument('--no-sandbox')
option.add_argument('--disable-dev-sh-usage')
option.add_argument('--blink-settings=imagesEnabled=false')


## Set up Selenium Chrome driver
driver = webdriver.Chrome(
options=option,
seleniumwire_options=proxy_options)




### Our Helper Functions ###
def get_page_url_status_code(url, driver):
page_url_status_code = 500

# Access requests via the `requests` attribute
for request in driver.requests:

if request.response:
#show all urls that are requested per page load
print(
request.url,
request.response.status_code,
request.response.headers['Content-Type']
)


if request.url == url:
page_url_status_code = request.response.status_code

return page_url_status_code


## customise this list with what ever your page does not need
def interceptor(request):
# stopping images from being requested
# in case any are not blocked by imagesEnabled=false in the webdriver options above
if request.path.endswith(('.png', '.jpg', '.gif')):
request.abort()

# stopping css from being requested
if request.path.endswith(('.css')):
request.abort()

# stopping fonts from being requested
if 'fonts.' in request.path: #eg fonts.googleapis.com or fonts.gstatic.com
request.abort()

### End Of Helper Functions ###




## looping through our list of urls
for url in url_list:

## manage retries in case we get a 500/401 response etc
for _ in range(NUM_RETRIES):
try:
## add an interceptor to make sure we don't request un-needed files (css or images) - saves money!
driver.request_interceptor = interceptor

driver.get(url)
status_code = get_page_url_status_code(url, driver)

if status_code in [200, 404]:
## escape for loop if the API returns a successful response
break
except Exception as e:
print("error", e)
driver.close()


if status_code == 200:
## Feed HTML response into BeautifulSoup
html_response = driver.page_source
soup = BeautifulSoup(html_response, "html.parser")

## Find all quotes sections
quotes_sections = soup.find_all('div', class_="quote")

## loop through each quotes section and extract the quote and author
for quote_block in quotes_sections:
quote = quote_block.find('span', class_='text').text
author = quote_block.find('small', class_='author').text

## Add scraped data to "scraped_quotes" list
scraped_quotes.append({
'quote': quote,
'author': author
})


print(scraped_quotes)