How to Scrape Google Reviews

Scraping Google Reviews is notoriously difficult. To start, Google Reviews are only accessible via Google Maps. On top of that, Google uses dynamic CSS selectors, the data gets loaded dynamically, and all of it is incredibly nested.

How to Scrape Google Reviews With Requests and BeautifulSoup

Today, we'll learn how to crawl Google Maps and then we'll learn how to get the reviews for each business found in our crawl. This information is incredibly useful, especially when you wish to collect aggregate data on different businesses.

💡GitHub CodeThe full code for this Google Reviews Scraper is available on Github here.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

TLDR - How to Scrape Google Reviews with Python

To scrape Google Reviews, we need to crawl Google Maps and create a list of businesses. Then, we need to lookup each business from the list and save the reviews. The code below does exactly this. To get started:

Create a new project folder with a config.json file.
Then add your ScrapeOps API key to the file, {"api_key": "your-super-secret-api-key"}.
Copy/paste the code below into a new Python file and you're good to go!
Run it with python name_of_your_python_file.py.

import osimport reimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 5000,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    rating_count: int = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass ReviewData:    name: str = ""    stars: int = 0    time_left: str = ""    review_shortened: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, locality, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.google.com/maps/search/{formatted_keyword}/@{locality},14z/data=!3m1!4b1?entry=ttu"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                            soup = BeautifulSoup(response.text, "html.parser")            business_links = soup.select("div div a")            excluded_words = ["Sign in"]            for business_link in business_links:                name = business_link.get("aria-label")                if not name or name in excluded_words or "Visit" in name:                    continue                maps_link = business_link.get("href")                full_card = business_link.parent                                rating_holder = full_card.select_one("span[role='img']")
                rating = 0.0                rating_count = 0
                if rating_holder:                    rating_array = rating_holder.text.split("(")                    rating = rating_array[0]                    rating_count = int(rating_array[1].replace(")", "").replace(",", ""))                                search_data = SearchData(                    name=name,                    stars=rating,                    url=maps_link,                    rating_count=rating_count                )                           data_pipeline.add_data(search_data)
            success = True            logger.info(f"Successfully parsed data from: {url}")                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, location, localities,  data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * len(localities),            [location] * len(localities),            localities,            [data_pipeline] * len(localities),            [retries] * len(localities)        )

def process_business(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(get_scrapeops_url(url, location=location))        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")                soup = BeautifulSoup(response.text, "html.parser")                      main_card = soup.select_one("div[role='main']")
                info_cards = soup.find_all("div", class_="MyEned")                review_pipeline = DataPipeline(csv_filename=f"{row['name']}.csv")                for card in info_cards:                    review = card.text
                    full_card = card.parent.parent.parent.parent                    reviewer_button = full_card.find("button")                    name = reviewer_button.get("aria-label").replace("Photo of ", "")                    rating_tag = full_card.select_one("span[role='img']")                    stars = int(rating_tag.get("aria-label").replace(" stars", "").replace(" star", ""))                    review_date = rating_tag.parent.find_all("span")[-1].text                                        review_data = ReviewData(                        name=name,                        stars=stars,                        time_left=review_date,                        review_shortened=review                    )                    review_pipeline.add_data(review_data)                review_pipeline.close_pipeline()                success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")



def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_business,                reader,                [location] * len(reader),                [retries] * len(reader)            )
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"    LOCALITIES = ["42.3,-83.5"]
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["restaurant"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, LOCATION, LOCALITIES, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

To customize your results, change any of the following:

MAX_RETRIES: the max amount of retries for a parse.
MAX_THREADS: how many threads you'd like to use when parsing pages simultaneously.
LOCATION: the location you'd like to appear from.
LOCALITIES: the areas of the map you'd like to scrape. They need to be added in as latitude and longitude pairs.
keyword_list: the keywords you'd like to search the map for.

When you change your localities, you need to use latitude and longitude pairs.

How To Architect Our Google Reviews Scraper

As mentioned above, to get Google Reviews, we need to crawl Google Maps. Maps itself is really tricky to scrape. We need a locality (latitude and longitude), and we need to wait for dynamic content to load on the screen before getting our result.If our proxy server properly loads our page, we can extract the following information for each business in the search: name, stars, url, and rating_count. Then, we'll save these to a CSV file. Then, our review scraper will go through and find reviews for each of these businesses.Our Maps crawler will need to do the following:

Lookup businesses in a certain locality and parse the results.
Store the parsed data inside a CSV file.
Concurrently parse multiple localities at once.
Integrate with a proxy to get past Google's anti-bot systems

After our crawl, the Reviews scraper needs to perform these tasks:

Read the CSV from the crawl into an array.
Parse reviews from each business extracted during the crawl.
Store review data for each business.
Concurrently parse and store this data.
Use proxy integration to get past anything that might block us.

Understanding How To Scrape Google Reviews

When we scrape Google Reviews, our data gets loaded dynamically. On top of that, it is incredibly nested within the page. Let's get a better understanding of how exactly to get the pages that contain our data. Then, we'll take a look at where we need to pull the data from.

Step 1: How To Request Google Reviews Pages

As with any scraping job, we need to begin with a GET request. If you're unfamiliar with HTTP, we perform a GET to get information.

When you navigate to a site in your browser, your browser performs what's called a GET request to the server.
Your browser receives a response back in the form of an HTML page.
With Python Requests (our HTTP client), we'll perform that same GET request.
The big difference is how we handle the HTML response.
Instead of rendering the page for us to view (like the browser does), we'll code our scraper to actually dig through the HTML for the information.

If you look below, you can veiw an example search for the word "restaurant". Here is our URL:

https://www.google.com/maps/place/Leo's+Coney+Island/@42.3937072,-83.4828338,17z/data=!4m6!3m5!1s0x8824acedc1b6f397:0xaa85d06de541a352!8m2!3d42.3937072!4d-83.4828338!16s%2Fg%2F1tf299fd?authuser=0&hl=en&entry=ttu&g_ep=EgoyMDI0MDkwOC4wIKXMDSoASAFQAw%3D%3D`

@42.3937072,-83.4828338 is our latitude and longitude.When we lookup a specific restaurant, we get a super similar page. We get our map, and along with it, we get a section of the page containing the business information and reviews.

Step 2: How To Extract Data From Google Reviews Results and Pages

As you just learned, we start with a GET request. The next question, what do we do with the page once we've gotten it? We need to dig through the HTML and pull the data out of the HTML page. Let's take a look at the pages we just visited and see where the data is located inside the HTML.On the search page, each restaurant has an a tag with a link to the restaurant information.On the individual business page, the actual reviews are embedded within a div with a class of MyEned. Once we find this element, we can find its parent elements. Once we've found the correct parent element, we can find all of the other information we need.

Step 3: Geolocated Data

There are two things we need to do in order to handle geolocation.

To start, when we search businesses on Google Maps, we're searching based on a specific locality using its latitude and longitude. When you're interacting with Google Maps, you're not paying attention to that part, but these coordinates are actually saved in our URL. Think back to the latitude and longitude in the url from earlier, @42.3937072,-83.4828338.
On top of the locality we wish to search, we need to handle the actual location we want to appear in on Google's servers. To take care of this, we can use the country param with the ScrapeOps Proxy Aggregator.

If you want to appear in the US, you can pass {"country": "us"} to ScrapeOps.

You can view a full list of supported countries here.

Setting Up Our Google Reviews Scraper Project

Let's get started. You can run the following commands to get setup.Create a New Project Folder

mkdir google-reviews-scraper
cd google-reviews-scraper

Create a New Virtual Environment

python -m venv venv

Activate the Environment

source venv/bin/activate

Install Our Dependencies

pip install requests

pip install beautifulsoup4

Build A Google Reviews Search Crawler

To get started, we need to get a list of businesses and their urls. To accomplish this, we need to build a crawler that performs a search and saves a list of businesses. We're going to go through several iterations and build our crawler in the following steps:

Perform a search and parse the results.
Store those results safely in a CSV file.
Run steps 1 and 2 on multiple localities with concurrency.
Use proxy integration to help control our geolocation and bypass anti-bots.

Step 1: Create Simple Search Data Parser

We need to start by creating a simple search parser. In the code example below, we setup our basic structure. This code contains error handling, retry logic, and our parsing function, scrape_search_results().Pay close attention to the parsing logic going on in this script.

import osimport reimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)

def scrape_search_results(keyword, location, locality, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.google.com/maps/search/{formatted_keyword}/@{locality},14z/data=!3m1!4b1?entry=ttu"    tries = 0    success = False        while tries <= retries and not success:        try:            response = requests.get(url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                            soup = BeautifulSoup(response.text, "html.parser")            business_links = soup.select("div div a")            excluded_words = ["Sign in"]            for business_link in business_links:                name = business_link.get("aria-label")                if not name or name in excluded_words or "Visit" in name:                    continue                maps_link = business_link.get("href")                full_card = business_link.parent                                rating_holder = full_card.select_one("span[role='img']")
                rating = 0.0                rating_count = 0
                if rating_holder:                    rating_array = rating_holder.text.split("(")                    rating = rating_array[0]                    rating_count = int(rating_array[1].replace(")", "").replace(",", ""))                                search_data = {                    "name": name,                    "stars": rating,                    "url": maps_link,                    "rating_count": rating_count                }                                   print(search_data)                            success = True            logger.info(f"Successfully parsed data from: {url}")                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, location, localities, retries=3):    for locality in localities:        scrape_search_results(keyword, location, locality, retries=retries)        
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"    LOCALITIES = ["42.3,-83.5","42.35,-83.5", "42.4,-83.5"]
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["restaurant"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        start_scrape(keyword, LOCATION, LOCALITIES, retries=MAX_RETRIES)            logger.info(f"Crawl complete.")

First, we find all of the business links, business_links = soup.select("div div a").
We filter out all of our unwanted links.
We retrieve the name of each business with business_link.get("aria-label").
business_link.get("href") gives us the link to each business.
We then find the parent element of the business link, full_card = business_link.parent.
full_card.select_one("span[role='img']") finds our rating holder.
We use basic string splitting to extract the rating and then we convert it to an integer.

Step 2: Storing the Scraped Data

Now, to store our data. Without data storage, our crawl would be pretty useless. Our goal is to store all the data extracted from the crawl inside a nice, neat CSV file.

First, we'll create a dataclass to represent our search results. Then, we need a pipeline to a CSV.
This pipeline should also filter out duplicate results so we're not wasting our precious resources looking things up twice when we scrape the reviews.

Here is our dataclass. We'll call it SearchData.

@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    rating_count: int = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

Here is our DataPipeline. This class opens a pipe to a CSV file and filters out duplicates using their name attribute.

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()

When we put everything together, we open a new DataPipeline and pass it into start_scrape(). It then gets passed into scrape_search_results(). Instead of finding and printing our data as a dict object, we create a SearchData object and pass it into our DataPipeline.

import osimport reimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    rating_count: int = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, locality, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.google.com/maps/search/{formatted_keyword}/@{locality},14z/data=!3m1!4b1?entry=ttu"    tries = 0    success = False        while tries <= retries and not success:        try:            response = requests.get(url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                            soup = BeautifulSoup(response.text, "html.parser")            business_links = soup.select("div div a")            excluded_words = ["Sign in"]            for business_link in business_links:                name = business_link.get("aria-label")                if not name or name in excluded_words or "Visit" in name:                    continue                maps_link = business_link.get("href")                full_card = business_link.parent                                rating_holder = full_card.select_one("span[role='img']")
                rating = 0.0                rating_count = 0
                if rating_holder:                    rating_array = rating_holder.text.split("(")                    rating = rating_array[0]                    rating_count = int(rating_array[1].replace(")", "").replace(",", ""))                                search_data = SearchData(                    name=name,                    stars=rating,                    url=maps_link,                    rating_count=rating_count                )                           data_pipeline.add_data(search_data)                            success = True            logger.info(f"Successfully parsed data from: {url}")                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, location, localities,  data_pipeline=None, retries=3):    for locality in localities:        scrape_search_results(keyword, location, locality, data_pipeline=data_pipeline, retries=retries)        
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"    LOCALITIES = ["42.3,-83.5","42.35,-83.5", "42.4,-83.5"]
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["restaurant"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, LOCATION, LOCALITIES, data_pipeline=crawl_pipeline, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

SearchData is used to represent individual search results from our crawl.
DataPipeline is used to pipe all of our SearchData objects to a CSV file and remove the duplicates.

Step 3: Adding Concurrency

Adding concurrency is really easy if you know what you're doing. start_scrape() already allows us to crawl a list of different localities.To crawl this list concurrently, we just need to refactor start_scrape() and replace the for loop with something a little more powerful. We'll do this using ThreadPoolExecutor. This opens up a new pool of threads and runs our parsing function on each thread concurrently.Here is our old version of start_scrape().

def start_scrape(keyword, location, localities,  data_pipeline=None, retries=3):    for locality in localities:        scrape_search_results(keyword, location, locality, data_pipeline=data_pipeline, retries=retries)

You can see the new and improved version in the snippet below.

def start_scrape(keyword, location, localities,  data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * len(localities),            [location] * len(localities),            localities,            [data_pipeline] * len(localities),            [retries] * len(localities)        )

exectutor.map() is the portion that actually replaces the for loop. Take a look at the args:

scrape_search_results: the function we want to call.
[keyword] * len(localities): our keyword passed in as a list.
[location] * len(localities): our location passed in as a list.
localities: the list of localities we'd like to crawl.
[data_pipeline] * len(localities): our DataPipeline object passed in as a list.
[retries] * len(localities): our retry limit passed in as a list.

As you probably noticed, all the arguments to our parsing function get passed in as lists. executor.map() takes these lists and passes them into a bunch of separate instances of our parsing function.

Step 4: Bypassing Anti-Bots

Anti-bots can be the achilles heel of any web scraping project. With Google Maps and Reviews, not only do we need to bypass anti-bots, but we also need to wait for our content to render.We need to tell ScrapeOps Proxy Aggregator the following four things when making our requests:

"api_key": your ScrapeOps API key.
"url": the url we want to scrape.
"country": the country we want our request to be routed through. This parameter uses a location of our choice when we make the request.
"wait": how long to wait before sending our response. This allows the content to render on their end before we get it back.

If you look at the function below, you'll see a function that incorporates all of the information above and returns it as a ScrapeOps Proxied url.

def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 5000,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

The code below holds our production ready Maps crawler. After creating our proxy function, we simply use it during the parse.

import osimport reimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 5000,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    rating_count: int = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, locality, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.google.com/maps/search/{formatted_keyword}/@{locality},14z/data=!3m1!4b1?entry=ttu"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                            soup = BeautifulSoup(response.text, "html.parser")            business_links = soup.select("div div a")            excluded_words = ["Sign in"]            for business_link in business_links:                name = business_link.get("aria-label")                if not name or name in excluded_words or "Visit" in name:                    continue                maps_link = business_link.get("href")                full_card = business_link.parent                                rating_holder = full_card.select_one("span[role='img']")
                rating = 0.0                rating_count = 0
                if rating_holder:                    rating_array = rating_holder.text.split("(")                    rating = rating_array[0]                    rating_count = int(rating_array[1].replace(")", "").replace(",", ""))                                search_data = SearchData(                    name=name,                    stars=rating,                    url=maps_link,                    rating_count=rating_count                )                           data_pipeline.add_data(search_data)                            success = True            logger.info(f"Successfully parsed data from: {url}")                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, location, localities,  data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * len(localities),            [location] * len(localities),            localities,            [data_pipeline] * len(localities),            [retries] * len(localities)        )        
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"    LOCALITIES = ["42.3,-83.5","42.35,-83.5", "42.4,-83.5"]
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["restaurant"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, LOCATION, LOCALITIES, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

We now have reliable proxy support and we're ready to scrape at scale.

Step 5: Production Run

Time to run our crawler in production! If you need to view it in closer detail, here is our main.

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"    LOCALITIES = ["42.3,-83.5","42.35,-83.5", "42.4,-83.5"]
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["restaurant"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, LOCATION, LOCALITIES, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

To change your results, you can change any of the following:

MAX_RETRIES: the max amount of retries for a parse.
MAX_THREADS: how many threads you'd like to use when parsing pages simultaneously.
LOCATION: the location you'd like to appear from.
LOCALITIES: the areas of the map you'd like to scrape. They need to be added in as latitude and longitude pairs.
keyword_list: the keywords you'd like to search the map for.

Here are the results from our crawl. We crawled different localities in 12.88 seconds. 12.88 seconds / 3 webpages = 4.293 pages per second.

Build A Google Reviews Scraper

Now that we're scraping businesses and generating a list with their urls, we need to read that list and do something with it. We don't just want to read it manually. We want a scraper that reads the list and then scrapes reviews for each business in the list using its url.Time to add more features. We'll add the following features in order.

Parsing business reviews.
Read the CSV file.
Store the review data.
Concurrently run steps 1 and 3 until the entire list of businesses has been processed.
Proxy Integration will once again be used to bypass anti-bots and render the content we'd like to scrape.

If you followed along and built the crawler, the following sections will seem pretty familiar.

Step 1: Create Simple Business Data Parser

Just as before, we'll start with a basic parsing function that includes error handling and retry logic. Pay close attention to how we extract the data here.

def process_business(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(url, location=location)        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")                soup = BeautifulSoup(response.text, "html.parser")                      main_card = soup.select_one("div[role='main']")
                info_cards = soup.find_all("div", class_="MyEned")                for card in info_cards:                    review = card.text
                    full_card = card.parent.parent.parent.parent                    reviewer_button = full_card.find("button")                    name = reviewer_button.get("aria-label").replace("Photo of ", "")                    rating_tag = full_card.select_one("span[role='img']")                    stars = int(rating_tag.get("aria-label").replace(" stars", "").replace(" star", ""))                    review_date = rating_tag.parent.find_all("span")[-1].text                                        review_data = {                        "name": name,                        "stars": stars,                        "time_left": review_date,                        "review_shortened": review                    }                                        print(review_data)
                success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")

We find all the info_card items: info_cards = soup.find_all("div", class_="MyEned").
We then iterate through them.
We pull the visible review: review = card.text.
Use the parent attribute to find the full review card that includes the reviewer name and rating: full_card = card.parent.parent.parent.parent.
reviewer_button = full_card.find("button") finds the button that holds information about our reviewer.
We find the user's name with the aria-label attribute: name = reviewer_button.get("aria-label").replace("Photo of ", ""). We also remove "Photo of " from the string that includes their name, this way, the only information we're saving is the reviewer name.
We follow a similar method to the one above when extracting our rating: int(rating_tag.get("aria-label").replace(" stars", "").replace(" star", "")).
review_date = rating_tag.parent.find_all("span")[-1].text finds all the span tags descended from the parent of our rating_tag. The last element is our review date, so we pull index -1 from the array.

Step 2: Loading URLs To Scrape

Next, we need to read the urls that we scraped during the crawl. We'll create another function similar to start_scrape(). This one needs to read our CSV file into an array of dict objects.Then, it should iterate through the array and call our parsing function on each row we read from the file.

def process_results(csv_file, location, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))                for row in reader:            process_business(row, location, retries=retries)

When we put it all together, it looks like this.

import osimport reimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 5000,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    rating_count: int = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, locality, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.google.com/maps/search/{formatted_keyword}/@{locality},14z/data=!3m1!4b1?entry=ttu"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                            soup = BeautifulSoup(response.text, "html.parser")            business_links = soup.select("div div a")            excluded_words = ["Sign in"]            for business_link in business_links:                name = business_link.get("aria-label")                if not name or name in excluded_words or "Visit" in name:                    continue                maps_link = business_link.get("href")                full_card = business_link.parent                                rating_holder = full_card.select_one("span[role='img']")
                rating = 0.0                rating_count = 0
                if rating_holder:                    rating_array = rating_holder.text.split("(")                    rating = rating_array[0]                    rating_count = int(rating_array[1].replace(")", "").replace(",", ""))                                search_data = SearchData(                    name=name,                    stars=rating,                    url=maps_link,                    rating_count=rating_count                )                           data_pipeline.add_data(search_data)                            success = True            logger.info(f"Successfully parsed data from: {url}")                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, location, localities,  data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * len(localities),            [location] * len(localities),            localities,            [data_pipeline] * len(localities),            [retries] * len(localities)        )

def process_business(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(url, location=location)        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")                soup = BeautifulSoup(response.text, "html.parser")                      main_card = soup.select_one("div[role='main']")
                info_cards = soup.find_all("div", class_="MyEned")                for card in info_cards:                    review = card.text
                    full_card = card.parent.parent.parent.parent                    reviewer_button = full_card.find("button")                    name = reviewer_button.get("aria-label").replace("Photo of ", "")                    rating_tag = full_card.select_one("span[role='img']")                    stars = int(rating_tag.get("aria-label").replace(" stars", "").replace(" star", ""))                    review_date = rating_tag.parent.find_all("span")[-1].text                                        review_data = {                        "name": name,                        "stars": stars,                        "time_left": review_date,                        "review_shortened": review                    }                                        print(review_data)
                success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")



def process_results(csv_file, location, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))                for row in reader:            process_business(row, location, retries=retries)
        
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"    LOCALITIES = ["42.3,-83.5","42.35,-83.5", "42.4,-83.5"]
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["restaurant"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, LOCATION, LOCALITIES, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, retries=MAX_RETRIES)

Step 3: Storing the Scraped Data

We already have a DataPipeline class. This makes our new storage really easy to implement. We just need to pass a dataclass into a DataPipeline. This new class will be used to represent reviews from the page.Take a look at ReviewData, it's almost identical to SearchData.

@dataclassclass ReviewData:    name: str = ""    stars: int = 0    time_left: str = ""    review_shortened: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

In the full code below, we open a new DataPipeline from inside our parsing function. Then, as we extract our data, we convert it into ReviewData. That ReviewData then gets passed into the DataPipeline as we parse it.

import osimport reimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 5000,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    rating_count: int = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass ReviewData:    name: str = ""    stars: int = 0    time_left: str = ""    review_shortened: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, locality, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.google.com/maps/search/{formatted_keyword}/@{locality},14z/data=!3m1!4b1?entry=ttu"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                            soup = BeautifulSoup(response.text, "html.parser")            business_links = soup.select("div div a")            excluded_words = ["Sign in"]            for business_link in business_links:                name = business_link.get("aria-label")                if not name or name in excluded_words or "Visit" in name:                    continue                maps_link = business_link.get("href")                full_card = business_link.parent                                rating_holder = full_card.select_one("span[role='img']")
                rating = 0.0                rating_count = 0
                if rating_holder:                    rating_array = rating_holder.text.split("(")                    rating = rating_array[0]                    rating_count = int(rating_array[1].replace(")", "").replace(",", ""))                                search_data = SearchData(                    name=name,                    stars=rating,                    url=maps_link,                    rating_count=rating_count                )                           data_pipeline.add_data(search_data)                            success = True            logger.info(f"Successfully parsed data from: {url}")                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, location, localities,  data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * len(localities),            [location] * len(localities),            localities,            [data_pipeline] * len(localities),            [retries] * len(localities)        )

def process_business(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(url, location=location)        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")                soup = BeautifulSoup(response.text, "html.parser")                      main_card = soup.select_one("div[role='main']")
                info_cards = soup.find_all("div", class_="MyEned")                review_pipeline = DataPipeline(csv_filename=f"{row['name']}.csv")                for card in info_cards:                    review = card.text
                    full_card = card.parent.parent.parent.parent                    reviewer_button = full_card.find("button")                    name = reviewer_button.get("aria-label").replace("Photo of ", "")                    rating_tag = full_card.select_one("span[role='img']")                    stars = int(rating_tag.get("aria-label").replace(" stars", "").replace(" star", ""))                    review_date = rating_tag.parent.find_all("span")[-1].text                                        review_data = ReviewData(                        name=name,                        stars=stars,                        time_left=review_date,                        review_shortened=review                    )                    review_pipeline.add_data(review_data)                review_pipeline.close_pipeline()                success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")



def process_results(csv_file, location, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))                for row in reader:            process_business(row, location, retries=retries)
        
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"    LOCALITIES = ["42.3,-83.5","42.35,-83.5", "42.4,-83.5"]
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["restaurant"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, LOCATION, LOCALITIES, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, retries=MAX_RETRIES)

Step 4: Adding Concurrency

For concurrency, we're going to use ThreadPoolExecutor just like we did before. We'll replace the for loop in process_results() with some more powerful, multithreaded code.

def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_business,                reader,                [location] * len(reader),                [retries] * len(reader)            )

process_business is the function we want to call on all threads.
All other arguments get passed in as lists, just like before.

Step 5: Bypassing Anti-Bots

We've already got our polished proxy function. All we need to do is use it in the right place. One line of our parsing function changes and everything is ready to go.

response = requests.get(get_scrapeops_url(url, location=location))

Here is our final code containing both the crawler and the scraper.

import osimport reimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 5000,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    rating_count: int = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass ReviewData:    name: str = ""    stars: int = 0    time_left: str = ""    review_shortened: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, locality, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.google.com/maps/search/{formatted_keyword}/@{locality},14z/data=!3m1!4b1?entry=ttu"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                            soup = BeautifulSoup(response.text, "html.parser")            business_links = soup.select("div div a")            excluded_words = ["Sign in"]            for business_link in business_links:                name = business_link.get("aria-label")                if not name or name in excluded_words or "Visit" in name:                    continue                maps_link = business_link.get("href")                full_card = business_link.parent                                rating_holder = full_card.select_one("span[role='img']")
                rating = 0.0                rating_count = 0
                if rating_holder:                    rating_array = rating_holder.text.split("(")                    rating = rating_array[0]                    rating_count = int(rating_array[1].replace(")", "").replace(",", ""))                                search_data = SearchData(                    name=name,                    stars=rating,                    url=maps_link,                    rating_count=rating_count                )                           data_pipeline.add_data(search_data)
            success = True            logger.info(f"Successfully parsed data from: {url}")                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, location, localities,  data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * len(localities),            [location] * len(localities),            localities,            [data_pipeline] * len(localities),            [retries] * len(localities)        )

def process_business(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(get_scrapeops_url(url, location=location))        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")                soup = BeautifulSoup(response.text, "html.parser")                      main_card = soup.select_one("div[role='main']")
                info_cards = soup.find_all("div", class_="MyEned")                review_pipeline = DataPipeline(csv_filename=f"{row['name']}.csv")                for card in info_cards:                    review = card.text
                    full_card = card.parent.parent.parent.parent                    reviewer_button = full_card.find("button")                    name = reviewer_button.get("aria-label").replace("Photo of ", "")                    rating_tag = full_card.select_one("span[role='img']")                    stars = int(rating_tag.get("aria-label").replace(" stars", "").replace(" star", ""))                    review_date = rating_tag.parent.find_all("span")[-1].text                                        review_data = ReviewData(                        name=name,                        stars=stars,                        time_left=review_date,                        review_shortened=review                    )                    review_pipeline.add_data(review_data)                review_pipeline.close_pipeline()                success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")



def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_business,                reader,                [location] * len(reader),                [retries] * len(reader)            )
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"    LOCALITIES = ["42.3,-83.5","42.35,-83.5", "42.4,-83.5"]
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["restaurant"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, LOCATION, LOCALITIES, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Step 6: Production Run

Now, to test the entire thing in production. You can view our main again below.

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"    LOCALITIES = ["42.3,-83.5","42.35,-83.5", "42.4,-83.5"]
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["restaurant"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, LOCATION, LOCALITIES, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

If you remember from earlier, our crawl took 12.88 seconds.The entire run for the crawl and scrape took 91.158 seconds and generated a crawl report with 22 results. 91.158 - 12.88 = 78.278 seconds spent scraping reviews. 78.278 seconds / 22 businesses = 3.558 seconds per page.This is right on par with our crawler speed from earlier.

Legal and Ethical Considerations

Any time you scrape the web, you need to pay attention to what you're doing. If you're scraping public data (data not gated behind a login) like we did in this article, it is typically legal no matter what country you live in.However, private data is a completely different story. If you decide to scrape private data, make sure to understand the laws and regulations that govern that data because you're subject to them.While our scrape was legal. It does potentially violate the Google Maps Terms of Service and the robots.txt. Violating these can lead to suspension and even deletion of your account. You can view these documents from Google Maps below.

Conclusion

In conclusion, scraping Google Reviews is both a tricky and difficult task. It requires us to crawl Google Maps to obtain a list of businesses and then we need to build a scraper for the reviews on each business.On top of all that, our content is rendered dynamically so we need to use the ScrapeOps Headless Browser to render all of our content. You should have a solid grasp on Python Requests and BeautifulSoup. You should also understand parsing, data storage, concurrency and proxy integration.If you're interested in the tech we used when building this project and writing this article, look at the links below.

How to Scrape Google Reviews Website With Selenium

Google Reviews is one of the most delicate and tedious scraping jobs around. Because Google Reviews are only accessible through Google Maps, all of our content gets loaded dynamically. On top of that, Google uses dynamic CSS selectors for virtually everything. This makes it nigh impossible to find objects using only their selector. To scrape Google Reviews, we need to use a combination of DOM traversal, CSS selectors and JavaScript rendering.If you follow along with our tutorial, you'll be able to scrape Google Reviews like a pro. Here's what we are going to cover through the article:

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

TLDR - How to Scrape Google Reviews with Selenium

If you need to scrape Google Reviews but don't have time to code, don't worry! We've got a prebuilt crawler and scraper for you right here.

Create a new project folder.
Then, make a config.json file with your ScrapeOps API key.
Next, make a new Python file and paste the following code into it.

import osimport csvfrom selenium import webdriverfrom selenium.webdriver.common.by import Byimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
OPTIONS = webdriver.ChromeOptions()OPTIONS.add_argument("--headless")OPTIONS.add_argument("--disable-javascript")
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 5000,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    rating_count: int = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass ReviewData:    name: str = ""    stars: int = 0    time_left: str = ""    review_shortened: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, locality, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.google.com/maps/search/{formatted_keyword}/@{locality},14z/data=!3m1!4b1?entry=ttu"    tries = 0    success = False        while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = driver.get(scrapeops_proxy_url)                            business_links = driver.find_elements(By.CSS_SELECTOR, "div div a")            excluded_words = ["Sign in"]            for business_link in business_links:                name = business_link.get_attribute("aria-label")                if not name or name in excluded_words or "Visit" in name:                    continue                maps_link = business_link.get_attribute("href")                full_card = business_link.find_element(By.XPATH, "..")                                rating_holders = full_card.find_elements(By.CSS_SELECTOR, "span[role='img'] > span")                rating = 0.0                rating_count = 0                has_rating = rating_holders[0].get_attribute("innerHTML")                if has_rating:                    rating = has_rating                    rating_count = rating_holders[1].get_attribute("innerHTML").replace("(", "").replace(")", "")                                search_data = SearchData(                    name=name,                    stars=rating,                    url=maps_link,                    rating_count=rating_count                )                           data_pipeline.add_data(search_data)
            success = True            logger.info(f"Successfully parsed data from: {url}")                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries += 1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, location, localities,  data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * len(localities),            [location] * len(localities),            localities,            [data_pipeline] * len(localities),            [retries] * len(localities)        )

def process_business(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            driver.get(get_scrapeops_url(url))            info_cards = driver.find_elements(By.CSS_SELECTOR, "div[class='MyEned']")            review_pipeline = DataPipeline(csv_filename=f"{row['name']}.csv")            for card in info_cards:                review = card.find_element(By.CSS_SELECTOR, "span").get_attribute("innerHTML")
                full_card = card.find_element(By.XPATH, "../../../..")
                reviewer_button = full_card.find_element(By.CSS_SELECTOR, "button")                name = reviewer_button.get_attribute("aria-label").replace("Photo of ", "")                rating_tag = full_card.find_element(By.CSS_SELECTOR, "span[role='img']")                stars = int(rating_tag.get_attribute("aria-label").replace(" stars", "").replace(" star", ""))                rating_parent = rating_tag.find_element(By.XPATH, "..")                review_date = rating_parent.find_elements(By.CSS_SELECTOR, "span")[-1].get_attribute("innerHTML")                                    review_data = ReviewData(                    name=name,                    stars=stars,                    time_left=review_date,                    review_shortened=review                )                review_pipeline.add_data(review_data)            review_pipeline.close_pipeline()            success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")
    else:        logger.info(f"Successfully parsed: {row['url']}")



def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_business,                reader,                [location] * len(reader),                [retries] * len(reader)            )
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"    LOCALITIES = ["42.3,-83.5"]
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["restaurant"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, LOCATION, LOCALITIES, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

To change your results, feel free to change any of the following configuration variables.

MAX_RETRIES: Defines the maximum number of attempts to retry a failed request during scraping or processing.
MAX_THREADS: Specifies the maximum number of threads for concurrent processing of tasks.
LOCATION: Determines the geographical location for the scraping requests.
LOCALITIES: Provides the geographic coordinates or specific areas to focus the Google Maps search.
keyword_list: Contains the list of keywords to search for on Google Maps (e.g., types of businesses such as "restaurants" or "cafes").

How To Architect Our Google Reviews Scraper

Like any other large scraping job, we begin with a crawl. When you perform a broad search using keywords and extract relevant data, this is called crawling. Our crawler will save our search results (restaurants) to a CSV file.Once our crawler has generated its report, we need to perform our actual scrape. Scraping is more targeted than crawling and often includes specific data collection from a special target page.In this case, our crawler will collect basic information about each business: name, rating, Google Reviews url, and the ratings count.We'll use Selenium and use iterative building to include all of the following concepts into our design. First we'll do this for the crawler, then we'll implement it into the scraper as well.

Parsing: The ability to parse the page and extract data.
Data Storage: Once we've extracted our data, we need to store it in a safe and efficient way.
Concurrency: After we can extract and store data properly, we'll use multithreading to run steps 1 and 2 on multiple pages concurrently.
Proxy Integration: No scraping project is complete without a decent proxy connection. We'll use the ScrapeOps Proxy API to take full advantage of powerful proxies with geotargeting, and JavaScript rendering.

Understanding How To Scrape Google Reviews

We need to how to get our pages and where our data is located on the page. First, we'll look at the urls we use to get these pages. Then, we'll look at their layouts and come up with some basic selectors to help us find our data.Once we understand these foundations, we'll take a closer look at handling geolocated data using the ScrapeOps Proxy Aggregator.

Step 1: How To Request Google Reviews Pages

Take a look at the image below. We performed a search for restaurants. If you look at the URL, it's incredibly long.The actual URL required for our search is much shorter:

https://www.google.com/maps/search/restaurants/@42.3753166,-83.4750232,15z/data=!3m1!4b1?entry=ttu

Our keyword comes after the search endpoint. Our target location gets added on to the keyword endpoint. When our scraper constructs its URLs, they'll be laid out like this:

https://www.google.com/maps/search/{formatted_keyword}/@{locality},14z/data=!3m1!4b1?entry=ttu

The URL for each business will be extracted during our crawl, so we won't need to rebuild these URLs. The image below shows a basic layout of a restaurant page.As you can see, this page is rendered as a popup on top of the Google Maps page. However, we'll be able to fetch these pages individually using driver.get().As you can see, the popup contains a Reviews tab. This is where our individual reviews will come from.

Step 2: How To Extract Data From Google Reviews Pages

To extract our data we need to see where it's located on the page. Let's inspect these pages using our browser so we can get a better understanding of what information we actually need to extract from the page. We can use this to come up with a couple basic selectors from which we can build everything else.The image below shows how we inspect our search results.

As you can see, all of our restaurants have a tags descended from two div objects.
While we can't use very specific selectors, this gives us something to work with.
This selector should yield results for all of the restaurants in our search: div div a.

Now, we need to take a look at our reviews. Each review comes embedded inside a div with the class of MyEned. Our review selector will look like this: div[class='MyEned'].

Step 3: Geolocated Data

Google Maps does use our current location by default. However, we're inputting our target location into the URL. To get past any blocks Google might put in our way, we're going to use Proxy Aggregator.With Proxy Aggregator, we get a simple API that we can use to control our proxy connection and this API includes geotargeting support. We simply need to know our location's country code.For instance, if we want to appear inside the US, we'd pass us into our country param.You can view a list of country codes below.

Country	Country Code
Brazil	`br`
Canada	`ca`
China	`cn`
India	`in`
Italy	`it`
Japan	`jp`
France	`fr`
Germany	`de`
Russia	`ru`
Spain	`es`
United States	`us`
United Kingdom	`uk`

For more information about Proxy Aggregator's geotargeting abilities, you can view the docs here.

Setting Up Our Google Reviews Scraper Project

Time to get started. Follow the steps below to be ready in just a couple minutes.Create a new project folder and cd into the folder.

mkdir google-reviews-seleniumcd google-reviews-selenium

Create a virtual environment.

python -m venv venv

Activate the environment.

source venv/bin/activate

Install Selenium.

pip install selenium

**Make sure you have webdriver installed. You can find the latest version here.

Build A Google Reviews Search Crawler

To get our reviews, we need a list of businesses with urls. We're going to start by creating a parser and we'll build off of it from there. After the next few sections, we'll have a fully functioning search crawler that generates reports. These reports will come in CSV format.Why CSV?CSV is easily readable by both humans and machines. Our crawler can generate a decent report for any human to read and Python can also read this report with relative ease.Because of this, a human can easily read our crawler's report and so can our scraper later on in the project.

Step 1: Create Simple Search Data Parser

Time to get started.In the script below, we create our basic structure and add a parsing function. This is essentially the scaffolding we'll use to create everything else. start_scrape() is used to invoke our actual parsing function, scrape_search_results().We have some basic retry logic for the parser and some configuration variables inside of our main. The main holds the actual runtime of the program

import osimport csvfrom selenium import webdriverfrom selenium.webdriver.common.by import Byimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
OPTIONS = webdriver.ChromeOptions()OPTIONS.add_argument("--headless")OPTIONS.add_argument("--disable-javascript")
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)

def scrape_search_results(keyword, location, locality, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.google.com/maps/search/{formatted_keyword}/@{locality},14z/data=!3m1!4b1?entry=ttu"    tries = 0    success = False        while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            response = driver.get(url)                            business_links = driver.find_elements(By.CSS_SELECTOR, "div div a")            excluded_words = ["Sign in"]            for business_link in business_links:                name = business_link.get_attribute("aria-label")                if not name or name in excluded_words or "Visit" in name:                    continue                maps_link = business_link.get_attribute("href")                full_card = business_link.find_element(By.XPATH, "..")                                rating_holders = full_card.find_elements(By.CSS_SELECTOR, "span[role='img'] > span")                rating = 0.0                rating_count = 0                has_rating = rating_holders[0].get_attribute("innerHTML")                if has_rating:                    rating = has_rating                    rating_count = rating_holders[1].get_attribute("innerHTML").replace("(", "").replace(")", "")                                search_data = {                    "name": name,                    "stars": rating,                    "url": maps_link,                    "rating_count": rating_count                }                print(search_data)
            success = True            logger.info(f"Successfully parsed data from: {url}")                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries += 1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, location, localities, retries=3):    for locality in localities:        scrape_search_results(keyword, location, locality, retries=retries)

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"    LOCALITIES = ["42.3,-83.5"]
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["restaurant"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        start_scrape(keyword, LOCATION, LOCALITIES, retries=MAX_RETRIES)        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

If you look closely at scrape_search_results(), you'll see the initial design we talked about in the understanding section.

driver.find_elements(By.CSS_SELECTOR, "div div a") finds and returns all the a elements descended from two div elements on the page. This is one of the selectors we wrote earlier.
We use get_attribute() to extract each restaurant name from the aria-label attribute.
We find our url by getting the href.
In order to find the full_card containing all of our data, we need to find the parent element of the business link. We use its XPath to do this: business_link.find_element(By.XPATH, "..").
We then find all the span elements descended from another span with the role: img. Here is our selector: span[role='img'] > span. If there are ratings present, we extract both the rating and rating_count using the innerHTML from these items. With Selenium, the text method only reliably shows text that's displayed on the page. Instead of scrolling and waiting for this information to populate, we simply pull it from the HTML immediately.

Step 2: Storing the Scraped Data

To store our data, we need to create a few custom datatypes. Earlier, we put all of our data into a dict object. dict is great for protoyping, but it doesn't always cover edge cases due to weak typing. dict allows for things like missing fields that could potentially corrupt our data.To handle this, we'll create a SearchData class. Then, we need a pipeline to pass all of these objects into our CSV file. For this, we'll create a DataPipeline class.Take a look at SearchData. Nothing fancy, just the fields we extracted earlier coupled with a check_string_fields() method to ensure that we don't have any missing fields.

@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    rating_count: int = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

Here's our DataPipeline. This class actually does most of the heavy lifting. It comes with a variety of methods, all of which are very important to our actual storage.To summarize it, our SearchData gets held inside the storage_queue. We can add new objects to the queue using the add_data() method. When close_pipeline() is invoked, the entire queue gets saved to a CSV file with save_to_csv().

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()

To see how it all fits together, take a look at our fully updated code below.

import osimport csvfrom selenium import webdriverfrom selenium.webdriver.common.by import Byimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
OPTIONS = webdriver.ChromeOptions()OPTIONS.add_argument("--headless")OPTIONS.add_argument("--disable-javascript")
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    rating_count: int = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())


class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, locality, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.google.com/maps/search/{formatted_keyword}/@{locality},14z/data=!3m1!4b1?entry=ttu"    tries = 0    success = False        while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            response = driver.get(url)                            business_links = driver.find_elements(By.CSS_SELECTOR, "div div a")            excluded_words = ["Sign in"]            for business_link in business_links:                name = business_link.get_attribute("aria-label")                if not name or name in excluded_words or "Visit" in name:                    continue                maps_link = business_link.get_attribute("href")                full_card = business_link.find_element(By.XPATH, "..")                                rating_holders = full_card.find_elements(By.CSS_SELECTOR, "span[role='img'] > span")                rating = 0.0                rating_count = 0                has_rating = rating_holders[0].get_attribute("innerHTML")                if has_rating:                    rating = has_rating                    rating_count = rating_holders[1].get_attribute("innerHTML").replace("(", "").replace(")", "")                                search_data = SearchData(                    name=name,                    stars=rating,                    url=maps_link,                    rating_count=rating_count                )                           data_pipeline.add_data(search_data)
            success = True            logger.info(f"Successfully parsed data from: {url}")                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries += 1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, location, localities,  data_pipeline=None, retries=3):    for locality in localities:        scrape_search_results(keyword, location, locality, data_pipeline=data_pipeline, retries=retries)

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"    LOCALITIES = ["42.3,-83.5"]
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["restaurant"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, LOCATION, LOCALITIES, data_pipeline=crawl_pipeline, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

Step 3: Adding Concurrency

As previously mentioned in our architecture section, we're going to use multithreading so we can scrape multiple pages concurrently.start_scrape() is used to trigger our parser on an array of localities using a for loop. This gets the job done, but it processes these localities one at a time. Let's rewrite it using ThreadPoolExecutor.Our new version opens a pool of threads and runs scrape_search_results on each thread simultaneously. It may look intimidating, but this function is much simpler than you might think.We pass lists of args into ThreadPoolExecutor and it then takes each arg from the list and passes it into an indiviudal instance of our target function.

def start_scrape(keyword, location, localities,  data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * len(localities),            [location] * len(localities),            localities,            [data_pipeline] * len(localities),            [retries] * len(localities)        )

Let's break these args down and make sense of them.

scrape_search_results: The function we wish to run multiple instances of.
[keyword] * len(localities), [location] * len(localities), [data_pipeline] * len(localities), [retries] * len(localities): All of these get passed in as arrays the length of our localities list.
localities: The list we actually want to process.

Here's our updated script.

import osimport csvfrom selenium import webdriverfrom selenium.webdriver.common.by import Byimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
OPTIONS = webdriver.ChromeOptions()OPTIONS.add_argument("--headless")OPTIONS.add_argument("--disable-javascript")
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    rating_count: int = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())


class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, locality, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.google.com/maps/search/{formatted_keyword}/@{locality},14z/data=!3m1!4b1?entry=ttu"    tries = 0    success = False        while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            response = driver.get(url)                            business_links = driver.find_elements(By.CSS_SELECTOR, "div div a")            excluded_words = ["Sign in"]            for business_link in business_links:                name = business_link.get_attribute("aria-label")                if not name or name in excluded_words or "Visit" in name:                    continue                maps_link = business_link.get_attribute("href")                full_card = business_link.find_element(By.XPATH, "..")                                rating_holders = full_card.find_elements(By.CSS_SELECTOR, "span[role='img'] > span")                rating = 0.0                rating_count = 0                has_rating = rating_holders[0].get_attribute("innerHTML")                if has_rating:                    rating = has_rating                    rating_count = rating_holders[1].get_attribute("innerHTML").replace("(", "").replace(")", "")                                search_data = SearchData(                    name=name,                    stars=rating,                    url=maps_link,                    rating_count=rating_count                )                           data_pipeline.add_data(search_data)
            success = True            logger.info(f"Successfully parsed data from: {url}")                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries += 1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, location, localities,  data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * len(localities),            [location] * len(localities),            localities,            [data_pipeline] * len(localities),            [retries] * len(localities)        )

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"    LOCALITIES = ["42.3,-83.5"]
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["restaurant"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, LOCATION, LOCALITIES, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

Step 4: Bypassing Anti-Bots

To get past anti-bots, we're going to use ScrapeOps Proxy Aggregator. Proxy Aggregator allows us to connect to a proxied site using a REST API. The function below takes in all of our requirements (api_key, url, and country).In this particular case, we need to add another parameter as well: wait. wait tells Proxy Aggregator to wait an arbitrary amount of time for content to render. We then use URL encoding to wrap all of these parameters into a ScrapeOps proxied url.

def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 5000,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

After adding proxy support, our full crawler looks like this.

import osimport csvfrom selenium import webdriverfrom selenium.webdriver.common.by import Byimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
OPTIONS = webdriver.ChromeOptions()OPTIONS.add_argument("--headless")OPTIONS.add_argument("--disable-javascript")
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 5000,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    rating_count: int = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())


class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, locality, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.google.com/maps/search/{formatted_keyword}/@{locality},14z/data=!3m1!4b1?entry=ttu"    tries = 0    success = False        while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = driver.get(scrapeops_proxy_url)                            business_links = driver.find_elements(By.CSS_SELECTOR, "div div a")            excluded_words = ["Sign in"]            for business_link in business_links:                name = business_link.get_attribute("aria-label")                if not name or name in excluded_words or "Visit" in name:                    continue                maps_link = business_link.get_attribute("href")                full_card = business_link.find_element(By.XPATH, "..")                                rating_holders = full_card.find_elements(By.CSS_SELECTOR, "span[role='img'] > span")                rating = 0.0                rating_count = 0                has_rating = rating_holders[0].get_attribute("innerHTML")                if has_rating:                    rating = has_rating                    rating_count = rating_holders[1].get_attribute("innerHTML").replace("(", "").replace(")", "")                                search_data = SearchData(                    name=name,                    stars=rating,                    url=maps_link,                    rating_count=rating_count                )                           data_pipeline.add_data(search_data)
            success = True            logger.info(f"Successfully parsed data from: {url}")                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries += 1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, location, localities,  data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * len(localities),            [location] * len(localities),            localities,            [data_pipeline] * len(localities),            [retries] * len(localities)        )

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"    LOCALITIES = ["42.3,-83.5"]
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["restaurant"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, LOCATION, LOCALITIES, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

Step 5: Production Run

Time to see how well our code actually runs. Take a look at the main below. Feel free to change the configuration variables to better fit your results.

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"    LOCALITIES = ["42.3,-83.5"]
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["restaurant"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, LOCATION, LOCALITIES, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

The crawl finished in 15.162 seconds.

Build A Google Reviews Scraper

Now we need to scrape individual reviews for these businesses. In the coming sections, we'll write a scraper that reads our CSV report and crawls reviews for each business we found during the crawl.In the next few sections, we'll write a new parser, and build all of our scraper features off of it much like we did earlier.

Step 1: Create Simple Google Reviews Data Parser

In the code snippet below contains our new parsing function.

def process_business(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            driver.get(url)            info_cards = driver.find_elements(By.CSS_SELECTOR, "div[class='MyEned']")            for card in info_cards:                review = card.find_element(By.CSS_SELECTOR, "span").get_attribute("innerHTML")
                full_card = card.find_element(By.XPATH, "../../../..")
                reviewer_button = full_card.find_element(By.CSS_SELECTOR, "button")                name = reviewer_button.get_attribute("aria-label").replace("Photo of ", "")                rating_tag = full_card.find_element(By.CSS_SELECTOR, "span[role='img']")                stars = int(rating_tag.get_attribute("aria-label").replace(" stars", "").replace(" star", ""))                rating_parent = rating_tag.find_element(By.XPATH, "..")                review_date = rating_parent.find_elements(By.CSS_SELECTOR, "span")[-1].get_attribute("innerHTML")                                    review_data = {                    "name": name,                    "stars": stars,                    "time_left": review_date,                    "review_shortened": review                }                print(review_data)            success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")
    else:        logger.info(f"Successfully parsed: {row['url']}")

Here are the key takeaways from the parsing function.

card.find_element(By.CSS_SELECTOR, "span").get_attribute("innerHTML") finds our actual review. Since its not visible on the page, we need to extract its innerHTML instead its text.
card.find_element(By.XPATH, "../../../..") finds the full element holding our review. It's the parent of the parent of the parent of the parent of our card element... The "great great grandparent" if you will.
We then find the button element and extract the reviewer's name from its aria-label.
We then find the span elements containing the rating and rating_count.
The rating gets extracted from the aria-label of its holder element.
We extract the rating_count from the text of its holder as well. Since the element isn't visible on the page, we once again use its innerHTML as opposed to its actual text.

Step 2: Loading URLs To Scrape

Now, that we've got a parser, we need to feed it objects to parse. process_results() will be used to trigger our scraping function.This function opens the CSV file and reads the rows into an array of dict objects. Each row then gets passed into process_business().

def process_results(csv_file, location, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_business(row, location, retries=retries)

Here is our fully updated code now that it can read and process the file.

import osimport csvfrom selenium import webdriverfrom selenium.webdriver.common.by import Byimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
OPTIONS = webdriver.ChromeOptions()OPTIONS.add_argument("--headless")OPTIONS.add_argument("--disable-javascript")
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 5000,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    rating_count: int = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())


class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, locality, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.google.com/maps/search/{formatted_keyword}/@{locality},14z/data=!3m1!4b1?entry=ttu"    tries = 0    success = False        while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = driver.get(scrapeops_proxy_url)                            business_links = driver.find_elements(By.CSS_SELECTOR, "div div a")            excluded_words = ["Sign in"]            for business_link in business_links:                name = business_link.get_attribute("aria-label")                if not name or name in excluded_words or "Visit" in name:                    continue                maps_link = business_link.get_attribute("href")                full_card = business_link.find_element(By.XPATH, "..")                                rating_holders = full_card.find_elements(By.CSS_SELECTOR, "span[role='img'] > span")                rating = 0.0                rating_count = 0                has_rating = rating_holders[0].get_attribute("innerHTML")                if has_rating:                    rating = has_rating                    rating_count = rating_holders[1].get_attribute("innerHTML").replace("(", "").replace(")", "")                                search_data = SearchData(                    name=name,                    stars=rating,                    url=maps_link,                    rating_count=rating_count                )                           data_pipeline.add_data(search_data)
            success = True            logger.info(f"Successfully parsed data from: {url}")                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries += 1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, location, localities,  data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * len(localities),            [location] * len(localities),            localities,            [data_pipeline] * len(localities),            [retries] * len(localities)        )

def process_business(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            driver.get(url)            info_cards = driver.find_elements(By.CSS_SELECTOR, "div[class='MyEned']")            for card in info_cards:                review = card.find_element(By.CSS_SELECTOR, "span").get_attribute("innerHTML")
                full_card = card.find_element(By.XPATH, "../../../..")
                reviewer_button = full_card.find_element(By.CSS_SELECTOR, "button")                name = reviewer_button.get_attribute("aria-label").replace("Photo of ", "")                rating_tag = full_card.find_element(By.CSS_SELECTOR, "span[role='img']")                stars = int(rating_tag.get_attribute("aria-label").replace(" stars", "").replace(" star", ""))                rating_parent = rating_tag.find_element(By.XPATH, "..")                review_date = rating_parent.find_elements(By.CSS_SELECTOR, "span")[-1].get_attribute("innerHTML")                                    review_data = {                    "name": name,                    "stars": stars,                    "time_left": review_date,                    "review_shortened": review                }                print(review_data)            success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")
    else:        logger.info(f"Successfully parsed: {row['url']}")



def process_results(csv_file, location, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_business(row, location, retries=retries)
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"    LOCALITIES = ["42.3,-83.5"]
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["restaurant"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, LOCATION, LOCALITIES, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, retries=MAX_RETRIES)

Step 3: Storing the Scraped Data

Next, we need to store our extracted data. We already have a DataPipeline class that takes in dataclass objects. Once again, we need to convert our extracted dict into a more strongly typed object.In the snippet below, we create a ReviewData object. It uses the same methods for dealing with bad data, we just have some different fields: name, stars, time_left, and review_shortened.

time_left: The time the review was left. Google doesn't always give us specific dates, sometimes they just give us a general time frame such as 2 months ago.
review_shortened: This is the shortened version of the review that gets displayed when we look at the business popup.

@dataclassclass ReviewData:    name: str = ""    stars: int = 0    time_left: str = ""    review_shortened: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

In our updated example below, we open a pipeline from inside of process_results(). We then use our extracted data to create a ReviewData object. This object then gets passed into the pipeline. Once we've parsed the reviews, we close the pipeline.

import osimport csvfrom selenium import webdriverfrom selenium.webdriver.common.by import Byimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
OPTIONS = webdriver.ChromeOptions()OPTIONS.add_argument("--headless")OPTIONS.add_argument("--disable-javascript")
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 5000,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    rating_count: int = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass ReviewData:    name: str = ""    stars: int = 0    time_left: str = ""    review_shortened: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, locality, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.google.com/maps/search/{formatted_keyword}/@{locality},14z/data=!3m1!4b1?entry=ttu"    tries = 0    success = False        while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = driver.get(scrapeops_proxy_url)                            business_links = driver.find_elements(By.CSS_SELECTOR, "div div a")            excluded_words = ["Sign in"]            for business_link in business_links:                name = business_link.get_attribute("aria-label")                if not name or name in excluded_words or "Visit" in name:                    continue                maps_link = business_link.get_attribute("href")                full_card = business_link.find_element(By.XPATH, "..")                                rating_holders = full_card.find_elements(By.CSS_SELECTOR, "span[role='img'] > span")                rating = 0.0                rating_count = 0                has_rating = rating_holders[0].get_attribute("innerHTML")                if has_rating:                    rating = has_rating                    rating_count = rating_holders[1].get_attribute("innerHTML").replace("(", "").replace(")", "")                                search_data = SearchData(                    name=name,                    stars=rating,                    url=maps_link,                    rating_count=rating_count                )                           data_pipeline.add_data(search_data)
            success = True            logger.info(f"Successfully parsed data from: {url}")                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries += 1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, location, localities,  data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * len(localities),            [location] * len(localities),            localities,            [data_pipeline] * len(localities),            [retries] * len(localities)        )

def process_business(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            driver.get(url)            info_cards = driver.find_elements(By.CSS_SELECTOR, "div[class='MyEned']")            review_pipeline = DataPipeline(csv_filename=f"{row['name']}.csv")            for card in info_cards:                review = card.find_element(By.CSS_SELECTOR, "span").get_attribute("innerHTML")
                full_card = card.find_element(By.XPATH, "../../../..")
                reviewer_button = full_card.find_element(By.CSS_SELECTOR, "button")                name = reviewer_button.get_attribute("aria-label").replace("Photo of ", "")                rating_tag = full_card.find_element(By.CSS_SELECTOR, "span[role='img']")                stars = int(rating_tag.get_attribute("aria-label").replace(" stars", "").replace(" star", ""))                rating_parent = rating_tag.find_element(By.XPATH, "..")                review_date = rating_parent.find_elements(By.CSS_SELECTOR, "span")[-1].get_attribute("innerHTML")                                    review_data = ReviewData(                    name=name,                    stars=stars,                    time_left=review_date,                    review_shortened=review                )                review_pipeline.add_data(review_data)            review_pipeline.close_pipeline()            success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")
    else:        logger.info(f"Successfully parsed: {row['url']}")



def process_results(csv_file, location, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_business(row, location, retries=retries)
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"    LOCALITIES = ["42.3,-83.5"]
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["restaurant"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, LOCATION, LOCALITIES, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, retries=MAX_RETRIES)

Step 4: Adding Concurrency

It's once again time to add concurrency support. Just like we did earlier, we'll use ThreadPoolExecutor to run our parsing function on multiple threads.Take a look at our rewritten version below. Just like earlier, our first argument is the function we want to call: process_business. Next, we pass in our CSV file: reader. All other args once again get passed in as arrays the length of the list we want to process.

def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_business,                reader,                [location] * len(reader),                [retries] * len(reader)            )

Step 5: Bypassing Anti-Bots

Since we already have our proxy function, we just need to use it again.In the snippet below, we just need to update our driver.get() line of the parsing function. This single change finishes all of our coding for this project.

driver.get(get_scrapeops_url(url))

Take a look at our full code for the project.

import osimport csvfrom selenium import webdriverfrom selenium.webdriver.common.by import Byimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
OPTIONS = webdriver.ChromeOptions()OPTIONS.add_argument("--headless")OPTIONS.add_argument("--disable-javascript")
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 5000,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    rating_count: int = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass ReviewData:    name: str = ""    stars: int = 0    time_left: str = ""    review_shortened: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, locality, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.google.com/maps/search/{formatted_keyword}/@{locality},14z/data=!3m1!4b1?entry=ttu"    tries = 0    success = False        while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = driver.get(scrapeops_proxy_url)                            business_links = driver.find_elements(By.CSS_SELECTOR, "div div a")            excluded_words = ["Sign in"]            for business_link in business_links:                name = business_link.get_attribute("aria-label")                if not name or name in excluded_words or "Visit" in name:                    continue                maps_link = business_link.get_attribute("href")                full_card = business_link.find_element(By.XPATH, "..")                                rating_holders = full_card.find_elements(By.CSS_SELECTOR, "span[role='img'] > span")                rating = 0.0                rating_count = 0                has_rating = rating_holders[0].get_attribute("innerHTML")                if has_rating:                    rating = has_rating                    rating_count = rating_holders[1].get_attribute("innerHTML").replace("(", "").replace(")", "")                                search_data = SearchData(                    name=name,                    stars=rating,                    url=maps_link,                    rating_count=rating_count                )                           data_pipeline.add_data(search_data)
            success = True            logger.info(f"Successfully parsed data from: {url}")                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries += 1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, location, localities,  data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * len(localities),            [location] * len(localities),            localities,            [data_pipeline] * len(localities),            [retries] * len(localities)        )

def process_business(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            driver.get(get_scrapeops_url(url))            info_cards = driver.find_elements(By.CSS_SELECTOR, "div[class='MyEned']")            review_pipeline = DataPipeline(csv_filename=f"{row['name']}.csv")            for card in info_cards:                review = card.find_element(By.CSS_SELECTOR, "span").get_attribute("innerHTML")
                full_card = card.find_element(By.XPATH, "../../../..")
                reviewer_button = full_card.find_element(By.CSS_SELECTOR, "button")                name = reviewer_button.get_attribute("aria-label").replace("Photo of ", "")                rating_tag = full_card.find_element(By.CSS_SELECTOR, "span[role='img']")                stars = int(rating_tag.get_attribute("aria-label").replace(" stars", "").replace(" star", ""))                rating_parent = rating_tag.find_element(By.XPATH, "..")                review_date = rating_parent.find_elements(By.CSS_SELECTOR, "span")[-1].get_attribute("innerHTML")                                    review_data = ReviewData(                    name=name,                    stars=stars,                    time_left=review_date,                    review_shortened=review                )                review_pipeline.add_data(review_data)            review_pipeline.close_pipeline()            success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")
    else:        logger.info(f"Successfully parsed: {row['url']}")



def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_business,                reader,                [location] * len(reader),                [retries] * len(reader)            )
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"    LOCALITIES = ["42.3,-83.5"]
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["restaurant"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, LOCATION, LOCALITIES, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Step 6: Production Run

Now, let's test it out. In case you need to view it again, here's our main. As always, feel free to change any of the config variables.

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"    LOCALITIES = ["42.3,-83.5"]
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["restaurant"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, LOCATION, LOCALITIES, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

If you remember from earlier, our crawler run took 15.162 seconds. Our full run took 39.18 seconds as you can see below.The crawl report had 8 businesses.

39.18 - 15.162 = 24.018 seconds spent scraping.
24.018 seconds / 8 businesses = 3.00225 seconds per business.

Considering that we've got a 5 second delay from our wait parameter, this is just about as fast as it can get.

Legal and Ethical Considerations

Whenever you perform a scrape, you need to be mindful about the legal and ethical ramifications. Don't scrape private data. When we scraped Google Reviews, we used only public data. When you're scraping public data, it's generally legal. It's no different than taking a picture of a billboard.Private data (data behind a login) is a completely different story with significant consequences you should be aware of.

Legal

Breaking the law when scraping can lead to any of the following:

Cease and Desist Letters: When a company formally asks you to stop scraping their site.
Lawsuits: Nobody likes going to court. If you collect data illegally, you can be liable for civil damages and more.
Prison Time: If you scrape people's private data, you'd better be prepared. This is a serious crime in most countries punishable by real prison time.

Ethical

Reputation Damage: No one wants to be in the next headline about unethical business practices. This can seriously damage your personal reputation and that of your company.
Lawsuits and Suspensions: When you agree to a site's terms, you're signing a legally binding contract. If you violate this contract, you can lose your account or even be subject to a lawsuit.

If you are unsure of your scraper's legality, please consult an attorney.You can view Google Maps' policies using the links below.

Conclusion

Now you know how to crawl and scrape Google Reviews. You can use this new skill to generate incredibly valuable datasets for both yourself and others. You also know how to implement parsing, storage, concurrency and proxy integration into your Python code. Practice these skills and make sure to use them in the future.If you'd like to know more about the tech stack used in this article, check out the links below.

More Web Scraping Guides

Here at ScrapeOps, we wrote the playbook on scraping with languages like Python, Selenium and more. No matter what your skill level is, we've got something for you. To learn more from our "How To Scrape" series, check out the links below!