How to Scrape Google Play

Originally called Android Market, Google Play has been at the forefront of major app stores for over 15 years. It is the largest app store in the world. The vast majority of all apps used (computer or smartphone) are available for download via the Play Store. Because of its long history and giant market share, Google Play is an excellent place to measure the success of an app. When we scrape and extract data from Google Play, we can get aggregate data to provide valuable insights into the historical success of apps and use it to project the potential success of other ideas. We can collect all sorts of data such as ratings, reviews, publishers and more.

How to Scrape Google Play With Requests and BeautifulSoup

In this python tutorial, we're going to build a search crawler and an app scraper to get data from Google Play.

💡GitHub CodeThe full code for this Google Play Store Scraper is available on Github here.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

TLDR - How to Scrape Google Play

Need to scrape Google Play? Don't have time to read? Use the prebuilt scraper we have right here.

First, you'll need to create a new folder with a config.json file inside.
Inside the config file, add your ScrapeOps API key: {"api_key": "your-super-secret-api-key}.
Then, copy and paste the code below into a new python file. You can run the file with python name_of_your_file.py.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 5000,        "residential": True        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    publisher: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass ReviewData:    name: str = ""    date: str = ""    stars: int = 0    description: str = ""

    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://play.google.com/store/search?q={formatted_keyword}&c=apps"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        div_cards = soup.select("div[role='listitem']")
            Excluded_words = ["Apps & games", "Movies & TV", "Books"]            for div_card in div_cards:                if div_card.text in Excluded_words:                    continue                info_rows = div_card.select("div div span")                                name = info_rows[1].text                publisher = info_rows[2].text                href = div_card.find("a").get("href")                link = f"https://play.google.com{href}"                rating = 0.0                if info_rows[3].text != None:                    rating = info_rows[3].text                                search_data = SearchData(                    name=name,                    stars=rating,                    url=link,                    publisher=publisher                )                
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keywords, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            keywords,            [location] * len(keywords),            [data_pipeline] * len(keywords),            [retries] * len(keywords)        )

def process_app(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(get_scrapeops_url(url, location=location))        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")                review_container = soup.select_one("div[data-g-id='reviews']")                review_headers = review_container.find_all("header")                review_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")                for review in review_headers:                    stars = len(review.find_all("svg"))                    card = review.parent
                    divs = card.select("div div div div div")
                    name = divs[1].text                    date = divs[10].text                    description = divs[12].text
                    review_data = ReviewData(                        name=name,                        date=date,                        stars=stars,                        description=description                    )                                                  review_pipeline.add_data(review_data)                review_pipeline.close_pipeline()                success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")

def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_app,                reader,                [location] * len(reader),                [retries] * len(reader)            )
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["crypto wallet", "web3 wallet"]    aggregate_files = []
    ## Job Processes    filename = "report.csv"
    crawl_pipeline = DataPipeline(csv_filename=filename)    start_scrape(keyword_list, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")
    logger.info("Starting scrape...")    process_results(filename, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)    logger.info("Scrape Complete")

This code will generate a search report, report.csv. Then it will generate an individual report for each app that we scraped and saved in report.csv. Feel free to adjust any of the following constants to fit your scraping project.

MAX_RETRIES: Sets the maximum number of retry attempts the script will make if a request fails.
MAX_THREADS: Sets the maximum number of threads (or concurrent tasks) that the script will use when scraping data.
LOCATION: Specifies the country code for the location from which you want to simulate the scraping requests.
keyword_list: A list of keywords or phrases that the script will use to search for listings on the store.

How To Architect Our Google Play Scraper

Our scraper project will consist of two separate scrapers, a result crawler and an app scraper.

Our crawler will perform a search, then parse and store the results.
Our scraper will be looking up an app, parsing and storing reviews about that app.

Here is the process for our crawler:

Parse information from a search page.
Store the parsed results from the search.
Concurrently execute steps one and two for multiple searches.
Proxy integration will help us avoid anti-bots and roadblocks.

Here is the process for our scraper:

Read the CSV file from the crawl.
Parse each app from the CSV.
Store the data from the parse.
Concurrently run steps 2 and 3 on multiple rows simultaneously.
Proxy Integration will once again get us past anti-bots.

Understanding How To Scrape Google Play

Before we build this project, we need to get a better understanding of Google Play at a high level. Google Play has results pages and it also has individual app pages.In the coming sections, we're going to take a look at these pages in more detail and find exactly the information we'll be extracting.

Step 1: How To Request Google Play Pages

As mentioned above, there are two types of pages we need to get. We fetch these pages with a GET request. As we take a look at these URLs, we'll be able to reconstruct them from inside our scraper. If you take a look at the image below, you'll see results for the term "crypto wallet".Here is our results page. As you can see, our URL is constructed like this:

https://play.google.com/store/search?q={keyword}&c=apps

Here is an individual app page. You can see the URL in the address bar, but luckily, we don't need to reconstruct this one. Our app URLs will be extracted during our search.

Step 2: How To Extract Data From Google Play Results and Pages

Now let's look at the HTML data we're going to be extracting. We'll start with the search results page.As you can see below, item on the page contains a role of listitem. When we search for this item, we'll be using the CSS selector div[role='listitem']. From there, we can pull all of the data we need.Now, we'll look at how we're going to extract review data. All of our reviews are embedded within a div container with a data-g-id of reviews. From within this container, we're going to do through and pull all of our reviews.

Step 3: Geolocated Data

To handle geolocation, we'll be using the ScrapeOps Proxy API. The ScrapeOps Proxy Aggregator allows us to pass in a country param.When we choose a country, ScrapeOps will route us through a server within that country.

If we want to appear in the US, we'll pass "country": "us".
We can also pass another parameter, residential. This one is a boolean.
If we set "residential": True, ScrapeOps will assign us a residential IP address which exponentially decreases our likelihood of getting blocked.

Setting Up Our Google Play Scraper Project

Let's get started. You can run the following commands to get setup.Create a New Project Folder

mkdir google-play-scraper
cd google-play-scraper

Create a New Virtual Environment

python -m venv venv

Activate the Environment

source venv/bin/activate

Install Our Dependencies

pip install requests

pip install beautifulsoup4

Build A Google Play Search Crawler

It's finally time to get started. We'll start off by building a crawler. We'll add a parser, and then we'll continue building on top of it from there. We'll add the following in order.

Parsing
Data Storage
Concurrency
Proxy Integration

Step 1: Create Simple Search Data Parser

We'll get started by creating an initial parser. The code we create here will give us the basic structure to build off of for the rest of the project. We'll add retries, error handling and our initial parsing function.While the basic structure is important, we you should really pay attention to here is the parsing function, scrape_search_results().Here is our starter script.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    publisher: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://play.google.com/store/search?q={formatted_keyword}&c=apps"    tries = 0    success = False        while tries <= retries and not success:        try:            response = requests.get(url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        div_cards = soup.select("div[role='listitem']")
            Excluded_words = ["Apps & games", "Movies & TV", "Books"]            for div_card in div_cards:                if div_card.text in Excluded_words:                    continue                info_rows = div_card.select("div div span")                                name = info_rows[1].text                publisher = info_rows[2].text                href = div_card.find("a").get("href")                link = f"https://play.google.com{href}"                rating = 0.0                if info_rows[3].text != None:                    rating = info_rows[3].text                                search_data = {                    "name": name,                    "stars": stars,                    "url": link,                    "publisher": publisher                }           
                print(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def start_scrape(keywords, location, retries=3):    for keyword in keywords:        scrape_search_results(keyword, location, retries=retries)

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["crypto wallet"]    aggregate_files = []
    ## Job Processes    filename = "report.csv"
    start_scrape(keyword_list, LOCATION, retries=MAX_RETRIES)    logger.info(f"Crawl complete.")

In scrape_search_results(), while the operation hasn't succeeded, we do the following:

div_cards = soup.select("div[role='listitem']") finds all div tags with the role, listitem.
We use an array of excluded words to filter out unwanted div cards.
info_rows = div_card.select("div div span") finds all of the rows inside each review card.
We then pull the name, publisher and rating from our info_rows.
We also pull the href element with href = div_card.find("a").get("href") and use some basic string formatting to reconstruct the full link.

Step 2: Storing the Scraped Data

After parsing data, we need to store it. Without storage, our parsing function is pretty much useless. When we store our data, we can review that CSV file later. Not only can we review the file ourselves, our app scraper will be able to look up each app from the CSV.For proper storage, we're going to need to create a couple different classes, SearchData and DataPipeline.Here is SearchData, we'll use it to hold data for individual search items.

@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    publisher: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

Here is our DataPipeline. The DataPipeline will be used to open a pipeline to a CSV file. This pipeline takes in dataclass objects and pipes them to the CSV file while removing duplicate ones.

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()

In our full code example below, we open a DataPipeline and pass it into start_scrape() which in turn passes it into scrape_search_results(). From within scrape_search_results() instead of printing our data to the terminal, we use it to create a SearchData object. That object then gets passed into our DataPipeline.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    publisher: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://play.google.com/store/search?q={formatted_keyword}&c=apps"    tries = 0    success = False        while tries <= retries and not success:        try:            response = requests.get(url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        div_cards = soup.select("div[role='listitem']")
            Excluded_words = ["Apps & games", "Movies & TV", "Books"]            for div_card in div_cards:                if div_card.text in Excluded_words:                    continue                info_rows = div_card.select("div div span")                                name = info_rows[1].text                publisher = info_rows[2].text                href = div_card.find("a").get("href")                link = f"https://play.google.com{href}"                rating = 0.0                if info_rows[3].text != None:                    rating = info_rows[3].text                                search_data = SearchData(                    name=name,                    stars=rating,                    url=link,                    publisher=publisher                )                
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def start_scrape(keywords, location, data_pipeline=None, retries=3):    for keyword in keywords:        scrape_search_results(keyword, location, data_pipeline=data_pipeline, retries=retries)

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["crypto wallet"]    aggregate_files = []
    ## Job Processes    filename = "report.csv"
    crawl_pipeline = DataPipeline(csv_filename=filename)    start_scrape(keyword_list, LOCATION, data_pipeline=crawl_pipeline, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")

crawl_pipeline = DataPipeline(csv_filename=filename) creates a DataPipeline.
From within scrape_search_results(), we turn our parsed data into a SearchData object and pass it into the pipeline.
After we've completed the crawl, we close the pipeline with crawl_pipeline.close_pipeline().

Step 3: Adding Concurrency

The next portion of our project is to add concurrency. At the moment, we use a for loop to iterate through our keyword_list. In this section, we're going to replace that for loop with ThreadPoolExecutor which gives us the power of multithreading.Here is our refactored start_scrape() function.

def start_scrape(keywords, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            keywords,            [location] * len(keywords),            [data_pipeline] * len(keywords),            [retries] * len(keywords)        )

All of the real logic is happening inside of executor.map():

scrape_search_results is the function we'd like to call on our available threads.
keywords is an array of keywords we want to search.
All other arguments get passed in as arrays and subsequently get passed into scrape_search_results on each call.

Here is our fully updated Python script.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    publisher: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://play.google.com/store/search?q={formatted_keyword}&c=apps"    tries = 0    success = False        while tries <= retries and not success:        try:            response = requests.get(url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        div_cards = soup.select("div[role='listitem']")
            Excluded_words = ["Apps & games", "Movies & TV", "Books"]            for div_card in div_cards:                if div_card.text in Excluded_words:                    continue                info_rows = div_card.select("div div span")                                name = info_rows[1].text                publisher = info_rows[2].text                href = div_card.find("a").get("href")                link = f"https://play.google.com{href}"                rating = 0.0                if info_rows[3].text != None:                    rating = info_rows[3].text                                search_data = SearchData(                    name=name,                    stars=rating,                    url=link,                    publisher=publisher                )                
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keywords, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            keywords,            [location] * len(keywords),            [data_pipeline] * len(keywords),            [retries] * len(keywords)        )

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["crypto wallet"]    aggregate_files = []
    ## Job Processes    filename = "report.csv"
    crawl_pipeline = DataPipeline(csv_filename=filename)    start_scrape(keyword_list, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")

We replaced the for loop from start_scrape() with ThreadPoolExecutor.
Our first argument to executor.map() is the function we want to call.
All other arguments get passed in as arrays to get passed into the first function.

Step 4: Bypassing Anti-Bots

Time to unleash the power of proxy. The ScrapeOps Proxy API allows us to be routed through servers and appear as if we're in a different location. We pass our URL and a few other arguments into this function and it uses string formatting to give us a proxied URL.

def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 5000,        "residential": True        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

"api_key": holds our ScrapeOps API key.
"url": is the url that we'd like to scrape.
"country": is the country we'd like to appear in.
"wait": is how long we want the ScrapeOps server to wait before sending our response back.
"residential": is a boolean that lets ScrapeOps know if we want a residential IP. If we set it to True, we get a residnetial IP instead of a datacenter IP address. This greatly decreases our likelihood of getting blocked.

Once our crawler is all put together, it looks like this.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 5000,        "residential": True        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    publisher: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://play.google.com/store/search?q={formatted_keyword}&c=apps"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        div_cards = soup.select("div[role='listitem']")
            Excluded_words = ["Apps & games", "Movies & TV", "Books"]            for div_card in div_cards:                if div_card.text in Excluded_words:                    continue                info_rows = div_card.select("div div span")                                name = info_rows[1].text                publisher = info_rows[2].text                href = div_card.find("a").get("href")                link = f"https://play.google.com{href}"                rating = 0.0                if info_rows[3].text != None:                    rating = info_rows[3].text                                search_data = SearchData(                    name=name,                    stars=rating,                    url=link,                    publisher=publisher                )                
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keywords, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            keywords,            [location] * len(keywords),            [data_pipeline] * len(keywords),            [retries] * len(keywords)        )

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["crypto wallet"]    aggregate_files = []
    ## Job Processes    filename = "report.csv"
    crawl_pipeline = DataPipeline(csv_filename=filename)    start_scrape(keyword_list, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")

Our crawler is now ready for production testing.

Step 6: Production Run

Time for the production run. I'm going to add another keyword to our list in the main. Aside from that, everything else will stay the same.Take a look below.

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["crypto wallet", "bitcoin wallet"]    aggregate_files = []
    ## Job Processes    filename = "report.csv"
    crawl_pipeline = DataPipeline(csv_filename=filename)    start_scrape(keyword_list, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")

Feel free to change any of the following to tweak your results:

MAX_THREADS
MAX_RETRIES
LOCATION
keyword_list

Here are our results from two searches.We generated a CSV file with 23 results in 18.165 seconds. This comes out to roughly 9 seconds per search. Considering that the ScrapeOps server is waiting for 5 of those 9 seconds, this isn't bad.

Build A Google Play Scraper

Now, we're going to build an app scraper. This scraper will read our CSV file and then lookup and parse each app from the file. Then it will create a new file for each app containing reviews for that app.Here are the steps we'll go through.

Create a parsing function.
Load the urls to scrape.
Storing the newly parsed data.
Adding concurrency to the scraper.
Integration with the ScrapeOps proxy.

Step 1: Create Simple Business Data Parser

Our basic parser is pretty similar to our first one. We have some basic error handling, retries and our intitial parsing logic. Just like earlier, the parsing logic is where you really need to pay attention.Here is our parsing function.

def process_app(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(url)        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")                review_container = soup.select_one("div[data-g-id='reviews']")                review_headers = review_container.find_all("header")
                for review in review_headers:                    stars = len(review.find_all("svg"))                    card = review.parent
                    divs = card.select("div div div div div")
                    name = divs[1].text                    date = divs[10].text                    description = divs[12].text
                    review_data = {                        "name": name,                        "date": date,                        "stars": stars,                        "description": description                    }                                             print(review_data)                                    success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")

First, we find our review_container with soup.select_one("div[data-g-id='reviews']").
Next we find a list of header elements for each review, review_container.find_all("header").
We then iterate through the review_headers.
On each header, we pull the following information:
- stars = len(review.find_all("svg"))
- A list of super nested divs, card.select("div div div div div")
- We pull the name, date and description from the list of divs.

Step 2: Loading URLs To Scrape

Now, we need to read our CSV file. Our scraper is going to read the rows of the CSV file and then pass them into the parsing function we just created. Let's make a new function, kind of similar to start_scrape(). We'll call this one process_results().Here is process_results(). Later on, we'll replace the for loop with multithreading like we did before.

def process_results(csv_file, location, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_app(row, location, retries=retries)

After putting everything together, this is how our code looks.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 5000,        "residential": True        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    publisher: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://play.google.com/store/search?q={formatted_keyword}&c=apps"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        div_cards = soup.select("div[role='listitem']")
            Excluded_words = ["Apps & games", "Movies & TV", "Books"]            for div_card in div_cards:                if div_card.text in Excluded_words:                    continue                info_rows = div_card.select("div div span")                                name = info_rows[1].text                publisher = info_rows[2].text                href = div_card.find("a").get("href")                link = f"https://play.google.com{href}"                rating = 0.0                if info_rows[3].text != None:                    rating = info_rows[3].text                                search_data = SearchData(                    name=name,                    stars=rating,                    url=link,                    publisher=publisher                )                
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keywords, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            keywords,            [location] * len(keywords),            [data_pipeline] * len(keywords),            [retries] * len(keywords)        )

def process_app(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(url)        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")                review_container = soup.select_one("div[data-g-id='reviews']")                review_headers = review_container.find_all("header")
                for review in review_headers:                    stars = len(review.find_all("svg"))                    card = review.parent
                    divs = card.select("div div div div div")
                    name = divs[1].text                    date = divs[10].text                    description = divs[12].text
                    review_data = {                        "name": name,                        "date": date,                        "stars": stars,                        "description": description                    }                                             print(review_data)                                    success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")

def process_results(csv_file, location, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_app(row, location, retries=retries)
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["crypto wallet"]    aggregate_files = []
    ## Job Processes    filename = "report.csv"
    crawl_pipeline = DataPipeline(csv_filename=filename)    start_scrape(keyword_list, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")
    logger.info("Starting scrape...")    process_results(filename, LOCATION, retries=MAX_RETRIES)    logger.info("Scrape Complete")

Step 3: Storing the Scraped Data

To add data storage, we'll need to add another dataclass. We'll call this one ReviewData. It will hold the following traits:

name
date
stars
description

@dataclassclass ReviewData:    name: str = ""    date: str = ""    stars: int = 0    description: str = ""

    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

After creating this class, we can go ahead and pass it into another DataPipeline. You can see this in our fully updated code below.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 5000,        "residential": True        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    publisher: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass ReviewData:    name: str = ""    date: str = ""    stars: int = 0    description: str = ""

    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://play.google.com/store/search?q={formatted_keyword}&c=apps"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        div_cards = soup.select("div[role='listitem']")
            Excluded_words = ["Apps & games", "Movies & TV", "Books"]            for div_card in div_cards:                if div_card.text in Excluded_words:                    continue                info_rows = div_card.select("div div span")                                name = info_rows[1].text                publisher = info_rows[2].text                href = div_card.find("a").get("href")                link = f"https://play.google.com{href}"                rating = 0.0                if info_rows[3].text != None:                    rating = info_rows[3].text                                search_data = SearchData(                    name=name,                    stars=rating,                    url=link,                    publisher=publisher                )                
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keywords, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            keywords,            [location] * len(keywords),            [data_pipeline] * len(keywords),            [retries] * len(keywords)        )

def process_app(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(url)        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")                review_container = soup.select_one("div[data-g-id='reviews']")                review_headers = review_container.find_all("header")                review_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")                for review in review_headers:                    stars = len(review.find_all("svg"))                    card = review.parent
                    divs = card.select("div div div div div")
                    name = divs[1].text                    date = divs[10].text                    description = divs[12].text
                    review_data = ReviewData(                        name=name,                        date=date,                        stars=stars,                        description=description                    )                                                  review_pipeline.add_data(review_data)                review_pipeline.close_pipeline()                success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")

def process_results(csv_file, location, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_app(row, location, retries=retries)
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["crypto wallet"]    aggregate_files = []
    ## Job Processes    filename = "report.csv"
    crawl_pipeline = DataPipeline(csv_filename=filename)    start_scrape(keyword_list, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")
    logger.info("Starting scrape...")    process_results(filename, LOCATION, retries=MAX_RETRIES)    logger.info("Scrape Complete")

ReviewData represents an individual review in our software.
DataPipeline saves our ReviewData to a CSV file.

Step 4: Adding Concurrency

When we add concurrency, we'll do the same thing we did before. We're going to replace our for loop with ThreadPoolExecutor. Here is our refactored process_results() function.

def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_app,                reader,                [location] * len(reader),                [retries] * len(reader)            )

process_app is the function we want to call on multiple threads this time.
reader is the array of rows from our CSV file.
location and retries also get passed in as arrays, just like before.

Step 5: Bypassing Anti-Bots

To bypass anti-bots, we just need to change one line. get_scrapeops_url() was already defined earlier. Now, we just need to use it again. We'll change one line of our parsing function to unleash the proxy.

response = requests.get(get_scrapeops_url(url, location=location))

Here is our production ready code.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 5000,        "residential": True        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    publisher: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass ReviewData:    name: str = ""    date: str = ""    stars: int = 0    description: str = ""

    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://play.google.com/store/search?q={formatted_keyword}&c=apps"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        div_cards = soup.select("div[role='listitem']")
            Excluded_words = ["Apps & games", "Movies & TV", "Books"]            for div_card in div_cards:                if div_card.text in Excluded_words:                    continue                info_rows = div_card.select("div div span")                                name = info_rows[1].text                publisher = info_rows[2].text                href = div_card.find("a").get("href")                link = f"https://play.google.com{href}"                rating = 0.0                if info_rows[3].text != None:                    rating = info_rows[3].text                                search_data = SearchData(                    name=name,                    stars=rating,                    url=link,                    publisher=publisher                )                
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keywords, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            keywords,            [location] * len(keywords),            [data_pipeline] * len(keywords),            [retries] * len(keywords)        )

def process_app(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(get_scrapeops_url(url, location=location))        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")                review_container = soup.select_one("div[data-g-id='reviews']")                review_headers = review_container.find_all("header")                review_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")                for review in review_headers:                    stars = len(review.find_all("svg"))                    card = review.parent
                    divs = card.select("div div div div div")
                    name = divs[1].text                    date = divs[10].text                    description = divs[12].text
                    review_data = ReviewData(                        name=name,                        date=date,                        stars=stars,                        description=description                    )                                                  review_pipeline.add_data(review_data)                review_pipeline.close_pipeline()                success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")

def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_app,                reader,                [location] * len(reader),                [retries] * len(reader)            )
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["crypto wallet", "web3 wallet"]    aggregate_files = []
    ## Job Processes    filename = "report.csv"
    crawl_pipeline = DataPipeline(csv_filename=filename)    start_scrape(keyword_list, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")
    logger.info("Starting scrape...")    process_results(filename, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)    logger.info("Scrape Complete")

Step 6: Production Run

Like before, we're ready to run in production run. I changed "bitcoin wallet" to "web3 wallet". Otherwise, everything else is the same. Go ahead and take a look at our main.

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["crypto wallet", "web3 wallet"]    aggregate_files = []
    ## Job Processes    filename = "report.csv"
    crawl_pipeline = DataPipeline(csv_filename=filename)    start_scrape(keyword_list, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")
    logger.info("Starting scrape...")    process_results(filename, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)    logger.info("Scrape Complete")

Just like before, feel free to change the constants to alter your results. Here are our results.Our entire operation finished in 113.846 seconds and generated the crawl generated a report with 21 apps. Our crawl earlier took 18.165 seconds. 113.846 - 18.165 = 95.681 seconds. 95.681 seconds / 21 apps = 4.556 seconds per app. This is about twice as fast as our crawl!

Legal and Ethical Considerations

When you access Google Play, you need to follow their terms of service and respect their robots.txt. Violating these terms can result in suspension or even permanent removal of your account.Their terms of service are available here. You can view their robots.txt here.Public data is generally alright to scrape. When data is public (not gated behind a login), it is public knowledge and public property.When accessing data behind a login, you are accessing private data and therefore subject to their terms.If you don't know if your scraper is legal, you need to consult an attorney.

Conclusion

You've made it to the end! You now know how to build a Google Play results crawler and a Google Play app scraper. You know how to add parsing, data storage, concurrency and proxy integration to both of these.You should also have a decent understanding of Python Requests and BeautifulSoup. Take these new skills and go build something! Track the stats we scraped here and plan out a successful Play Store app.To learn more about the tech stack we used to write this article, take a look at the links below!

How to Scrape Google Play With Selenium

Today, we're going to learn how to scrape Google Play using Selenium.

TLDR - How to Scrape Google Play

Google Play can be quite the difficult scrape. Our pre-built scraper runs both a crawl and a full scrape on reviews for the crawl results.

Simply create a new Python project and add your ScrapeOps API key to a config.json file.
Once that's finished, craete a new Python file in the same folder and paste the following code.

import osimport csvimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
OPTIONS = webdriver.ChromeOptions()OPTIONS.add_argument("--headless")OPTIONS.add_argument("--disable-javascript")
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 5000,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    publisher: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass ReviewData:    name: str = ""    date: str = ""    stars: int = 0    description: str = ""

    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://play.google.com/store/search?q={formatted_keyword}&c=apps"    tries = 0    success = False        while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = driver.get(scrapeops_proxy_url)
            div_cards = driver.find_elements(By.CSS_SELECTOR, "div[role='listitem']")
            Excluded_words = ["Apps & games", "Movies & TV", "Books"]            for div_card in div_cards:                if div_card.text in Excluded_words:                    continue                info_rows = div_card.find_elements(By.CSS_SELECTOR, "div div span")                                name = info_rows[1].text                publisher = info_rows[2].text                href = div_card.find_element(By.CSS_SELECTOR, "a").get_attribute("href")                rating = 0.0                if info_rows[3].text != None:                    rating = info_rows[3].text                                search_data = SearchData(                    name=name,                    stars=rating,                    url=href,                    publisher=publisher                )                
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1                    finally:            driver.quit()    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keywords, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            keywords,            [location] * len(keywords),            [data_pipeline] * len(keywords),            [retries] * len(keywords)        )

def process_app(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            driver.get(get_scrapeops_url(url, location=location))
            review_container = driver.find_element(By.CSS_SELECTOR, "div[data-g-id='reviews']")            review_headers = review_container.find_elements(By.CSS_SELECTOR, "header[data-review-id]")            review_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")            for review in review_headers:                header_text = review.text.split("\n")                stars = review.find_element(By.CSS_SELECTOR, "div[role='img']").get_attribute("aria-label").split(" ")[1]                name = header_text[0]                date = header_text[2]                description = review.find_element(By.XPATH, "..").text.split("\n")[3]                review_data = ReviewData(                    name=name,                    date=date,                    stars=stars,                    description=description                )                                          review_pipeline.add_data(review_data)                            review_pipeline.close_pipeline()            success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")

def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_app,                reader,                [location] * len(reader),                [retries] * len(reader)            )
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["crypto wallet", "web3 wallet"]    aggregate_files = []
    ## Job Processes    filename = "report.csv"        crawl_pipeline = DataPipeline(csv_filename=filename)    start_scrape(keyword_list, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")    
    logger.info("Starting scrape...")    process_results(filename, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)    logger.info("Scrape Complete")

Feel free to tweak any of the following from the main in order to change your scraping results:

MAX_RETRIES: Specifies the maximum number of retry attempts for failed operations.
MAX_THREADS: Determines the number of threads to use for concurrent execution.
LOCATION: Specifies the geographic location for proxy requests.
keyword_list: Contains a list of search terms to scrape from the Google Play Store.

How To Architect Our Google Play Scraper

To extract data from Google Play, we need to build both a crawler and a scraper.Our crawler needs to perform a general search using a keyword. It then extracts and saves the search results.Next, the scraper takes over. The scraper needs to read the extracted data from the crawl and scrape additional data for each result.Our crawler runs a keyword search and saves the results. In this case, our scraper will then go and extract reviews for each app found during the crawl.We'll add the following into each component using iterative builds.

Parsing: Finding and extracting data from a page.
Data Storage: Storing our extracted data using a pipeline to a CSV file.
Concurrency: For best performance, we should be able to extract multiple pages concurrently.
Proxy Integration: Without a proxy connection, we won't get far. We'll utilize the ScrapeOps Proxy Aggregator to get past IP blocking and anti-bots.

Understanding How To Scrape Google Play

Before we start coding, we need to know how Google Play is laid out. We need to know which pages to fetch with Selenium. We also need to know what selectors we'll use to extract our page data.In the next few sections, we'll go over this in excruciating detail to leave no stone unturned.

Step 1: How To Request Google Play Pages

Take a look at the page below. As you might notice, our URL is:

https://play.google.com/store/search?q=crypto%20wallet&c=apps

Our actual URL endpoint is https://play.google.com/store/search.
Here is our query string: q=crypto%20wallet&c=apps.
q=crypto%20wallet tells Google Play that we want to search the term, crypto wallet.
c=apps is used to tell Google Play that we're looking at apps, not music, books etc.

Next, we'll take a look at the page for an individual app. If you scroll toward the bottom, you'll get some of the top reviews of the app. These top reviews usually provide a solid overview of how people use the app.

Step 2: How To Extract Data From Google Play Pages

To extract our data, we need to actually inspect the HTML of the web page. If you look at the image below, each app on the page is a div with the role, listitem.We can use the following CSS selector to find it: div[role='listitem']. When we're actually crawling the page, we'll use Selenium to find all the elements matching this selector.The next screenshot shows a review on the app page we looked at earlier.Each review contains a header element with a data-review-id. However, this data-review-id changes with each review on the page. To select these items, we'll use header[data-review-id]. This finds all header elements where a data-review-id is present.You might noticed that the review body isn't highlighted in the screenshot. This is because it's actually a separate element from the one we just mentioned.To find the body of the review, we'll need to find the parent element of the header[data-review-id] item.

Step 3: Geolocated Data

As we mentioned in our architecture secion, we'll be using Proxy Aggregator to get past anti-bots and IP blocking. Proxy Aggregator also gives us the ability to choose our geolocation using geotargeting.When we use geotargeting, we tells ScrapeOps where we want to appear from. Our Proxy Aggregator then takes care of the rest and routes our request through the country of our choice.We do this by passing a country param to Proxy Aggregator. Countries are represented by two-letter country codes.You can view a list of our supported countries below.

Country	Country Code
Brazil	`br`
Canada	`ca`
China	`cn`
India	`in`
Italy	`it`
Japan	`jp`
France	`fr`
Germany	`de`
Russia	`ru`
Spain	`es`
United States	`us`
United Kingdom	`uk`

Proxy Aggregator gives us many super powers, including geotargeting. The full docs to Proxy Aggregator are available here.

Setting Up Our Google Play Scraper Project

Time to get started. Follow the steps below to be ready in just a couple minutes.Create a new project folder and cd into the folder.

mkdir google-play-seleniumcd google-play-selenium

Create a virtual environment.

python -m venv venv

Activate the environment.

source venv/bin/activate

Install Selenium.

pip install selenium

**Make sure you have webdriver installed. You can find the latest version here.

Build A Google Play Search Crawler

Step 1: Create Simple Search Data Parser

To extract our data from the search page, we need to parse it. The code below contains our basic structure going forward as well as our parsing logic.We have our imports, a parsing function, scrape_search_results(), and the function that triggers the actual scrape, start_scrape().Everything is important, but you should pay special attention to the parsing function. This is where our actual extraction takes place.

import osimport csvimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
OPTIONS = webdriver.ChromeOptions()OPTIONS.add_argument("--headless")OPTIONS.add_argument("--disable-javascript")
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)

def scrape_search_results(keyword, location, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://play.google.com/store/search?q={formatted_keyword}&c=apps"    tries = 0    success = False        while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            response = driver.get(url)
            div_cards = driver.find_elements(By.CSS_SELECTOR, "div[role='listitem']")
            Excluded_words = ["Apps & games", "Movies & TV", "Books"]            for div_card in div_cards:                if div_card.text in Excluded_words:                    continue                info_rows = div_card.find_elements(By.CSS_SELECTOR, "div div span")                                name = info_rows[1].text                publisher = info_rows[2].text                href = div_card.find_element(By.CSS_SELECTOR, "a").get_attribute("href")                rating = 0.0                if info_rows[3].text != None:                    rating = info_rows[3].text                                search_data = {                    "name": name,                    "stars": rating,                    "url": href,                    "publisher": publisher                }             
                print(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1                    finally:            driver.quit()    if not success:        raise Exception(f"Max Retries exceeded: {retries}")
def start_scrape(keywords, location, retries=3):    for keyword in keywords:        scrape_search_results(keyword, location, retries=retries)

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["crypto wallet"]    aggregate_files = []
    ## Job Processes    filename = "report.csv"        start_scrape(keyword_list, LOCATION, retries=MAX_RETRIES)    logger.info(f"Crawl complete.")

Here are the key takeaways from the parsing function.

driver.find_elements(By.CSS_SELECTOR, "div[role='listitem']") finds all of our search results.
div_card.find_elements(By.CSS_SELECTOR, "div div span") gets all of our info_cards. Each of these cards holds an individual chunk of relevant data.
We extract the name and publisher from the info_cards.
div_card.find_element(By.CSS_SELECTOR, "a").get_attribute("href") gives us the link to the app's page on Google Play.
We assign a default rating of 0.0. If there is a rating present, we reassign it to the rating variable.

Step 2: Storing the Scraped Data

In our previous section, we extracted our data and used it to create a dict. Then, we printed that dict to the console. This is acceptable for prototyping, but we need stronger datatypes for production and we need to pipe our data to a CSV file.In this section, we're going to replace our dict with a more strongly typed SearchData object. We'll also create a DataPipeline class for the sole purpose of storing these objects.Take a look at SearchData. It has a __post_init__() method for instantiation and a check_string_fields() method to ensure that no field is left empty. Its actual fields are pretty simple: name, stars, url, and publisher.

@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    publisher: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

Now, we need a pipeline for SearchData. Our DataPipeline has a constructor, __init__() along with several other methods for handling the storage_queue and actually saving the data.

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()

save_to_csv(): Saves our queue to a CSV file.
is_duplicate(): Tells us whether an item is a duplicated. We use this to filter duplicates out of the pipeline.
add_data(): This adds data to our storage_queue.
close_pipeline(): Sleep for 3 seconds to allow any file operations to complete. Then, we save the queue to our output file.

You can see how it all fits together in the full code below.

import osimport csvimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
OPTIONS = webdriver.ChromeOptions()OPTIONS.add_argument("--headless")OPTIONS.add_argument("--disable-javascript")
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    publisher: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())


class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://play.google.com/store/search?q={formatted_keyword}&c=apps"    tries = 0    success = False        while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            response = driver.get(url)
            div_cards = driver.find_elements(By.CSS_SELECTOR, "div[role='listitem']")
            Excluded_words = ["Apps & games", "Movies & TV", "Books"]            for div_card in div_cards:                if div_card.text in Excluded_words:                    continue                info_rows = div_card.find_elements(By.CSS_SELECTOR, "div div span")                                name = info_rows[1].text                publisher = info_rows[2].text                href = div_card.find_element(By.CSS_SELECTOR, "a").get_attribute("href")                rating = 0.0                if info_rows[3].text != None:                    rating = info_rows[3].text                                search_data = SearchData(                    name=name,                    stars=rating,                    url=href,                    publisher=publisher                )                
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1                    finally:            driver.quit()    if not success:        raise Exception(f"Max Retries exceeded: {retries}")
def start_scrape(keywords, location, data_pipeline=None, retries=3):    for keyword in keywords:        scrape_search_results(keyword, location, data_pipeline=data_pipeline, retries=retries)

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["crypto wallet"]    aggregate_files = []
    ## Job Processes    filename = "report.csv"        crawl_pipeline = DataPipeline(csv_filename=filename)    start_scrape(keyword_list, LOCATION, data_pipeline=crawl_pipeline, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")

Our main reflects these new design changes.

Before starting our crawl, we create a new DataPipeline.
We pass our pipeline into start_scrape() which passes it into scrape_search_results() where we add all of our SearchData to the pipeline.
Once the crawl has finished, we close the pipeline.

Step 3: Adding Concurrency

We now need to add concurrency. To accomplish this, we'll rewrite start_scrape() to take advantage of ThreadPoolExecutor. It now takes a max_threads argument. This allows us to choose how many threads we'd like to use.Instead of iterating through our keywords, we now pass them into ThreadPoolExecutor.

def start_scrape(keywords, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            keywords,            [location] * len(keywords),            [data_pipeline] * len(keywords),            [retries] * len(keywords)        )

executor.map() holds the key to our concurrency. It takes the following arguments.

scrape_search_results: The function we want to call on each thread.
keywords: The array of keywords we want to search.
location, data_pipeline, and retries all get passed in as arrays the length of our keywords list. Our executor then passes them into each separate thread that scrape_search_results is running on.

Our full code now looks like this.

import osimport csvimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
OPTIONS = webdriver.ChromeOptions()OPTIONS.add_argument("--headless")OPTIONS.add_argument("--disable-javascript")
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    publisher: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())


class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://play.google.com/store/search?q={formatted_keyword}&c=apps"    tries = 0    success = False        while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            response = driver.get(url)
            div_cards = driver.find_elements(By.CSS_SELECTOR, "div[role='listitem']")
            Excluded_words = ["Apps & games", "Movies & TV", "Books"]            for div_card in div_cards:                if div_card.text in Excluded_words:                    continue                info_rows = div_card.find_elements(By.CSS_SELECTOR, "div div span")                                name = info_rows[1].text                publisher = info_rows[2].text                href = div_card.find_element(By.CSS_SELECTOR, "a").get_attribute("href")                rating = 0.0                if info_rows[3].text != None:                    rating = info_rows[3].text                                search_data = SearchData(                    name=name,                    stars=rating,                    url=href,                    publisher=publisher                )                
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1                    finally:            driver.quit()    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keywords, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            keywords,            [location] * len(keywords),            [data_pipeline] * len(keywords),            [retries] * len(keywords)        )

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["crypto wallet"]    aggregate_files = []
    ## Job Processes    filename = "report.csv"        crawl_pipeline = DataPipeline(csv_filename=filename)    start_scrape(keyword_list, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")

Key points to notice here.

MAX_THREADS now gets passed into start_scrape().
start_scrape() now runs multiple instances of our parsing function concurrently.

Step 4: Bypassing Anti-Bots

As we mentioned in our geolocation section, we're using Proxy Aggregator to control both our geolocation and to bypass anti-bots.With some simple string formatting, we can take our api_key, url, and location and use url encoding to wrap them all into a ScrapeOps Proxied URL.We also add a wait parameter so our dynamic content can be rendered on the page before Proxy Aggregator sends it back to us.

def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 5000,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

To update our overall script, instead of driver.get(url), we now use driver.get(scrapeops_proxy_url). Our finalized crawler is available below.

import osimport csvimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
OPTIONS = webdriver.ChromeOptions()OPTIONS.add_argument("--headless")OPTIONS.add_argument("--disable-javascript")
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 5000,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    publisher: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())


class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://play.google.com/store/search?q={formatted_keyword}&c=apps"    tries = 0    success = False        while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = driver.get(scrapeops_proxy_url)
            div_cards = driver.find_elements(By.CSS_SELECTOR, "div[role='listitem']")
            Excluded_words = ["Apps & games", "Movies & TV", "Books"]            for div_card in div_cards:                if div_card.text in Excluded_words:                    continue                info_rows = div_card.find_elements(By.CSS_SELECTOR, "div div span")                                name = info_rows[1].text                publisher = info_rows[2].text                href = div_card.find_element(By.CSS_SELECTOR, "a").get_attribute("href")                rating = 0.0                if info_rows[3].text != None:                    rating = info_rows[3].text                                search_data = SearchData(                    name=name,                    stars=rating,                    url=href,                    publisher=publisher                )                
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1                    finally:            driver.quit()    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keywords, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            keywords,            [location] * len(keywords),            [data_pipeline] * len(keywords),            [retries] * len(keywords)        )

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["crypto wallet"]    aggregate_files = []
    ## Job Processes    filename = "report.csv"        crawl_pipeline = DataPipeline(csv_filename=filename)    start_scrape(keyword_list, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")

Step 5: Production Run

Time to run this thing in production. If you look at our main below, we update our keyword list to hold two searches now. This allows us to see how the crawler performs when handling multiple searches concurrently. Everything else in our main stays the same.

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["crypto wallet", "web3 wallet"]    aggregate_files = []
    ## Job Processes    filename = "report.csv"        crawl_pipeline = DataPipeline(csv_filename=filename)    start_scrape(keyword_list, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")

Our crawler generated the full report in 28.141 seconds.

Build A Google Play Reviews Scraper

We can get a better guage the success of an app by scraping its reviews.In the next few sections, we'll use all the design principles we learned earlier to create an effective review scraper. Most of this should seem pretty familiar.

Step 1: Create Simple Google Play Reviews Data Parser

Once again, we need a parsing function. Just like before, this function will be used to extract our data.Later on, we'll pass it into a pipeline, but for now, we'll just print a dict to the console like we did earlier.Our new parsing function is called process_app().

def process_app(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            driver.get(url, location=location)
            review_container = driver.find_element(By.CSS_SELECTOR, "div[data-g-id='reviews']")            review_headers = review_container.find_elements(By.CSS_SELECTOR, "header[data-review-id]")            for review in review_headers:                header_text = review.text.split("\n")                stars = review.find_element(By.CSS_SELECTOR, "div[role='img']").get_attribute("aria-label").split(" ")[1]                name = header_text[0]                date = header_text[2]                description = review.find_element(By.XPATH, "..").text.split("\n")[3]                review_data = {                    "name": name,                    "date": date,                    "stars": stars,                    "description": description                }                                         print(review_data)                            success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")

driver.find_element(By.CSS_SELECTOR, "div[data-g-id='reviews']") finds the container holding all of the reviews.
review_container.find_elements(By.CSS_SELECTOR, "header[data-review-id]") is used to find the header element we found earlier when inspecting the page.
review.find_element(By.CSS_SELECTOR, "div[role='img']").get_attribute("aria-label").split(" ")[1] finds our stars element and extracts the value from its aria-label. We then split this value and grab element 1, which contains the actual rating.
We pull the name and date from the header_text.
We use XPATH to find the parent element of the header: review.find_element(By.XPATH, "..").text.split("\n")[3]. The fourth element in the array (index number 3) is the actual description given in the review.

Step 2: Loading URLs To Scrape

During our crawl, we scraped a url for each app we found. Here, we're going to actually use that url in order to look up each individual app on the store.We need to write a new function (similar to start_scrape()) called process_results(). This function will read all the rows of our CSV report and then pass each row into the parsing function we just wrote.

def process_results(csv_file, location, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_app(row, location, retries=retries)

With these new pieces, our fully updated code looks like this.

import osimport csvimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
OPTIONS = webdriver.ChromeOptions()OPTIONS.add_argument("--headless")OPTIONS.add_argument("--disable-javascript")
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 5000,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    publisher: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())


class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://play.google.com/store/search?q={formatted_keyword}&c=apps"    tries = 0    success = False        while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = driver.get(scrapeops_proxy_url)
            div_cards = driver.find_elements(By.CSS_SELECTOR, "div[role='listitem']")
            Excluded_words = ["Apps & games", "Movies & TV", "Books"]            for div_card in div_cards:                if div_card.text in Excluded_words:                    continue                info_rows = div_card.find_elements(By.CSS_SELECTOR, "div div span")                                name = info_rows[1].text                publisher = info_rows[2].text                href = div_card.find_element(By.CSS_SELECTOR, "a").get_attribute("href")                rating = 0.0                if info_rows[3].text != None:                    rating = info_rows[3].text                                search_data = SearchData(                    name=name,                    stars=rating,                    url=href,                    publisher=publisher                )                
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1                    finally:            driver.quit()    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keywords, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            keywords,            [location] * len(keywords),            [data_pipeline] * len(keywords),            [retries] * len(keywords)        )

def process_app(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            driver.get(url, location=location)
            review_container = driver.find_element(By.CSS_SELECTOR, "div[data-g-id='reviews']")            review_headers = review_container.find_elements(By.CSS_SELECTOR, "header[data-review-id]")            for review in review_headers:                header_text = review.text.split("\n")                stars = review.find_element(By.CSS_SELECTOR, "div[role='img']").get_attribute("aria-label").split(" ")[1]                name = header_text[0]                date = header_text[2]                description = review.find_element(By.XPATH, "..").text.split("\n")[3]                review_data = {                    "name": name,                    "date": date,                    "stars": stars,                    "description": description                }                                         print(review_data)                            success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")

def process_results(csv_file, location, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_app(row, location, retries=retries)
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["crypto wallet"]    aggregate_files = []
    ## Job Processes    filename = "report.csv"        crawl_pipeline = DataPipeline(csv_filename=filename)    start_scrape(keyword_list, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")    
    logger.info("Starting scrape...")    process_results(filename, LOCATION, retries=MAX_RETRIES)    logger.info("Scrape Complete")

Step 3: Storing the Scraped Data

We already have a DataPipeline. However, we're once again extracting our data into a dict.To fix this, we need to create another strongly typed dataclass to ensure that our data is formatted properly when we store it. This next one is called ReviewData.It contains the same methods as SearchData, but the fields are a bit different. Our new fields are name, date, stars, and description.

@dataclassclass ReviewData:    name: str = ""    date: str = ""    stars: int = 0    description: str = ""

    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

Now, instead of printing our data to the console, we open a new DataPipeline from process_app(). We then pass all of these ReviewData objects into the pipeline as we find them.

import osimport csvimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
OPTIONS = webdriver.ChromeOptions()OPTIONS.add_argument("--headless")OPTIONS.add_argument("--disable-javascript")
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 5000,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    publisher: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass ReviewData:    name: str = ""    date: str = ""    stars: int = 0    description: str = ""

    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://play.google.com/store/search?q={formatted_keyword}&c=apps"    tries = 0    success = False        while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = driver.get(scrapeops_proxy_url)
            div_cards = driver.find_elements(By.CSS_SELECTOR, "div[role='listitem']")
            Excluded_words = ["Apps & games", "Movies & TV", "Books"]            for div_card in div_cards:                if div_card.text in Excluded_words:                    continue                info_rows = div_card.find_elements(By.CSS_SELECTOR, "div div span")                                name = info_rows[1].text                publisher = info_rows[2].text                href = div_card.find_element(By.CSS_SELECTOR, "a").get_attribute("href")                rating = 0.0                if info_rows[3].text != None:                    rating = info_rows[3].text                                search_data = SearchData(                    name=name,                    stars=rating,                    url=href,                    publisher=publisher                )                
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1                    finally:            driver.quit()    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keywords, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            keywords,            [location] * len(keywords),            [data_pipeline] * len(keywords),            [retries] * len(keywords)        )

def process_app(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            driver.get(url, location=location)
            review_container = driver.find_element(By.CSS_SELECTOR, "div[data-g-id='reviews']")            review_headers = review_container.find_elements(By.CSS_SELECTOR, "header[data-review-id]")            review_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")            for review in review_headers:                header_text = review.text.split("\n")                stars = review.find_element(By.CSS_SELECTOR, "div[role='img']").get_attribute("aria-label").split(" ")[1]                name = header_text[0]                date = header_text[2]                description = review.find_element(By.XPATH, "..").text.split("\n")[3]                review_data = ReviewData(                    name=name,                    date=date,                    stars=stars,                    description=description                )                                          review_pipeline.add_data(review_data)                            review_pipeline.close_pipeline()            success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")

def process_results(csv_file, location, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_app(row, location, retries=retries)
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["crypto wallet"]    aggregate_files = []
    ## Job Processes    filename = "report.csv"        crawl_pipeline = DataPipeline(csv_filename=filename)    start_scrape(keyword_list, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")    
    logger.info("Starting scrape...")    process_results(filename, LOCATION, retries=MAX_RETRIES)    logger.info("Scrape Complete")

Step 4: Adding Concurrency

Next, we need to once again add concurrency to our trigger function. The snippet below contains a rewrite of process_results().Like before, we utilize ThreadPoolExecutor and pass our parser into it. Instead of our keywords list, it takes our reader object, which contains all the rows from our CSV file.

def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_app,                reader,                [location] * len(reader),                [retries] * len(reader)            )

Step 5: Bypassing Anti-Bots

We already have our proxy function as well. We just need to use it in the right place. We'll change a single line and this will add proxy integration to process_app().

driver.get(get_scrapeops_url(url, location=location))

Our fully finished project is available below.

import osimport csvimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
OPTIONS = webdriver.ChromeOptions()OPTIONS.add_argument("--headless")OPTIONS.add_argument("--disable-javascript")
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 5000,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    publisher: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass ReviewData:    name: str = ""    date: str = ""    stars: int = 0    description: str = ""

    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://play.google.com/store/search?q={formatted_keyword}&c=apps"    tries = 0    success = False        while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = driver.get(scrapeops_proxy_url)
            div_cards = driver.find_elements(By.CSS_SELECTOR, "div[role='listitem']")
            Excluded_words = ["Apps & games", "Movies & TV", "Books"]            for div_card in div_cards:                if div_card.text in Excluded_words:                    continue                info_rows = div_card.find_elements(By.CSS_SELECTOR, "div div span")                                name = info_rows[1].text                publisher = info_rows[2].text                href = div_card.find_element(By.CSS_SELECTOR, "a").get_attribute("href")                rating = 0.0                if info_rows[3].text != None:                    rating = info_rows[3].text                                search_data = SearchData(                    name=name,                    stars=rating,                    url=href,                    publisher=publisher                )                
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1                    finally:            driver.quit()    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keywords, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            keywords,            [location] * len(keywords),            [data_pipeline] * len(keywords),            [retries] * len(keywords)        )

def process_app(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            driver.get(get_scrapeops_url(url, location=location))
            review_container = driver.find_element(By.CSS_SELECTOR, "div[data-g-id='reviews']")            review_headers = review_container.find_elements(By.CSS_SELECTOR, "header[data-review-id]")            review_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")            for review in review_headers:                header_text = review.text.split("\n")                stars = review.find_element(By.CSS_SELECTOR, "div[role='img']").get_attribute("aria-label").split(" ")[1]                name = header_text[0]                date = header_text[2]                description = review.find_element(By.XPATH, "..").text.split("\n")[3]                review_data = ReviewData(                    name=name,                    date=date,                    stars=stars,                    description=description                )                                          review_pipeline.add_data(review_data)                            review_pipeline.close_pipeline()            success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")

def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_app,                reader,                [location] * len(reader),                [retries] * len(reader)            )
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["crypto wallet", "web3 wallet"]    aggregate_files = []
    ## Job Processes    filename = "report.csv"        crawl_pipeline = DataPipeline(csv_filename=filename)    start_scrape(keyword_list, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")    
    logger.info("Starting scrape...")    process_results(filename, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)    logger.info("Scrape Complete")

Step 6: Production Run

We'll use basically the same main that we used earlier. Feel free to change MAX_RETRIES, MAX_THREADS, LOCATION or keyword_list.

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5        LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["crypto wallet", "web3 wallet"]    aggregate_files = []
    ## Job Processes    filename = "report.csv"        crawl_pipeline = DataPipeline(csv_filename=filename)    start_scrape(keyword_list, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")    
    logger.info("Starting scrape...")    process_results(filename, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)    logger.info("Scrape Complete")

If you remember earlier, we crawled both keywords in 28.141 seconds.If you look at the screenshot below, our full crawl and scrape took 109.038 seconds.

109.038 - 28.141 = 80.897 seconds scraping.
Our crawl generated a report with 35 rows.
80.897 seconds / 35 pages = 2.311 seconds per page.

Considering that we have a 5 second wait baked into the proxy, function, this is an incredibly faster scraper. Our concurrency is really making a huge difference here.

Legal and Ethical Considerations

When scraping the web, you should always be mindful of what data you scrape and what you decide to do with it. Don't scrape private data.In this tutorial, we scraped publicly available data from Google Play. When you're scraping public data, it's generally legal. It's no different than taking a picture of a billboard. Private data (data behind a login) is a completely different story.

Legal

Breaking the law when scraping can lead to any of the following:

Cease and Desist Letters: When a company formally asks you to stop scraping their site.
Lawsuits: Nobody likes going to court. If you collect data illegally, you can be liable for civil damages and more.
Prison Time: If you scrape people's private data, you'd better be prepared. This is a serious crime in most countries punishable by real prison time.

Ethical

Reputation Damage: No one wants to be in the next headline about unethical business practices. This can seriously damage your personal reputation and that of your company.
Lawsuits and Suspensions: When you agree to a site's terms, you're signing a legally binding contract. If you violate this contract, you can lose your account or even be subject to a lawsuit.

If you are unsure of your scraper's legality, please consult an attorney.You can view Google Play's policies using the links below.

Conclusion

That's it! You've now harnessed the power of Selenium for scraping Google Play. You know how to perform a general crawl, and you know how to scrape the reviews that make up those oh-so-important Play Store ratings. Take this new knowledge and build something cool!If you'd like to learn more about the tech stack used in this article, check out the links below.

More Web Scraping Guides

At ScrapeOps, not only do we have a great proxy API, we also have a ton of learning resources and they're all available to you for free! If you want to learn more about web scraping in general, we wrote the Web Scraping Playbook on it! If you'd learn more from our "How To Scrape" series, check out the articles listed below.