How to Scrape Trustpilot

In today's world, pretty much everything is online. This is especially true when it comes to business. Trustpilot allows users to rate and review various businesses, making it an excellent resource for evaluating companies you're uncertain about. Before we jump in, go ahead and choose a language that works best with your setup - Python, Selenium, Puppeteer, whatever you're most comfortable with. Trustpilot’s site is pretty straightforward, but the right tools will save you a lot of headaches.

How to Scrape Trustpilot with Requests and BeautifulSoup

In this tutorial, we're going to go over the finer points of scraping Trustpilot and generating some detailed reports.

💡GitHub CodeThe full code for this Trustpilot Scraper is available on Github here.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

TLDR - How to Scrape Trustpilot

When scraping Trustpilot, all of our important data gets embedded in a JSON blob on the page. This goes for both the search results page and individual business pages.If we can pull and parse the JSON, we can get all the information we want to scrape.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": "us"        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    rating: float = 0    num_reviews: int = 0    website: str = ""    trustpilot_url: str = ""    location: str = ""    category: str = ""
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass ReviewData:    name: str = ""    rating: float = 0    text: str = ""    title: str = ""    date: str = ""

    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:
    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.trustpilot.com/search?query={formatted_keyword}&page={page_number+1}"
    tries = 0    success = False
    while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code == 200:                success = True
            else:                raise Exception(f"Failed request, Status Code {response.status_code}")
                ## Extract Data            soup = BeautifulSoup(response.text, "html.parser")            script_tag = soup.find("script", id="__NEXT_DATA__")            if script_tag:                json_data = json.loads(script_tag.contents[0])
                business_units = json_data["props"]["pageProps"]["businessUnits"]
                for business in business_units:
                    name = business.get("displayName").lower().replace(" ", "").replace("'", "")                    trustpilot_formatted = business.get("contact")["website"].split("://")[1]                    location = business.get("location")                    category_list = business.get("categories")                    category = category_list[0]["categoryId"] if len(category_list) > 0 else "n/a"
                    ## Extract Data                    search_data = SearchData(                        name = business.get("displayName", ""),                        stars = business.get("stars", 0),                        rating = business.get("trustScore", 0),                        num_reviews = business.get("numberOfReviews", 0),                        website = business.get("contact")["website"],                        trustpilot_url = f"https://www.trustpilot.com/review/{trustpilot_formatted}",                        location = location.get("country", "n/a"),                        category = category                        )
                    data_pipeline.add_data(search_data)                logger.info(f"Successfully parsed data from: {url}")                success = True

        except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

def process_business(row, location, retries=3):    url = row["trustpilot_url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(get_scrapeops_url(url, location=location))        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")
                script = soup.find("script", id="__NEXT_DATA__")
                json_data = json.loads(script.contents[0])
                business_info = json_data["props"]["pageProps"]
                reviews = business_info["reviews"]
                review_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")                for review in reviews:
                    review_data = ReviewData(                        name= review["consumer"]["displayName"],                        rating= review["rating"],                        text= review["text"],                        title= review["title"],                        date= review["dates"]["publishedDate"]                    )
                    review_pipeline.add_data(review_data)

                review_pipeline.close_pipeline()                success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['trustpilot_url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['trustpilot_url']}")



def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_business,                reader,                [location] * len(reader),                [retries] * len(reader)            )
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["online bank"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

If you'd like to tweak this scraper, feel free to change any of the keyword_list or any of the following constants as well:

MAX_RETRIES
MAX_THREADS
PAGES
LOCATION

How To How To Architect Our Trustpilot Scraper

When we build our Trustpilot scraper, we actually need to build two scrapers.

Our first scraper is a crawler. This crawler will look up search results, interpret them, and store them in a CSV file.
Our second scraper will be our review scraper.

For the best performance and stability, each of these scrapers will need the following:

Parsing: so we can pull proper information from a page.
Pagination: so we can pull up different pages be more selective about our data.
Data Storage: to store our data in a safe, efficient and readable way.
Concurrency: to scrape multiple pages at once.
Proxy Integration: when scraping anything at scale, we often face the issue of getting blocked. Proxies allow us a redundant connection and reduce our likelihood of getting blocked by different websites.

Understanding How To Scrape Trustpilot

Step 1: How To Request Trustpilot Pages

When we perform a search using Trustpilot, our URL typically looks like this:

https://www.trustpilot.com/search?query=word1+word2

Go ahead and take a look at the screenshot below, which shows a search for the term online bank.The other type of page we request from Trustpilot is the business page. This is the portion where we scrape our reviews from. This URL is pretty strange, here is the convention:

https://www.trustpilot.com/review/actual_website_domain_name

The example below is a screenshot for good-bank.de.Since the site's domain name is good-bank.de, the Trustpilot URL would be:

https://www.trustpilot.com/review/good-bank.de

While this naming convention is unorthodox, we can base our URL system on this.

Step 2: How To Extract Data From Trustpilot Results and Pages

When pulling our data from Trustpilot, we actually get pretty lucky. Our data actually gets saved in the page as a JSON blob.This is extremely convenient because we don't have to constantly be looking for nested HTML/CSS elements. We need to look for one tag, script. script holds JavaScript, and the JavaScript holds our JSON.Here is the JSON blob from good-bank.de.On both our search results, and our business pages, all the information we want is saved in a script tag with an id of "__NEXT_DATA__".

Step 3: How To Control Pagination

To paginate our results, we can use the following format for our search URL:

https://www.trustpilot.com/search?query={formatted_keyword}&page={page_number+1}

So if we wanted page 1 of online banks, our URL would look like this:

https://www.trustpilot.com/search?query=online+bank&page=1

As we already discussed previously, our url for individual business is setup like this:

https://www.trustpilot.com/review/actual_website_domain_name

With a system ready for our URLs, we're all set to extract our data.

Step 4: Geolocated Data

To handle Geoloacated Data, we'll be using the ScrapeOps Proxy API.

If we want to be in Great Britain, we simply set our country parameter to "uk",
if we want to be in the US, we can set this param to "us".

When we pass our country into the ScrapeOps API, ScrapeOps will actually route our requests through a server in that country, so even if the site checks our geolocation, our geolocation will show up correctly!

Setting Up Our Trustpilot Scraper Project

Let's get started. You can run the following commands to get setup.Create a New Project Folder

mkdir trustpilot-scraper
cd trustpilot-scraper

Create a New Virtual Environment

python -m venv venv

Activate the Environment

source venv/bin/activate

Install Our Dependencies

pip install requests

pip install beautifulsoup4

Build A Trustpilot Search Crawler

Step 1: Create Simple Search Data Parser

We'll start by building a parser for our search results. The goal here is pretty simple: fetch a page and pull information from it.Along with parsing, we'll add some basic retry logic as well. With retries, if we fail to get our information on the first try, our parser will keep trying until it runs out of retries.The Python script below is pretty basic, but this is the foundation to our entire project.

while we still have retries left and the operation hasn't succeeded, we get the page and find the script tag with the id, "__NEXT_DATA__".
From within this object, we pull all of our relevant information from the JSON blob and then print it to the terminal.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)
def scrape_search_results(keyword, location, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.trustpilot.com/search?query={formatted_keyword}&page={page_number+1}"
    tries = 0    success = False
    while tries <= retries and not success:        try:            response = requests.get(url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code == 200:                success = True
            else:                raise Exception(f"Failed request, Status Code {response.status_code}")
            soup = BeautifulSoup(response.text, "html.parser")            script_tag = soup.find("script", id="__NEXT_DATA__")            if script_tag:                json_data = json.loads(script_tag.contents[0])
                business_units = json_data["props"]["pageProps"]["businessUnits"]
                for business in business_units:
                    name = business.get("displayName").lower().replace(" ", "").replace("'", "")                    trustpilot_formatted = business.get("contact")["website"].split("://")[1]                    location = business.get("location")                    category_list = business.get("categories")                    category = category_list[0]["categoryId"] if len(category_list) > 0 else "n/a"
                    ## Extract Data                    search_data = {                        "name": business.get("displayName", ""),                        "stars": business.get("stars", 0),                        "rating": business.get("trustScore", 0),                        "num_reviews": business.get("numberOfReviews", 0),                        "website": business.get("contact")["website"],                        "trustpilot_url": f"https://www.trustpilot.com/review/{trustpilot_formatted}",                        "location": location.get("country", "n/a"),                        "category": category                    }
                logger.info(f"Successfully parsed data from: {url}")                success = True

        except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, retries=3):    for page in range(pages):        scrape_search_results(keyword, location, page, retries)
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["online bank"]
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        scrape_search_results(keyword, LOCATION, retries=MAX_RETRIES)
    logger.info(f"Crawl complete.")

Step 2: Add Pagination

Now that we can pull information from a Trustpilot page, we need to be able to decide which page we want to scrape. We can do this by using pagination.As discussed above, our paginated URL is laid out like this:

https://www.trustpilot.com/search?query={formatted_keyword}&page={page_number+1}

Here is our fully updated code, it hasn't really changed much yet.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)
def scrape_search_results(keyword, location, page_number, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.trustpilot.com/search?query={formatted_keyword}&page={page_number+1}"
    tries = 0    success = False
    while tries <= retries and not success:        try:            response = requests.get(url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code == 200:                success = True
            else:                raise Exception(f"Failed request, Status Code {response.status_code}")
                ## Extract Data            soup = BeautifulSoup(response.text, "html.parser")            script_tag = soup.find("script", id="__NEXT_DATA__")            if script_tag:                json_data = json.loads(script_tag.contents[0])
                business_units = json_data["props"]["pageProps"]["businessUnits"]
                for business in business_units:
                    name = business.get("displayName").lower().replace(" ", "").replace("'", "")                    trustpilot_formatted = business.get("contact")["website"].split("://")[1]                    location = business.get("location")                    category_list = business.get("categories")                    category = category_list[0]["categoryId"] if len(category_list) > 0 else "n/a"
                    ## Extract Data                    search_data = {                        "name": business.get("displayName", ""),                        "stars": business.get("stars", 0),                        "rating": business.get("trustScore", 0),                        "num_reviews": business.get("numberOfReviews", 0),                        "website": business.get("contact")["website"],                        "trustpilot_url": f"https://www.trustpilot.com/review/{trustpilot_formatted}",                        "location": location.get("country", "n/a"),                        "category": category                    }
                logger.info(f"Successfully parsed data from: {url}")                success = True

        except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, retries=3):    for page in range(pages):        scrape_search_results(keyword, location, page, retries)
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["online bank"]
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        start_scrape(keyword, PAGES, LOCATION, retries=MAX_RETRIES)
    logger.info(f"Crawl complete.")

We also added a start_scrape() function which gives us the ability to scrape multiple pages. Later on, we'll add concurrency to this function, but for now, we're just going to use a for loop as a placeholder.

Step 3: Storing the Scraped Data

Now that we can retrieve our data properly, it's time to start storing it. In this example, we're going to add a SearchData class and a DataPipeline class.SearchData is a dataclass and the purpose of it is to simply hold our data. Once we've instantiated the SearchData, we can pass it into our DataPipeline.Take a look at the updated code.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    rating: float = 0    num_reviews: int = 0    website: str = ""    trustpilot_url: str = ""    location: str = ""    category: str = ""
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:
    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.trustpilot.com/search?query={formatted_keyword}&page={page_number+1}"
    tries = 0    success = False
    while tries <= retries and not success:        try:            response = requests.get(url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code == 200:                success = True
            else:                raise Exception(f"Failed request, Status Code {response.status_code}")
                ## Extract Data            soup = BeautifulSoup(response.text, "html.parser")            script_tag = soup.find("script", id="__NEXT_DATA__")            if script_tag:                json_data = json.loads(script_tag.contents[0])
                business_units = json_data["props"]["pageProps"]["businessUnits"]
                for business in business_units:
                    name = business.get("displayName").lower().replace(" ", "").replace("'", "")                    trustpilot_formatted = business.get("contact")["website"].split("://")[1]                    location = business.get("location")                    category_list = business.get("categories")                    category = category_list[0]["categoryId"] if len(category_list) > 0 else "n/a"
                    ## Extract Data                    search_data = SearchData(                        name = business.get("displayName", ""),                        stars = business.get("stars", 0),                        rating = business.get("trustScore", 0),                        num_reviews = business.get("numberOfReviews", 0),                        website = business.get("contact")["website"],                        trustpilot_url = f"https://www.trustpilot.com/review/{trustpilot_formatted}",                        location = location.get("country", "n/a"),                        category = category                        )
                    data_pipeline.add_data(search_data)                logger.info(f"Successfully parsed data from: {url}")                success = True

        except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    for page in range(pages):        scrape_search_results(keyword, location, page, data_pipeline, retries)
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["online bank"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

The DataPipeline creates a pipeline to a CSV file. If the file already exists, we append it. If it doesn't exist, we create it.
Once the SearchData gets passed into our DataPipeline, the DataPipeline filters out our duplicates and stores the rest of our relevant data to a CSV file.

Step 4: Adding Concurrency

To maximize our efficieny, we need to add concurrency. Concurrency gives us the ability to process multiple pages at once.Here we'll be using ThreadPoolExecutor for multithreading.Our only major difference here is the start_scrape() function. Here is what it looks like now:

def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

Here is the fully updated code.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    rating: float = 0    num_reviews: int = 0    website: str = ""    trustpilot_url: str = ""    location: str = ""    category: str = ""
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:
    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.trustpilot.com/search?query={formatted_keyword}&page={page_number+1}"
    tries = 0    success = False
    while tries <= retries and not success:        try:            response = requests.get(url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code == 200:                success = True
            else:                raise Exception(f"Failed request, Status Code {response.status_code}")
                ## Extract Data            soup = BeautifulSoup(response.text, "html.parser")            script_tag = soup.find("script", id="__NEXT_DATA__")            if script_tag:                json_data = json.loads(script_tag.contents[0])
                business_units = json_data["props"]["pageProps"]["businessUnits"]
                for business in business_units:
                    name = business.get("displayName").lower().replace(" ", "").replace("'", "")                    trustpilot_formatted = business.get("contact")["website"].split("://")[1]                    location = business.get("location")                    category_list = business.get("categories")                    category = category_list[0]["categoryId"] if len(category_list) > 0 else "n/a"
                    ## Extract Data                    search_data = SearchData(                        name = business.get("displayName", ""),                        stars = business.get("stars", 0),                        rating = business.get("trustScore", 0),                        num_reviews = business.get("numberOfReviews", 0),                        website = business.get("contact")["website"],                        trustpilot_url = f"https://www.trustpilot.com/review/{trustpilot_formatted}",                        location = location.get("country", "n/a"),                        category = category                        )
                    data_pipeline.add_data(search_data)                logger.info(f"Successfully parsed data from: {url}")                success = True

        except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["online bank"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

With concurrency, we can now scrape numerous pages all at the same time.

Step 5: Bypassing Anti-Bots

In the wild, scrapers are often caught and blocked by anti-bots. Anti-bots are software designed to find and block malicious traffic. While our scraper is not malicious, it is incredibly fast and at this point there is nothing human about it. These abnormalities will likely get us flagged and probably blocked.To bypass these anti-bots, we need to use a proxy. The ScrapeOps Proxy will rotate our IP address and therefore, each request we make seems like its coming from a different location. So instead of one bot making a ton of bizarre requests to the server, our scraper will look like many instances of normal client side traffic.This function does all of this for us:

def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": "us"        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

In this example, our code barely changes at all, but it brings us to a production ready level. Take a look at the full code example below.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": "us"        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    rating: float = 0    num_reviews: int = 0    website: str = ""    trustpilot_url: str = ""    location: str = ""    category: str = ""
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:
    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.trustpilot.com/search?query={formatted_keyword}&page={page_number+1}"
    tries = 0    success = False
    while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code == 200:                success = True
            else:                raise Exception(f"Failed request, Status Code {response.status_code}")
                ## Extract Data            soup = BeautifulSoup(response.text, "html.parser")            script_tag = soup.find("script", id="__NEXT_DATA__")            if script_tag:                json_data = json.loads(script_tag.contents[0])
                business_units = json_data["props"]["pageProps"]["businessUnits"]
                for business in business_units:
                    name = business.get("displayName").lower().replace(" ", "").replace("'", "")                    trustpilot_formatted = business.get("contact")["website"].split("://")[1]                    location = business.get("location")                    category_list = business.get("categories")                    category = category_list[0]["categoryId"] if len(category_list) > 0 else "n/a"
                    ## Extract Data                    search_data = SearchData(                        name = business.get("displayName", ""),                        stars = business.get("stars", 0),                        rating = business.get("trustScore", 0),                        num_reviews = business.get("numberOfReviews", 0),                        website = business.get("contact")["website"],                        trustpilot_url = f"https://www.trustpilot.com/review/{trustpilot_formatted}",                        location = location.get("country", "n/a"),                        category = category                        )
                    data_pipeline.add_data(search_data)                logger.info(f"Successfully parsed data from: {url}")                success = True

        except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["online bank"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

Step 6: Production Run

It's finally time to run our crawler in production. Take a look at our main. I'm changing a few constants here.

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 10    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["online bank"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

PAGES has been set to 10 and LOCATION has been set to "us". Now let's see how long it takes to process 10 pages of data.Here are the results:We processed 10 pages in just over 4 seconds!!!

Build A Trustpilot Scraper

Our crawler gives us great results in a fast and efficient way. Now we need to pair our crawler with a scraper. This scraper will be pulling information about the individual businesses that we save in the report that our crawler generates.Our scraper will do the following:

Open the report we created
Get the pages from that report
Pull information from these pages
Create an individual report for each of the businesses we've looked up

Along with this process, we'll once again utilize our basic steps from the crawler: parsing, storage, concurrency, and proxy integration.

Step 1: Create Simple Business Data Parser

Here, we'll just create a simple parsing function. Take a look below.

def process_business(row, location, retries=3):    url = row["trustpilot_url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(url)        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")
                script = soup.find("script", id="__NEXT_DATA__")
                json_data = json.loads(script.contents[0])
                business_info = json_data["props"]["pageProps"]
                reviews = business_info["reviews"]
                for review in reviews:
                    review_data = {                        "name": review["consumer"]["displayName"],                        "rating": review["rating"],                        "text": review["text"],                        "title": review["title"],                        "date": review["dates"]["publishedDate"]                    }
                    print(review_data)
                success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['trustpilot_url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['trustpilot_url']}")

This function takes in a row from our CSV file and then fetches the trustpilot_url of the business.
Once we've got the page, we once again look for the script tag with the id of "__NEXT_DATA__" to find our JSON blob.
From within our JSON blob, we pull information from each review listed within the blob.

Step 2: Loading URLs To Scrape

In order to use our process_business() function, we need to be able to read the rows from our CSV file. Now we're going to fully update our code.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": "us"        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    rating: float = 0    num_reviews: int = 0    website: str = ""    trustpilot_url: str = ""    location: str = ""    category: str = ""
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:
    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.trustpilot.com/search?query={formatted_keyword}&page={page_number+1}"
    tries = 0    success = False
    while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code == 200:                success = True
            else:                raise Exception(f"Failed request, Status Code {response.status_code}")
                ## Extract Data            soup = BeautifulSoup(response.text, "html.parser")            script_tag = soup.find("script", id="__NEXT_DATA__")            if script_tag:                json_data = json.loads(script_tag.contents[0])
                business_units = json_data["props"]["pageProps"]["businessUnits"]
                for business in business_units:
                    name = business.get("displayName").lower().replace(" ", "").replace("'", "")                    trustpilot_formatted = business.get("contact")["website"].split("://")[1]                    location = business.get("location")                    category_list = business.get("categories")                    category = category_list[0]["categoryId"] if len(category_list) > 0 else "n/a"
                    ## Extract Data                    search_data = SearchData(                        name = business.get("displayName", ""),                        stars = business.get("stars", 0),                        rating = business.get("trustScore", 0),                        num_reviews = business.get("numberOfReviews", 0),                        website = business.get("contact")["website"],                        trustpilot_url = f"https://www.trustpilot.com/review/{trustpilot_formatted}",                        location = location.get("country", "n/a"),                        category = category                        )
                    data_pipeline.add_data(search_data)                logger.info(f"Successfully parsed data from: {url}")                success = True

        except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

def process_business(row, location, retries=3):    url = row["trustpilot_url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(url)        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")
                script = soup.find("script", id="__NEXT_DATA__")
                json_data = json.loads(script.contents[0])
                business_info = json_data["props"]["pageProps"]
                reviews = business_info["reviews"]
                for review in reviews:
                    review_data = {                        "name": review["consumer"]["displayName"],                        "rating": review["rating"],                        "text": review["text"],                        "title": review["title"],                        "date": review["dates"]["publishedDate"]                    }
                    print(review_data)
                success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['trustpilot_url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['trustpilot_url']}")

def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_business(row, location, retries)
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["online bank"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

In the example above, our process_results() function reads the rows from our CSV file and passes each of them into process_business().process_business() then pulls our information and prints it to the terminal.

Step 3: Storing the Scraped Data

Once again, we're now in the position where we need to store our data. We'll add a ReviewData class. This class is going to simply hold data, just like our SearchData.We then pass our ReviewData into a DataPipeline just like we did earlier.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": "us"        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    rating: float = 0    num_reviews: int = 0    website: str = ""    trustpilot_url: str = ""    location: str = ""    category: str = ""
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass ReviewData:    name: str = ""    rating: float = 0    text: str = ""    title: str = ""    date: str = ""

    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:
    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.trustpilot.com/search?query={formatted_keyword}&page={page_number+1}"
    tries = 0    success = False
    while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code == 200:                success = True
            else:                raise Exception(f"Failed request, Status Code {response.status_code}")
                ## Extract Data            soup = BeautifulSoup(response.text, "html.parser")            script_tag = soup.find("script", id="__NEXT_DATA__")            if script_tag:                json_data = json.loads(script_tag.contents[0])
                business_units = json_data["props"]["pageProps"]["businessUnits"]
                for business in business_units:
                    name = business.get("displayName").lower().replace(" ", "").replace("'", "")                    trustpilot_formatted = business.get("contact")["website"].split("://")[1]                    location = business.get("location")                    category_list = business.get("categories")                    category = category_list[0]["categoryId"] if len(category_list) > 0 else "n/a"
                    ## Extract Data                    search_data = SearchData(                        name = business.get("displayName", ""),                        stars = business.get("stars", 0),                        rating = business.get("trustScore", 0),                        num_reviews = business.get("numberOfReviews", 0),                        website = business.get("contact")["website"],                        trustpilot_url = f"https://www.trustpilot.com/review/{trustpilot_formatted}",                        location = location.get("country", "n/a"),                        category = category                        )
                    data_pipeline.add_data(search_data)                logger.info(f"Successfully parsed data from: {url}")                success = True

        except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

def process_business(row, location, retries=3):    url = row["trustpilot_url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(url)        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")
                script = soup.find("script", id="__NEXT_DATA__")
                json_data = json.loads(script.contents[0])
                business_info = json_data["props"]["pageProps"]
                reviews = business_info["reviews"]
                review_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")                for review in reviews:
                    review_data = ReviewData(                        name= review["consumer"]["displayName"],                        rating= review["rating"],                        text= review["text"],                        title= review["title"],                        date= review["dates"]["publishedDate"]                    )
                    review_pipeline.add_data(review_data)

                review_pipeline.close_pipeline()                success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['trustpilot_url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['trustpilot_url']}")



def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_business(row, location, retries)
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["online bank"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Step 4: Adding Concurrency

Once again, we need to add concurrency. This time, instead of scraping multiple result pages at once, we're obviously going to be scraping multiple business pages at once.Here is our process_results() function refactored for conccurency.

def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_business,                reader,                [location] * len(reader),                [retries] * len(reader)            )

The rest of our code largely remains the same.

Step 5: Bypassing Anti-Bots

To finish everything off, we once again need to add support for anti-bots. Our final example really only has one relevant change.

response = requests.get(get_scrapeops_url(url, location=location))

Here is the fully updated code:

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": "us"        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    rating: float = 0    num_reviews: int = 0    website: str = ""    trustpilot_url: str = ""    location: str = ""    category: str = ""
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass ReviewData:    name: str = ""    rating: float = 0    text: str = ""    title: str = ""    date: str = ""

    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:
    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.trustpilot.com/search?query={formatted_keyword}&page={page_number+1}"
    tries = 0    success = False
    while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code == 200:                success = True
            else:                raise Exception(f"Failed request, Status Code {response.status_code}")
                ## Extract Data            soup = BeautifulSoup(response.text, "html.parser")            script_tag = soup.find("script", id="__NEXT_DATA__")            if script_tag:                json_data = json.loads(script_tag.contents[0])
                business_units = json_data["props"]["pageProps"]["businessUnits"]
                for business in business_units:
                    name = business.get("displayName").lower().replace(" ", "").replace("'", "")                    trustpilot_formatted = business.get("contact")["website"].split("://")[1]                    location = business.get("location")                    category_list = business.get("categories")                    category = category_list[0]["categoryId"] if len(category_list) > 0 else "n/a"
                    ## Extract Data                    search_data = SearchData(                        name = business.get("displayName", ""),                        stars = business.get("stars", 0),                        rating = business.get("trustScore", 0),                        num_reviews = business.get("numberOfReviews", 0),                        website = business.get("contact")["website"],                        trustpilot_url = f"https://www.trustpilot.com/review/{trustpilot_formatted}",                        location = location.get("country", "n/a"),                        category = category                        )
                    data_pipeline.add_data(search_data)                logger.info(f"Successfully parsed data from: {url}")                success = True

        except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

def process_business(row, location, retries=3):    url = row["trustpilot_url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(get_scrapeops_url(url, location=location))        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")
                script = soup.find("script", id="__NEXT_DATA__")
                json_data = json.loads(script.contents[0])
                business_info = json_data["props"]["pageProps"]
                reviews = business_info["reviews"]
                review_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")                for review in reviews:
                    review_data = ReviewData(                        name= review["consumer"]["displayName"],                        rating= review["rating"],                        text= review["text"],                        title= review["title"],                        date= review["dates"]["publishedDate"]                    )
                    review_pipeline.add_data(review_data)

                review_pipeline.close_pipeline()                success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['trustpilot_url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['trustpilot_url']}")



def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_business,                reader,                [location] * len(reader),                [retries] * len(reader)            )
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["online bank"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Step 6: Production Run

Let's run both the crawler and the scraper together in production. Here is our updated main.

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 10    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["online bank"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

As before, I've changed our PAGES to 10 and our LOCATION to "us". Here are the results.It took just over 100 seconds (including the time it took to create our initial report) to generate a full report and process all the results (86 rows). This comes out to a speed of about 1.17 seconds per business.

Legal and Ethical Considerations

When scraping any website, you need to always pay attention to the site's terms and conditions. You can view Trustpilot's consumer terms here.You should also respect the site's robots.txt. You can view their robots.txt file here.Always be careful about the information you extract and don't scrape private or confidential data.

If a website is hidden behind a login, that is generally considered private data.
If your data does not require a login, it is generally considered to be public data.

If you have questions about the legality of your scraping job, it is best to consult an attorney familiar with the laws and localities you're dealing with.

Conclusion

You now know how to build both a crawler and a scraper for Trustpilot. You know how to utilize parsing, pagination, data storage, concurrency, and proxy integration. You should also know how to deal with blobs of JSON data. Dealing with JSON is a very important skill not only in web scraping but software development in general.If you'd like to learn more about the tools used in this article, take a look at the links below:

How to Scrape Trustpilot with Selenium

In this Selenium tutorial, we're going to go over the finer points of scraping Trustpilot and generating some detailed reports.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

TLDR - How to Scrape Trustpilot

On Trustpilot, all of our data actually comes embedded within the page in a JSON blob. If we can find the blob, all we need to do is parse the data. The script below does exactly that for both a search results page and a business page.

import osimport csvimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
from selenium import webdriverfrom selenium.webdriver.common.by import By
OPTIONS = webdriver.ChromeOptions()OPTIONS.add_argument("--headless")
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": "us"        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    rating: float = 0    num_reviews: int = 0    website: str = ""    trustpilot_url: str = ""    location: str = ""    category: str = ""
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass ReviewData:    name: str = ""    rating: float = 0    text: str = ""    title: str = ""    date: str = ""

    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:
    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):
    driver = webdriver.Chrome(options=OPTIONS)
    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.trustpilot.com/search?query={formatted_keyword}&page={page_number+1}"
    tries = 0    success = False
    while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            driver.get(scrapeops_proxy_url)            logger.info(f"{keyword}: Fetched page {page_number}")
                ## Extract Data            script_tag = driver.find_element(By.CSS_SELECTOR, "script[id='__NEXT_DATA__'")            if script_tag:                json_data = json.loads(script_tag.get_attribute("innerHTML"))

                business_units = json_data["props"]["pageProps"]["businessUnits"]
                for business in business_units:
                    name = business.get("displayName").lower().replace(" ", "").replace("'", "")                    trustpilot_formatted = business.get("contact")["website"].split("://")[1]                    location = business.get("location")                    category_list = business.get("categories")                    category = category_list[0]["categoryId"] if len(category_list) > 0 else "n/a"
                    ## Extract Data                    search_data = SearchData(                        name = business.get("displayName", ""),                        stars = business.get("stars", 0),                        rating = business.get("trustScore", 0),                        num_reviews = business.get("numberOfReviews", 0),                        website = business.get("contact")["website"],                        trustpilot_url = f"https://www.trustpilot.com/review/{trustpilot_formatted}",                        location = location.get("country", "n/a"),                        category = category                        )
                    data_pipeline.add_data(search_data)                logger.info(f"Successfully parsed data from: {url}")
                driver.quit()                success = True

        except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

def process_business(row, location, retries=3):    url = row["trustpilot_url"]    tries = 0    success = False
    while tries <= retries and not success:        driver = webdriver.Chrome(options=OPTIONS)        try:            driver.get(get_scrapeops_url(url, location=location))
            script = driver.find_element(By.CSS_SELECTOR, "script[id='__NEXT_DATA__'")
            json_data = json.loads(script.get_attribute("innerHTML"))
            business_info = json_data["props"]["pageProps"]
            reviews = business_info["reviews"]
            review_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")
            for review in reviews:                review_data = ReviewData(                    name= review["consumer"]["displayName"],                    rating= review["rating"],                    text= review["text"],                    title= review["title"],                    date= review["dates"]["publishedDate"]                )
                review_pipeline.add_data(review_data)

            review_pipeline.close_pipeline()            success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['trustpilot_url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['trustpilot_url']}")



def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_business,                reader,                [location] * len(reader),                [retries] * len(reader)            )
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["online bank"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")


    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

To customize this scraper, feel free to change any of the following constants:

keyword_list: This array contains the search keywords you want to use on Trustpilot to find businesses. Each keyword in the array will be used to perform a separate search and gather data on matching businesses.
MAX_RETRIES: This value sets the number of retry attempts for each scraping task if it fails. More retries increase the chances of successful scraping despite intermittent errors or temporary issues but also prolong the total scraping time.
MAX_THREADS: This number controls how many scraping tasks are run concurrently. A higher limit can speed up the scraping process but may increase the load on your system and the target website, potentially leading to rate limiting or bans.
PAGES: This value represents the number of pages of search results you want to scrape for each keyword. Each page typically contains a set number of business listings.
LOCATION: This string specifies the geographical location to use for the proxy service. It determines the country from which the scraping requests appear to originate. The location might affect the results due to regional differences in business listings and reviews.

How To How To Architect Our Trustpilot Scraper

When we build our Trustpilot scraper, we actually need to build two scrapers.

Our first scraper is a crawler. This crawler will look up search results, interpret them, and store them in a CSV file.
Our second scraper will be our review scraper. The review scraper needs to look up specific businesses and save their reviews to a new CSV file.

For the best performance and stability, each of these scrapers will need the following:

Parsing: so we can pull proper information from a page.
Pagination: so we can pull up different pages be more selective about our data.
Data Storage: to store our data in a safe, efficient and readable way.
Concurrency: to scrape multiple pages at once.
Proxy Integration: when scraping anything at scale, we often face the issue of getting blocked. Proxies allow us a redundant connection and reduce our likelihood of getting blocked by different websites.

Understanding How To Scrape Trustpilot

Step 1: How To Request Trustpilot Pages

When we perform a search using Trustpilot, our URL typically looks like this:

https://www.trustpilot.com/search?query=word1+word2

https://www.trustpilot.com/review/actual_website_domain_name

The example below is a screenshot for good-bank.de.Since the site's domain name is good-bank.de, the Trustpilot URL would be:

https://www.trustpilot.com/review/good-bank.de

While this naming convention is unorthodox, we can base our URL system on this.

Step 2: How To Extract Data From Trustpilot Results and Pages

Step 3: How To Control Pagination

To paginate our results, we can use the following format for our search URL:

https://www.trustpilot.com/search?query={formatted_keyword}&page={page_number+1}

So if we wanted page 1 of online banks, our URL would look like this:

https://www.trustpilot.com/search?query=online+bank&page=1

As we already discussed previously, our url for individual business is setup like this:

https://www.trustpilot.com/review/actual_website_domain_name

With a system ready for our URLs, we're all set to extract our data.

Step 4: Geolocated Data

To handle Geoloacated Data, we'll be using the ScrapeOps Proxy API.

If we want to be in Great Britain, we simply set our country parameter to "uk",
if we want to be in the US, we can set this param to "us".

Setting Up Our Trustpilot Scraper Project

Let's get started. You can run the following commands to get setup.Create a New Project Folder

mkdir trustpilot-scraper
cd trustpilot-scraper

Create a New Virtual Environment

python -m venv venv

Activate the Environment

source venv/bin/activate

Install Our Dependencies

pip install selenium

Make sure you have Chromedriver installed!

Build A Trustpilot Search Crawler

Step 1: Create Simple Search Data Parser

We'll start by building a parser for our search results. The goal here is pretty simple: fetch a page and pull information from it.Along with parsing, we'll add some basic retry logic as well. With retries, if we fail to get our information on the first try, our parser will keep trying until it runs out of retries.The script below is pretty basic, but this is the foundation to our entire project.

while we still have retries left and the operation hasn't succeeded, we get the page and find the script tag with the id, "__NEXT_DATA__".
From within this object, we pull all of our relevant information from the JSON blob and then print it to the terminal.

import osimport csvimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
from selenium import webdriverfrom selenium.webdriver.common.by import By
OPTIONS = webdriver.ChromeOptions()OPTIONS.add_argument("--headless")
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)

def scrape_search_results(keyword, location, page_number, retries=3):
    driver = webdriver.Chrome(options=OPTIONS)
    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.trustpilot.com/search?query={formatted_keyword}"
    tries = 0    success = False
    while tries <= retries and not success:        try:            driver.get(url)            logger.info(f"{keyword}: Fetched page {page_number}")
                ## Extract Data            script_tag = driver.find_element(By.CSS_SELECTOR, "script[id='__NEXT_DATA__'")            if script_tag:                json_data = json.loads(script_tag.get_attribute("innerHTML"))

                business_units = json_data["props"]["pageProps"]["businessUnits"]
                for business in business_units:
                    name = business.get("displayName").lower().replace(" ", "").replace("'", "")                    trustpilot_formatted = business.get("contact")["website"].split("://")[1]                    location = business.get("location")                    category_list = business.get("categories")                    category = category_list[0]["categoryId"] if len(category_list) > 0 else "n/a"
                    ## Extract Data                    search_data = {                        "name": business.get("displayName", ""),                        "stars": business.get("stars", 0),                        "rating": business.get("trustScore", 0),                        "num_reviews": business.get("numberOfReviews", 0),                        "website": business.get("contact")["website"],                        "trustpilot_url": f"https://www.trustpilot.com/review/{trustpilot_formatted}",                        "location": location.get("country", "n/a"),                        "category": category                    }
                    print(search_data)                logger.info(f"Successfully parsed data from: {url}")
                driver.quit()                success = True

        except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def start_scrape(keyword, pages, location, data_pipeline=None, retries=3):    for page in range(pages):        scrape_search_results(keyword, page, location, data_pipeline, retries=retries)

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["online bank"]
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")

Step 2: Add Pagination

https://www.trustpilot.com/search?query={formatted_keyword}&page={page_number+1}

Here is our fully updated code, it hasn't really changed much yet.

import osimport csvimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
from selenium import webdriverfrom selenium.webdriver.common.by import By
OPTIONS = webdriver.ChromeOptions()OPTIONS.add_argument("--headless")
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)

def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):
    driver = webdriver.Chrome(options=OPTIONS)
    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.trustpilot.com/search?query={formatted_keyword}&page={page_number+1}"
    tries = 0    success = False
    while tries <= retries and not success:        try:            driver.get(url)            logger.info(f"{keyword}: Fetched page {page_number}")
                ## Extract Data            script_tag = driver.find_element(By.CSS_SELECTOR, "script[id='__NEXT_DATA__'")            if script_tag:                json_data = json.loads(script_tag.get_attribute("innerHTML"))

                business_units = json_data["props"]["pageProps"]["businessUnits"]
                for business in business_units:
                    name = business.get("displayName").lower().replace(" ", "").replace("'", "")                    trustpilot_formatted = business.get("contact")["website"].split("://")[1]                    location = business.get("location")                    category_list = business.get("categories")                    category = category_list[0]["categoryId"] if len(category_list) > 0 else "n/a"
                    ## Extract Data                    search_data = {                        "name": business.get("displayName", ""),                        "stars": business.get("stars", 0),                        "rating": business.get("trustScore", 0),                        "num_reviews": business.get("numberOfReviews", 0),                        "website": business.get("contact")["website"],                        "trustpilot_url": f"https://www.trustpilot.com/review/{trustpilot_formatted}",                        "location": location.get("country", "n/a"),                        "category": category                    }
                    print(search_data)                logger.info(f"Successfully parsed data from: {url}")
                driver.quit()                success = True

        except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def start_scrape(keyword, pages, location, data_pipeline=None, retries=3):    for page in range(pages):        scrape_search_results(keyword, page, location, data_pipeline, retries=retries)

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["online bank"]
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")

We also added a startScrape() function which gives us the ability to scrape multiple pages. Later on, we'll add concurrency to this function, but for now, we're just going to use a for loop as a placeholder.We take in a range() of pages and then we go though and run scrape_search_results() on each page.

Step 3: Storing the Scraped Data

import osimport csvimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
from selenium import webdriverfrom selenium.webdriver.common.by import By
OPTIONS = webdriver.ChromeOptions()OPTIONS.add_argument("--headless")
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    rating: float = 0    num_reviews: int = 0    website: str = ""    trustpilot_url: str = ""    location: str = ""    category: str = ""
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:
    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):
    driver = webdriver.Chrome(options=OPTIONS)
    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.trustpilot.com/search?query={formatted_keyword}&page={page_number+1}"
    tries = 0    success = False
    while tries <= retries and not success:        try:            driver.get(url)            logger.info(f"{keyword}: Fetched page {page_number}")
                ## Extract Data            script_tag = driver.find_element(By.CSS_SELECTOR, "script[id='__NEXT_DATA__'")            if script_tag:                json_data = json.loads(script_tag.get_attribute("innerHTML"))

                business_units = json_data["props"]["pageProps"]["businessUnits"]
                for business in business_units:
                    name = business.get("displayName").lower().replace(" ", "").replace("'", "")                    trustpilot_formatted = business.get("contact")["website"].split("://")[1]                    location = business.get("location")                    category_list = business.get("categories")                    category = category_list[0]["categoryId"] if len(category_list) > 0 else "n/a"
                    ## Extract Data                    search_data = SearchData(                        name = business.get("displayName", ""),                        stars = business.get("stars", 0),                        rating = business.get("trustScore", 0),                        num_reviews = business.get("numberOfReviews", 0),                        website = business.get("contact")["website"],                        trustpilot_url = f"https://www.trustpilot.com/review/{trustpilot_formatted}",                        location = location.get("country", "n/a"),                        category = category                        )
                    data_pipeline.add_data(search_data)                logger.info(f"Successfully parsed data from: {url}")
                driver.quit()                success = True

        except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def start_scrape(keyword, pages, location, data_pipeline=None, retries=3):    for page in range(pages):        scrape_search_results(keyword, page, location, data_pipeline, retries=retries)

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["online bank"]
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")

The DataPipeline creates a pipeline to a CSV file. If the file already exists, we append it. If it doesn't exist, we create it.
Once the SearchData gets passed into our DataPipeline, the DataPipeline filters out our duplicates and stores the rest of our relevant data to a CSV file.

Step 4: Adding Concurrency

To maximize our efficieny, we need to add concurrency. Concurrency gives us the ability to process multiple pages at once. Here we'll be using ThreadPoolExecutor for multithreading.Our only major difference here is the start_scrape() function. Here is what it looks like now:

def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

Here is the fully updated code.

import osimport csvimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
from selenium import webdriverfrom selenium.webdriver.common.by import By
OPTIONS = webdriver.ChromeOptions()OPTIONS.add_argument("--headless")
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    rating: float = 0    num_reviews: int = 0    website: str = ""    trustpilot_url: str = ""    location: str = ""    category: str = ""
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:
    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):
    driver = webdriver.Chrome(options=OPTIONS)
    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.trustpilot.com/search?query={formatted_keyword}&page={page_number+1}"
    tries = 0    success = False
    while tries <= retries and not success:        try:            driver.get(url)            logger.info(f"{keyword}: Fetched page {page_number}")
                ## Extract Data            script_tag = driver.find_element(By.CSS_SELECTOR, "script[id='__NEXT_DATA__'")            if script_tag:                json_data = json.loads(script_tag.get_attribute("innerHTML"))

                business_units = json_data["props"]["pageProps"]["businessUnits"]
                for business in business_units:
                    name = business.get("displayName").lower().replace(" ", "").replace("'", "")                    trustpilot_formatted = business.get("contact")["website"].split("://")[1]                    location = business.get("location")                    category_list = business.get("categories")                    category = category_list[0]["categoryId"] if len(category_list) > 0 else "n/a"
                    ## Extract Data                    search_data = SearchData(                        name = business.get("displayName", ""),                        stars = business.get("stars", 0),                        rating = business.get("trustScore", 0),                        num_reviews = business.get("numberOfReviews", 0),                        website = business.get("contact")["website"],                        trustpilot_url = f"https://www.trustpilot.com/review/{trustpilot_formatted}",                        location = location.get("country", "n/a"),                        category = category                        )
                    data_pipeline.add_data(search_data)                logger.info(f"Successfully parsed data from: {url}")
                driver.quit()                success = True

        except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["online bank"]
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")

With concurrency, we can now scrape numerous pages all at the same time.

Step 5: Bypassing Anti-Bots

def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": "us"        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

In this example, our code barely changes at all, but it brings us to a production ready level. Take a look at the full code example below.

import osimport csvimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
from selenium import webdriverfrom selenium.webdriver.common.by import By
OPTIONS = webdriver.ChromeOptions()OPTIONS.add_argument("--headless")
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": "us"        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    rating: float = 0    num_reviews: int = 0    website: str = ""    trustpilot_url: str = ""    location: str = ""    category: str = ""
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:
    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):
    driver = webdriver.Chrome(options=OPTIONS)
    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.trustpilot.com/search?query={formatted_keyword}&page={page_number+1}"
    tries = 0    success = False
    while tries <= retries and not success:        try:            driver.get(url)            logger.info(f"{keyword}: Fetched page {page_number}")
                ## Extract Data            script_tag = driver.find_element(By.CSS_SELECTOR, "script[id='__NEXT_DATA__'")            if script_tag:                json_data = json.loads(script_tag.get_attribute("innerHTML"))

                business_units = json_data["props"]["pageProps"]["businessUnits"]
                for business in business_units:
                    name = business.get("displayName").lower().replace(" ", "").replace("'", "")                    trustpilot_formatted = business.get("contact")["website"].split("://")[1]                    location = business.get("location")                    category_list = business.get("categories")                    category = category_list[0]["categoryId"] if len(category_list) > 0 else "n/a"
                    ## Extract Data                    search_data = SearchData(                        name = business.get("displayName", ""),                        stars = business.get("stars", 0),                        rating = business.get("trustScore", 0),                        num_reviews = business.get("numberOfReviews", 0),                        website = business.get("contact")["website"],                        trustpilot_url = f"https://www.trustpilot.com/review/{trustpilot_formatted}",                        location = location.get("country", "n/a"),                        category = category                        )
                    data_pipeline.add_data(search_data)                logger.info(f"Successfully parsed data from: {url}")
                driver.quit()                success = True

        except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["online bank"]
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")

Step 6: Production Run

It's finally time to run our crawler in production. Take a look at our main. I'm changing a few constants here.

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 10    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["online bank"]
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")

PAGES has been set to 10 and LOCATION has been set to "us". Now let's see how long it takes to process 10 pages of data.Here are the results:We processed 10 pages in roughly 21 seconds. All in all, it costs us about 2.1 seconds per page!

Build A Trustpilot Scraper

Open the report we created
Get the pages from that report
Pull information from these pages
Create an individual report for each of the businesses we've looked up

Along with this process, we'll once again utilize our basic steps from the crawler: parsing, storage, concurrency, and proxy integration.

Step 1: Create Simple Business Data Parser

Here, we'll just create a simple parsing function. Take a look below.

def process_business(row, location, retries=3):    url = row["trustpilot_url"]    tries = 0    success = False
    while tries <= retries and not success:        driver = webdriver.Chrome(options=OPTIONS)        try:            driver.get(url)
            script = driver.find_element(By.CSS_SELECTOR, "script[id='__NEXT_DATA__'")
            json_data = json.loads(script.get_attribute("innerHTML"))
            business_info = json_data["props"]["pageProps"]
            reviews = business_info["reviews"]
            for review in reviews:                review_data = {                    "name": review["consumer"]["displayName"],                    "rating": review["rating"],                    "text": review["text"],                    "title": review["title"],                    "date": review["dates"]["publishedDate"]                }
                print(review_data)
            success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['trustpilot_url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['trustpilot_url']}")

This function takes in a row from our CSV file and then fetches the trustpilot_url of the business.
Once we've got the page, we once again look for the script tag with the id of "__NEXT_DATA__" to find our JSON blob.
From within our JSON blob, we pull information from each review listed within the blob.

This code won't run yet, we need a way to read the CSV file!

Step 2: Loading URLs To Scrape

In order to use our process_business() function, we need to be able to read the rows from our CSV file. Now we're going to fully update our code so we can actually read information from the CSV.

import osimport csvimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
from selenium import webdriverfrom selenium.webdriver.common.by import By
OPTIONS = webdriver.ChromeOptions()OPTIONS.add_argument("--headless")
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": "us"        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    rating: float = 0    num_reviews: int = 0    website: str = ""    trustpilot_url: str = ""    location: str = ""    category: str = ""
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:
    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):
    driver = webdriver.Chrome(options=OPTIONS)
    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.trustpilot.com/search?query={formatted_keyword}&page={page_number+1}"
    tries = 0    success = False
    while tries <= retries and not success:        try:            driver.get(url)            logger.info(f"{keyword}: Fetched page {page_number}")
                ## Extract Data            script_tag = driver.find_element(By.CSS_SELECTOR, "script[id='__NEXT_DATA__'")            if script_tag:                json_data = json.loads(script_tag.get_attribute("innerHTML"))

                business_units = json_data["props"]["pageProps"]["businessUnits"]
                for business in business_units:
                    name = business.get("displayName").lower().replace(" ", "").replace("'", "")                    trustpilot_formatted = business.get("contact")["website"].split("://")[1]                    location = business.get("location")                    category_list = business.get("categories")                    category = category_list[0]["categoryId"] if len(category_list) > 0 else "n/a"
                    ## Extract Data                    search_data = SearchData(                        name = business.get("displayName", ""),                        stars = business.get("stars", 0),                        rating = business.get("trustScore", 0),                        num_reviews = business.get("numberOfReviews", 0),                        website = business.get("contact")["website"],                        trustpilot_url = f"https://www.trustpilot.com/review/{trustpilot_formatted}",                        location = location.get("country", "n/a"),                        category = category                        )
                    data_pipeline.add_data(search_data)                logger.info(f"Successfully parsed data from: {url}")
                driver.quit()                success = True

        except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

def process_business(row, location, retries=3):    url = row["trustpilot_url"]    tries = 0    success = False
    while tries <= retries and not success:        driver = webdriver.Chrome(options=OPTIONS)        try:            driver.get(url)
            script = driver.find_element(By.CSS_SELECTOR, "script[id='__NEXT_DATA__'")
            json_data = json.loads(script.get_attribute("innerHTML"))
            business_info = json_data["props"]["pageProps"]
            reviews = business_info["reviews"]
            for review in reviews:                review_data = {                    "name": review["consumer"]["displayName"],                    "rating": review["rating"],                    "text": review["text"],                    "title": review["title"],                    "date": review["dates"]["publishedDate"]                }
                print(review_data)
            success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['trustpilot_url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['trustpilot_url']}")



def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_business(row, location, retries=retries)
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["online bank"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")


    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

In the example above, our process_results() function reads the rows from our CSV file and passes each of them into process_business(). process_business() then pulls our information and prints it to the terminal.

Step 3: Storing the Scraped Data

import osimport csvimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
from selenium import webdriverfrom selenium.webdriver.common.by import By
OPTIONS = webdriver.ChromeOptions()OPTIONS.add_argument("--headless")
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": "us"        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    rating: float = 0    num_reviews: int = 0    website: str = ""    trustpilot_url: str = ""    location: str = ""    category: str = ""
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass ReviewData:    name: str = ""    rating: float = 0    text: str = ""    title: str = ""    date: str = ""

    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:
    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):
    driver = webdriver.Chrome(options=OPTIONS)
    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.trustpilot.com/search?query={formatted_keyword}&page={page_number+1}"
    tries = 0    success = False
    while tries <= retries and not success:        try:            driver.get(get_scrapeops_url(url, location=location))            logger.info(f"{keyword}: Fetched page {page_number}")
                ## Extract Data            script_tag = driver.find_element(By.CSS_SELECTOR, "script[id='__NEXT_DATA__'")            if script_tag:                json_data = json.loads(script_tag.get_attribute("innerHTML"))

                business_units = json_data["props"]["pageProps"]["businessUnits"]
                for business in business_units:
                    name = business.get("displayName").lower().replace(" ", "").replace("'", "")                    trustpilot_formatted = business.get("contact")["website"].split("://")[1]                    location = business.get("location")                    category_list = business.get("categories")                    category = category_list[0]["categoryId"] if len(category_list) > 0 else "n/a"
                    ## Extract Data                    search_data = SearchData(                        name = business.get("displayName", ""),                        stars = business.get("stars", 0),                        rating = business.get("trustScore", 0),                        num_reviews = business.get("numberOfReviews", 0),                        website = business.get("contact")["website"],                        trustpilot_url = f"https://www.trustpilot.com/review/{trustpilot_formatted}",                        location = location.get("country", "n/a"),                        category = category                        )
                    data_pipeline.add_data(search_data)                logger.info(f"Successfully parsed data from: {url}")
                driver.quit()                success = True

        except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

def process_business(row, location, retries=3):    url = row["trustpilot_url"]    tries = 0    success = False
    while tries <= retries and not success:        driver = webdriver.Chrome(options=OPTIONS)        try:            driver.get(url, location=location)
            script = driver.find_element(By.CSS_SELECTOR, "script[id='__NEXT_DATA__'")
            json_data = json.loads(script.get_attribute("innerHTML"))
            business_info = json_data["props"]["pageProps"]
            reviews = business_info["reviews"]
            review_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")
            for review in reviews:                review_data = ReviewData(                    name= review["consumer"]["displayName"],                    rating= review["rating"],                    text= review["text"],                    title= review["title"],                    date= review["dates"]["publishedDate"]                )
                review_pipeline.add_data(review_data)

            review_pipeline.close_pipeline()            success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['trustpilot_url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['trustpilot_url']}")



def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_business(row, location, retries=retries)
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["online bank"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")


    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Step 4: Adding Concurrency

def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_business,                reader,                [location] * len(reader),                [retries] * len(reader)            )

The rest of our code largely remains the same.

Step 5: Bypassing Anti-Bots

To finish everything off, we once again need to add support for anti-bots. Our final example really only has one relevant change.

scrapeops_proxy_url = get_scrapeops_url(url, location=location)driver.get(scrapeops_proxy_url)

Here is the fully updated code:

import osimport csvimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
from selenium import webdriverfrom selenium.webdriver.common.by import By
OPTIONS = webdriver.ChromeOptions()OPTIONS.add_argument("--headless")
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": "us"        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    rating: float = 0    num_reviews: int = 0    website: str = ""    trustpilot_url: str = ""    location: str = ""    category: str = ""
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass ReviewData:    name: str = ""    rating: float = 0    text: str = ""    title: str = ""    date: str = ""

    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:
    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):
    driver = webdriver.Chrome(options=OPTIONS)
    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.trustpilot.com/search?query={formatted_keyword}&page={page_number+1}"
    tries = 0    success = False
    while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            driver.get(scrapeops_proxy_url)            logger.info(f"{keyword}: Fetched page {page_number}")
                ## Extract Data            script_tag = driver.find_element(By.CSS_SELECTOR, "script[id='__NEXT_DATA__'")            if script_tag:                json_data = json.loads(script_tag.get_attribute("innerHTML"))

                business_units = json_data["props"]["pageProps"]["businessUnits"]
                for business in business_units:
                    name = business.get("displayName").lower().replace(" ", "").replace("'", "")                    trustpilot_formatted = business.get("contact")["website"].split("://")[1]                    location = business.get("location")                    category_list = business.get("categories")                    category = category_list[0]["categoryId"] if len(category_list) > 0 else "n/a"
                    ## Extract Data                    search_data = SearchData(                        name = business.get("displayName", ""),                        stars = business.get("stars", 0),                        rating = business.get("trustScore", 0),                        num_reviews = business.get("numberOfReviews", 0),                        website = business.get("contact")["website"],                        trustpilot_url = f"https://www.trustpilot.com/review/{trustpilot_formatted}",                        location = location.get("country", "n/a"),                        category = category                        )
                    data_pipeline.add_data(search_data)                logger.info(f"Successfully parsed data from: {url}")
                driver.quit()                success = True

        except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

def process_business(row, location, retries=3):    url = row["trustpilot_url"]    tries = 0    success = False
    while tries <= retries and not success:        driver = webdriver.Chrome(options=OPTIONS)        try:            driver.get(get_scrapeops_url(url, location=location))
            script = driver.find_element(By.CSS_SELECTOR, "script[id='__NEXT_DATA__'")
            json_data = json.loads(script.get_attribute("innerHTML"))
            business_info = json_data["props"]["pageProps"]
            reviews = business_info["reviews"]
            review_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")
            for review in reviews:                review_data = ReviewData(                    name= review["consumer"]["displayName"],                    rating= review["rating"],                    text= review["text"],                    title= review["title"],                    date= review["dates"]["publishedDate"]                )
                review_pipeline.add_data(review_data)

            review_pipeline.close_pipeline()            success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['trustpilot_url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['trustpilot_url']}")



def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_business,                reader,                [location] * len(reader),                [retries] * len(reader)            )
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["online bank"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")


    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Step 6: Production Run

Let's run both the crawler and the scraper together in production. Here is our updated main.

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 10    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["online bank"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")


    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

As before, I've changed our PAGES to 10 and our LOCATION to "us". Here are the results.It took just over 247 seconds (including the time it took to create our initial report) to generate a full report and process all the results (86 rows). This comes out to a speed of about 2.87 seconds per business.

Legal and Ethical Considerations

When scraping any website, you need to always pay attention to the site's terms and conditions. You can view Trustpilot's consumer terms [here].(https://legal.trustpilot.com/for-reviewers/terms-of-use-for-consumers). You should also respect the site's robots.txt. You can view their robots.txt file here.Always be careful about the information you extract and don't scrape private or confidential data. If a website is hidden behind a login, that is generally considered private data. If your data does not require a login, it is generally considered to be public data. If you have questions about the legality of your scraping job, it is best to consult an attorney familiar with the laws and localities you're dealing with.

Conclusion

How to Scrape Trustpilot with Puppeteer

In this Puppeteer tutorial, we're going to go over the finer points of scraping Trustpilot and generating some detailed reports.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

TLDR - How to Scrape Trustpilot

On Trustpilot, all of our data actually comes embedded within the page in a JSON blob. If we can find the blob, all we need to do is parse the data.The script below does exactly that for both a search results page and a business page.

const puppeteer = require('puppeteer');const createCsvWriter = require('csv-writer').createObjectCsvWriter;const csvParse = require('csv-parse');const fs = require('fs');
const API_KEY = JSON.parse(fs.readFileSync('config.json')).api_key;
async function writeToCsv(data, outputFile) {  if (!data || data.length === 0) {    throw new Error('No data to write!');  }  const fileExists = fs.existsSync(outputFile);
  const headers = Object.keys(data[0]).map((key) => ({ id: key, title: key }));
  const csvWriter = createCsvWriter({    path: outputFile,    header: headers,    append: fileExists,  });  try {    await csvWriter.writeRecords(data);  } catch (e) {    throw new Error('Failed to write to csv');  }}
async function readCsv(inputFile) {  const results = [];  const parser = fs.createReadStream(inputFile).pipe(    csvParse.parse({      columns: true,      delimiter: ',',      trim: true,      skip_empty_lines: true,    })  );
  for await (const record of parser) {    results.push(record);  }  return results;}
function range(start, end) {  const array = [];  for (let i = start; i < end; i++) {    array.push(i);  }  return array;}
function getScrapeOpsUrl(url, location = 'us') {  const params = new URLSearchParams({    api_key: API_KEY,    url: url,    country: location,  });  return `https://proxy.scrapeops.io/v1/?${params.toString()}`;}
async function scrapeSearchResults(  browser,  keyword,  pageNumber,  location = 'us',  retries = 3) {  let tries = 0;  let success = false;
  while (tries <= retries && !success) {    const formattedKeyword = keyword.replace(' ', '+');    const page = await browser.newPage();    try {      const url = `https://www.trustpilot.com/search?query=${formattedKeyword}&page=${pageNumber + 1}`;
      const proxyUrl = getScrapeOpsUrl(url, location);      await page.goto(proxyUrl);
      console.log(`Successfully fetched: ${url}`);
      const script = await page.$("script[id='__NEXT_DATA__']");
      const innerHTML = await page.evaluate(        (element) => element.innerHTML,        script      );      const jsonData = JSON.parse(innerHTML);
      const businessUnits = jsonData.props.pageProps.businessUnits;
      for (const business of businessUnits) {        let category = 'n/a';        if ('categories' in business && business.categories.length > 0) {          category = business.categories[0].categoryId;        }
        let location = 'n/a';        if ('location' in business && 'country' in business.location) {          location = business.location.country;        }        const trustpilotFormatted = business.contact.website.split('://')[1];
        const businessInfo = {          name: business.displayName            .toLowerCase()            .replace(' ', '')            .replace("'", ''),          stars: business.stars,          rating: business.trustScore,          num_reviews: business.numberOfReviews,          website: business.contact.website,          trustpilot_url: `https://www.trustpilot.com/review/${trustpilotFormatted}`,          location: location,          category: category,        };
        await writeToCsv([businessInfo], `${keyword.replace(' ', '-')}.csv`);      }
      success = true;    } catch (err) {      console.log(`Error: ${err}, tries left ${retries - tries}`);      tries++;    } finally {      await page.close();    }  }}
async function startScrape(  keyword,  pages,  location,  concurrencyLimit,  retries) {  const pageList = range(0, pages);
  const browser = await puppeteer.launch();
  while (pageList.length > 0) {    const currentBatch = pageList.splice(0, concurrencyLimit);    const tasks = currentBatch.map((page) =>      scrapeSearchResults(browser, keyword, page, location, retries)    );
    try {      await Promise.all(tasks);    } catch (err) {      console.log(`Failed to process batch: ${err}`);    }  }
  await browser.close();}
async function processBusiness(browser, row, location, retries = 3) {  const url = row.trustpilot_url;  let tries = 0;  let success = false;
  while (tries <= retries && !success) {    const page = await browser.newPage();
    try {      await page.goto(getScrapeOpsUrl(url, location));
      const script = await page.$("script[id='__NEXT_DATA__']");      const innerHTML = await page.evaluate(        (element) => element.innerHTML,        script      );
      const jsonData = JSON.parse(innerHTML);      const businessInfo = jsonData.props.pageProps;
      const reviews = businessInfo.reviews;
      for (const review of reviews) {        const reviewData = {          name: review.consumer.displayName,          rating: review.rating,          text: review.text,          title: review.title,          date: review.dates.publishedDate,        };        await writeToCsv([reviewData], `${row.name}.csv`);      }
      success = true;    } catch (err) {      console.log(`Error: ${err}, tries left: ${retries - tries}`);      tries++;    } finally {      await page.close();    }  }}
async function processResults(csvFile, location, concurrencyLimit, retries) {  const businesses = await readCsv(csvFile);  const browser = await puppeteer.launch();
  while (businesses.length > 0) {    const currentBatch = businesses.splice(0, concurrencyLimit);    const tasks = currentBatch.map((business) =>      processBusiness(browser, business, location, retries)    );
    try {      await Promise.all(tasks);    } catch (err) {      console.log(`Failed to process batch: ${err}`);    }  }  await browser.close();}
async function main() {  const keywords = ['online bank'];  const concurrencyLimit = 5;  const pages = 1;  const location = 'us';  const retries = 3;  const aggregateFiles = [];
  for (const keyword of keywords) {    await startScrape(keyword, pages, location, concurrencyLimit, retries);    aggregateFiles.push(`${keyword.replace(' ', '-')}.csv`);  }
  for (const file of aggregateFiles) {    await processResults(file, location, concurrencyLimit, retries);  }}
main();

To customize this scraper, feel free to change any of the following constants:

keywords: This array contains the search keywords you want to use on Trustpilot to find businesses. Each keyword in the array will be used to perform a separate search and gather data on matching businesses.
concurrencyLimit: This number controls how many scraping tasks are run concurrently. A higher limit can speed up the scraping process but may increase the load on your system and the target website, potentially leading to rate limiting or bans.
pages: This value represents the number of pages of search results you want to scrape for each keyword. Each page typically contains a set number of business listings.
location: This string specifies the geographical location to use for the proxy service. It determines the country from which the scraping requests appear to originate. The location might affect the results due to regional differences in business listings and reviews.
retries: This value sets the number of retry attempts for each scraping task if it fails. More retries increase the chances of successful scraping despite intermittent errors or temporary issues, but they also prolong the total scraping time.

How To How To Architect Our Trustpilot Scraper

When we build our Trustpilot scraper, we actually need to build two scrapers.

Our first scraper is a crawler. This crawler will look up search results, interpret them, and store them in a CSV file.
Our second scraper will be our review scraper. The review scraper needs to look up specific businesses and save their reviews to a new CSV file.

For the best performance and stability, each of these scrapers will need the following:

Parsing: so we can pull proper information from a page.
Pagination: so we can pull up different pages be more selective about our data.
Data Storage: to store our data in a safe, efficient and readable way.
Concurrency: to scrape multiple pages at once.
Proxy Integration: when scraping anything at scale, we often face the issue of getting blocked. Proxies allow us a redundant connection and reduce our likelihood of getting blocked by different websites.

Understanding How To Scrape Trustpilot

Step 1: How To Request Trustpilot Pages

When we perform a search using Trustpilot, our URL typically looks like this:

https://www.trustpilot.com/search?query=word1+word2

https://www.trustpilot.com/review/actual_website_domain_name

The example below is a screenshot for good-bank.de.Since the site's domain name is good-bank.de, the Trustpilot URL would be:

https://www.trustpilot.com/review/good-bank.de

While this naming convention is unorthodox, we can base our URL system on this.

Step 2: How To Extract Data From Trustpilot Results and Pages

Step 3: How To Control Pagination

To paginate our results, we can use the following format for our search URL:

https://www.trustpilot.com/search?query={formatted_keyword}&page={page_number+1}

So if we wanted page 1 of online banks, our URL would look like this:

https://www.trustpilot.com/search?query=online+bank&page=1

As we already discussed previously, our url for individual business is setup like this:

https://www.trustpilot.com/review/actual_website_domain_name

With a system ready for our URLs, we're all set to extract our data.

Step 4: Geolocated Data

To handle Geoloacated Data, we'll be using the ScrapeOps Proxy API.

If we want to be in Great Britain, we simply set our country parameter to "uk",
if we want to be in the US, we can set this param to "us".

Setting Up Our Trustpilot Scraper Project

Let's get started. You can run the following commands to get setup.Create a New Project Folder

mkdir trustpilot-scraper
cd trustpilot-scraper

Create a New JavaScript Project

npm init --y

Install Our Dependencies

npm install puppeteer

npm install csv-writer

npm install csv-parse

npm install fs

Now that we've installed our deps, it's time to get started on coding!

Build A Trustpilot Search Crawler

Step 1: Create Simple Search Data Parser

We'll start by building a parser for our search results. The goal here is pretty simple: fetch a page and pull information from it.Along with parsing, we'll add some basic retry logic as well. With retries, if we fail to get our information on the first try, our parser will keep trying until it runs out of retries.The script below is pretty basic, but this is the foundation to our entire project.

while we still have retries left and the operation hasn't succeeded, we get the page and find the script tag with the id, "__NEXT_DATA__".
From within this object, we pull all of our relevant information from the JSON blob and then print it to the terminal.

const puppeteer = require('puppeteer');const createCsvWriter = require('csv-writer').createObjectCsvWriter;const csvParse = require('csv-parse');const fs = require('fs');
const API_KEY = JSON.parse(fs.readFileSync('config.json')).api_key;
async function scrapeSearchResults(  browser,  keyword,  location = 'us',  retries = 3) {  let tries = 0;  let success = false;
  while (tries <= retries && !success) {    const formattedKeyword = keyword.replace(' ', '+');    const page = await browser.newPage();    try {      const url = `https://www.trustpilot.com/search?query=${formattedKeyword}`;
      await page.goto(url);
      console.log(`Successfully fetched: ${url}`);
      const script = await page.$("script[id='__NEXT_DATA__']");
      const innerHTML = await page.evaluate(        (element) => element.innerHTML,        script      );      const jsonData = JSON.parse(innerHTML);
      const businessUnits = jsonData.props.pageProps.businessUnits;
      for (const business of businessUnits) {        let category = 'n/a';        if ('categories' in business && business.categories.length > 0) {          category = business.categories[0].categoryId;        }
        let location = 'n/a';        if ('location' in business && 'country' in business.location) {          location = business.location.country;        }        const trustpilotFormatted = business.contact.website.split('://')[1];
        const businessInfo = {          name: business.displayName            .toLowerCase()            .replace(' ', '')            .replace("'", ''),          stars: business.stars,          rating: business.trustScore,          num_reviews: business.numberOfReviews,          website: business.contact.website,          trustpilot_url: `https://www.trustpilot.com/review/${trustpilotFormatted}`,          location: location,          category: category,        };
        console.log(businessInfo);      }
      success = true;    } catch (err) {      console.log(`Error: ${err}, tries left ${retries - tries}`);      tries++;    } finally {      await page.close();    }  }}
async function main() {  const keywords = ['online bank'];  const location = 'us';  const retries = 3;
  for (const keyword of keywords) {    const browser = await puppeteer.launch();
    await scrapeSearchResults(browser, keyword, location, retries);
    await browser.close();  }}
main();

Step 2: Add Pagination

https://www.trustpilot.com/search?query={formatted_keyword}&page={page_number+1}

Here is our fully updated code, it hasn't really changed much yet.

const puppeteer = require('puppeteer');const createCsvWriter = require('csv-writer').createObjectCsvWriter;const csvParse = require('csv-parse');const fs = require('fs');
const API_KEY = JSON.parse(fs.readFileSync('config.json')).api_key;
function range(start, end) {  const array = [];  for (let i = start; i < end; i++) {    array.push(i);  }  return array;}
async function scrapeSearchResults(  browser,  keyword,  pageNumber,  location = 'us',  retries = 3) {  let tries = 0;  let success = false;
  while (tries <= retries && !success) {    const formattedKeyword = keyword.replace(' ', '+');    const page = await browser.newPage();    try {      const url = `https://www.trustpilot.com/search?query=${formattedKeyword}&page=${pageNumber + 1}`;
      await page.goto(url);
      console.log(`Successfully fetched: ${url}`);
      const script = await page.$("script[id='__NEXT_DATA__']");
      const innerHTML = await page.evaluate(        (element) => element.innerHTML,        script      );      const jsonData = JSON.parse(innerHTML);
      const businessUnits = jsonData.props.pageProps.businessUnits;
      for (const business of businessUnits) {        let category = 'n/a';        if ('categories' in business && business.categories.length > 0) {          category = business.categories[0].categoryId;        }
        let location = 'n/a';        if ('location' in business && 'country' in business.location) {          location = business.location.country;        }        const trustpilotFormatted = business.contact.website.split('://')[1];
        const businessInfo = {          name: business.displayName            .toLowerCase()            .replace(' ', '')            .replace("'", ''),          stars: business.stars,          rating: business.trustScore,          num_reviews: business.numberOfReviews,          website: business.contact.website,          trustpilot_url: `https://www.trustpilot.com/review/${trustpilotFormatted}`,          location: location,          category: category,        };
        console.log(businessInfo);      }
      success = true;    } catch (err) {      console.log(`Error: ${err}, tries left ${retries - tries}`);      tries++;    } finally {      await page.close();    }  }}
async function startScrape(keyword, pages, location, retries) {  const pageList = range(0, pages);
  const browser = await puppeteer.launch();
  for (const page of pageList) {    await scrapeSearchResults(browser, keyword, page, location, retries);  }
  await browser.close();}
async function main() {  const keywords = ['online bank'];  const concurrencyLimit = 5;  const pages = 1;  const location = 'us';  const retries = 3;
  for (const keyword of keywords) {    await startScrape(keyword, pages, location, concurrencyLimit, retries);  }}
main();

Step 3: Storing the Scraped Data

Now that we can retrieve our data properly, it's time to start storing it.To store our data, we use the writeToCsv() function. This function takes data (an array of JSON objects) and an outputFile.

If the outputFile exists, we append it.
If it does not exist, we create it.

Take a look at the updated code.

const puppeteer = require('puppeteer');const createCsvWriter = require('csv-writer').createObjectCsvWriter;const csvParse = require('csv-parse');const fs = require('fs');
const API_KEY = JSON.parse(fs.readFileSync('config.json')).api_key;
async function writeToCsv(data, outputFile) {  if (!data || data.length === 0) {    throw new Error('No data to write!');  }  const fileExists = fs.existsSync(outputFile);
  const headers = Object.keys(data[0]).map((key) => ({ id: key, title: key }));
  const csvWriter = createCsvWriter({    path: outputFile,    header: headers,    append: fileExists,  });  try {    await csvWriter.writeRecords(data);  } catch (e) {    throw new Error('Failed to write to csv');  }}
function range(start, end) {  const array = [];  for (let i = start; i < end; i++) {    array.push(i);  }  return array;}
async function scrapeSearchResults(  browser,  keyword,  pageNumber,  location = 'us',  retries = 3) {  let tries = 0;  let success = false;
  while (tries <= retries && !success) {    const formattedKeyword = keyword.replace(' ', '+');    const page = await browser.newPage();    try {      const url = `https://www.trustpilot.com/search?query=${formattedKeyword}&page=${pageNumber + 1}`;
      await page.goto(url);
      console.log(`Successfully fetched: ${url}`);
      const script = await page.$("script[id='__NEXT_DATA__']");
      const innerHTML = await page.evaluate(        (element) => element.innerHTML,        script      );      const jsonData = JSON.parse(innerHTML);
      const businessUnits = jsonData.props.pageProps.businessUnits;
      for (const business of businessUnits) {        let category = 'n/a';        if ('categories' in business && business.categories.length > 0) {          category = business.categories[0].categoryId;        }
        let location = 'n/a';        if ('location' in business && 'country' in business.location) {          location = business.location.country;        }        const trustpilotFormatted = business.contact.website.split('://')[1];
        const businessInfo = {          name: business.displayName            .toLowerCase()            .replace(' ', '')            .replace("'", ''),          stars: business.stars,          rating: business.trustScore,          num_reviews: business.numberOfReviews,          website: business.contact.website,          trustpilot_url: `https://www.trustpilot.com/review/${trustpilotFormatted}`,          location: location,          category: category,        };
        await writeToCsv([businessInfo], `${keyword.replace(' ', '-')}.csv`);      }
      success = true;    } catch (err) {      console.log(`Error: ${err}, tries left ${retries - tries}`);      tries++;    } finally {      await page.close();    }  }}
async function startScrape(keyword, pages, location, retries) {  const pageList = range(0, pages);
  const browser = await puppeteer.launch();
  for (const page of pageList) {    await scrapeSearchResults(browser, keyword, page, location, retries);  }
  await browser.close();}
async function main() {  const keywords = ['online bank'];  const concurrencyLimit = 5;  const pages = 1;  const location = 'us';  const retries = 3;
  for (const keyword of keywords) {    await startScrape(keyword, pages, location, concurrencyLimit, retries);  }}
main();

Our businessInfo holds all the information that we scraped.
Once we've got our businessInfo, we simply write it to CSV with await writeToCsv([businessInfo], `${keyword.replace(" ", "-")}.csv`).
By calling this on every object inside the loop, each result is written to the file as soon as it's been processed.
In the event of a crash, this allows us to save absolutely every good piece of data that we were able to retrieve.

Step 4: Adding Concurrency

To maximize our efficieny, we need to add concurrency. Concurrency gives us the ability to process multiple pages at once. Here we'll be using a combination of async programming and batching in order to process batches of multiple results simultaneously.We use a concurrencyLimit to determine our batch size. While we still have pages to scrape, we splice() out a batch and process it. Once that batch has finished, we move onto the next one.Our only major difference here is the startScrape() function. Here is what it looks like now:

async function startScrape(  keyword,  pages,  location,  concurrencyLimit,  retries) {  const pageList = range(0, pages);
  const browser = await puppeteer.launch();
  while (pageList.length > 0) {    const currentBatch = pageList.splice(0, concurrencyLimit);    const tasks = currentBatch.map((page) =>      scrapeSearchResults(browser, keyword, page, location, retries)    );
    try {      await Promise.all(tasks);    } catch (err) {      console.log(`Failed to process batch: ${err}`);    }  }
  await browser.close();}

Here is the fully updated code.

const puppeteer = require('puppeteer');const createCsvWriter = require('csv-writer').createObjectCsvWriter;const csvParse = require('csv-parse');const fs = require('fs');
const API_KEY = JSON.parse(fs.readFileSync('config.json')).api_key;
async function writeToCsv(data, outputFile) {  if (!data || data.length === 0) {    throw new Error('No data to write!');  }  const fileExists = fs.existsSync(outputFile);
  const headers = Object.keys(data[0]).map((key) => ({ id: key, title: key }));
  const csvWriter = createCsvWriter({    path: outputFile,    header: headers,    append: fileExists,  });  try {    await csvWriter.writeRecords(data);  } catch (e) {    throw new Error('Failed to write to csv');  }}
function range(start, end) {  const array = [];  for (let i = start; i < end; i++) {    array.push(i);  }  return array;}
async function scrapeSearchResults(  browser,  keyword,  pageNumber,  location = 'us',  retries = 3) {  let tries = 0;  let success = false;
  while (tries <= retries && !success) {    const formattedKeyword = keyword.replace(' ', '+');    const page = await browser.newPage();    try {      const url = `https://www.trustpilot.com/search?query=${formattedKeyword}&page=${pageNumber + 1}`;
      await page.goto(url);
      console.log(`Successfully fetched: ${url}`);
      const script = await page.$("script[id='__NEXT_DATA__']");
      const innerHTML = await page.evaluate(        (element) => element.innerHTML,        script      );      const jsonData = JSON.parse(innerHTML);
      const businessUnits = jsonData.props.pageProps.businessUnits;
      for (const business of businessUnits) {        let category = 'n/a';        if ('categories' in business && business.categories.length > 0) {          category = business.categories[0].categoryId;        }
        let location = 'n/a';        if ('location' in business && 'country' in business.location) {          location = business.location.country;        }        const trustpilotFormatted = business.contact.website.split('://')[1];
        const businessInfo = {          name: business.displayName            .toLowerCase()            .replace(' ', '')            .replace("'", ''),          stars: business.stars,          rating: business.trustScore,          num_reviews: business.numberOfReviews,          website: business.contact.website,          trustpilot_url: `https://www.trustpilot.com/review/${trustpilotFormatted}`,          location: location,          category: category,        };
        await writeToCsv([businessInfo], `${keyword.replace(' ', '-')}.csv`);      }
      success = true;    } catch (err) {      console.log(`Error: ${err}, tries left ${retries - tries}`);      tries++;    } finally {      await page.close();    }  }}
async function startScrape(  keyword,  pages,  location,  concurrencyLimit,  retries) {  const pageList = range(0, pages);
  const browser = await puppeteer.launch();
  while (pageList.length > 0) {    const currentBatch = pageList.splice(0, concurrencyLimit);    const tasks = currentBatch.map((page) =>      scrapeSearchResults(browser, keyword, page, location, retries)    );
    try {      await Promise.all(tasks);    } catch (err) {      console.log(`Failed to process batch: ${err}`);    }  }
  await browser.close();}
async function main() {  const keywords = ['online bank'];  const concurrencyLimit = 5;  const pages = 1;  const location = 'us';  const retries = 3;
  for (const keyword of keywords) {    await startScrape(keyword, pages, location, concurrencyLimit, retries);  }}
main();

With concurrency, we can now scrape numerous pages all at the same time.

Step 5: Bypassing Anti-Bots

In the wild, scrapers are often caught and blocked by anti-bots. Anti-bots are software designed to find and block malicious traffic. While our scraper is not malicious, it is incredibly fast and at this point there is nothing human about it. These abnormalities will likely get us flagged and probably blocked.To bypass these anti-bots, we need to use a proxy. The ScrapeOps Proxy will rotate our IP address and therefore, each request we make seems like its coming from a different location. So instead of one bot making a ton of bizarre requests to the server, our scraper will look like many instances of normal client side traffic.This function does all of this for us:

function getScrapeOpsUrl(url, location = 'us') {  const params = new URLSearchParams({    api_key: API_KEY,    url: url,    country: location,  });  return `https://proxy.scrapeops.io/v1/?${params.toString()}`;}

getScrapeOpsUrl() takes in all of our parameters and uses simple string formatting to return the proxyUrl that we're going to be using.In this example, our code barely changes at all, but it brings us to a production ready level. Take a look at the full code example below.

const puppeteer = require('puppeteer');const createCsvWriter = require('csv-writer').createObjectCsvWriter;const csvParse = require('csv-parse');const fs = require('fs');
const API_KEY = JSON.parse(fs.readFileSync('config.json')).api_key;
async function writeToCsv(data, outputFile) {  if (!data || data.length === 0) {    throw new Error('No data to write!');  }  const fileExists = fs.existsSync(outputFile);
  const headers = Object.keys(data[0]).map((key) => ({ id: key, title: key }));
  const csvWriter = createCsvWriter({    path: outputFile,    header: headers,    append: fileExists,  });  try {    await csvWriter.writeRecords(data);  } catch (e) {    throw new Error('Failed to write to csv');  }}
function range(start, end) {  const array = [];  for (let i = start; i < end; i++) {    array.push(i);  }  return array;}
function getScrapeOpsUrl(url, location = 'us') {  const params = new URLSearchParams({    api_key: API_KEY,    url: url,    country: location,  });  return `https://proxy.scrapeops.io/v1/?${params.toString()}`;}
async function scrapeSearchResults(  browser,  keyword,  pageNumber,  location = 'us',  retries = 3) {  let tries = 0;  let success = false;
  while (tries <= retries && !success) {    const formattedKeyword = keyword.replace(' ', '+');    const page = await browser.newPage();    try {      const url = `https://www.trustpilot.com/search?query=${formattedKeyword}&page=${pageNumber + 1}`;
      const proxyUrl = getScrapeOpsUrl(url, location);      await page.goto(proxyUrl);
      console.log(`Successfully fetched: ${url}`);
      const script = await page.$("script[id='__NEXT_DATA__']");
      const innerHTML = await page.evaluate(        (element) => element.innerHTML,        script      );      const jsonData = JSON.parse(innerHTML);
      const businessUnits = jsonData.props.pageProps.businessUnits;
      for (const business of businessUnits) {        let category = 'n/a';        if ('categories' in business && business.categories.length > 0) {          category = business.categories[0].categoryId;        }
        let location = 'n/a';        if ('location' in business && 'country' in business.location) {          location = business.location.country;        }        const trustpilotFormatted = business.contact.website.split('://')[1];
        const businessInfo = {          name: business.displayName            .toLowerCase()            .replace(' ', '')            .replace("'", ''),          stars: business.stars,          rating: business.trustScore,          num_reviews: business.numberOfReviews,          website: business.contact.website,          trustpilot_url: `https://www.trustpilot.com/review/${trustpilotFormatted}`,          location: location,          category: category,        };
        await writeToCsv([businessInfo], `${keyword.replace(' ', '-')}.csv`);      }
      success = true;    } catch (err) {      console.log(`Error: ${err}, tries left ${retries - tries}`);      tries++;    } finally {      await page.close();    }  }}
async function startScrape(  keyword,  pages,  location,  concurrencyLimit,  retries) {  const pageList = range(0, pages);
  const browser = await puppeteer.launch();
  while (pageList.length > 0) {    const currentBatch = pageList.splice(0, concurrencyLimit);    const tasks = currentBatch.map((page) =>      scrapeSearchResults(browser, keyword, page, location, retries)    );
    try {      await Promise.all(tasks);    } catch (err) {      console.log(`Failed to process batch: ${err}`);    }  }
  await browser.close();}
async function main() {  const keywords = ['online bank'];  const concurrencyLimit = 5;  const pages = 1;  const location = 'us';  const retries = 3;
  for (const keyword of keywords) {    await startScrape(keyword, pages, location, concurrencyLimit, retries);  }}
main();

Step 6: Production Run

It's finally time to run our crawler in production. Take a look at our main. I'm changing a few constants here.

async function main() {  const keywords = ['online bank'];  const concurrencyLimit = 5;  const pages = 10;  const location = 'us';  const retries = 3;
  for (const keyword of keywords) {    await startScrape(keyword, pages, location, concurrencyLimit, retries);  }}

pages has been set to 10 and location has been set to "us". Now let's see how long it takes to process 10 pages of data.Here are the results:We processed 10 pages in roughly 9.3 seconds. All in all, it costs us less than a second per page!

Build A Trustpilot Scraper

Our crawler gives us great results in a fast and efficient way. Now we need to pair our crawler with a scraper. This scraper will be pulling information about the individual businesses that we save in the report that our crawler generates.Our scraper will do the following:

Open the report we created
Get the pages from that report
Pull information from these pages
Create an individual report for each of the businesses we've looked up

Along with this process, we'll once again utilize our basic steps from the crawler: parsing, storage, concurrency, and proxy integration.

Step 1: Create Simple Business Data Parser

Here, we'll just create a simple parsing function. Take a look below.

async function processBusiness(browser, row, location, retries = 3) {  const url = row.trustpilot_url;  let tries = 0;  let success = false;
  while (tries <= retries && !success) {    const page = await browser.newPage();
    try {      await page.goto(url, location);
      const script = await page.$("script[id='__NEXT_DATA__']");      const innerHTML = await page.evaluate(        (element) => element.innerHTML,        script      );
      const jsonData = JSON.parse(innerHTML);      const businessInfo = jsonData.props.pageProps;
      const reviews = businessInfo.reviews;
      for (const review of reviews) {        const reviewData = {          name: review.consumer.displayName,          rating: review.rating,          text: review.text,          title: review.title,          date: review.dates.publishedDate,        };        console.log(reviewData);      }
      success = true;    } catch (err) {      console.log(`Error: ${err}, tries left: ${retries - tries}`);      tries++;    } finally {      await page.close();    }  }}

This function takes in a row from our CSV file and then fetches the trustpilot_url of the business.
Once we've got the page, we once again look for the script tag with the id of "__NEXT_DATA__" to find our JSON blob.
From within our JSON blob, we pull information from each review listed within the blob.

Step 2: Loading URLs To Scrape

In order to use our processBusiness() function, we need to be able to read the rows from our CSV file. Now we're going to fully update our code.In the example below, we also add a processResults() function. processResults() reads the CSV report from our crawler and then processes each business from the report.

const puppeteer = require('puppeteer');const createCsvWriter = require('csv-writer').createObjectCsvWriter;const csvParse = require('csv-parse');const fs = require('fs');
const API_KEY = JSON.parse(fs.readFileSync('config.json')).api_key;
async function writeToCsv(data, outputFile) {  if (!data || data.length === 0) {    throw new Error('No data to write!');  }  const fileExists = fs.existsSync(outputFile);
  const headers = Object.keys(data[0]).map((key) => ({ id: key, title: key }));
  const csvWriter = createCsvWriter({    path: outputFile,    header: headers,    append: fileExists,  });  try {    await csvWriter.writeRecords(data);  } catch (e) {    throw new Error('Failed to write to csv');  }}
async function readCsv(inputFile) {  const results = [];  const parser = fs.createReadStream(inputFile).pipe(    csvParse.parse({      columns: true,      delimiter: ',',      trim: true,      skip_empty_lines: true,    })  );
  for await (const record of parser) {    results.push(record);  }  return results;}
function range(start, end) {  const array = [];  for (let i = start; i < end; i++) {    array.push(i);  }  return array;}
function getScrapeOpsUrl(url, location = 'us') {  const params = new URLSearchParams({    api_key: API_KEY,    url: url,    country: location,  });  return `https://proxy.scrapeops.io/v1/?${params.toString()}`;}
async function scrapeSearchResults(  browser,  keyword,  pageNumber,  location = 'us',  retries = 3) {  let tries = 0;  let success = false;
  while (tries <= retries && !success) {    const formattedKeyword = keyword.replace(' ', '+');    const page = await browser.newPage();    try {      const url = `https://www.trustpilot.com/search?query=${formattedKeyword}&page=${pageNumber + 1}`;
      const proxyUrl = getScrapeOpsUrl(url, location);      await page.goto(proxyUrl);
      console.log(`Successfully fetched: ${url}`);
      const script = await page.$("script[id='__NEXT_DATA__']");
      const innerHTML = await page.evaluate(        (element) => element.innerHTML,        script      );      const jsonData = JSON.parse(innerHTML);
      const businessUnits = jsonData.props.pageProps.businessUnits;
      for (const business of businessUnits) {        let category = 'n/a';        if ('categories' in business && business.categories.length > 0) {          category = business.categories[0].categoryId;        }
        let location = 'n/a';        if ('location' in business && 'country' in business.location) {          location = business.location.country;        }        const trustpilotFormatted = business.contact.website.split('://')[1];
        const businessInfo = {          name: business.displayName            .toLowerCase()            .replace(' ', '')            .replace("'", ''),          stars: business.stars,          rating: business.trustScore,          num_reviews: business.numberOfReviews,          website: business.contact.website,          trustpilot_url: `https://www.trustpilot.com/review/${trustpilotFormatted}`,          location: location,          category: category,        };
        await writeToCsv([businessInfo], `${keyword.replace(' ', '-')}.csv`);      }
      success = true;    } catch (err) {      console.log(`Error: ${err}, tries left ${retries - tries}`);      tries++;    } finally {      await page.close();    }  }}
async function startScrape(  keyword,  pages,  location,  concurrencyLimit,  retries) {  const pageList = range(0, pages);
  const browser = await puppeteer.launch();
  while (pageList.length > 0) {    const currentBatch = pageList.splice(0, concurrencyLimit);    const tasks = currentBatch.map((page) =>      scrapeSearchResults(browser, keyword, page, location, retries)    );
    try {      await Promise.all(tasks);    } catch (err) {      console.log(`Failed to process batch: ${err}`);    }  }
  await browser.close();}
async function processBusiness(browser, row, location, retries = 3) {  const url = row.trustpilot_url;  let tries = 0;  let success = false;
  while (tries <= retries && !success) {    const page = await browser.newPage();
    try {      await page.goto(url, location);
      const script = await page.$("script[id='__NEXT_DATA__']");      const innerHTML = await page.evaluate(        (element) => element.innerHTML,        script      );
      const jsonData = JSON.parse(innerHTML);      const businessInfo = jsonData.props.pageProps;
      const reviews = businessInfo.reviews;
      for (const review of reviews) {        const reviewData = {          name: review.consumer.displayName,          rating: review.rating,          text: review.text,          title: review.title,          date: review.dates.publishedDate,        };        console.log(reviewData);      }
      success = true;    } catch (err) {      console.log(`Error: ${err}, tries left: ${retries - tries}`);      tries++;    } finally {      await page.close();    }  }}
async function processResults(csvFile, location, retries) {  const businesses = await readCsv(csvFile);  const browser = await puppeteer.launch();
  for (const business of businesses) {    await processBusiness(browser, business, location, retries);  }  await browser.close();}
async function main() {  const keywords = ['online bank'];  const concurrencyLimit = 5;  const pages = 1;  const location = 'us';  const retries = 3;  const aggregateFiles = [];
  for (const keyword of keywords) {    await startScrape(keyword, pages, location, concurrencyLimit, retries);    aggregateFiles.push(`${keyword.replace(' ', '-')}.csv`);  }
  for (const file of aggregateFiles) {    await processResults(file, location, concurrencyLimit, retries);  }}
main();

Step 3: Storing the Scraped Data

Once again, we're now in the position where we need to store our data.Similar to the businessInfo object we used earlier, we now use a reviewData object and then we pass it into the writeToCsv() function again.

const puppeteer = require('puppeteer');const createCsvWriter = require('csv-writer').createObjectCsvWriter;const csvParse = require('csv-parse');const fs = require('fs');
const API_KEY = JSON.parse(fs.readFileSync('config.json')).api_key;
async function writeToCsv(data, outputFile) {  if (!data || data.length === 0) {    throw new Error('No data to write!');  }  const fileExists = fs.existsSync(outputFile);
  const headers = Object.keys(data[0]).map((key) => ({ id: key, title: key }));
  const csvWriter = createCsvWriter({    path: outputFile,    header: headers,    append: fileExists,  });  try {    await csvWriter.writeRecords(data);  } catch (e) {    throw new Error('Failed to write to csv');  }}
async function readCsv(inputFile) {  const results = [];  const parser = fs.createReadStream(inputFile).pipe(    csvParse.parse({      columns: true,      delimiter: ',',      trim: true,      skip_empty_lines: true,    })  );
  for await (const record of parser) {    results.push(record);  }  return results;}
function range(start, end) {  const array = [];  for (let i = start; i < end; i++) {    array.push(i);  }  return array;}
function getScrapeOpsUrl(url, location = 'us') {  const params = new URLSearchParams({    api_key: API_KEY,    url: url,    country: location,  });  return `https://proxy.scrapeops.io/v1/?${params.toString()}`;}
async function scrapeSearchResults(  browser,  keyword,  pageNumber,  location = 'us',  retries = 3) {  let tries = 0;  let success = false;
  while (tries <= retries && !success) {    const formattedKeyword = keyword.replace(' ', '+');    const page = await browser.newPage();    try {      const url = `https://www.trustpilot.com/search?query=${formattedKeyword}&page=${pageNumber + 1}`;
      const proxyUrl = getScrapeOpsUrl(url, location);      await page.goto(proxyUrl);
      console.log(`Successfully fetched: ${url}`);
      const script = await page.$("script[id='__NEXT_DATA__']");
      const innerHTML = await page.evaluate(        (element) => element.innerHTML,        script      );      const jsonData = JSON.parse(innerHTML);
      const businessUnits = jsonData.props.pageProps.businessUnits;
      for (const business of businessUnits) {        let category = 'n/a';        if ('categories' in business && business.categories.length > 0) {          category = business.categories[0].categoryId;        }
        let location = 'n/a';        if ('location' in business && 'country' in business.location) {          location = business.location.country;        }        const trustpilotFormatted = business.contact.website.split('://')[1];
        const businessInfo = {          name: business.displayName            .toLowerCase()            .replace(' ', '')            .replace("'", ''),          stars: business.stars,          rating: business.trustScore,          num_reviews: business.numberOfReviews,          website: business.contact.website,          trustpilot_url: `https://www.trustpilot.com/review/${trustpilotFormatted}`,          location: location,          category: category,        };
        await writeToCsv([businessInfo], `${keyword.replace(' ', '-')}.csv`);      }
      success = true;    } catch (err) {      console.log(`Error: ${err}, tries left ${retries - tries}`);      tries++;    } finally {      await page.close();    }  }}
async function startScrape(  keyword,  pages,  location,  concurrencyLimit,  retries) {  const pageList = range(0, pages);
  const browser = await puppeteer.launch();
  while (pageList.length > 0) {    const currentBatch = pageList.splice(0, concurrencyLimit);    const tasks = currentBatch.map((page) =>      scrapeSearchResults(browser, keyword, page, location, retries)    );
    try {      await Promise.all(tasks);    } catch (err) {      console.log(`Failed to process batch: ${err}`);    }  }
  await browser.close();}
async function processBusiness(browser, row, location, retries = 3) {  const url = row.trustpilot_url;  let tries = 0;  let success = false;
  while (tries <= retries && !success) {    const page = await browser.newPage();
    try {      await page.goto(url, location);
      const script = await page.$("script[id='__NEXT_DATA__']");      const innerHTML = await page.evaluate(        (element) => element.innerHTML,        script      );
      const jsonData = JSON.parse(innerHTML);      const businessInfo = jsonData.props.pageProps;
      const reviews = businessInfo.reviews;
      for (const review of reviews) {        const reviewData = {          name: review.consumer.displayName,          rating: review.rating,          text: review.text,          title: review.title,          date: review.dates.publishedDate,        };        await writeToCsv([reviewData], `${row.name}.csv`);      }
      success = true;    } catch (err) {      console.log(`Error: ${err}, tries left: ${retries - tries}`);      tries++;    } finally {      await page.close();    }  }}
async function processResults(csvFile, location, retries) {  const businesses = await readCsv(csvFile);  const browser = await puppeteer.launch();
  for (const business of businesses) {    await processBusiness(browser, business, location, retries);  }  await browser.close();}
async function main() {  const keywords = ['online bank'];  const concurrencyLimit = 5;  const pages = 1;  const location = 'us';  const retries = 3;  const aggregateFiles = [];
  for (const keyword of keywords) {    await startScrape(keyword, pages, location, concurrencyLimit, retries);    aggregateFiles.push(`${keyword.replace(' ', '-')}.csv`);  }
  for (const file of aggregateFiles) {    await processResults(file, location, concurrencyLimit, retries);  }}
main();

As we did before, we pass each object into writeToCsv() as soon as it has been processed. This allows us to store our data efficiently, but also write the absolute most possible data in the even of a crash.

Step 4: Adding Concurrency

Once again, we need to add concurrency. This time, instead of scraping multiple result pages at once, we're obviously going to be scraping multiple business pages at once.Here is our processResults() function refactored for concurrency.

async function processResults(csvFile, location, concurrencyLimit, retries) {  const businesses = await readCsv(csvFile);  const browser = await puppeteer.launch();
  while (businesses.length > 0) {    const currentBatch = businesses.splice(0, concurrencyLimit);    const tasks = currentBatch.map((business) =>      processBusiness(browser, business, location, retries)    );
    try {      await Promise.all(tasks);    } catch (err) {      console.log(`Failed to process batch: ${err}`);    }  }  await browser.close();}

The rest of our code largely remains the same.

Step 5: Bypassing Anti-Bots

To finish everything off, we once again need to add support for anti-bots. Our final example really only has one relevant change.

await page.goto(getScrapeOpsUrl(url, location));

Here is the fully updated code:

const puppeteer = require('puppeteer');const createCsvWriter = require('csv-writer').createObjectCsvWriter;const csvParse = require('csv-parse');const fs = require('fs');
const API_KEY = JSON.parse(fs.readFileSync('config.json')).api_key;
async function writeToCsv(data, outputFile) {  if (!data || data.length === 0) {    throw new Error('No data to write!');  }  const fileExists = fs.existsSync(outputFile);
  const headers = Object.keys(data[0]).map((key) => ({ id: key, title: key }));
  const csvWriter = createCsvWriter({    path: outputFile,    header: headers,    append: fileExists,  });  try {    await csvWriter.writeRecords(data);  } catch (e) {    throw new Error('Failed to write to csv');  }}
async function readCsv(inputFile) {  const results = [];  const parser = fs.createReadStream(inputFile).pipe(    csvParse.parse({      columns: true,      delimiter: ',',      trim: true,      skip_empty_lines: true,    })  );
  for await (const record of parser) {    results.push(record);  }  return results;}
function range(start, end) {  const array = [];  for (let i = start; i < end; i++) {    array.push(i);  }  return array;}
function getScrapeOpsUrl(url, location = 'us') {  const params = new URLSearchParams({    api_key: API_KEY,    url: url,    country: location,  });  return `https://proxy.scrapeops.io/v1/?${params.toString()}`;}
async function scrapeSearchResults(  browser,  keyword,  pageNumber,  location = 'us',  retries = 3) {  let tries = 0;  let success = false;
  while (tries <= retries && !success) {    const formattedKeyword = keyword.replace(' ', '+');    const page = await browser.newPage();    try {      const url = `https://www.trustpilot.com/search?query=${formattedKeyword}&page=${pageNumber + 1}`;
      const proxyUrl = getScrapeOpsUrl(url, location);      await page.goto(proxyUrl);
      console.log(`Successfully fetched: ${url}`);
      const script = await page.$("script[id='__NEXT_DATA__']");
      const innerHTML = await page.evaluate(        (element) => element.innerHTML,        script      );      const jsonData = JSON.parse(innerHTML);
      const businessUnits = jsonData.props.pageProps.businessUnits;
      for (const business of businessUnits) {        let category = 'n/a';        if ('categories' in business && business.categories.length > 0) {          category = business.categories[0].categoryId;        }
        let location = 'n/a';        if ('location' in business && 'country' in business.location) {          location = business.location.country;        }        const trustpilotFormatted = business.contact.website.split('://')[1];
        const businessInfo = {          name: business.displayName            .toLowerCase()            .replace(' ', '')            .replace("'", ''),          stars: business.stars,          rating: business.trustScore,          num_reviews: business.numberOfReviews,          website: business.contact.website,          trustpilot_url: `https://www.trustpilot.com/review/${trustpilotFormatted}`,          location: location,          category: category,        };
        await writeToCsv([businessInfo], `${keyword.replace(' ', '-')}.csv`);      }
      success = true;    } catch (err) {      console.log(`Error: ${err}, tries left ${retries - tries}`);      tries++;    } finally {      await page.close();    }  }}
async function startScrape(  keyword,  pages,  location,  concurrencyLimit,  retries) {  const pageList = range(0, pages);
  const browser = await puppeteer.launch();
  while (pageList.length > 0) {    const currentBatch = pageList.splice(0, concurrencyLimit);    const tasks = currentBatch.map((page) =>      scrapeSearchResults(browser, keyword, page, location, retries)    );
    try {      await Promise.all(tasks);    } catch (err) {      console.log(`Failed to process batch: ${err}`);    }  }
  await browser.close();}
async function processBusiness(browser, row, location, retries = 3) {  const url = row.trustpilot_url;  let tries = 0;  let success = false;
  while (tries <= retries && !success) {    const page = await browser.newPage();
    try {      await page.goto(getScrapeOpsUrl(url, location));
      const script = await page.$("script[id='__NEXT_DATA__']");      const innerHTML = await page.evaluate(        (element) => element.innerHTML,        script      );
      const jsonData = JSON.parse(innerHTML);      const businessInfo = jsonData.props.pageProps;
      const reviews = businessInfo.reviews;
      for (const review of reviews) {        const reviewData = {          name: review.consumer.displayName,          rating: review.rating,          text: review.text,          title: review.title,          date: review.dates.publishedDate,        };        await writeToCsv([reviewData], `${row.name}.csv`);      }
      success = true;    } catch (err) {      console.log(`Error: ${err}, tries left: ${retries - tries}`);      tries++;    } finally {      await page.close();    }  }}
async function processResults(csvFile, location, concurrencyLimit, retries) {  const businesses = await readCsv(csvFile);  const browser = await puppeteer.launch();
  while (businesses.length > 0) {    const currentBatch = businesses.splice(0, concurrencyLimit);    const tasks = currentBatch.map((business) =>      processBusiness(browser, business, location, retries)    );
    try {      await Promise.all(tasks);    } catch (err) {      console.log(`Failed to process batch: ${err}`);    }  }  await browser.close();}
async function main() {  const keywords = ['online bank'];  const concurrencyLimit = 5;  const pages = 1;  const location = 'us';  const retries = 3;  const aggregateFiles = [];
  for (const keyword of keywords) {    await startScrape(keyword, pages, location, concurrencyLimit, retries);    aggregateFiles.push(`${keyword.replace(' ', '-')}.csv`);  }
  for (const file of aggregateFiles) {    await processResults(file, location, concurrencyLimit, retries);  }}
main();

Step 6: Production Run

Let's run both the crawler and the scraper together in production. Here is our updated main.

async function main() {  const keywords = ['online bank'];  const concurrencyLimit = 5;  const pages = 10;  const location = 'us';  const retries = 3;  const aggregateFiles = [];
  for (const keyword of keywords) {    await startScrape(keyword, pages, location, concurrencyLimit, retries);    aggregateFiles.push(`${keyword.replace(' ', '-')}.csv`);  }
  for (const file of aggregateFiles) {    await processResults(file, location, concurrencyLimit, retries);  }}

As before, I've changed our PAGES to 10 and our LOCATION to "us". Here are the results.It took just over 121 seconds (including the time it took to create our initial report) to generate a full report and process all the results (86 rows).This comes out to a speed of about 1.41 seconds per business.

Legal and Ethical Considerations

If a website is hidden behind a login, that is generally considered private data.
If your data does not require a login, it is generally considered to be public data.

If you have questions about the legality of your scraping job, it is best to consult an attorney familiar with the laws and localities you're dealing with.

Conclusion

More Web Scraping Guides

Now that you've got these new skills, it's time to practice them... Go build something! Here at ScrapeOps, we've got loads of resources for you to learn from. If you're in the mood to learn more, check out our Web Scraping Playbook or take a look at the articles below.