How to Scrape SimilarWeb

SimilarWeb is a great place to find useful information about any website such as rank, category, rank_change, average_vist, pages_per_visit, and bounce_rate. Each of these metrics can provide critical data and insight into what users are doing when they access the site.

How to Scrape SimilarWeb With Requests and BeautifulSoup

In this tutorial, we're going to learn how to scrape SimilarWeb using Requests & BeautifulSoup.

💡GitHub CodeThe full code for this SimilarWeb Scraper is available on Github here.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

TLDR - How to Scrape SimilarWeb with Python

Scraping SimilarWeb can be a really difficult job. For starters, SimilarWeb blocks people after a certain amount of access so we absolutely need a proxy with rotating IP addresses.If you want to scrape it, use this scraper below.

Create a new project folder with a config.json file.
After creating the config file, add your ScrapeOps API key {"api_key": "your-super-secret-api-key"}.
Then, copy/paste the code below into a new python file.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
def get_scrapeops_url(url):    payload = {        "api_key": API_KEY,        "url": url,        "wait": 3000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)

@dataclassclass SearchData:    name: str = ""    url: str = ""    rank: int = 0    rank_change: int = 0    average_visit: str = ""    pages_per_visit: float = 0.0    bounce_rate: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass CompetitorData:    name: str = ""    url: str = ""    affinity: str = ""    monthly_visits: str = ""    category: str = ""    category_rank: int = None
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, data_pipeline=None, retries=3):    url = f"https://www.similarweb.com/top-websites/{keyword['category']}/{keyword['subcategory']}/"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                            soup = BeautifulSoup(response.text, "html.parser")                        rows = soup.find_all("tr", class_="top-table__row")                        rank = 1            for row in rows:                link_holder = row.find("a", class_="tw-table__compare")                site_name = link_holder.text                link = f"https://www.similarweb.com/website/{site_name}/"
                rank_change_holder = row.find("td", class_="top-table__column top-table__column--rank-change")                rank_change = 0                up_or_down = rank_change_holder.find("span").get("class")[1]                if "change--up" in up_or_down:                    rank_change += int(rank_change_holder.text)                elif "change--down" in up_or_down:                    rank_change -= int(rank_change_holder.text)                                average_visit = row.find("span", class_="tw-table__avg-visit-duration").text                pages_per_visit = float(row.find("span", class_="tw-table__pages-per-visit").text)                bounce_rate = row.find("span", class_="tw-table__bounce-rate").text                                                search_data = SearchData(                    name=site_name,                    url=link,                    rank=rank,                    rank_change=rank_change,                    average_visit=average_visit,                    pages_per_visit=pages_per_visit,                    bounce_rate=bounce_rate                )                rank+=1                
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def start_scrape(keywords, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            keywords,            [data_pipeline] * len(keywords),            [retries] * len(keywords)        )

def process_website(row, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(url)        try:            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")            else:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")                blocked = soup.find("div", class_="wa-limit-modal")                if blocked:                    raise Exception(f"Blocked")                competitors = soup.find_all("div", class_="wa-competitors__list-item")
                competitor_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")                for competitor in competitors:                    site_name = competitor.find("span", class_="wa-competitors__list-item-title").text                    link = f"https://www.similarweb.com/website/{site_name}/"                    affinity = competitor.find("span", class_="app-progress__value").text                    target_spans = competitor.find_all("span", "wa-competitors__list-column")
                    monthly_visits = target_spans[2].text                    category = target_spans[3].text                    category_rank = int(target_spans[4].text.replace("#", "").replace(",", "").replace("--", "0"))
                    competitor_data = CompetitorData(                        name=site_name,                        url=link,                        affinity=affinity,                        monthly_visits=monthly_visits,                        category=category,                        category_rank=category_rank                    )                                        competitor_pipeline.add_data(competitor_data)
                competitor_pipeline.close_pipeline()                success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}, Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")

def process_results(csv_file, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_website(row, retries=retries)
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = [{"category": "arts-and-entertainment", "subcategory": "humor"}, {"category": "arts-and-entertainment", "subcategory": "animation-and-comics"}]    aggregate_files = []
    ## Job Processes    filename = "arts-and-entertainment"
    crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")    start_scrape(keyword_list, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, retries=MAX_RETRIES)

To tweak your results, feel free to change any of the following:

MAX_THREADS: Defines the number of concurrent threads used during the scraping and processing tasks.
MAX_RETRIES: Determines the maximum number of retries the script will attempt if a request fails (e.g., due to a network issue or a non-200 status code).
keyword_list: A list of dictionaries where each dictionary contains a "category" and "subcategory" that specify the type of websites to scrape from SimilarWeb.
filename: The base name used to create the CSV file where the scraped data will be saved.

How To Architect Our SimilarWeb Scraper

Scraping SimilarWeb can be challenging. As soon as you really try to do anything with it, you're prompted to create an account in order to gain any real access to the site.This not only makes the site difficult to scrape, but it makes it difficult to even access it from a traditional browser! Even though we're prompted to create an account, we still can perform some actions before we wind up getting blocked every time.With ScrapeOps Proxy Aggregator, we can utilize rotating proxies to get past this... we're getting blocked based on our IP address. It's much more difficult to block us if we're using a fresh IP with each ping.Our SimilarWeb scraper will follow a similar architecture to most of our other scraping projects from our "How to Scrape" series. We'll need to build both a crawler and a scraper.

Our crawler will lookup top sites on a particular category.
Then, our scraper will go through and scrape the competitors and their respective info for each of these top ranked sites.

We'll use an iterative building process to add the following features:

Lookup a particular site and parse its data.
Store the parsed data inside an easy to manage CSV file.
Concurrently search multiple categories at the same time.
Use the ScrapeOps Proxy Aggregator to get past these anti-bots and free trial prompts.

Our scraper will be built in the following interations:

Look up and parse the competitors from a row in the CSV file generated earlier.
Save the competitors of each site to a new CSV report.
Concurrently run steps 1 and 2 simultaneously.
Once again, use the ScrapeOps Proxy Aggregator to get past any anti-bots and free trial prompts.

Understanding How To Scrape SimilarWeb

We need to understand SimilarWeb at a high level before we start writing any serious code.In these next few sections, let's take a look at exactly how we're going to access our information and how to pick it from the page.

Step 1: How To Request SimilarWeb Pages

Like everything else on the web, we need to begin with a simple GET request. The front page of SimilarWeb isn't really very useful, so we'll query a specific endpoint.In this case, we'll be looking up the top 50 humor sites. Here is our URL:

https://www.similarweb.com/top-websites/arts-and-entertainment/humor/

The URL gets laid out in the following structure:

https://www.similarweb.com/top-websites/{CATEGORY}/{SUBCATEGORY}/

For any specific search, we need both a category and a subcategory. In this case, our category is "arts-and-entertainment" while our subcategory is "humor".You can view a shot of the page below.When you view the page for a specific site, the URL is looks like this:

https://www.similarweb.com/website/pikabu.ru/

The layout goes as follows:

https://www.similarweb.com/website/{NAME_OF_SITE}/

Step 2: How To Extract Data From SimilarWeb Results and Pages

Extracting the data can be a little tricky. However, if we have access to site, this is completely doable. For starters, some of our content is loaded dynamically.To load our dynamic content, we need to use the wait parameter when talking to ScrapeOps. After we have our loaded page, we just need to find the information using its CSS class.For the results pages, each row has a class of top-table__row. We can find all these rows and easily extract their data from there.To extract our competitors, we extract div elements with the class of wa-competitors__list-item. Each of these div tags holds all the data for each competitor.On top if these pages, we need to be aware of the modal that SimilarWeb uses to block us. If this modal is present, we need to retry our request. As you can see in the image below, it's a div with a class of wa-limit-modal.

Step 3: Geolocated Data

ScrapeOps gives the ability to control our geolocation via the country parameter. However, with SimilarWeb we don't want to control our geolocation.Instead of controlling our location, we want as many IP addresses as possible to reduce our likelihood of getting blocked and asked to sign in/ sign up like you saw in the previous section.By not controlling our location, this gives us a much larger pool of IP addresses to use.

Setting Up Our SimilarWeb Scraper Project

Let's get started. You can run the following commands to get setup.Create a New Project Folder

mkdir similarweb-scraper
cd similarweb-scraper

Create a New Virtual Environment

python -m venv venv

Activate the Environment

source venv/bin/activate

Install Our Dependencies

pip install requests

pip install beautifulsoup4

Build A SimilarWeb Search Crawler

Time to start building! In the next few sections, we'll go through and build our crawler piece by piece.We'll start with a parser. Next we'll add data storage. Then, we'll add in concurrency. Finally, we'll add proxy integration.

Step 1: Create Simple Search Data Parser

Parsing is the first step of our scrape. In the code below, we create our basic script and add structure like error handling and retries. Most importantly, we implement our base parsing function.To see how our data gets extracted, pay close attention to scrape_search_results().

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)

def scrape_search_results(keyword, retries=3):    url = f"https://www.similarweb.com/top-websites/{keyword['category']}/{keyword['subcategory']}/"    tries = 0    success = False        while tries <= retries and not success:        try:            response = requests.get(url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                            soup = BeautifulSoup(response.text, "html.parser")                        rows = soup.find_all("tr", class_="top-table__row")                        rank = 1            for row in rows:                link_holder = row.find("a", class_="tw-table__compare")                site_name = link_holder.text                link = f"https://www.similarweb.com/website/{site_name}/"
                rank_change_holder = row.find("td", class_="top-table__column top-table__column--rank-change")                rank_change = 0                up_or_down = rank_change_holder.find("span").get("class")[1]                if "change--up" in up_or_down:                    rank_change += int(rank_change_holder.text)                elif "change--down" in up_or_down:                    rank_change -= int(rank_change_holder.text)                                average_visit = row.find("span", class_="tw-table__avg-visit-duration").text                pages_per_visit = float(row.find("span", class_="tw-table__pages-per-visit").text)                bounce_rate = row.find("span", class_="tw-table__bounce-rate").text                                                search_data = {                    "name": site_name,                    "url": link,                    "rank": rank,                    "rank_change": rank_change,                    "average_visit": average_visit,                    "pages_per_visit": pages_per_visit,                    "bounce_rate": bounce_rate                }                                rank+=1                
                print(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def start_scrape(keywords, retries=3):    for keyword in keywords:        scrape_search_results(keyword, retries=retries)

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = [{"category": "arts-and-entertainment", "subcategory": "humor"}, {"category": "arts-and-entertainment", "subcategory": "animation-and-comics"}]    aggregate_files = []
    ## Job Processes    filename = "arts-and-entertainment"
    start_scrape(keyword_list, retries=MAX_RETRIES)    logger.info(f"Crawl complete.")

When parsing the page, this is where we extract our data:

First, we find all our rows, rows = soup.find_all("tr", class_="top-table__row").
We find our link_holder with row.find("a", class_="tw-table__compare").
Using the link_holder object, we extract our site_name and construct our link.
rank_change_holder.find("span").get("class")[1] is used to find whether the rank went up or down.
We then find the average visit with row.find("span", class_="tw-table__avg-visit-duration").text.
float(row.find("span", class_="tw-table__pages-per-visit").text) finds our pages_per_visit.
Finally, we get our bounce_rate with row.find("span", class_="tw-table__bounce-rate").text.

Step 2: Storing the Scraped Data

Once we've got our data, we need to store it. To store our data, we need to make a couple classes. We need a dataclass, SearchData.SearchData will be used to represent individual objects from our search results. Once we have a SearchData object, we need to pass it into a DataPipeline.Our DataPipeline is used to open a pipe to a CSV file. The pipeline filters out duplicates by name and then saves all non-duplicate objects to a CSV file.Here is our SearchData class. We use this to represent individual ranking results.

@dataclassclass SearchData:    name: str = ""    url: str = ""    rank: int = 0    rank_change: int = 0    average_visit: str = ""    pages_per_visit: float = 0.0    bounce_rate: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

Here is our DataPipeline.

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()

When we put it all together, we need to open a new DataPipeline and pass it into start_scrape(). start_scrape() then passes the pipeline into our parsing function.Instead of printing our parsed data, we now pass that into the pipeline. Once we're finished parsing the results, we go ahead and close the DataPipeline.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)

@dataclassclass SearchData:    name: str = ""    url: str = ""    rank: int = 0    rank_change: int = 0    average_visit: str = ""    pages_per_visit: float = 0.0    bounce_rate: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, data_pipeline=None, retries=3):    url = f"https://www.similarweb.com/top-websites/{keyword['category']}/{keyword['subcategory']}/"    tries = 0    success = False        while tries <= retries and not success:        try:            response = requests.get(url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                            soup = BeautifulSoup(response.text, "html.parser")                        rows = soup.find_all("tr", class_="top-table__row")                        rank = 1            for row in rows:                link_holder = row.find("a", class_="tw-table__compare")                site_name = link_holder.text                link = f"https://www.similarweb.com/website/{site_name}/"
                rank_change_holder = row.find("td", class_="top-table__column top-table__column--rank-change")                rank_change = 0                up_or_down = rank_change_holder.find("span").get("class")[1]                if "change--up" in up_or_down:                    rank_change += int(rank_change_holder.text)                elif "change--down" in up_or_down:                    rank_change -= int(rank_change_holder.text)                                average_visit = row.find("span", class_="tw-table__avg-visit-duration").text                pages_per_visit = float(row.find("span", class_="tw-table__pages-per-visit").text)                bounce_rate = row.find("span", class_="tw-table__bounce-rate").text                                                search_data = SearchData(                    name=site_name,                    url=link,                    rank=rank,                    rank_change=rank_change,                    average_visit=average_visit,                    pages_per_visit=pages_per_visit,                    bounce_rate=bounce_rate                )                rank+=1                
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def start_scrape(keywords, data_pipeline=None, retries=3):    for keyword in keywords:        scrape_search_results(keyword, data_pipeline=data_pipeline, retries=retries)

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = [{"category": "arts-and-entertainment", "subcategory": "humor"}, {"category": "arts-and-entertainment", "subcategory": "animation-and-comics"}]    aggregate_files = []
    ## Job Processes    filename = "arts-and-entertainment"
    crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")    start_scrape(keyword_list, data_pipeline=crawl_pipeline, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

Each item in our results is represented in our code as SearchData.
These SearchData objects then get passed into our DataPipeline and saved to a CSV file.

Step 3: Adding Concurrency

Now, we need to add concurrency. We'll use ThreadPoolExecutor to add support for multithreading. Once we can open multiple threads, we can use those threads to run our parsing function on multiple pages concurrently.Here is our start_scrape() function adjusted for concurrency.

def start_scrape(keywords, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            keywords,            [data_pipeline] * len(keywords),            [retries] * len(keywords)        )

scrape_search_results is the function we'd like to call using multiple threads.
keywords is the array of things we'd like to search.
All other args to scrape_search_results get passed in as arrays.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)

@dataclassclass SearchData:    name: str = ""    url: str = ""    rank: int = 0    rank_change: int = 0    average_visit: str = ""    pages_per_visit: float = 0.0    bounce_rate: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, data_pipeline=None, retries=3):    url = f"https://www.similarweb.com/top-websites/{keyword['category']}/{keyword['subcategory']}/"    tries = 0    success = False        while tries <= retries and not success:        try:            response = requests.get(url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                            soup = BeautifulSoup(response.text, "html.parser")                        rows = soup.find_all("tr", class_="top-table__row")                        rank = 1            for row in rows:                link_holder = row.find("a", class_="tw-table__compare")                site_name = link_holder.text                link = f"https://www.similarweb.com/website/{site_name}/"
                rank_change_holder = row.find("td", class_="top-table__column top-table__column--rank-change")                rank_change = 0                up_or_down = rank_change_holder.find("span").get("class")[1]                if "change--up" in up_or_down:                    rank_change += int(rank_change_holder.text)                elif "change--down" in up_or_down:                    rank_change -= int(rank_change_holder.text)                                average_visit = row.find("span", class_="tw-table__avg-visit-duration").text                pages_per_visit = float(row.find("span", class_="tw-table__pages-per-visit").text)                bounce_rate = row.find("span", class_="tw-table__bounce-rate").text                                                search_data = SearchData(                    name=site_name,                    url=link,                    rank=rank,                    rank_change=rank_change,                    average_visit=average_visit,                    pages_per_visit=pages_per_visit,                    bounce_rate=bounce_rate                )                rank+=1                
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def start_scrape(keywords, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            keywords,            [data_pipeline] * len(keywords),            [retries] * len(keywords)        )

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = [{"category": "arts-and-entertainment", "subcategory": "humor"}, {"category": "arts-and-entertainment", "subcategory": "animation-and-comics"}]    aggregate_files = []
    ## Job Processes    filename = "arts-and-entertainment"
    crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")    start_scrape(keyword_list, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

We now have the concurrency capability to crawl multiple categories at once.

Step 4: Bypassing Anti-Bots

To properly scrape SimilarWeb, we need a ton of IP addresses.To get as many addresses as possible, we're going to use just three parameters, API_KEY, url and wait. This tells ScrapeOps that we want to wait 3 seconds for content to render and we don't care which country we're routed through.This gives us the largest pool of potential IP addresses because we can be routed through any server that ScrapeOps supports.

def get_scrapeops_url(url):    payload = {        "api_key": API_KEY,        "url": url,        "wait": 3000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

The code below contains our production ready crawler.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
def get_scrapeops_url(url):    payload = {        "api_key": API_KEY,        "url": url,        "wait": 3000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)

@dataclassclass SearchData:    name: str = ""    url: str = ""    rank: int = 0    rank_change: int = 0    average_visit: str = ""    pages_per_visit: float = 0.0    bounce_rate: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, data_pipeline=None, retries=3):    url = f"https://www.similarweb.com/top-websites/{keyword['category']}/{keyword['subcategory']}/"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                            soup = BeautifulSoup(response.text, "html.parser")                        rows = soup.find_all("tr", class_="top-table__row")                        rank = 1            for row in rows:                link_holder = row.find("a", class_="tw-table__compare")                site_name = link_holder.text                link = f"https://www.similarweb.com/website/{site_name}/"
                rank_change_holder = row.find("td", class_="top-table__column top-table__column--rank-change")                rank_change = 0                up_or_down = rank_change_holder.find("span").get("class")[1]                if "change--up" in up_or_down:                    rank_change += int(rank_change_holder.text)                elif "change--down" in up_or_down:                    rank_change -= int(rank_change_holder.text)                                average_visit = row.find("span", class_="tw-table__avg-visit-duration").text                pages_per_visit = float(row.find("span", class_="tw-table__pages-per-visit").text)                bounce_rate = row.find("span", class_="tw-table__bounce-rate").text                                                search_data = SearchData(                    name=site_name,                    url=link,                    rank=rank,                    rank_change=rank_change,                    average_visit=average_visit,                    pages_per_visit=pages_per_visit,                    bounce_rate=bounce_rate                )                rank+=1                
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def start_scrape(keywords, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            keywords,            [data_pipeline] * len(keywords),            [retries] * len(keywords)        )

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = [{"category": "arts-and-entertainment", "subcategory": "humor"}, {"category": "arts-and-entertainment", "subcategory": "animation-and-comics"}]    aggregate_files = []
    ## Job Processes    filename = "arts-and-entertainment"
    crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")    start_scrape(keyword_list, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

Step 6: Production Run

Alright! It's time to run this code in production. As you've noticed, we have our MAX_THREADS set to 5. We're only searching 2 categories, so ThreadPoolExecutor will run this on 2 threads and finish it out.In the next half of our article, when we write the scraper, we'll take advantage of all 5 threads.Here is our main.

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = [{"category": "arts-and-entertainment", "subcategory": "humor"}, {"category": "arts-and-entertainment", "subcategory": "animation-and-comics"}]    aggregate_files = []
    ## Job Processes    filename = "arts-and-entertainment"
    crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")    start_scrape(keyword_list, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

Here are the results from our crawl. They are all over the place. On one run, it took 20.645 seconds. On the next, it took 66.588 seconds. This shows that when SimilarWeb begins blocking us, ScrapeOps looks for new servers to use until each request is successful.

Build A SimilarWeb Scraper

Now that we're running a proper crawl and saving the results, we need to do something with those results. In this section, we'll go through and scrape the competitors to each site we extracted during the crawl.The scraper needs to do the following:

Read the CSV into an array.
Parse the websites from the array.
Store the competitor data from the parsing stage.
Run steps 2 and 3 concurrently for faster results.
Integrate with the ScrapeOps Proxy Aggregator to get past anti-bots and other roadblocks.

Step 1: Create Simple Website Data Parser

Just like before, we're going to start with a parsing function. This one will find all of the competitor objects on the page and extract their data.

def process_website(row, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(url)        try:            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")            else:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")                blocked = soup.find("div", class_="wa-limit-modal")                if blocked:                    raise Exception(f"Blocked")                competitors = soup.find_all("div", class_="wa-competitors__list-item")
                for competitor in competitors:                    site_name = competitor.find("span", class_="wa-competitors__list-item-title").text                    link = f"https://www.similarweb.com/website/{site_name}/"                    affinity = competitor.find("span", class_="app-progress__value").text                    target_spans = competitor.find_all("span", "wa-competitors__list-column")
                    monthly_visits = target_spans[2].text                    category = target_spans[3].text                    category_rank = int(target_spans[4].text.replace("#", "").replace(",", "").replace("--", "0"))
                    competitor_data = {                        "name": site_name,                        "url": link,                        "affinity": affinity,                        "monthly_visits": monthly_visits,                        "category": category,                        "category_rank": category_rank                    }                                                            print(competitor_data)
                success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}, Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")

This time, during our parse, we execute these steps:

Find all of the competitor rows: soup.find_all("div", class_="wa-competitors__list-item").
Iterate through the competitor rows.
For each competitor, we pull the following:
- site_name
- affinity
- monthly_visits
- category
- category_link
- We construct the url by once again formatting the site_name.

Step 2: Loading URLs To Scrape

We have our parsing function, but it needs a url to work. Here, we'll add another function that reads urls from the CSV and calls process_website() on each row from the file.Here is our process_results() function.

def process_results(csv_file, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_website(row, retries=retries)

You can see how it all fits together in the code below.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
def get_scrapeops_url(url):    payload = {        "api_key": API_KEY,        "url": url,        "wait": 3000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)

@dataclassclass SearchData:    name: str = ""    url: str = ""    rank: int = 0    rank_change: int = 0    average_visit: str = ""    pages_per_visit: float = 0.0    bounce_rate: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, data_pipeline=None, retries=3):    url = f"https://www.similarweb.com/top-websites/{keyword['category']}/{keyword['subcategory']}/"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                            soup = BeautifulSoup(response.text, "html.parser")                        rows = soup.find_all("tr", class_="top-table__row")                        rank = 1            for row in rows:                link_holder = row.find("a", class_="tw-table__compare")                site_name = link_holder.text                link = f"https://www.similarweb.com/website/{site_name}/"
                rank_change_holder = row.find("td", class_="top-table__column top-table__column--rank-change")                rank_change = 0                up_or_down = rank_change_holder.find("span").get("class")[1]                if "change--up" in up_or_down:                    rank_change += int(rank_change_holder.text)                elif "change--down" in up_or_down:                    rank_change -= int(rank_change_holder.text)                                average_visit = row.find("span", class_="tw-table__avg-visit-duration").text                pages_per_visit = float(row.find("span", class_="tw-table__pages-per-visit").text)                bounce_rate = row.find("span", class_="tw-table__bounce-rate").text                                                search_data = SearchData(                    name=site_name,                    url=link,                    rank=rank,                    rank_change=rank_change,                    average_visit=average_visit,                    pages_per_visit=pages_per_visit,                    bounce_rate=bounce_rate                )                rank+=1                
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def start_scrape(keywords, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            keywords,            [data_pipeline] * len(keywords),            [retries] * len(keywords)        )

def process_website(row, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(url)        try:            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")            else:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")                blocked = soup.find("div", class_="wa-limit-modal")                if blocked:                    raise Exception(f"Blocked")                competitors = soup.find_all("div", class_="wa-competitors__list-item")
                for competitor in competitors:                    site_name = competitor.find("span", class_="wa-competitors__list-item-title").text                    link = f"https://www.similarweb.com/website/{site_name}/"                    affinity = competitor.find("span", class_="app-progress__value").text                    target_spans = competitor.find_all("span", "wa-competitors__list-column")
                    monthly_visits = target_spans[2].text                    category = target_spans[3].text                    category_rank = int(target_spans[4].text.replace("#", "").replace(",", "").replace("--", "0"))
                    competitor_data = {                        "name": site_name,                        "url": link,                        "affinity": affinity,                        "monthly_visits": monthly_visits,                        "category": category,                        "category_rank": category_rank                    }                                                            print(competitor_data)
                success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}, Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")

def process_results(csv_file, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_website(row, retries=retries)
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = [{"category": "arts-and-entertainment", "subcategory": "humor"}, {"category": "arts-and-entertainment", "subcategory": "animation-and-comics"}]    aggregate_files = []
    ## Job Processes    filename = "arts-and-entertainment"
    crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")    start_scrape(keyword_list, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, retries=MAX_RETRIES)

process_results() reads our CSV into an array.
For each row of the file, we run process_website() on the row.

Step 3: Storing the Scraped Data

Without storage, there wouldn't be a point in scraping to begin with. We've already got the DataPipeline, we just need a dataclass to feed into it. We're going to create a new one called CompetitorData. It's very much like our SearchData.Here is our CompetitorData class.

@dataclassclass CompetitorData:    name: str = ""    url: str = ""    affinity: str = ""    monthly_visits: str = ""    category: str = ""    category_rank: int = None
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

In our revised code below, we open another DataPipeline inside our parsing function and we pass CompetitorData into it.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
def get_scrapeops_url(url):    payload = {        "api_key": API_KEY,        "url": url,        "wait": 3000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)

@dataclassclass SearchData:    name: str = ""    url: str = ""    rank: int = 0    rank_change: int = 0    average_visit: str = ""    pages_per_visit: float = 0.0    bounce_rate: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass CompetitorData:    name: str = ""    url: str = ""    affinity: str = ""    monthly_visits: str = ""    category: str = ""    category_rank: int = None
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, data_pipeline=None, retries=3):    url = f"https://www.similarweb.com/top-websites/{keyword['category']}/{keyword['subcategory']}/"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                            soup = BeautifulSoup(response.text, "html.parser")                        rows = soup.find_all("tr", class_="top-table__row")                        rank = 1            for row in rows:                link_holder = row.find("a", class_="tw-table__compare")                site_name = link_holder.text                link = f"https://www.similarweb.com/website/{site_name}/"
                rank_change_holder = row.find("td", class_="top-table__column top-table__column--rank-change")                rank_change = 0                up_or_down = rank_change_holder.find("span").get("class")[1]                if "change--up" in up_or_down:                    rank_change += int(rank_change_holder.text)                elif "change--down" in up_or_down:                    rank_change -= int(rank_change_holder.text)                                average_visit = row.find("span", class_="tw-table__avg-visit-duration").text                pages_per_visit = float(row.find("span", class_="tw-table__pages-per-visit").text)                bounce_rate = row.find("span", class_="tw-table__bounce-rate").text                                                search_data = SearchData(                    name=site_name,                    url=link,                    rank=rank,                    rank_change=rank_change,                    average_visit=average_visit,                    pages_per_visit=pages_per_visit,                    bounce_rate=bounce_rate                )                rank+=1                
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def start_scrape(keywords, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            keywords,            [data_pipeline] * len(keywords),            [retries] * len(keywords)        )

def process_website(row, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(url)        try:            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")            else:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")                blocked = soup.find("div", class_="wa-limit-modal")                if blocked:                    raise Exception(f"Blocked")                competitors = soup.find_all("div", class_="wa-competitors__list-item")
                competitor_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")                for competitor in competitors:                    site_name = competitor.find("span", class_="wa-competitors__list-item-title").text                    link = f"https://www.similarweb.com/website/{site_name}/"                    affinity = competitor.find("span", class_="app-progress__value").text                    target_spans = competitor.find_all("span", "wa-competitors__list-column")
                    monthly_visits = target_spans[2].text                    category = target_spans[3].text                    category_rank = int(target_spans[4].text.replace("#", "").replace(",", "").replace("--", "0"))
                    competitor_data = CompetitorData(                        name=site_name,                        url=link,                        affinity=affinity,                        monthly_visits=monthly_visits,                        category=category,                        category_rank=category_rank                    )                                        competitor_pipeline.add_data(competitor_data)
                competitor_pipeline.close_pipeline()                success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}, Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")

def process_results(csv_file, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_website(row, retries=retries)
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = [{"category": "arts-and-entertainment", "subcategory": "humor"}, {"category": "arts-and-entertainment", "subcategory": "animation-and-comics"}]    aggregate_files = []
    ## Job Processes    filename = "arts-and-entertainment"
    crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")    start_scrape(keyword_list, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, retries=MAX_RETRIES)

CompetitorData is used to represent the competitors we extract from the page.
We open a new DataPipeline inside of our parsing function and pass these CompetitorData objects into the pipeline.

Step 4: Adding Concurrency

We now need to add concurrency. Instead of searching multiple categories this time, we'll need to run our parsing function on multiple rows simultaneously.To accomplish this, we're going to refactor process_results() to take advantage of multiple threads using ThreadPoolExecutor.Here is our multithreaded process_results().

def process_results(csv_file, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_website,                reader,                [retries] * len(reader)            )

process_website is the function we want to call on multiple threads.
reader is the array of objects that we want to process with multiple threads.
retries gets passed in as an array the length of reader as well.

All arguments to process_website get passed into executor.map() as arrays. These then get passed into process_website.Here is our full code up to this point.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
def get_scrapeops_url(url):    payload = {        "api_key": API_KEY,        "url": url,        "wait": 3000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)

@dataclassclass SearchData:    name: str = ""    url: str = ""    rank: int = 0    rank_change: int = 0    average_visit: str = ""    pages_per_visit: float = 0.0    bounce_rate: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass CompetitorData:    name: str = ""    url: str = ""    affinity: str = ""    monthly_visits: str = ""    category: str = ""    category_rank: int = None
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, data_pipeline=None, retries=3):    url = f"https://www.similarweb.com/top-websites/{keyword['category']}/{keyword['subcategory']}/"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                            soup = BeautifulSoup(response.text, "html.parser")                        rows = soup.find_all("tr", class_="top-table__row")                        rank = 1            for row in rows:                link_holder = row.find("a", class_="tw-table__compare")                site_name = link_holder.text                link = f"https://www.similarweb.com/website/{site_name}/"
                rank_change_holder = row.find("td", class_="top-table__column top-table__column--rank-change")                rank_change = 0                up_or_down = rank_change_holder.find("span").get("class")[1]                if "change--up" in up_or_down:                    rank_change += int(rank_change_holder.text)                elif "change--down" in up_or_down:                    rank_change -= int(rank_change_holder.text)                                average_visit = row.find("span", class_="tw-table__avg-visit-duration").text                pages_per_visit = float(row.find("span", class_="tw-table__pages-per-visit").text)                bounce_rate = row.find("span", class_="tw-table__bounce-rate").text                                                search_data = SearchData(                    name=site_name,                    url=link,                    rank=rank,                    rank_change=rank_change,                    average_visit=average_visit,                    pages_per_visit=pages_per_visit,                    bounce_rate=bounce_rate                )                rank+=1                
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def start_scrape(keywords, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            keywords,            [data_pipeline] * len(keywords),            [retries] * len(keywords)        )

def process_website(row, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(url)        try:            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")            else:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")                blocked = soup.find("div", class_="wa-limit-modal")                if blocked:                    raise Exception(f"Blocked")                competitors = soup.find_all("div", class_="wa-competitors__list-item")
                competitor_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")                for competitor in competitors:                    site_name = competitor.find("span", class_="wa-competitors__list-item-title").text                    link = f"https://www.similarweb.com/website/{site_name}/"                    affinity = competitor.find("span", class_="app-progress__value").text                    target_spans = competitor.find_all("span", "wa-competitors__list-column")
                    monthly_visits = target_spans[2].text                    category = target_spans[3].text                    category_rank = int(target_spans[4].text.replace("#", "").replace(",", "").replace("--", "0"))
                    competitor_data = CompetitorData(                        name=site_name,                        url=link,                        affinity=affinity,                        monthly_visits=monthly_visits,                        category=category,                        category_rank=category_rank                    )                                        competitor_pipeline.add_data(competitor_data)
                competitor_pipeline.close_pipeline()                success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}, Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")

def process_results(csv_file, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_website,                reader,                [retries] * len(reader)            )
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = [{"category": "arts-and-entertainment", "subcategory": "humor"}, {"category": "arts-and-entertainment", "subcategory": "animation-and-comics"}]    aggregate_files = []
    ## Job Processes    filename = "arts-and-entertainment"
    crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")    start_scrape(keyword_list, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Step 5: Bypassing Anti-Bots

Particularly when viewing reports for individual sites, we tend to get blocked. You can see what that looks like in the image below. To get around this, we're going to use the proxy function we wrote earlier.We only need to change one line of our parsing function to implement this.

response = requests.get(get_scrapeops_url(url))

Here is our finalized code.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]©
def get_scrapeops_url(url):    payload = {        "api_key": API_KEY,        "url": url,        "wait": 3000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)

@dataclassclass SearchData:    name: str = ""    url: str = ""    rank: int = 0    rank_change: int = 0    average_visit: str = ""    pages_per_visit: float = 0.0    bounce_rate: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass CompetitorData:    name: str = ""    url: str = ""    affinity: str = ""    monthly_visits: str = ""    category: str = ""    category_rank: int = None
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, data_pipeline=None, retries=3):    url = f"https://www.similarweb.com/top-websites/{keyword['category']}/{keyword['subcategory']}/"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                            soup = BeautifulSoup(response.text, "html.parser")                        rows = soup.find_all("tr", class_="top-table__row")                        rank = 1            for row in rows:                link_holder = row.find("a", class_="tw-table__compare")                site_name = link_holder.text                link = f"https://www.similarweb.com/website/{site_name}/"
                rank_change_holder = row.find("td", class_="top-table__column top-table__column--rank-change")                rank_change = 0                up_or_down = rank_change_holder.find("span").get("class")[1]                if "change--up" in up_or_down:                    rank_change += int(rank_change_holder.text)                elif "change--down" in up_or_down:                    rank_change -= int(rank_change_holder.text)                                average_visit = row.find("span", class_="tw-table__avg-visit-duration").text                pages_per_visit = float(row.find("span", class_="tw-table__pages-per-visit").text)                bounce_rate = row.find("span", class_="tw-table__bounce-rate").text                                                search_data = SearchData(                    name=site_name,                    url=link,                    rank=rank,                    rank_change=rank_change,                    average_visit=average_visit,                    pages_per_visit=pages_per_visit,                    bounce_rate=bounce_rate                )                rank+=1                
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def start_scrape(keywords, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            keywords,            [data_pipeline] * len(keywords),            [retries] * len(keywords)        )

def process_website(row, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(get_scrapeops_url(url))        try:            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")            else:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")                blocked = soup.find("div", class_="wa-limit-modal")                if blocked:                    raise Exception(f"Blocked")                competitors = soup.find_all("div", class_="wa-competitors__list-item")
                competitor_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")                for competitor in competitors:                    site_name = competitor.find("span", class_="wa-competitors__list-item-title").text                    link = f"https://www.similarweb.com/website/{site_name}/"                    affinity = competitor.find("span", class_="app-progress__value").text                    target_spans = competitor.find_all("span", "wa-competitors__list-column")
                    monthly_visits = target_spans[2].text                    category = target_spans[3].text                    category_rank = int(target_spans[4].text.replace("#", "").replace(",", "").replace("--", "0"))
                    competitor_data = CompetitorData(                        name=site_name,                        url=link,                        affinity=affinity,                        monthly_visits=monthly_visits,                        category=category,                        category_rank=category_rank                    )                                        competitor_pipeline.add_data(competitor_data)
                competitor_pipeline.close_pipeline()                success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}, Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")

def process_results(csv_file, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_website,                reader,                [retries] * len(reader)            )
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = [{"category": "arts-and-entertainment", "subcategory": "humor"}, {"category": "arts-and-entertainment", "subcategory": "animation-and-comics"}]    aggregate_files = []
    ## Job Processes    filename = "arts-and-entertainment"
    crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")    start_scrape(keyword_list, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Step 6: Production Run

Time to run this entire thing in production! We're going to use the same settings as before. Here is our main if you need a refresher.Since there was such a spread in our crawl times, we'll estimate the crawl at 30 seconds.

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = [{"category": "arts-and-entertainment", "subcategory": "humor"}, {"category": "arts-and-entertainment", "subcategory": "animation-and-comics"}]    aggregate_files = []
    ## Job Processes    filename = "arts-and-entertainment"
    crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")    start_scrape(keyword_list, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Here are the results.The crawl generated a CSV file with 100 results and took 618.173 seconds. As mentioned earlier, we'll estimate our crawl at 30 seconds. 618.173 - 30 = 588.173 seconds scraping competitors. 588.173 seconds / 100 results = 5.88173 seconds per result. Especially considering the 3 second wait time for content to render, this is pretty good!.

Legal and Ethical Considerations

When scraping the web, we need to be conscious of our actions. When you scrape public data (data that is not behind a login), that data is considered to be public information and therefore legal to scrape... Not unlike taking a picture of a billboard.However, private data is a completely different story. If you choose to scrape private data, you are subject to a whole different slew of intellectual property laws and privacy regulations.Even if you're scraping public data, you need to be mindful of your target site's Terms and Conditions and their robots.txt file as well. Violating these could lead to suspension of your account or even a permanent ban.You can view these for SimilarWeb by checking the links below.

If you're unsure of your scraper, you should talk to an attorney.

Conclusion

In conclusion, SimilarWeb can be a very difficult site to access without an account let alone scrape. The ScrapeOps Proxy Aggregator gives us the ability to constantly rotate to new IP addresses and as SimilarWeb blocks them. You got a crash course in iterative development and you should understand parsing, data storage, concurrency and proxy integration.If you'd like to learn more about the tech stack used in this site, take a look at the links below.

How to Scrape SimilarWeb With Selenium

In this tutorial, we will explore how to scrape SimilarWeb with Selenium.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

**TLDR - How to Scrape SimilarWeb with Selenium

Scraping SimilarWeb can be quite challenging. To begin with, SimilarWeb restricts access after a certain point, which makes it essential to have a proxy that rotates IP addresses.If your goal is to scrape SimilarWeb, the scraper provided below should be used.

First, create a new project folder and include a config.json file.
Once the config file is created, input your ScrapeOps API key {"api_key": "your-super-secret-api-key"}.
Afterwards, copy and paste the code below into a new Python file.

import os  import csv  import json  import time  import logging  from urllib.parse import urlencode  from selenium import webdriver  from selenium.webdriver.chrome.service import Service as ChromeService  from selenium.webdriver.chrome.options import Options  from selenium.webdriver.common.by import By  from webdriver_manager.chrome import ChromeDriverManager  import concurrent.futures  from dataclasses import dataclass, fields, asdict
API_KEY = ""  with open("config.json", "r") as config_file:      config = json.load(config_file)      API_KEY = config["api_key"]
def get_scrapeops_url(url):      payload = {          "api_key": API_KEY,          "url": url,          "wait": 3000      }      proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)      return proxy_url
# Set up logging  logging.basicConfig(level=logging.INFO)  logger = logging.getLogger(__name__)
# Setup Selenium WebDriver  def setup_driver():      options = Options()      options.add_argument("--headless")  # Run in headless mode for efficiency      options.add_argument("--no-sandbox")      options.add_argument("--disable-dev-shm-usage")      return webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)
@dataclass  class SearchData:      name: str = ""      url: str = ""      rank: int = 0      rank_change: int = 0      average_visit: str = ""      pages_per_visit: float = 0.0      bounce_rate: str = ""
    def __post_init__(self):          self.check_string_fields()
    def check_string_fields(self):          for field in fields(self):              if isinstance(getattr(self, field.name), str):                  if getattr(self, field.name) == "":                      setattr(self, field.name, f"No {field.name}")                  else:                      value = getattr(self, field.name)                      setattr(self, field.name, value.strip())
@dataclass  class CompetitorData:      name: str = ""      url: str = ""      affinity: str = ""      monthly_visits: str = ""      category: str = ""      category_rank: int = None
    def __post_init__(self):          self.check_string_fields()
    def check_string_fields(self):          for field in fields(self):              if isinstance(getattr(self, field.name), str):                  if getattr(self, field.name) == "":                      setattr(self, field.name, f"No {field.name}")                  else:                      value = getattr(self, field.name)                      setattr(self, field.name, value.strip())
class DataPipeline:      def __init__(self, csv_filename="", storage_queue_limit=50):          self.names_seen = []          self.storage_queue = []          self.storage_queue_limit = storage_queue_limit          self.csv_filename = csv_filename          self.csv_file_open = False
    def save_to_csv(self):          self.csv_file_open = True          data_to_save = []          data_to_save.extend(self.storage_queue)          self.storage_queue.clear()          if not data_to_save:              return
        keys = [field.name for field in fields(data_to_save[0])]          file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0          with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:              writer = csv.DictWriter(output_file, fieldnames=keys)              if not file_exists:                  writer.writeheader()              for item in data_to_save:                  writer.writerow(asdict(item))          self.csv_file_open = False
    def is_duplicate(self, input_data):          if input_data.name in self.names_seen:              logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")              return True          self.names_seen.append(input_data.name)          return False
    def add_data(self, scraped_data):          if not self.is_duplicate(scraped_data):              self.storage_queue.append(scraped_data)              if len(self.storage_queue) >= self.storage_queue_limit and not self.csv_file_open:                  self.save_to_csv()
    def close_pipeline(self):          if self.csv_file_open:              time.sleep(3)          if self.storage_queue:              self.save_to_csv()
# Function to scrape search results (fully Selenium-based)  def scrape_search_results(keyword, data_pipeline=None, retries=3):      url = f"https://www.similarweb.com/top-websites/{keyword['category']}/{keyword['subcategory']}/"      tries = 0      success = False
    while tries <= retries and not success:          try:              # Initialize WebDriver and load page              driver = setup_driver()              scrapeops_proxy_url = get_scrapeops_url(url)
            driver.get(scrapeops_proxy_url)              time.sleep(3)  # Allow page to load
            logger.info(f"Opened URL: {url}")
            # Find all rows of the search results table              rows = driver.find_elements(By.CSS_SELECTOR, "tr.top-table__row")              rank = 1
            for row in rows:                  site_name = row.find_element(By.CSS_SELECTOR, "a.tw-table__compare").text.strip()                  link = f"https://www.similarweb.com/website/{site_name}/"
                # Rank change processing                  rank_change_holder = row.find_element(By.CSS_SELECTOR, "td.top-table__column--rank-change")                  rank_change = 0                  up_or_down = rank_change_holder.find_element(By.TAG_NAME, "span").get_attribute("class").split()[-1]                  if "change--up" in up_or_down:                      rank_change += int(rank_change_holder.text.strip())                  elif "change--down" in up_or_down:                      rank_change -= int(rank_change_holder.text.strip())
                average_visit = row.find_element(By.CSS_SELECTOR, "span.tw-table__avg-visit-duration").text.strip()                  pages_per_visit = float(row.find_element(By.CSS_SELECTOR, "span.tw-table__pages-per-visit").text.strip())                  bounce_rate = row.find_element(By.CSS_SELECTOR, "span.tw-table__bounce-rate").text.strip()
                # Create data object                  search_data = SearchData(                      name=site_name,                      url=link,                      rank=rank,                      rank_change=rank_change,                      average_visit=average_visit,                      pages_per_visit=pages_per_visit,                      bounce_rate=bounce_rate                  )                  rank += 1                  data_pipeline.add_data(search_data)
            logger.info(f"Successfully parsed data from: {url}")              success = True          except Exception as e:              logger.error(f"An error occurred while processing page {url}: {e}, retries left {retries-tries}")              tries += 1          finally:              driver.quit()
    if not success:          raise Exception(f"Max retries exceeded: {retries}")
# Function to process and scrape all search results concurrently  def start_scrape(keywords, data_pipeline=None, max_threads=5, retries=3):      with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:          executor.map(              scrape_search_results,              keywords,              [data_pipeline] * len(keywords),              [retries] * len(keywords)          )
# Function to process websites (Selenium-based) and extract competitor data  def process_website(row, retries=3):      url = row["url"]      tries = 0      success = False
    while tries <= retries and not success:          try:              driver = setup_driver()              scrapeops_proxy_url = get_scrapeops_url(url)              driver.get(scrapeops_proxy_url)              time.sleep(3)  # Allow page to load
            # Check if blocked by a modal or warning              try:                  blocked_modal = driver.find_element(By.CSS_SELECTOR, "div.wa-limit-modal")                  if blocked_modal:                      raise Exception("Blocked by modal")              except:                  pass  # No blocking modal
            # Extract competitor data              competitors = driver.find_elements(By.CSS_SELECTOR, "div.wa-competitors__list-item")              competitor_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}_competitors.csv")
            for competitor in competitors:                  site_name = competitor.find_element(By.CSS_SELECTOR, "span.wa-competitors__list-item-title").text.strip()                  link = f"https://www.similarweb.com/website/{site_name}/"                  affinity = competitor.find_element(By.CSS_SELECTOR, "span.app-progress__value").text.strip()
                target_spans = competitor.find_elements(By.CSS_SELECTOR, "span.wa-competitors__list-column")                  monthly_visits = target_spans[2].text.strip()                  category = target_spans[3].text.strip()                  category_rank = int(target_spans[4].text.replace("#", "").replace(",", "").replace("--", "0").strip())
                competitor_data = CompetitorData(                      name=site_name,                      url=link,                      affinity=affinity,                      monthly_visits=monthly_visits,                      category=category,                      category_rank=category_rank                  )                  competitor_pipeline.add_data(competitor_data)
            competitor_pipeline.close_pipeline()              success = True          except Exception as e:              logger.error(f"Exception thrown: {e}")              logger.warning(f"Failed to process page: {url}, Retries left: {retries-tries}")              tries += 1          finally:              driver.quit()
    if not success:          raise Exception(f"Max Retries exceeded: {retries}")      else:          logger.info(f"Successfully parsed: {row['url']}")
def process_results(csv_file, max_threads=5, retries=3):      logger.info(f"processing {csv_file}")      with open(csv_file, newline="") as file:          reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:              executor.map(                  process_website,                  reader,                  [retries] * len(reader)              )
if __name__ == "__main__":      MAX_RETRIES = 3      MAX_THREADS = 5
    logger.info(f"Crawl starting...")
    # Example keywords to scrape      keyword_list = [          {"category": "arts-and-entertainment", "subcategory": "humor"},          {"category": "arts-and-entertainment", "subcategory": "animation-and-comics"}      ]      aggregate_files = []
    # Crawl and save results      filename = "arts-and-entertainment"      crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")      start_scrape(keyword_list, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)      crawl_pipeline.close_pipeline()      aggregate_files.append(f"{filename}.csv")      logger.info(f"Crawl complete.")
    # Process each CSV file      for file in aggregate_files:          process_results(file,max_threads=MAX_THREADS, retries=MAX_RETRIES)

To adjust your results, you can modify the following:

MAX_THREADS: Defines how many concurrent threads are used for processing and scraping tasks.
MAX_RETRIES Determines the number of retries the script will make if a request fails, such as due to a non-200 status code or network issues.
keyword_list A list of dictionaries, each containing a "category" and "subcategory," which specify the type of websites to be scraped from SimilarWeb.
filename The base name that is used to generate the CSV file where the data obtained from scraping will be saved.

How To Architect Our SimilarWeb Scraper

Scraping SimilarWeb can be difficult. Once you attempt to do anything substantial with it, the site prompts you to create an account to access its full features.The site not only becomes difficult to scrape, but accessing it through a traditional browser is also challenging! While we are prompted to sign up for an account, there are still some actions we can take before being blocked every time.By using ScrapeOps Proxy Aggregator, rotating proxies allow us to bypass this issue because we are being blocked based on our IP address. When each ping uses a new IP, it becomes much harder to block us.Our SimilarWeb scraper will follow a similar structure to other scraping projects in our "How to Scrape" series. Both a crawler and a scraper will need to be built.

The crawler will search for the top sites in a particular category, and
The scraper will collect data from competitors and their relevant information for each of these top-ranked sites.

We’ll take an iterative approach to build the following features:

Search for a particular site and extract its data.
Store the extracted data in an easily manageable CSV file.
Simultaneously search multiple categories.
Use ScrapeOps Proxy Aggregator to bypass anti-bots and free trial limitations.

Our scraper will be developed in these steps:

Find and extract competitors from a row in the CSV file created earlier.
Save competitors of each site into a new CSV report.
Run steps 1 and 2 concurrently.
Once again, use ScrapeOps Proxy Aggregator to bypass anti-bots and free trial prompts.

Understanding How To Scrape SimilarWeb

Before diving into writing any serious code, it’s important to gain a high-level understanding of SimilarWeb.Over the next few sections, we’ll explore how to retrieve the necessary information and how to extract it from the page.

Step 1: How To Request SimilarWeb Pages

As with most web interactions, the process starts with a basic GET request. The SimilarWeb homepage isn't particularly useful for our purposes, so we’ll target a specific endpoint.For this example, we’ll be retrieving the top 50 humor websites.Here’s the URL:

https://www.similarweb.com/top-websites/arts-and-entertainment/humor/

The URL follows this structure:

https://www.similarweb.com/top-websites/{CATEGORY}/{SUBCATEGORY}/

Each search requires both a category and a subcategory. In this instance, the category is "arts-and-entertainment," and the subcategory is "humor." You can see a snapshot of the page below.When you visit the page for a particular site, the URL appears like this:

https://www.similarweb.com/website/pikabu.ru/

The format is structured as follows:

https://www.similarweb.com/website/{NAME_OF_SITE}/

Step 2: How To Extract Data From SimilarWeb Results and Pages

Extracting the data can be somewhat tricky. However, this is completely doable if we have access to the site. To begin with, some of our content loads dynamically.When talking to ScrapeOps, we need to use the wait parameter to load our dynamic content. Once the page is loaded, we can simply locate the information by using its CSS class.For the results pages, each row has a class of top-table__row. From there, we can locate all these rows and extract their data with ease.To extract our competitors, we first target div elements that have the class wa-competitors__list-item. These div tags contain all the information for each individual competitor.At the top of these pages, we need to stay mindful of the modal that SimilarWeb uses to block our access. If this modal appears, we need to attempt the request again. As shown in the image below, it's a div with the class wa-limit-modal.

Step 3: Geolocated Data

ScrapeOps allows us to manage our geolocation using the country parameter. On the other hand, with SimilarWeb, we prefer not to manage our geolocation.Rather than managing our location, we aim to have as many IP addresses as possible to decrease the chances of being blocked or prompted to sign in/sign up, as mentioned in the previous section.By not managing our location, we have access to a significantly larger pool of IP addresses.

Setting Up Our SimilarWeb Scraper Project

You can begin by executing the commands below to set up.Create a New Project FolderFirst, create a new project folder using the command:

mkdir similarweb-scraper

Then, move into the newly created folder by running:

cd similarweb-scraper

Create a New Virtual EnvironmentNext, set up a virtual environment with the command:

python -m venv venv

Activate the EnvironmentTo activate the virtual environment, use:

source venv/bin/activate

Install Our DependenciesAfterwards, install the required dependencies by running:

pip install selenium  pip install webdriver-manager

Build A SimilarWeb Search Crawler

It's time to begin building! In the following sections, we'll construct our crawler step by step.We'll start with a parser and then move on to adding data storage.After that, we'll implement concurrency and, lastly, integrate proxy support.

Step 1: Create Simple Search Data Parser

The first step in our scraping process is parsing.In the code provided below, we set up our basic script and introduce features such as error handling and retries.The key part is the implementation of our base parsing function. To observe how the data is extracted, focus on scrape_search_results().

import os  import json  import logging  from selenium import webdriver  from selenium.webdriver.chrome.service import Service as ChromeService  from selenium.webdriver.chrome.options import Options  from selenium.webdriver.common.by import By  from webdriver_manager.chrome import ChromeDriverManager  import time
API_KEY = ""
# Load the API key from the config file  with open("config.json", "r") as config_file:      config = json.load(config_file)      API_KEY = config["api_key"]
# Logging configuration  logging.basicConfig(level=logging.INFO)  logger = logging.getLogger(__name__)
# Function to set up the Selenium WebDriver with necessary options  def setup_driver():      options = Options()      options.add_argument("--headless")  # Run in headless mode      options.add_argument("--no-sandbox")      options.add_argument("--disable-dev-shm-usage")      return webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)
# Main scraping function using Selenium  def scrape_search_results(keyword, retries=3):      url = f"https://www.similarweb.com/top-websites/{keyword['category']}/{keyword['subcategory']}/"      tries = 0      success = False           while tries <= retries and not success:          try:              # Set up and start the WebDriver              driver = setup_driver()              driver.get(url)              logger.info(f"Received page from: {url}")                           # Wait for the page to load fully              time.sleep(3)
            # Find all rows for the top websites table              rows = driver.find_elements(By.CSS_SELECTOR, "tr.top-table__row")
            rank = 1              for row in rows:                  link_holder = row.find_element(By.CSS_SELECTOR, "a.tw-table__compare")                  site_name = link_holder.text                  link = f"https://www.similarweb.com/website/{site_name}/"
                rank_change_holder = row.find_element(By.CSS_SELECTOR, "td.top-table__column.top-table__column--rank-change")                  rank_change = 0                  up_or_down = rank_change_holder.find_element(By.CSS_SELECTOR, "span").get_attribute("class").split(" ")[1]                                   if "change--up" in up_or_down:                      rank_change += int(rank_change_holder.text)                  elif "change--down" in up_or_down:                      rank_change -= int(rank_change_holder.text)
                average_visit = row.find_element(By.CSS_SELECTOR, "span.tw-table__avg-visit-duration").text                  pages_per_visit = float(row.find_element(By.CSS_SELECTOR, "span.tw-table__pages-per-visit").text)                  bounce_rate = row.find_element(By.CSS_SELECTOR, "span.tw-table__bounce-rate").text
                # Collecting scraped data                  search_data = {                      "name": site_name,                      "url": link,                      "rank": rank,                      "rank_change": rank_change,                      "average_visit": average_visit,                      "pages_per_visit": pages_per_visit,                      "bounce_rate": bounce_rate                  }
                rank += 1                  print("search data: ",search_data)
            logger.info(f"Successfully parsed data from: {url}")              success = True          except Exception as e:              logger.error(f"An error occurred while processing page {url}: {e}, retries left {retries-tries}")              tries += 1          finally:              # Close the WebDriver after each attempt              driver.quit()
    if not success:          raise Exception(f"Max retries exceeded for: {url}")
# Function to start the scraping process for a list of keywords  def start_scrape(keywords, retries=3):      for keyword in keywords:          scrape_search_results(keyword, retries=retries)
if __name__ == "__main__":      MAX_RETRIES = 3      MAX_THREADS = 5
    logger.info(f"Crawl starting...")
    # Input list of keywords to scrape      keyword_list = [          {"category": "arts-and-entertainment", "subcategory": "humor"},          {"category": "arts-and-entertainment", "subcategory": "animation-and-comics"}      ]           # Start scraping process      start_scrape(keyword_list, retries=MAX_RETRIES)
    logger.info(f"Crawl complete.")

First, we locate all rows using rows = driver.find_elements(By.CSS_SELECTOR, "tr.top-table__row").
Then, we retrieve the link_holder with link_holder = row.find_element(By.CSS_SELECTOR, "a.tw-table__compare").
From the link_holder, we extract the site_name and construct our link.
We determine whether the rank has increased or decreased by using rank_change_holder.find_element(By.CSS_SELECTOR, "span").get_attribute("class").split(" ")[1].
The average visit duration is obtained through row.find_element(By.CSS_SELECTOR, "span.tw-table__avg-visit-duration").text.
The number of pages_per_visit is retrieved with float(row.find_element(By.CSS_SELECTOR, "span.tw-table__pages-per-visit").text).
Lastly, the bounce rate is collected using row.find_element(By.CSS_SELECTOR, "span.tw-table__bounce-rate").text.

Step 2: Storing the Scraped Data

After we have our data, we need to store it. In order to store it, we will create a few classes. A dataclass called SearchData is required. This class will represent individual objects from the search results.Once the SearchData object is created, it needs to be passed into a DataPipeline. The DataPipeline is responsible for opening a pipe to a CSV file. It removes duplicates by name and then saves all the non-duplicate objects to the CSV file.Below is our SearchData class, which we use to represent individual ranking results.

@dataclass  class SearchData:      name: str = ""      url: str = ""      rank: int = 0      rank_change: int = 0      average_visit: str = ""      pages_per_visit: float = 0.0      bounce_rate: str = ""
    def __post_init__(self):          self.check_string_fields()
    def check_string_fields(self):          for field in fields(self):              if isinstance(getattr(self, field.name), str):                  if getattr(self, field.name) == "":                      setattr(self, field.name, f"No {field.name}")                  else:                      value = getattr(self, field.name).strip()                      setattr(self, field.name, value)

Here is our DataPipeline.

class DataPipeline:      def __init__(self, csv_filename="", storage_queue_limit=50):          self.names_seen = []          self.storage_queue = []          self.storage_queue_limit = storage_queue_limit          self.csv_filename = csv_filename          self.csv_file_open = False
    def save_to_csv(self):          self.csv_file_open = True          data_to_save = []          data_to_save.extend(self.storage_queue)          self.storage_queue.clear()          if not data_to_save:              return
        keys = [field.name for field in fields(data_to_save[0])]          file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0          with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:              writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                  writer.writeheader()
            for item in data_to_save:                  writer.writerow(asdict(item))
        self.csv_file_open = False
    def is_duplicate(self, input_data):          if input_data.name in self.names_seen:              logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")              return True          self.names_seen.append(input_data.name)          return False
    def add_data(self, scraped_data):          if not self.is_duplicate(scraped_data):              self.storage_queue.append(scraped_data)              if len(self.storage_queue) >= self.storage_queue_limit and not self.csv_file_open:                  self.save_to_csv()
    def close_pipeline(self):          if self.csv_file_open:              time.sleep(3)          if self.storage_queue:              self.save_to_csv()

We need to open a new DataPipeline and pass it into start_scrape() when we put everything together.
start_scrape() then sends the pipeline to our parsing function.
Rather than printing the parsed data, we now send it into the pipeline.
After parsing the results, we close the DataPipeline.

import os  import csv  import json  import logging  import time  from dataclasses import dataclass, fields, asdict  from selenium import webdriver  from selenium.webdriver.chrome.service import Service as ChromeService  from selenium.webdriver.chrome.options import Options  from selenium.webdriver.common.by import By  from webdriver_manager.chrome import ChromeDriverManager
API_KEY = ""
# Load API key from config  with open("config.json", "r") as config_file:      config = json.load(config_file)      API_KEY = config["api_key"]
# Logging configuration  logging.basicConfig(level=logging.INFO)  logger = logging.getLogger(__name__)
# Dataclass representing individual search results  @dataclass  class SearchData:      name: str = ""      url: str = ""      rank: int = 0      rank_change: int = 0      average_visit: str = ""      pages_per_visit: float = 0.0      bounce_rate: str = ""
    def __post_init__(self):          self.check_string_fields()
    def check_string_fields(self):          for field in fields(self):              if isinstance(getattr(self, field.name), str):                  if getattr(self, field.name) == "":                      setattr(self, field.name, f"No {field.name}")                  else:                      value = getattr(self, field.name).strip()                      setattr(self, field.name, value)
# Class for handling data storage to CSV  class DataPipeline:      def __init__(self, csv_filename="", storage_queue_limit=50):          self.names_seen = []          self.storage_queue = []          self.storage_queue_limit = storage_queue_limit          self.csv_filename = csv_filename          self.csv_file_open = False
    def save_to_csv(self):          self.csv_file_open = True          data_to_save = []          data_to_save.extend(self.storage_queue)          self.storage_queue.clear()          if not data_to_save:              return
        keys = [field.name for field in fields(data_to_save[0])]          file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0          with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:              writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                  writer.writeheader()
            for item in data_to_save:                  writer.writerow(asdict(item))
        self.csv_file_open = False
    def is_duplicate(self, input_data):          if input_data.name in self.names_seen:              logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")              return True          self.names_seen.append(input_data.name)          return False
    def add_data(self, scraped_data):          if not self.is_duplicate(scraped_data):              self.storage_queue.append(scraped_data)              if len(self.storage_queue) >= self.storage_queue_limit and not self.csv_file_open:                  self.save_to_csv()
    def close_pipeline(self):          if self.csv_file_open:              time.sleep(3)          if self.storage_queue:              self.save_to_csv()
# Function to set up Selenium WebDriver  def setup_driver():      options = Options()      options.add_argument("--headless")      options.add_argument("--no-sandbox")      options.add_argument("--disable-dev-shm-usage")      return webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)
# Function to scrape search results using Selenium  def scrape_search_results(keyword, data_pipeline=None, retries=3):      url = f"https://www.similarweb.com/top-websites/{keyword['category']}/{keyword['subcategory']}/"      tries = 0      success = False
    while tries <= retries and not success:          try:              # Setup and start Selenium WebDriver              driver = setup_driver()              driver.get(url)              logger.info(f"Received page from: {url}")              time.sleep(3)  # Wait for the page to load
            # Find rows in the search results              rows = driver.find_elements(By.CSS_SELECTOR, "tr.top-table__row")              rank = 1
            for row in rows:                  link_holder = row.find_element(By.CSS_SELECTOR, "a.tw-table__compare")                  site_name = link_holder.text.strip()                  link = f"https://www.similarweb.com/website/{site_name}/"
                rank_change_holder = row.find_element(By.CSS_SELECTOR, "td.top-table__column.top-table__column--rank-change")                  rank_change = 0                  up_or_down = rank_change_holder.find_element(By.CSS_SELECTOR, "span").get_attribute("class").split(" ")[1]                  if "change--up" in up_or_down:                      rank_change += int(rank_change_holder.text)                  elif "change--down" in up_or_down:                      rank_change -= int(rank_change_holder.text)
                average_visit = row.find_element(By.CSS_SELECTOR, "span.tw-table__avg-visit-duration").text.strip()                  pages_per_visit = float(row.find_element(By.CSS_SELECTOR, "span.tw-table__pages-per-visit").text)                  bounce_rate = row.find_element(By.CSS_SELECTOR, "span.tw-table__bounce-rate").text.strip()
                # Create a SearchData object                  search_data = SearchData(                      name=site_name,                      url=link,                      rank=rank,                      rank_change=rank_change,                      average_visit=average_visit,                      pages_per_visit=pages_per_visit,                      bounce_rate=bounce_rate                  )
                # Add data to the pipeline                  data_pipeline.add_data(search_data)
                rank += 1
            logger.info(f"Successfully parsed data from: {url}")              success = True
        except Exception as e:              logger.error(f"An error occurred while processing page {url}: {e}, retries left {retries - tries}")              tries += 1          finally:              driver.quit()
    if not success:          raise Exception(f"Max retries exceeded for: {url}")
# Function to start the scraping process for a list of keywords  def start_scrape(keywords, data_pipeline=None, retries=3):      for keyword in keywords:          scrape_search_results(keyword, data_pipeline=data_pipeline, retries=retries)
# Main execution  if __name__ == "__main__":      MAX_RETRIES = 3      MAX_THREADS = 5
    logger.info(f"Crawl starting...")
    # Input list of keywords to scrape      keyword_list = [          {"category": "arts-and-entertainment", "subcategory": "humor"},          {"category": "arts-and-entertainment", "subcategory": "animation-and-comics"}      ]           # Initialize DataPipeline      filename = "arts-and-entertainment"      crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")
    # Start the scraping process      start_scrape(keyword_list, data_pipeline=crawl_pipeline, retries=MAX_RETRIES)
    # Close the pipeline after scraping      crawl_pipeline.close_pipeline()
    logger.info(f"Crawl complete.")

In our code, each item in the results is represented as SearchData. These SearchData objects are then passed into our DataPipeline and stored in a CSV file.

Step 3: Adding Concurrency

Now, we need to incorporate concurrency. To add multithreading support, we’ll utilize ThreadPoolExecutor.Once we have the ability to open several threads, we can employ those threads to run our parsing function on multiple pages simultaneously.Below is our start_scrape() function modified for concurrency.

def start_scrape(keywords, data_pipeline=None, max_threads=5, retries=3):      with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:          executor.map(              scrape_search_results,              keywords,              [data_pipeline] * len(keywords),              [retries] * len(keywords)          )

We want to call the function scrape_search_results by utilizing multiple threads. The array keywords contains the items we wish to search for. All additional arguments to scrape_search_results are passed in as arrays.

import os  import csv  import json  import logging  import time  from dataclasses import dataclass, fields, asdict  from selenium import webdriver  from selenium.webdriver.chrome.service import Service as ChromeService  from selenium.webdriver.chrome.options import Options  from selenium.webdriver.common.by import By  from webdriver_manager.chrome import ChromeDriverManager  import concurrent.futures
API_KEY = ""
# Load API key from config  with open("config.json", "r") as config_file:      config = json.load(config_file)      API_KEY = config["api_key"]     
# Logging configuration  logging.basicConfig(level=logging.INFO)  logger = logging.getLogger(__name__)
# Dataclass representing individual search results  @dataclass  class SearchData:      name: str = ""      url: str = ""      rank: int = 0      rank_change: int = 0      average_visit: str = ""      pages_per_visit: float = 0.0      bounce_rate: str = ""
    def __post_init__(self):          self.check_string_fields()
    def check_string_fields(self):          for field in fields(self):              if isinstance(getattr(self, field.name), str):                  if getattr(self, field.name) == "":                      setattr(self, field.name, f"No {field.name}")                  else:                      value = getattr(self, field.name).strip()                      setattr(self, field.name, value)
# Class for handling data storage to CSV  class DataPipeline:      def __init__(self, csv_filename="", storage_queue_limit=50):          self.names_seen = []          self.storage_queue = []          self.storage_queue_limit = storage_queue_limit          self.csv_filename = csv_filename          self.csv_file_open = False
    def save_to_csv(self):          self.csv_file_open = True          data_to_save = []          data_to_save.extend(self.storage_queue)          self.storage_queue.clear()          if not data_to_save:              return
        keys = [field.name for field in fields(data_to_save[0])]          file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0          with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:              writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                  writer.writeheader()
            for item in data_to_save:                  writer.writerow(asdict(item))
        self.csv_file_open = False
    def is_duplicate(self, input_data):          if input_data.name in self.names_seen:              logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")              return True          self.names_seen.append(input_data.name)          return False
    def add_data(self, scraped_data):          if not self.is_duplicate(scraped_data):              self.storage_queue.append(scraped_data)              if len(self.storage_queue) >= self.storage_queue_limit and not self.csv_file_open:                  self.save_to_csv()
    def close_pipeline(self):          if self.csv_file_open:              time.sleep(3)          if self.storage_queue:              self.save_to_csv()
# Function to set up Selenium WebDriver  def setup_driver():      options = Options()      options.add_argument("--headless")      options.add_argument("--no-sandbox")      options.add_argument("--disable-dev-shm-usage")      return webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)
# Function to scrape search results using Selenium  def scrape_search_results(keyword, data_pipeline=None, retries=3):      url = f"https://www.similarweb.com/top-websites/{keyword['category']}/{keyword['subcategory']}/"      tries = 0      success = False
    while tries <= retries and not success:          try:              # Setup and start Selenium WebDriver              driver = setup_driver()
            driver.get(url)              logger.info(f"Received page from: {url}")              time.sleep(3)  # Wait for the page to load
            # Find rows in the search results              rows = driver.find_elements(By.CSS_SELECTOR, "tr.top-table__row")              rank = 1
            for row in rows:                  link_holder = row.find_element(By.CSS_SELECTOR, "a.tw-table__compare")                  site_name = link_holder.text.strip()                  link = f"https://www.similarweb.com/website/{site_name}/"
                rank_change_holder = row.find_element(By.CSS_SELECTOR, "td.top-table__column.top-table__column--rank-change")                  rank_change = 0                  up_or_down = rank_change_holder.find_element(By.CSS_SELECTOR, "span").get_attribute("class").split(" ")[1]                  if "change--up" in up_or_down:                      rank_change += int(rank_change_holder.text)                  elif "change--down" in up_or_down:                      rank_change -= int(rank_change_holder.text)
                average_visit = row.find_element(By.CSS_SELECTOR, "span.tw-table__avg-visit-duration").text.strip()                  pages_per_visit = float(row.find_element(By.CSS_SELECTOR, "span.tw-table__pages-per-visit").text)                  bounce_rate = row.find_element(By.CSS_SELECTOR, "span.tw-table__bounce-rate").text.strip()
                # Create a SearchData object                  search_data = SearchData(                      name=site_name,                      url=link,                      rank=rank,                      rank_change=rank_change,                      average_visit=average_visit,                      pages_per_visit=pages_per_visit,                      bounce_rate=bounce_rate                  )
                # Add data to the pipeline                  data_pipeline.add_data(search_data)
                rank += 1
            logger.info(f"Successfully parsed data from: {url}")              success = True
        except Exception as e:              logger.error(f"An error occurred while processing page {url}: {e}, retries left {retries - tries}")              tries += 1          finally:              driver.quit()
    if not success:          raise Exception(f"Max retries exceeded for: {url}")
# Function to start the scraping process for a list of keywords  def start_scrape(keywords, data_pipeline=None, max_threads=5, retries=3):      with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:          executor.map(              scrape_search_results,              keywords,              [data_pipeline] * len(keywords),              [retries] * len(keywords)          )
# Main execution  if __name__ == "__main__":      MAX_RETRIES = 3      MAX_THREADS = 5
    logger.info(f"Crawl starting...")
    # Input list of keywords to scrape      keyword_list = [          {"category": "arts-and-entertainment", "subcategory": "humor"},          {"category": "arts-and-entertainment", "subcategory": "animation-and-comics"}      ]           # Initialize DataPipeline      filename = "arts-and-entertainment"      crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")
    # Start the scraping process      start_scrape(keyword_list, data_pipeline=crawl_pipeline,max_threads=MAX_THREADS, retries=MAX_RETRIES)
    # Close the pipeline after scraping      crawl_pipeline.close_pipeline()
    logger.info(f"Crawl complete.")

We are now capable of crawling several categories concurrently.

Step 4: Bypassing Anti-Bots

To scrape SimilarWeb effectively, a large number of IP addresses are required. By using only three parameters — API_KEY, url, and wait— we can obtain as many addresses as possible.This tells ScrapeOps that we’re willing to wait 3 seconds for the content to load, without concern for the country through which we’re routed.This approach provides us with the largest possible pool of IP addresses since routing can happen through any server that ScrapeOps supports.

def get_scrapeops_url(url):      payload = {          "api_key": API_KEY,          "url": url,          "wait": 3000,      }      proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)      return proxy_url

The code below holds our crawler that is ready for production.

import os  import csv  import json  import logging  import time  from urllib.parse import urlencode  from dataclasses import dataclass, fields, asdict  from selenium import webdriver  from selenium.webdriver.chrome.service import Service as ChromeService  from selenium.webdriver.chrome.options import Options  from selenium.webdriver.common.by import By  from webdriver_manager.chrome import ChromeDriverManager  import concurrent.futures
API_KEY = ""
# Load API key from config  with open("config.json", "r") as config_file:      config = json.load(config_file)      API_KEY = config["api_key"]       def get_scrapeops_url(url):      payload = {          "api_key": API_KEY,          "url": url,          "wait": 3000,      }      proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)      return proxy_url
# Logging configuration  logging.basicConfig(level=logging.INFO)  logger = logging.getLogger(__name__)
# Dataclass representing individual search results  @dataclass  class SearchData:      name: str = ""      url: str = ""      rank: int = 0      rank_change: int = 0      average_visit: str = ""      pages_per_visit: float = 0.0      bounce_rate: str = ""
    def __post_init__(self):          self.check_string_fields()
    def check_string_fields(self):          for field in fields(self):              if isinstance(getattr(self, field.name), str):                  if getattr(self, field.name) == "":                      setattr(self, field.name, f"No {field.name}")                  else:                      value = getattr(self, field.name).strip()                      setattr(self, field.name, value)
# Class for handling data storage to CSV  class DataPipeline:      def __init__(self, csv_filename="", storage_queue_limit=50):          self.names_seen = []          self.storage_queue = []          self.storage_queue_limit = storage_queue_limit          self.csv_filename = csv_filename          self.csv_file_open = False
    def save_to_csv(self):          self.csv_file_open = True          data_to_save = []          data_to_save.extend(self.storage_queue)          self.storage_queue.clear()          if not data_to_save:              return
        keys = [field.name for field in fields(data_to_save[0])]          file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0          with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:              writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                  writer.writeheader()
            for item in data_to_save:                  writer.writerow(asdict(item))
        self.csv_file_open = False
    def is_duplicate(self, input_data):          if input_data.name in self.names_seen:              logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")              return True          self.names_seen.append(input_data.name)          return False
    def add_data(self, scraped_data):          if not self.is_duplicate(scraped_data):              self.storage_queue.append(scraped_data)              if len(self.storage_queue) >= self.storage_queue_limit and not self.csv_file_open:                  self.save_to_csv()
    def close_pipeline(self):          if self.csv_file_open:              time.sleep(3)          if self.storage_queue:              self.save_to_csv()
# Function to set up Selenium WebDriver  def setup_driver():      options = Options()      options.add_argument("--headless")      options.add_argument("--no-sandbox")      options.add_argument("--disable-dev-shm-usage")      return webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)
# Function to scrape search results using Selenium  def scrape_search_results(keyword, data_pipeline=None, retries=3):      url = f"https://www.similarweb.com/top-websites/{keyword['category']}/{keyword['subcategory']}/"      tries = 0      success = False
    while tries <= retries and not success:          try:              # Setup and start Selenium WebDriver              driver = setup_driver()              scrapeops_proxy_url = get_scrapeops_url(url)
            driver.get(scrapeops_proxy_url)              logger.info(f"Received page from: {url}")              time.sleep(3)  # Wait for the page to load
            # Find rows in the search results              rows = driver.find_elements(By.CSS_SELECTOR, "tr.top-table__row")              rank = 1
            for row in rows:                  link_holder = row.find_element(By.CSS_SELECTOR, "a.tw-table__compare")                  site_name = link_holder.text.strip()                  link = f"https://www.similarweb.com/website/{site_name}/"
                rank_change_holder = row.find_element(By.CSS_SELECTOR, "td.top-table__column.top-table__column--rank-change")                  rank_change = 0                  up_or_down = rank_change_holder.find_element(By.CSS_SELECTOR, "span").get_attribute("class").split(" ")[1]                  if "change--up" in up_or_down:                      rank_change += int(rank_change_holder.text)                  elif "change--down" in up_or_down:                      rank_change -= int(rank_change_holder.text)
                average_visit = row.find_element(By.CSS_SELECTOR, "span.tw-table__avg-visit-duration").text.strip()                  pages_per_visit = float(row.find_element(By.CSS_SELECTOR, "span.tw-table__pages-per-visit").text)                  bounce_rate = row.find_element(By.CSS_SELECTOR, "span.tw-table__bounce-rate").text.strip()
                # Create a SearchData object                  search_data = SearchData(                      name=site_name,                      url=link,                      rank=rank,                      rank_change=rank_change,                      average_visit=average_visit,                      pages_per_visit=pages_per_visit,                      bounce_rate=bounce_rate                  )
                # Add data to the pipeline                  data_pipeline.add_data(search_data)
                rank += 1
            logger.info(f"Successfully parsed data from: {url}")              success = True
        except Exception as e:              logger.error(f"An error occurred while processing page {url}: {e}, retries left {retries - tries}")              tries += 1          finally:              driver.quit()
    if not success:          raise Exception(f"Max retries exceeded for: {url}")
# Function to start the scraping process for a list of keywords  def start_scrape(keywords, data_pipeline=None, max_threads=5, retries=3):      with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:          executor.map(              scrape_search_results,              keywords,              [data_pipeline] * len(keywords),              [retries] * len(keywords)          )
# Main execution  if __name__ == "__main__":      MAX_RETRIES = 3      MAX_THREADS = 5
    logger.info(f"Crawl starting...")
    # Input list of keywords to scrape      keyword_list = [          {"category": "arts-and-entertainment", "subcategory": "humor"},          {"category": "arts-and-entertainment", "subcategory": "animation-and-comics"}      ]           # Initialize DataPipeline      filename = "arts-and-entertainment"      crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")
    # Start the scraping process      start_scrape(keyword_list, data_pipeline=crawl_pipeline,max_threads=MAX_THREADS, retries=MAX_RETRIES)
    # Close the pipeline after scraping      crawl_pipeline.close_pipeline()
    logger.info(f"Crawl complete.")

Step 5: Production Run

Alright! Time to run this code in production.As you've noticed, our MAX_THREADS are set to 5. Since we're only searching 2 categories, ThreadPoolExecutor will use 2 threads to run this and finish it.In the second half of our article, we'll make use of all 5 threads when writing the scraper.Here is our main:

# Main execution  if __name__ == "__main__":      MAX_RETRIES = 3      MAX_THREADS = 5
    logger.info(f"Crawl starting...")
    # Input list of keywords to scrape      keyword_list = [          {"category": "arts-and-entertainment", "subcategory": "humor"},          {"category": "arts-and-entertainment", "subcategory": "animation-and-comics"}      ]           # Initialize DataPipeline      filename = "arts-and-entertainment"      crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")
    # Start the scraping process      start_scrape(keyword_list, data_pipeline=crawl_pipeline,max_threads=MAX_THREADS, retries=MAX_RETRIES)
    # Close the pipeline after scraping      crawl_pipeline.close_pipeline()
    logger.info(f"Crawl complete.")

The results from our crawl are inconsistent.In one instance, it took 36.82 seconds, while in another, it took 52.59 seconds.This demonstrates that when SimilarWeb starts blocking us, ScrapeOps searches for new servers to ensure every request is completed.

Build A SimilarWeb Scraper

Now that the results are being saved after running a proper crawl, we need to utilize those results. In this section, we'll scrape the competitors for each site that was extracted during the crawl.The scraper should do the following:

Load the CSV into an array.
Extract the websites from the array.
Save the competitor data after extracting it.
Perform steps 2 and 3 simultaneously for quicker results.
Work with the ScrapeOps Proxy Aggregator to bypass anti-bots and other obstacles.

Step 1: Create Simple Website Data Parser

Just like before, we'll begin with a parsing function. This one will locate all of the competitor objects on the page and pull out their data.

def process_website(row, retries=3):      url = row["url"]      tries = 0      success = False
    while tries <= retries and not success:          try:              driver = setup_driver()              driver.get(url)              time.sleep(3)  # Allow page to load
            # Check if blocked by a modal or warning              try:                  blocked_modal = driver.find_element(By.CSS_SELECTOR, "div.wa-limit-modal")                  if blocked_modal:                      raise Exception("Blocked by modal")              except:                  pass  # No blocking modal
            # Extract competitor data              competitors = driver.find_elements(By.CSS_SELECTOR, "div.wa-competitors__list-item")
            for competitor in competitors:                  site_name = competitor.find_element(By.CSS_SELECTOR, "span.wa-competitors__list-item-title").text.strip()                  link = f"https://www.similarweb.com/website/{site_name}/"                  affinity = competitor.find_element(By.CSS_SELECTOR, "span.app-progress__value").text.strip()
                target_spans = competitor.find_elements(By.CSS_SELECTOR, "span.wa-competitors__list-column")                  monthly_visits = target_spans[2].text.strip()                  category = target_spans[3].text.strip()                  category_rank = int(target_spans[4].text.replace("#", "").replace(",", "").replace("--", "0").strip())
                competitor_data = {                      "name": site_name,                      "url": link,                      "affinity": affinity,                      "monthly_visits": monthly_visits,                      "category": category,                      "category_rank": category_rank                  }                  print(competitor_data)  # Replace with actual storage mechanism
            success = True          except Exception as e:              logger.error(f"Exception thrown: {e}")              logger.warning(f"Failed to process page: {url}, Retries left: {retries-tries}")              tries += 1          finally:              driver.quit()
    if not success:          raise Exception(f"Max Retries exceeded: {retries}")      else:          logger.info(f"Successfully parsed: {row['url']}")

This time, during our parse, we execute these steps:

Find all of the competitor rows: driver.find_elements(By.CSS_SELECTOR, "div.wa-competitors__list-item")
Iterate through the competitor rows.
For each competitor, we pull the following:
- site_name
- affinity
- monthly_visits
- category
- category_link
- We construct the url by once again formatting the site_name.

Step 2: Loading URLs To Scrape

We have our parsing function, but it requires a URL to operate.In this case, we'll include another function that retrieves URLs from the CSV file and applies process_website() to every row in the file.Below is our process_results() function.

def process_results(csv_file, retries=3):      logger.info(f"Processing {csv_file}")      with open(csv_file, newline="") as file:          reader = list(csv.DictReader(file))
        for row in reader:              process_website(row, retries=retries)

Check out the complete code given below:

import os  import csv  import json  import time  import logging  from urllib.parse import urlencode  from selenium import webdriver  from selenium.webdriver.chrome.service import Service as ChromeService  from selenium.webdriver.chrome.options import Options  from selenium.webdriver.common.by import By  from webdriver_manager.chrome import ChromeDriverManager  import concurrent.futures  from dataclasses import dataclass, field, fields, asdict
# ScrapeOps API Key (if you're using a proxy service like ScrapeOps)  API_KEY = ""  with open("config.json", "r") as config_file:      config = json.load(config_file)      API_KEY = config["api_key"]
def get_scrapeops_url(url):      payload = {          "api_key": API_KEY,          "url": url,          "wait": 3000      }      proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)      return proxy_url
# Set up logging  logging.basicConfig(level=logging.INFO)  logger = logging.getLogger(__name__)
# Setup Selenium WebDriver  def setup_driver():      options = Options()      options.add_argument("--headless")  # Run in headless mode for efficiency      options.add_argument("--no-sandbox")      options.add_argument("--disable-dev-shm-usage")      return webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)
@dataclass  class SearchData:      name: str = ""      url: str = ""      rank: int = 0      rank_change: int = 0      average_visit: str = ""      pages_per_visit: float = 0.0      bounce_rate: str = ""
    def __post_init__(self):          self.check_string_fields()
    def check_string_fields(self):          for field in fields(self):              if isinstance(getattr(self, field.name), str):                  if getattr(self, field.name) == "":                      setattr(self, field.name, f"No {field.name}")                  else:                      value = getattr(self, field.name)                      setattr(self, field.name, value.strip())
class DataPipeline:      def __init__(self, csv_filename="", storage_queue_limit=50):          self.names_seen = []          self.storage_queue = []          self.storage_queue_limit = storage_queue_limit          self.csv_filename = csv_filename          self.csv_file_open = False
    def save_to_csv(self):          self.csv_file_open = True          data_to_save = []          data_to_save.extend(self.storage_queue)          self.storage_queue.clear()          if not data_to_save:              return
        keys = [field.name for field in fields(data_to_save[0])]          file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0          with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:              writer = csv.DictWriter(output_file, fieldnames=keys)              if not file_exists:                  writer.writeheader()              for item in data_to_save:                  writer.writerow(asdict(item))          self.csv_file_open = False
    def is_duplicate(self, input_data):          if input_data.name in self.names_seen:              logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")              return True          self.names_seen.append(input_data.name)          return False
    def add_data(self, scraped_data):          if not self.is_duplicate(scraped_data):              self.storage_queue.append(scraped_data)              if len(self.storage_queue) >= self.storage_queue_limit and not self.csv_file_open:                  self.save_to_csv()
    def close_pipeline(self):          if self.csv_file_open:              time.sleep(3)          if self.storage_queue:              self.save_to_csv()
# Function to scrape search results (fully Selenium-based)  def scrape_search_results(keyword, data_pipeline=None, retries=3):      url = f"https://www.similarweb.com/top-websites/{keyword['category']}/{keyword['subcategory']}/"      tries = 0      success = False
    while tries <= retries and not success:          try:              # Initialize WebDriver and load page              driver = setup_driver()              scrapeops_proxy_url = get_scrapeops_url(url)
            driver.get(scrapeops_proxy_url)              time.sleep(3)  # Allow page to load
            logger.info(f"Opened URL: {url}")
            # Find all rows of the search results table              rows = driver.find_elements(By.CSS_SELECTOR, "tr.top-table__row")              rank = 1
            for row in rows:                  site_name = row.find_element(By.CSS_SELECTOR, "a.tw-table__compare").text.strip()                  link = f"https://www.similarweb.com/website/{site_name}/"
                # Rank change processing                  rank_change_holder = row.find_element(By.CSS_SELECTOR, "td.top-table__column--rank-change")                  rank_change = 0                  up_or_down = rank_change_holder.find_element(By.TAG_NAME, "span").get_attribute("class").split()[-1]                  if "change--up" in up_or_down:                      rank_change += int(rank_change_holder.text.strip())                  elif "change--down" in up_or_down:                      rank_change -= int(rank_change_holder.text.strip())
                average_visit = row.find_element(By.CSS_SELECTOR, "span.tw-table__avg-visit-duration").text.strip()                  pages_per_visit = float(row.find_element(By.CSS_SELECTOR, "span.tw-table__pages-per-visit").text.strip())                  bounce_rate = row.find_element(By.CSS_SELECTOR, "span.tw-table__bounce-rate").text.strip()
                # Create data object                  search_data = SearchData(                      name=site_name,                      url=link,                      rank=rank,                      rank_change=rank_change,                      average_visit=average_visit,                      pages_per_visit=pages_per_visit,                      bounce_rate=bounce_rate                  )                  rank += 1                  data_pipeline.add_data(search_data)
            logger.info(f"Successfully parsed data from: {url}")              success = True          except Exception as e:              logger.error(f"An error occurred while processing page {url}: {e}, retries left {retries-tries}")              tries += 1          finally:              driver.quit()
    if not success:          raise Exception(f"Max retries exceeded: {retries}")
# Function to process and scrape all search results concurrently  def start_scrape(keywords, data_pipeline=None, max_threads=5, retries=3):      with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:          executor.map(              scrape_search_results,              keywords,              [data_pipeline] * len(keywords),              [retries] * len(keywords)          )
# Function to process websites (Selenium-based)  def process_website(row, retries=3):      url = row["url"]      tries = 0      success = False
    while tries <= retries and not success:          try:              driver = setup_driver()              driver.get(url)              time.sleep(3)  # Allow page to load
            # Check if blocked by a modal or warning              try:                  blocked_modal = driver.find_element(By.CSS_SELECTOR, "div.wa-limit-modal")                  if blocked_modal:                      raise Exception("Blocked by modal")              except:                  pass  # No blocking modal
            # Extract competitor data              competitors = driver.find_elements(By.CSS_SELECTOR, "div.wa-competitors__list-item")
            for competitor in competitors:                  site_name = competitor.find_element(By.CSS_SELECTOR, "span.wa-competitors__list-item-title").text.strip()                  link = f"https://www.similarweb.com/website/{site_name}/"                  affinity = competitor.find_element(By.CSS_SELECTOR, "span.app-progress__value").text.strip()
                target_spans = competitor.find_elements(By.CSS_SELECTOR, "span.wa-competitors__list-column")                  monthly_visits = target_spans[2].text.strip()                  category = target_spans[3].text.strip()                  category_rank = int(target_spans[4].text.replace("#", "").replace(",", "").replace("--", "0").strip())
                competitor_data = {                      "name": site_name,                      "url": link,                      "affinity": affinity,                      "monthly_visits": monthly_visits,                      "category": category,                      "category_rank": category_rank                  }                  print(competitor_data)  # Replace with actual storage mechanism
            success = True          except Exception as e:              logger.error(f"Exception thrown: {e}")              logger.warning(f"Failed to process page: {url}, Retries left: {retries-tries}")              tries += 1          finally:              driver.quit()
    if not success:          raise Exception(f"Max Retries exceeded: {retries}")      else:          logger.info(f"Successfully parsed: {row['url']}")
# Function to load and process CSV results  def process_results(csv_file, retries=3):      logger.info(f"Processing {csv_file}")      with open(csv_file, newline="") as file:          reader = list(csv.DictReader(file))
        for row in reader:              process_website(row, retries=retries)
if __name__ == "__main__":      MAX_RETRIES = 3      MAX_THREADS = 5
    logger.info(f"Crawl starting...")
    # Example keywords to scrape      keyword_list = [          {"category": "arts-and-entertainment", "subcategory": "humor"},          {"category": "arts-and-entertainment", "subcategory": "animation-and-comics"}      ]      aggregate_files = []
    # Crawl and save results      filename = "arts-and-entertainment"      crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")      start_scrape(keyword_list, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)      crawl_pipeline.close_pipeline()      aggregate_files.append(f"{filename}.csv")      logger.info(f"Crawl complete.")
    # Process each CSV file      for file in aggregate_files:          process_results(file, retries=MAX_RETRIES)

process_results() loads our CSV into an array. We apply process_website() to each row of the file.

Step 3: Storing the Scraped Data

DataScraping would be pointless if we didn't store the data. Since we already have the DataPipeline, we just need a dataclass to input into it.We'll create a new one called CompetitorData, which is quite similar to our SearchData. Below is our CompetitorData class.

@dataclass  class CompetitorData:      name: str = ""      url: str = ""      affinity: str = ""      monthly_visits: str = ""      category: str = ""      category_rank: int = None
    def __post_init__(self):          self.check_string_fields()
    def check_string_fields(self):          for field in fields(self):              if isinstance(getattr(self, field.name), str):                  if getattr(self, field.name) == "":                      setattr(self, field.name, f"No {field.name}")                  else:                      value = getattr(self, field.name)                      setattr(self, field.name, value.strip())

In the revised code below, inside our parsing function, we open another DataPipeline and pass CompetitorData into it.

import os  import csv  import json  import time  import logging  from urllib.parse import urlencode  from selenium import webdriver  from selenium.webdriver.chrome.service import Service as ChromeService  from selenium.webdriver.chrome.options import Options  from selenium.webdriver.common.by import By  from webdriver_manager.chrome import ChromeDriverManager  import concurrent.futures  from dataclasses import dataclass, fields, asdict
API_KEY = ""  with open("config.json", "r") as config_file:      config = json.load(config_file)      API_KEY = config["api_key"]
def get_scrapeops_url(url):      payload = {          "api_key": API_KEY,          "url": url,          "wait": 3000      }      proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)      return proxy_url
# Set up logging  logging.basicConfig(level=logging.INFO)  logger = logging.getLogger(__name__)
# Setup Selenium WebDriver  def setup_driver():      options = Options()      options.add_argument("--headless")  # Run in headless mode for efficiency      options.add_argument("--no-sandbox")      options.add_argument("--disable-dev-shm-usage")      return webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)
@dataclass  class SearchData:      name: str = ""      url: str = ""      rank: int = 0      rank_change: int = 0      average_visit: str = ""      pages_per_visit: float = 0.0      bounce_rate: str = ""
    def __post_init__(self):          self.check_string_fields()
    def check_string_fields(self):          for field in fields(self):              if isinstance(getattr(self, field.name), str):                  if getattr(self, field.name) == "":                      setattr(self, field.name, f"No {field.name}")                  else:                      value = getattr(self, field.name)                      setattr(self, field.name, value.strip())
@dataclass  class CompetitorData:      name: str = ""      url: str = ""      affinity: str = ""      monthly_visits: str = ""      category: str = ""      category_rank: int = None
    def __post_init__(self):          self.check_string_fields()
    def check_string_fields(self):          for field in fields(self):              if isinstance(getattr(self, field.name), str):                  if getattr(self, field.name) == "":                      setattr(self, field.name, f"No {field.name}")                  else:                      value = getattr(self, field.name)                      setattr(self, field.name, value.strip())
class DataPipeline:      def __init__(self, csv_filename="", storage_queue_limit=50):          self.names_seen = []          self.storage_queue = []          self.storage_queue_limit = storage_queue_limit          self.csv_filename = csv_filename          self.csv_file_open = False
    def save_to_csv(self):          self.csv_file_open = True          data_to_save = []          data_to_save.extend(self.storage_queue)          self.storage_queue.clear()          if not data_to_save:              return
        keys = [field.name for field in fields(data_to_save[0])]          file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0          with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:              writer = csv.DictWriter(output_file, fieldnames=keys)              if not file_exists:                  writer.writeheader()              for item in data_to_save:                  writer.writerow(asdict(item))          self.csv_file_open = False
    def is_duplicate(self, input_data):          if input_data.name in self.names_seen:              logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")              return True          self.names_seen.append(input_data.name)          return False
    def add_data(self, scraped_data):          if not self.is_duplicate(scraped_data):              self.storage_queue.append(scraped_data)              if len(self.storage_queue) >= self.storage_queue_limit and not self.csv_file_open:                  self.save_to_csv()
    def close_pipeline(self):          if self.csv_file_open:              time.sleep(3)          if self.storage_queue:              self.save_to_csv()
# Function to scrape search results (fully Selenium-based)  def scrape_search_results(keyword, data_pipeline=None, retries=3):      url = f"https://www.similarweb.com/top-websites/{keyword['category']}/{keyword['subcategory']}/"      tries = 0      success = False
    while tries <= retries and not success:          try:              # Initialize WebDriver and load page              driver = setup_driver()              scrapeops_proxy_url = get_scrapeops_url(url)
            driver.get(scrapeops_proxy_url)              time.sleep(3)  # Allow page to load
            logger.info(f"Opened URL: {url}")
            # Find all rows of the search results table              rows = driver.find_elements(By.CSS_SELECTOR, "tr.top-table__row")              rank = 1
            for row in rows:                  site_name = row.find_element(By.CSS_SELECTOR, "a.tw-table__compare").text.strip()                  link = f"https://www.similarweb.com/website/{site_name}/"
                # Rank change processing                  rank_change_holder = row.find_element(By.CSS_SELECTOR, "td.top-table__column--rank-change")                  rank_change = 0                  up_or_down = rank_change_holder.find_element(By.TAG_NAME, "span").get_attribute("class").split()[-1]                  if "change--up" in up_or_down:                      rank_change += int(rank_change_holder.text.strip())                  elif "change--down" in up_or_down:                      rank_change -= int(rank_change_holder.text.strip())
                average_visit = row.find_element(By.CSS_SELECTOR, "span.tw-table__avg-visit-duration").text.strip()                  pages_per_visit = float(row.find_element(By.CSS_SELECTOR, "span.tw-table__pages-per-visit").text.strip())                  bounce_rate = row.find_element(By.CSS_SELECTOR, "span.tw-table__bounce-rate").text.strip()
                # Create data object                  search_data = SearchData(                      name=site_name,                      url=link,                      rank=rank,                      rank_change=rank_change,                      average_visit=average_visit,                      pages_per_visit=pages_per_visit,                      bounce_rate=bounce_rate                  )                  rank += 1                  data_pipeline.add_data(search_data)
            logger.info(f"Successfully parsed data from: {url}")              success = True          except Exception as e:              logger.error(f"An error occurred while processing page {url}: {e}, retries left {retries-tries}")              tries += 1          finally:              driver.quit()
    if not success:          raise Exception(f"Max retries exceeded: {retries}")
# Function to process and scrape all search results concurrently  def start_scrape(keywords, data_pipeline=None, max_threads=5, retries=3):      with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:          executor.map(              scrape_search_results,              keywords,              [data_pipeline] * len(keywords),              [retries] * len(keywords)          )
# Function to process websites (Selenium-based) and extract competitor data  def process_website(row, retries=3):      url = row["url"]      tries = 0      success = False
    while tries <= retries and not success:          try:              driver = setup_driver()
            driver.get(url)              time.sleep(3)  # Allow page to load
            # Check if blocked by a modal or warning              try:                  blocked_modal = driver.find_element(By.CSS_SELECTOR, "div.wa-limit-modal")                  if blocked_modal:                      raise Exception("Blocked by modal")              except:                  pass  # No blocking modal
            # Extract competitor data              competitors = driver.find_elements(By.CSS_SELECTOR, "div.wa-competitors__list-item")              competitor_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}_competitors.csv")
            for competitor in competitors:                  site_name = competitor.find_element(By.CSS_SELECTOR, "span.wa-competitors__list-item-title").text.strip()                  link = f"https://www.similarweb.com/website/{site_name}/"                  affinity = competitor.find_element(By.CSS_SELECTOR, "span.app-progress__value").text.strip()
                target_spans = competitor.find_elements(By.CSS_SELECTOR, "span.wa-competitors__list-column")                  monthly_visits = target_spans[2].text.strip()                  category = target_spans[3].text.strip()                  category_rank = int(target_spans[4].text.replace("#", "").replace(",", "").replace("--", "0").strip())
                competitor_data = CompetitorData(                      name=site_name,                      url=link,                      affinity=affinity,                      monthly_visits=monthly_visits,                      category=category,                      category_rank=category_rank                  )                  competitor_pipeline.add_data(competitor_data)
            competitor_pipeline.close_pipeline()              success = True          except Exception as e:              logger.error(f"Exception thrown: {e}")              logger.warning(f"Failed to process page: {url}, Retries left: {retries-tries}")              tries += 1          finally:              driver.quit()
    if not success:          raise Exception(f"Max Retries exceeded: {retries}")      else:          logger.info(f"Successfully parsed: {row['url']}")
# Function to load and process CSV results  def process_results(csv_file, retries=3):      logger.info(f"Processing {csv_file}")      with open(csv_file, newline="") as file:          reader = list(csv.DictReader(file))
        for row in reader:              process_website(row, retries=retries)
if __name__ == "__main__":      MAX_RETRIES = 3      MAX_THREADS = 5
    logger.info(f"Crawl starting...")
    # Example keywords to scrape      keyword_list = [          {"category": "arts-and-entertainment", "subcategory": "humor"},          {"category": "arts-and-entertainment", "subcategory": "animation-and-comics"}      ]      aggregate_files = []
    # Crawl and save results      filename = "arts-and-entertainment"      crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")      start_scrape(keyword_list, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)      crawl_pipeline.close_pipeline()      aggregate_files.append(f"{filename}.csv")      logger.info(f"Crawl complete.")
    # Process each CSV file      for file in aggregate_files:          process_results(file, retries=MAX_RETRIES)

CompetitorData is used to represent the competitors we extract from the page.Inside of our parsing function, we open a new DataPipeline and pass these CompetitorData objects into it.

Step 4: Adding Concurrency

We now need to add concurrency. This time, instead of searching multiple categories, we'll need to run our parsing function on multiple rows at the same time.To achieve this, we're going to refactor process_results() to take advantage of multiple threads, using ThreadPoolExecutor.Below is our multithreaded version of process_results().

def process_results(csv_file, max_threads=5, retries=3):      logger.info(f"processing {csv_file}")      with open(csv_file, newline="") as file:          reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:              executor.map(                  process_website,                  reader,                  [retries] * len(reader)              )

We want to call the process_website function across multiple threads. The reader is an array of objects that we aim to process using several threads. The retries are also passed as an array, matching the length of the reader array.All the arguments passed to process_website are given into executor.map() in array form, which are then forwarded into process_website.Below is the full code we've written so far.

import os  import csv  import json  import time  import logging  from urllib.parse import urlencode  from selenium import webdriver  from selenium.webdriver.chrome.service import Service as ChromeService  from selenium.webdriver.chrome.options import Options  from selenium.webdriver.common.by import By  from webdriver_manager.chrome import ChromeDriverManager  import concurrent.futures  from dataclasses import dataclass, fields, asdict
API_KEY = ""  with open("config.json", "r") as config_file:      config = json.load(config_file)      API_KEY = config["api_key"]
def get_scrapeops_url(url):      payload = {          "api_key": API_KEY,          "url": url,          "wait": 3000      }      proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)      return proxy_url
# Set up logging  logging.basicConfig(level=logging.INFO)  logger = logging.getLogger(__name__)
# Setup Selenium WebDriver  def setup_driver():      options = Options()      options.add_argument("--headless")  # Run in headless mode for efficiency      options.add_argument("--no-sandbox")      options.add_argument("--disable-dev-shm-usage")      return webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)
@dataclass  class SearchData:      name: str = ""      url: str = ""      rank: int = 0      rank_change: int = 0      average_visit: str = ""      pages_per_visit: float = 0.0      bounce_rate: str = ""
    def __post_init__(self):          self.check_string_fields()
    def check_string_fields(self):          for field in fields(self):              if isinstance(getattr(self, field.name), str):                  if getattr(self, field.name) == "":                      setattr(self, field.name, f"No {field.name}")                  else:                      value = getattr(self, field.name)                      setattr(self, field.name, value.strip())
@dataclass  class CompetitorData:      name: str = ""      url: str = ""      affinity: str = ""      monthly_visits: str = ""      category: str = ""      category_rank: int = None
    def __post_init__(self):          self.check_string_fields()
    def check_string_fields(self):          for field in fields(self):              if isinstance(getattr(self, field.name), str):                  if getattr(self, field.name) == "":                      setattr(self, field.name, f"No {field.name}")                  else:                      value = getattr(self, field.name)                      setattr(self, field.name, value.strip())
class DataPipeline:      def __init__(self, csv_filename="", storage_queue_limit=50):          self.names_seen = []          self.storage_queue = []          self.storage_queue_limit = storage_queue_limit          self.csv_filename = csv_filename          self.csv_file_open = False
    def save_to_csv(self):          self.csv_file_open = True          data_to_save = []          data_to_save.extend(self.storage_queue)          self.storage_queue.clear()          if not data_to_save:              return
        keys = [field.name for field in fields(data_to_save[0])]          file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0          with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:              writer = csv.DictWriter(output_file, fieldnames=keys)              if not file_exists:                  writer.writeheader()              for item in data_to_save:                  writer.writerow(asdict(item))          self.csv_file_open = False
    def is_duplicate(self, input_data):          if input_data.name in self.names_seen:              logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")              return True          self.names_seen.append(input_data.name)          return False
    def add_data(self, scraped_data):          if not self.is_duplicate(scraped_data):              self.storage_queue.append(scraped_data)              if len(self.storage_queue) >= self.storage_queue_limit and not self.csv_file_open:                  self.save_to_csv()
    def close_pipeline(self):          if self.csv_file_open:              time.sleep(3)          if self.storage_queue:              self.save_to_csv()
# Function to scrape search results (fully Selenium-based)  def scrape_search_results(keyword, data_pipeline=None, retries=3):      url = f"https://www.similarweb.com/top-websites/{keyword['category']}/{keyword['subcategory']}/"      tries = 0      success = False
    while tries <= retries and not success:          try:              # Initialize WebDriver and load page              driver = setup_driver()              scrapeops_proxy_url = get_scrapeops_url(url)
            driver.get(scrapeops_proxy_url)              time.sleep(3)  # Allow page to load
            logger.info(f"Opened URL: {url}")
            # Find all rows of the search results table              rows = driver.find_elements(By.CSS_SELECTOR, "tr.top-table__row")              rank = 1
            for row in rows:                  site_name = row.find_element(By.CSS_SELECTOR, "a.tw-table__compare").text.strip()                  link = f"https://www.similarweb.com/website/{site_name}/"
                # Rank change processing                  rank_change_holder = row.find_element(By.CSS_SELECTOR, "td.top-table__column--rank-change")                  rank_change = 0                  up_or_down = rank_change_holder.find_element(By.TAG_NAME, "span").get_attribute("class").split()[-1]                  if "change--up" in up_or_down:                      rank_change += int(rank_change_holder.text.strip())                  elif "change--down" in up_or_down:                      rank_change -= int(rank_change_holder.text.strip())
                average_visit = row.find_element(By.CSS_SELECTOR, "span.tw-table__avg-visit-duration").text.strip()                  pages_per_visit = float(row.find_element(By.CSS_SELECTOR, "span.tw-table__pages-per-visit").text.strip())                  bounce_rate = row.find_element(By.CSS_SELECTOR, "span.tw-table__bounce-rate").text.strip()
                # Create data object                  search_data = SearchData(                      name=site_name,                      url=link,                      rank=rank,                      rank_change=rank_change,                      average_visit=average_visit,                      pages_per_visit=pages_per_visit,                      bounce_rate=bounce_rate                  )                  rank += 1                  data_pipeline.add_data(search_data)
            logger.info(f"Successfully parsed data from: {url}")              success = True          except Exception as e:              logger.error(f"An error occurred while processing page {url}: {e}, retries left {retries-tries}")              tries += 1          finally:              driver.quit()
    if not success:          raise Exception(f"Max retries exceeded: {retries}")
# Function to process and scrape all search results concurrently  def start_scrape(keywords, data_pipeline=None, max_threads=5, retries=3):      with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:          executor.map(              scrape_search_results,              keywords,              [data_pipeline] * len(keywords),              [retries] * len(keywords)          )
# Function to process websites (Selenium-based) and extract competitor data  def process_website(row, retries=3):      url = row["url"]      tries = 0      success = False
    while tries <= retries and not success:          try:              driver = setup_driver()
            driver.get(url)              time.sleep(3)  # Allow page to load
            # Check if blocked by a modal or warning              try:                  blocked_modal = driver.find_element(By.CSS_SELECTOR, "div.wa-limit-modal")                  if blocked_modal:                      raise Exception("Blocked by modal")              except:                  pass  # No blocking modal
            # Extract competitor data              competitors = driver.find_elements(By.CSS_SELECTOR, "div.wa-competitors__list-item")              competitor_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}_competitors.csv")
            for competitor in competitors:                  site_name = competitor.find_element(By.CSS_SELECTOR, "span.wa-competitors__list-item-title").text.strip()                  link = f"https://www.similarweb.com/website/{site_name}/"                  affinity = competitor.find_element(By.CSS_SELECTOR, "span.app-progress__value").text.strip()
                target_spans = competitor.find_elements(By.CSS_SELECTOR, "span.wa-competitors__list-column")                  monthly_visits = target_spans[2].text.strip()                  category = target_spans[3].text.strip()                  category_rank = int(target_spans[4].text.replace("#", "").replace(",", "").replace("--", "0").strip())
                competitor_data = CompetitorData(                      name=site_name,                      url=link,                      affinity=affinity,                      monthly_visits=monthly_visits,                      category=category,                      category_rank=category_rank                  )                  competitor_pipeline.add_data(competitor_data)
            competitor_pipeline.close_pipeline()              success = True          except Exception as e:              logger.error(f"Exception thrown: {e}")              logger.warning(f"Failed to process page: {url}, Retries left: {retries-tries}")              tries += 1          finally:              driver.quit()
    if not success:          raise Exception(f"Max Retries exceeded: {retries}")      else:          logger.info(f"Successfully parsed: {row['url']}")
def process_results(csv_file, max_threads=5, retries=3):      logger.info(f"processing {csv_file}")      with open(csv_file, newline="") as file:          reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:              executor.map(                  process_website,                  reader,                  [retries] * len(reader)              )
if __name__ == "__main__":      MAX_RETRIES = 3      MAX_THREADS = 5
    logger.info(f"Crawl starting...")
    # Example keywords to scrape      keyword_list = [          {"category": "arts-and-entertainment", "subcategory": "humor"},          {"category": "arts-and-entertainment", "subcategory": "animation-and-comics"}      ]      aggregate_files = []
    # Crawl and save results      filename = "arts-and-entertainment"      crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")      start_scrape(keyword_list, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)      crawl_pipeline.close_pipeline()      aggregate_files.append(f"{filename}.csv")      logger.info(f"Crawl complete.")
    # Process each CSV file      for file in aggregate_files:          process_results(file,max_threads=MAX_THREADS, retries=MAX_RETRIES)

Step 5: Bypassing Anti-Bots

When viewing reports for individual sites, we often get blocked. You can see an example of this in the image below. To bypass this, we will use the proxy function that was written earlier.We only need to change two lines of our parsing function just to use proxy URL to implement this.

proxy_url = get_scrapeops_url(url)  driver.get(proxy_url)

Here is the complete code.

import os  import csv  import json  import time  import logging  from urllib.parse import urlencode  from selenium import webdriver  from selenium.webdriver.chrome.service import Service as ChromeService  from selenium.webdriver.chrome.options import Options  from selenium.webdriver.common.by import By  from webdriver_manager.chrome import ChromeDriverManager  import concurrent.futures  from dataclasses import dataclass, fields, asdict
API_KEY = ""  with open("config.json", "r") as config_file:      config = json.load(config_file)      API_KEY = config["api_key"]
def get_scrapeops_url(url):      payload = {          "api_key": API_KEY,          "url": url,          "wait": 3000      }      proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)      return proxy_url
# Set up logging  logging.basicConfig(level=logging.INFO)  logger = logging.getLogger(__name__)
# Setup Selenium WebDriver  def setup_driver():      options = Options()      options.add_argument("--headless")  # Run in headless mode for efficiency      options.add_argument("--no-sandbox")      options.add_argument("--disable-dev-shm-usage")      return webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)
@dataclass  class SearchData:      name: str = ""      url: str = ""      rank: int = 0      rank_change: int = 0      average_visit: str = ""      pages_per_visit: float = 0.0      bounce_rate: str = ""
    def __post_init__(self):          self.check_string_fields()
    def check_string_fields(self):          for field in fields(self):              if isinstance(getattr(self, field.name), str):                  if getattr(self, field.name) == "":                      setattr(self, field.name, f"No {field.name}")                  else:                      value = getattr(self, field.name)                      setattr(self, field.name, value.strip())
@dataclass  class CompetitorData:      name: str = ""      url: str = ""      affinity: str = ""      monthly_visits: str = ""      category: str = ""      category_rank: int = None
    def __post_init__(self):          self.check_string_fields()
    def check_string_fields(self):          for field in fields(self):              if isinstance(getattr(self, field.name), str):                  if getattr(self, field.name) == "":                      setattr(self, field.name, f"No {field.name}")                  else:                      value = getattr(self, field.name)                      setattr(self, field.name, value.strip())
class DataPipeline:      def __init__(self, csv_filename="", storage_queue_limit=50):          self.names_seen = []          self.storage_queue = []          self.storage_queue_limit = storage_queue_limit          self.csv_filename = csv_filename          self.csv_file_open = False
    def save_to_csv(self):          self.csv_file_open = True          data_to_save = []          data_to_save.extend(self.storage_queue)          self.storage_queue.clear()          if not data_to_save:              return
        keys = [field.name for field in fields(data_to_save[0])]          file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0          with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:              writer = csv.DictWriter(output_file, fieldnames=keys)              if not file_exists:                  writer.writeheader()              for item in data_to_save:                  writer.writerow(asdict(item))          self.csv_file_open = False
    def is_duplicate(self, input_data):          if input_data.name in self.names_seen:              logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")              return True          self.names_seen.append(input_data.name)          return False
    def add_data(self, scraped_data):          if not self.is_duplicate(scraped_data):              self.storage_queue.append(scraped_data)              if len(self.storage_queue) >= self.storage_queue_limit and not self.csv_file_open:                  self.save_to_csv()
    def close_pipeline(self):          if self.csv_file_open:              time.sleep(3)          if self.storage_queue:              self.save_to_csv()
# Function to scrape search results (fully Selenium-based)  def scrape_search_results(keyword, data_pipeline=None, retries=3):      url = f"https://www.similarweb.com/top-websites/{keyword['category']}/{keyword['subcategory']}/"      tries = 0      success = False
    while tries <= retries and not success:          try:              # Initialize WebDriver and load page              driver = setup_driver()              scrapeops_proxy_url = get_scrapeops_url(url)
            driver.get(scrapeops_proxy_url)              time.sleep(3)  # Allow page to load
            logger.info(f"Opened URL: {url}")
            # Find all rows of the search results table              rows = driver.find_elements(By.CSS_SELECTOR, "tr.top-table__row")              rank = 1
            for row in rows:                  site_name = row.find_element(By.CSS_SELECTOR, "a.tw-table__compare").text.strip()                  link = f"https://www.similarweb.com/website/{site_name}/"
                # Rank change processing                  rank_change_holder = row.find_element(By.CSS_SELECTOR, "td.top-table__column--rank-change")                  rank_change = 0                  up_or_down = rank_change_holder.find_element(By.TAG_NAME, "span").get_attribute("class").split()[-1]                  if "change--up" in up_or_down:                      rank_change += int(rank_change_holder.text.strip())                  elif "change--down" in up_or_down:                      rank_change -= int(rank_change_holder.text.strip())
                average_visit = row.find_element(By.CSS_SELECTOR, "span.tw-table__avg-visit-duration").text.strip()                  pages_per_visit = float(row.find_element(By.CSS_SELECTOR, "span.tw-table__pages-per-visit").text.strip())                  bounce_rate = row.find_element(By.CSS_SELECTOR, "span.tw-table__bounce-rate").text.strip()
                # Create data object                  search_data = SearchData(                      name=site_name,                      url=link,                      rank=rank,                      rank_change=rank_change,                      average_visit=average_visit,                      pages_per_visit=pages_per_visit,                      bounce_rate=bounce_rate                  )                  rank += 1                  data_pipeline.add_data(search_data)
            logger.info(f"Successfully parsed data from: {url}")              success = True          except Exception as e:              logger.error(f"An error occurred while processing page {url}: {e}, retries left {retries-tries}")              tries += 1          finally:              driver.quit()
    if not success:          raise Exception(f"Max retries exceeded: {retries}")
# Function to process and scrape all search results concurrently  def start_scrape(keywords, data_pipeline=None, max_threads=5, retries=3):      with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:          executor.map(              scrape_search_results,              keywords,              [data_pipeline] * len(keywords),              [retries] * len(keywords)          )
# Function to process websites (Selenium-based) and extract competitor data  def process_website(row, retries=3):      url = row["url"]      tries = 0      success = False
    while tries <= retries and not success:          try:              driver = setup_driver()              scrapeops_proxy_url = get_scrapeops_url(url)              driver.get(scrapeops_proxy_url)              time.sleep(3)  # Allow page to load
            # Check if blocked by a modal or warning              try:                  blocked_modal = driver.find_element(By.CSS_SELECTOR, "div.wa-limit-modal")                  if blocked_modal:                      raise Exception("Blocked by modal")              except:                  pass  # No blocking modal
            # Extract competitor data              competitors = driver.find_elements(By.CSS_SELECTOR, "div.wa-competitors__list-item")              competitor_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}_competitors.csv")
            for competitor in competitors:                  site_name = competitor.find_element(By.CSS_SELECTOR, "span.wa-competitors__list-item-title").text.strip()                  link = f"https://www.similarweb.com/website/{site_name}/"                  affinity = competitor.find_element(By.CSS_SELECTOR, "span.app-progress__value").text.strip()
                target_spans = competitor.find_elements(By.CSS_SELECTOR, "span.wa-competitors__list-column")                  monthly_visits = target_spans[2].text.strip()                  category = target_spans[3].text.strip()                  category_rank = int(target_spans[4].text.replace("#", "").replace(",", "").replace("--", "0").strip())
                competitor_data = CompetitorData(                      name=site_name,                      url=link,                      affinity=affinity,                      monthly_visits=monthly_visits,                      category=category,                      category_rank=category_rank                  )                  competitor_pipeline.add_data(competitor_data)
            competitor_pipeline.close_pipeline()              success = True          except Exception as e:              logger.error(f"Exception thrown: {e}")              logger.warning(f"Failed to process page: {url}, Retries left: {retries-tries}")              tries += 1          finally:              driver.quit()
    if not success:          raise Exception(f"Max Retries exceeded: {retries}")      else:          logger.info(f"Successfully parsed: {row['url']}")
def process_results(csv_file, max_threads=5, retries=3):      logger.info(f"processing {csv_file}")      with open(csv_file, newline="") as file:          reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:              executor.map(                  process_website,                  reader,                  [retries] * len(reader)              )
if __name__ == "__main__":      MAX_RETRIES = 3      MAX_THREADS = 5
    logger.info(f"Crawl starting...")
    # Example keywords to scrape      keyword_list = [          {"category": "arts-and-entertainment", "subcategory": "humor"},          {"category": "arts-and-entertainment", "subcategory": "animation-and-comics"}      ]      aggregate_files = []
    # Crawl and save results      filename = "arts-and-entertainment"      crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")      start_scrape(keyword_list, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)      crawl_pipeline.close_pipeline()      aggregate_files.append(f"{filename}.csv")      logger.info(f"Crawl complete.")
    # Process each CSV file      for file in aggregate_files:          process_results(file,max_threads=MAX_THREADS, retries=MAX_RETRIES)

Step 6: Production Run

Time to run this entire thing in production! We're going to use the same settings as before. Here is our main if you need a refresher.
Since there was such a spread in our crawl times, we'll estimate the crawl at 30 seconds.

if __name__ == "__main__":      MAX_RETRIES = 3      MAX_THREADS = 5
    logger.info(f"Crawl starting...")
    # Example keywords to scrape      keyword_list = [          {"category": "arts-and-entertainment", "subcategory": "humor"},          {"category": "arts-and-entertainment", "subcategory": "animation-and-comics"}      ]      aggregate_files = []
    # Crawl and save results      filename = "arts-and-entertainment"      crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")      start_scrape(keyword_list, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)      crawl_pipeline.close_pipeline()      aggregate_files.append(f"{filename}.csv")      logger.info(f"Crawl complete.")
    # Process each CSV file      for file in aggregate_files:          process_results(file,max_threads=MAX_THREADS, retries=MAX_RETRIES)

The crawl generated a CSV file with 100 results and took 1073.349 seconds. As mentioned earlier, we'll estimate our crawl at 30 seconds. 1073.349 - 30 = 1,043.349 seconds scraping competitors.1,043.349 seconds / 100 results = 10.433 seconds per result.Especially considering the 3 second wait time for content to render, this is pretty good!.

Legal and Ethical Considerations

Web scraping comes with important legal and ethical responsibilities. Scraping public data—information that is not gated behind a login—is typically considered legal, much like taking a photograph of a billboard.However, scraping private data introduces entirely different challenges. Accessing data behind a login or other restrictions may violate intellectual property laws and privacy regulations.Even when dealing with public data, it's essential to respect the Terms and Conditions of the target website and adhere to its robots.txt file. Ignoring these rules could result in account suspension or permanent bans.You can view these for SimilarWeb by checking the links below.

If you're unsure of your scraper, you should talk to an attorney.

Conclusion

In conclusion, SimilarWeb can be a very difficult site to access without an account let alone scrape. The ScrapeOps Proxy Aggregator gives us the ability to constantly rotate to new IP addresses and as SimilarWeb blocks them.You got a crash course in iterative development and you should understand parsing, data storage, concurrency and proxy integration.If you'd like to learn more about the tech stack used in this site, take a look at the links below.

Selenium

More Web Scraping Guides

Here at ScrapeOps, we've got a ton of learning resources. Whether you're just learning how to code or if you've been writing software for years, we've got something for you. We even wrote the Web Scraping Playbook. If you want to read more from our "How To Scrape" series, check out the articles below.

How to Scrape SimilarWeb

How to Scrape SimilarWeb With Requests and BeautifulSoup

Need help scraping the web?

TLDR - How to Scrape SimilarWeb with Python

How To Architect Our SimilarWeb Scraper

Understanding How To Scrape SimilarWeb

Step 1: How To Request SimilarWeb Pages

Step 2: How To Extract Data From SimilarWeb Results and Pages

Step 3: Geolocated Data

Setting Up Our SimilarWeb Scraper Project

Build A SimilarWeb Search Crawler

Step 1: Create Simple Search Data Parser

Step 2: Storing the Scraped Data

Step 3: Adding Concurrency

Step 4: Bypassing Anti-Bots

Step 6: Production Run

Build A SimilarWeb Scraper

Step 1: Create Simple Website Data Parser

Step 2: Loading URLs To Scrape

Step 3: Storing the Scraped Data

Step 4: Adding Concurrency

Step 5: Bypassing Anti-Bots

Step 6: Production Run

Legal and Ethical Considerations

Conclusion

How to Scrape SimilarWeb With Selenium

Need help scraping the web?

**TLDR - How to Scrape SimilarWeb with Selenium

How To Architect Our SimilarWeb Scraper

Understanding How To Scrape SimilarWeb

Step 1: How To Request SimilarWeb Pages

Step 2: How To Extract Data From SimilarWeb Results and Pages

Step 3: Geolocated Data

Setting Up Our SimilarWeb Scraper Project

Build A SimilarWeb Search Crawler

Step 1: Create Simple Search Data Parser

Step 2: Storing the Scraped Data

Step 3: Adding Concurrency

Step 4: Bypassing Anti-Bots

Step 5: Production Run

Build A SimilarWeb Scraper

Step 1: Create Simple Website Data Parser

Step 2: Loading URLs To Scrape

Step 3: Storing the Scraped Data

Step 4: Adding Concurrency

Step 5: Bypassing Anti-Bots

Step 6: Production Run​

Legal and Ethical Considerations

Conclusion

More Web Scraping Guides

Step 6: Production Run