Scrape Quora With Python Requests and BeautifulSoup

How to Scrape Quora

Quora is a question and answer website that is home to a wealth of information. Since its founding in 2009, Quora has been a great place to ask a question and get an answer. This data could be used to understand consumer behavior, identify pain points, and discover new product opportunities.

How to Scrape Quora With Requests and BeautifulSoup

In this tutorial, we'll learn how to scrape Quora using Python Requests and BeautifulSoup.

💡GitHub CodeThe full code for this Quora Scraper is available on Github here.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

TLDR - How to Scrape Quora with Python

Need to scrape Quora but don't have time to code? Use the scraper below!Create a new project folder and add a config.json file with your "api_key". After you're done, add a file with the code below. Feel free to change any of the constants inside the main to tweak your results.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 2000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    url: str = ""    rank: int = 0
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass ReplyData:    name: str = ""    reply: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    result_number = page_number * 10    url = f"https://www.google.com/search?q={formatted_keyword}%20site%3Aquora.com&start={result_number}"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code == 200:                success = True                        else:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        div_cards = soup.select("div span")

            for div_card in div_cards:                name = div_card.find("h3")                link = div_card.find("a")
                if not name or not link:                    continue
                result_number += 1                ranking = result_number
                search_data = SearchData(                    name=name.text,                    url=link.get("href"),                    rank=ranking                )             
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

def process_post(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(get_scrapeops_url(url, location=location))        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")                main_content = soup.select_one("div[id='mainContent']")                answer_cards = main_content.select("div div div div div[class='q-box']")
                answer_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")
                last_seen_name = ""                for answer_card in answer_cards:                    excluded_words = ["All related", "Recommended"]                    array = answer_card.text.split("·")
                    if len(array) < 3:                        continue
                    promoted = "Promoted" in array[0]                    related =  "Related" in array[2][0:30] or "Related" in array[-2][0:30]                    repeat_name = array[0] in last_seen_name or array[0] == last_seen_name
                    if promoted or related or repeat_name:                        last_seen_name = array[0]                        continue
                    reply_data = ReplyData(                        name=array[0],                        reply=array[-2]                    )                    answer_pipeline.add_data(reply_data)
                answer_pipeline.close_pipeline()                success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")



def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_post,                reader,                [location] * len(reader),                [retries] * len(reader)            )
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 5    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["learn rust"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Feel free to change any of the following constants:

MAX_THREADS: Controls the number of threads that the program will use for concurrent execution.
MAX_RETRIES: Defines the number of times the scraper will retry a failed request before giving up.
PAGES: Determines how many pages of Google search results to scrape for each keyword.
LOCATION: Specifies the geographical location (country) for the Google search.
keyword_list: This is a list of keywords for which the script will perform the search and subsequent scraping.

How To Architect Our Quora Scraper

This project consists of two scrapers, a crawler, and a scraper. The job of the crawler is to perform a search on a certain topic, and the scraper will scrape individual posts from that search.Our crawler needs to perform the following actions:

Search Quora for a specific topic and parse the results.
Use pagination to control our results.
Save the extracted data with proper data storage.
Concurrently search multiple pages of results at once.
Use proxy integration to get past anti-bots any other potential roadblocks that may be in our way.

After the crawl, our scraper will execute these actions:

Read the CSV from the crawl.
Lookup each individual post from the CSV and parse the results.
Store the relevant data from the step above.
Perform the two steps above concurrently on multiple posts.
Integrate with a proxy once again to avoid anti-bots and anything else that might get in our way.

Understanding How To Scrape Quora

Scraping Quora is not our typical project. Take a look at what happens when you try to search Quora without an account. You are immediately blocked and asked to login. Quora Login Modal

As you saw above, we cannot actually perform a Quora search without logging in.However, there is a way around this. First, we need to get a better understanding of this task at a high level.As always, we're going to build both a crawler and a scraper.

The crawler will scrape Quora posts from Google.
The scraper will pull information about those individual topics directly from Quora. If you're not familiar with scraping Google results, we've got an article on that here.

Throughout this project, we'll make use of:

Parsing
Pagination
Data Storage
Concurrency
Proxy Integration

Step 1: How To Request Quora Pages

To search Quora without an account, we're going to use Google. When you lookup Quora on Google, you are actually given an option to search Quora directly through Google.Take a look at the screenshot below.

Here are the actual results from the search. If you look at the URL, this is something we can use to construct urls programmatically:

https://www.google.com/search?q=learn%20rust%20site%3Aquora.com

When searching programmatically, it would look like this

https://www.google.com/search?q={formatted_keyword}%20site%3Aquora.com

To request individual posts, we'll be extracting links directly from the search results above. If you click on the first result, you get this URL:

https://www.quora.com/How-do-I-learn-the-Rust-programming-language

Even though we won't need to build these urls manually, you should understand that they're laid out like this:

https://www.quora.com/Name-of-your-post

This is what the posts look like in your actual browser. Quora Post Learn Rust

Step 2: How To Extract Data From Quora Results and Pages

Now, let's get a better look at the data we'll be extracting. Our Google results are nested incredibly deeply within the page. Our Quora results aren't much easier to extract because the replies are nested just as deeply.Here are the Google results. Google Search Results Page HTML Inspection

Google Search Results Page HTML Inspection

Here is an example Quora response card. Each div card has a class of q-box. Quora HTML Inspection

Step 3: How To Control Pagination

To control pagination with Google, we'll use the following layout in our URL:

https://www.google.com/search?q={query}&start={page * 10}

When using Google, each result has a unique number and receive approximately 10 results per page.

So page 0 would give us results 1 through 10.
Page 2 gives us results 11 through 20 and so on and so forth.

Step 4: Geolocated Data

We don't need specific geolocation support for Quora, but we'll be using the ScrapeOps Proxy API which provides this for us anyway.When you use the API, you can give a parameter, country and you'll be routed through a server in the country of your choosing.

If we want to appear in the US, we can pass "country": "us".
To appear in the UK, we can pass "country": "uk".

Setting Up Our Quora Scraper Project

Let's get started. You can run the following commands to get setup.Create a New Project Folder

mkdir quora-scraper
cd quora-scraper

Create a New Virtual Environment

python -m venv venv

Activate the Environment

source venv/bin/activate

Install Our Dependencies

pip install requests

pip install beautifulsoup4

Build A Quora Search Crawler

As previously mentioned, we're going to use Google to crawl Quora. We'll add the following to our scraper in the next few sections:

Parsing
Pagination
Data Storage
Concurrency
Proxy Integration

Step 1: Create Simple Search Data Parser

We'll start by creating a parsing function. The code below sets our basic structure for the rest of the project. We begin with some retry logic and basic error handling but you really need to pay attention to our parsing logic.Here is the code we'll start with.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 2000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)

def scrape_search_results(keyword, location, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.google.com/search?q={formatted_keyword}%20site%3Aquora.com"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code == 200:                success = True                        else:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        div_cards = soup.select("div span")

            for div_card in div_cards:                name = div_card.find("h3")                link = div_card.find("a")
                if not name or not link:                    continue
                result_number += 1                ranking = result_number
                search_data = {                    "name": name.text,                    "url": link.get("href"),                    "rank": ranking                }
            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["learn rust"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        scrape_search_results(keyword, LOCATION, retries=MAX_RETRIES)                    logger.info(f"Crawl complete.")

When parsing this search data from Google, we:

Find all of our result cards with soup.select("div span")
For each div_card:
- name = div_card.find("h3") finds the post name on Quora.
- link = div_card.find("a") finds the link to the post.

Step 2: Add Pagination

Adding pagination is a pretty simple task here. Google's results are all given a number and we get 10 results per page. Page 1 gives us results 1 through 10. Page 2 gives us results 11 through 20 and so on and so forth.To add these results, we need to add a start parameter to our URL.With the start param added in, our URL now looks like this:

https://www.google.com/search?q={formatted_keyword}%20site%3Aquora.com&start={result_number}

We'll also add a start_scrape() function. This one is pretty simple at the moment. It iterates through our pages and runs scrape_search_results() on each page.

def start_scrape(keyword, pages, location, data_pipeline=None, retries=3):    for page in range(pages):        scrape_search_results(keyword, location, page, data_pipeline=data_pipeline, retries=retries)

Take a look at our full code now.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 2000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)

def scrape_search_results(keyword, location, page_number, retries=3):    formatted_keyword = keyword.replace(" ", "+")    result_number = page_number * 10    url = f"https://www.google.com/search?q={formatted_keyword}%20site%3Aquora.com&start={result_number}"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code == 200:                success = True                        else:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        div_cards = soup.select("div span")

            for div_card in div_cards:                name = div_card.find("h3")                link = div_card.find("a")
                if not name or not link:                    continue
                result_number += 1                ranking = result_number
                search_data = {                    "name": name.text,                    "url": link.get("href"),                    "rank": ranking                }
            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, data_pipeline=None, retries=3):    for page in range(pages):        scrape_search_results(keyword, location, page, data_pipeline=data_pipeline, retries=retries)


if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["learn rust"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        start_scrape(keyword, PAGES, LOCATION, retries=MAX_RETRIES)            logger.info(f"Crawl complete.")

Step 3: Storing the Scraped Data

Next, we need to store the data we've scraped. In order to do this, we need to add a couple classes. First, we'll add a SearchData class and then we'll add a DataPipeline class.

SearchData is a dataclass that exists specifically to hold data we've scraped. Once we've transformed a result on the page into SearchData, we need to pass it into our DataPipeline.
The DataPipeline pipes this data to a CSV file and filters out our duplicates.

Here is our SearchData. It holds a name, url, and rank.

@dataclassclass SearchData:    name: str = ""    url: str = ""    rank: int = 0
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

Here is our DataPipeline.

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()

Now, we need to add these pieces into our full code. You can see what this looks like below.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 2000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    url: str = ""    rank: int = 0
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    result_number = page_number * 10    url = f"https://www.google.com/search?q={formatted_keyword}%20site%3Aquora.com&start={result_number}"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code == 200:                success = True                        else:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        div_cards = soup.select("div span")

            for div_card in div_cards:                name = div_card.find("h3")                link = div_card.find("a")
                if not name or not link:                    continue
                result_number += 1                ranking = result_number
                search_data = SearchData(                    name=name.text,                    url=link.get("href"),                    rank=ranking                )             
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, data_pipeline=None, retries=3):    for page in range(pages):        scrape_search_results(keyword, location, page, data_pipeline=data_pipeline, retries=retries)


if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["learn rust"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

In our main, we now create a DataPipeline. This DataPipeline then gets passed into start_scrape() and then scrape_search_results(). We then use the add_data() method to add this data to our pipeline.

Step 4: Adding Concurrency

Concurrency is vital when you're scraping at scale. start_scrape() already gives us the ability to run scrape_search_results() on multiple pages, but we want to run it on multiple pages simultaneously. In order to accomplish this, we'll be using ThreadPoolExecutor.Here is our new start_scrape() function.

def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

scrape_search_results is the first argument passed into executor.map(). This is the function we want to run on each available thread.
All other arguments get passed into executor.map() as arrays that then get passed in on the individual threads.

Here is our full code up to this point.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 2000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    url: str = ""    rank: int = 0
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    result_number = page_number * 10    url = f"https://www.google.com/search?q={formatted_keyword}%20site%3Aquora.com&start={result_number}"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code == 200:                success = True                        else:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        div_cards = soup.select("div span")

            for div_card in div_cards:                name = div_card.find("h3")                link = div_card.find("a")
                if not name or not link:                    continue
                result_number += 1                ranking = result_number
                search_data = SearchData(                    name=name.text,                    url=link.get("href"),                    rank=ranking                )             
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )


if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["learn rust"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

Step 5: Bypassing Anti-Bots

To bypass anti-bots, we'll make full use of the ScrapeOps Proxy API. This gets us past anti-bots and any other software that might be used to detect and block our scraper.We'll write a function, get_scrapeops_url() which takes in a url and a location. It also uses a wait parameter which tells ScrapeOps how long to wait before sending back our results.Our function takes in all of this information and then converts it to a proxied url with all of our custom parameters.

def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 2000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

"api_key" holds our ScrapeOps API key.
"url" represents the url we'd like to scrape.
"country" is the country we want to be routed through.
"wait" is the period we want ScrapeOps to wait before sending our results.

When we adjust our full code for proxy integration, it looks like this.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 2000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    url: str = ""    rank: int = 0
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    result_number = page_number * 10    url = f"https://www.google.com/search?q={formatted_keyword}%20site%3Aquora.com&start={result_number}"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code == 200:                success = True                        else:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        div_cards = soup.select("div span")

            for div_card in div_cards:                name = div_card.find("h3")                link = div_card.find("a")
                if not name or not link:                    continue
                result_number += 1                ranking = result_number
                search_data = SearchData(                    name=name.text,                    url=link.get("href"),                    rank=ranking                )             
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )


if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["learn rust"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

Step 6: Production Run

Now, that we've built our crawler, let's run it in production. I'm going to set our PAGES to 5 and time the operation. Here is our updated main.

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 5    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["learn rust"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

Here are our results. Crawler Performance Terminal

Our crawl finished at 32.635 seconds. 32.635 seconds / 5 pages = 6.527 seconds per page.If you'd like to play around and tweak your results, feel free to change any of the following constants from main:

MAX_RETRIES
PAGES
MAX_THREADS
LOCATION
keyword_list

Build A Quora Scraper

To scrape Quora posts, we need to build a scraper that performs the following steps:

Read the CSV from our crawler.
Parse each post from the CSV.
Store the data that we extracted in the parse.
Perform steps 2 and 3 with concurrency so we can process multiple posts at the same time.
Integrate with a proxy in order to get past anti-bots.

Step 1: Create Simple Business Data Parser

Like we did earlier, we'll start with a basic parsing function. This parser will extract reply information from each post.With Quora particularly, our data is incredibly nested and there are a ton of replies that show up that aren't specific to the question being asked.This parser is going to filter out the promoted and related replies so that we only get direct replies to the questions being asked.Here is the parsing function we'll start with.

def process_post(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(url, location=location)        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")                main_content = soup.select_one("div[id='mainContent']")                answer_cards = main_content.select("div div div div div[class='q-box']")
                last_seen_name = ""                for answer_card in answer_cards:                    excluded_words = ["All related", "Recommended"]                    array = answer_card.text.split("·")
                    if len(array) < 3:                        continue
                    promoted = "Promoted" in array[0]                    related =  "Related" in array[2][0:30] or "Related" in array[-2][0:30]                    repeat_name = array[0] in last_seen_name or array[0] == last_seen_name
                    if promoted or related or repeat_name:                        last_seen_name = array[0]                        continue                                        print("name:", array[0])                    print("reply:", array[-2])                                    success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")

In this function, we:

Find the main_content box with soup.select_one("div[id='mainContent']").
We then find the div elements holding our replies with main_content.select("div div div div div[class='q-box']").
We then split the text of these cards into an array.
Arrays that are not long enough get filtered out.
We then check to see if the card is a "Promoted" or "Related" reply.
As long as this the reply is not promoted or related, we pull the "name" and "reply" out of the string array.

Step 2: Loading URLs To Scrape

To use our parsing function, we need to load our URLs. We can do this by reading the CSV file generated by our crawler. Similar to our start_scrape() function, we'll create a new one, process_results(). This function will read the CSV file and call process_post() on all the rows from the file.Here is process_results().

def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_post(row, location, retries=retries)

Once we've created our reader object, we iterate through with a for loop and then we run process_post() on each row from the file.Here is our full code up to this point.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 2000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    url: str = ""    rank: int = 0
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    result_number = page_number * 10    url = f"https://www.google.com/search?q={formatted_keyword}%20site%3Aquora.com&start={result_number}"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code == 200:                success = True                        else:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        div_cards = soup.select("div span")

            for div_card in div_cards:                name = div_card.find("h3")                link = div_card.find("a")
                if not name or not link:                    continue
                result_number += 1                ranking = result_number
                search_data = SearchData(                    name=name.text,                    url=link.get("href"),                    rank=ranking                )             
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

def process_post(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(url, location=location)        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")                main_content = soup.select_one("div[id='mainContent']")                answer_cards = main_content.select("div div div div div[class='q-box']")
                last_seen_name = ""                for answer_card in answer_cards:                    excluded_words = ["All related", "Recommended"]                    array = answer_card.text.split("·")
                    if len(array) < 3:                        continue
                    promoted = "Promoted" in array[0]                    related =  "Related" in array[2][0:30] or "Related" in array[-2][0:30]                    repeat_name = array[0] in last_seen_name or array[0] == last_seen_name
                    if promoted or related or repeat_name:                        last_seen_name = array[0]                        continue                                        print("name:", array[0])                    print("reply:", array[-2])                                    success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")



def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_post(row, location, retries=retries)
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["learn rust"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Step 3: Storing the Scraped Data

We're now reading a CSV file and properly parsing the information from it. Now, we need to store the data. Sound familiar?This is going to be relatively easier than last time. We already have our DataPipeline, we just need another dataclass.In this section, we'll create a ReplyData class. It's going to hold a name and the content of the reply.Here is our new dataclass, ReplyData.

@dataclassclass ReplyData:    name: str = ""    reply: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

In the full code below, we instantiate our new class and we also create a DataPipeline instance during our parsing function.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 2000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    url: str = ""    rank: int = 0
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass ReplyData:    name: str = ""    reply: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    result_number = page_number * 10    url = f"https://www.google.com/search?q={formatted_keyword}%20site%3Aquora.com&start={result_number}"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code == 200:                success = True                        else:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        div_cards = soup.select("div span")

            for div_card in div_cards:                name = div_card.find("h3")                link = div_card.find("a")
                if not name or not link:                    continue
                result_number += 1                ranking = result_number
                search_data = SearchData(                    name=name.text,                    url=link.get("href"),                    rank=ranking                )             
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

def process_post(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(url, location=location)        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")                main_content = soup.select_one("div[id='mainContent']")                answer_cards = main_content.select("div div div div div[class='q-box']")
                answer_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")
                last_seen_name = ""                for answer_card in answer_cards:                    excluded_words = ["All related", "Recommended"]                    array = answer_card.text.split("·")
                    if len(array) < 3:                        continue
                    promoted = "Promoted" in array[0]                    related =  "Related" in array[2][0:30] or "Related" in array[-2][0:30]                    repeat_name = array[0] in last_seen_name or array[0] == last_seen_name
                    if promoted or related or repeat_name:                        last_seen_name = array[0]                        continue
                    reply_data = ReplyData(                        name=array[0],                        reply=array[-2]                    )                    answer_pipeline.add_data(reply_data)
                answer_pipeline.close_pipeline()                success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")



def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_post(row, location, retries=retries)
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["learn rust"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

By adding ReplyData and another DataPipeline, we can now pipe reply data straight to a CSV file.

Step 4: Adding Concurrency

We're now at a point where we need to add concurrency. To accomplish this, we use ThreadPoolExecutor just like we did earlier. We pass in process_post as our first argument, and then we pass subsequent arguments in as arrays, just like we did before.Here is our refactored process_results() function.

def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_post,                reader,                [location] * len(reader),                [retries] * len(reader)            )

process_post is the function we want to run on each available thread.
All arguments to process_post get passed into executor.map() as arrays.

Step 5: Bypassing Anti-Bots

At this point, bypassing anti-bots is a cinch. We already have all the infrastructure we need. All we need to do is change a single line in our parsing function.

response = requests.get(get_scrapeops_url(url, location=location))

Here is our full code. Our scraper is now ready to run in producion.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 2000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    url: str = ""    rank: int = 0
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass ReplyData:    name: str = ""    reply: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    result_number = page_number * 10    url = f"https://www.google.com/search?q={formatted_keyword}%20site%3Aquora.com&start={result_number}"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code == 200:                success = True                        else:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        div_cards = soup.select("div span")

            for div_card in div_cards:                name = div_card.find("h3")                link = div_card.find("a")
                if not name or not link:                    continue
                result_number += 1                ranking = result_number
                search_data = SearchData(                    name=name.text,                    url=link.get("href"),                    rank=ranking                )             
                data_pipeline.add_data(search_data)            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")



def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

def process_post(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(get_scrapeops_url(url, location=location))        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")                main_content = soup.select_one("div[id='mainContent']")                answer_cards = main_content.select("div div div div div[class='q-box']")
                answer_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")
                last_seen_name = ""                for answer_card in answer_cards:                    excluded_words = ["All related", "Recommended"]                    array = answer_card.text.split("·")
                    if len(array) < 3:                        continue
                    promoted = "Promoted" in array[0]                    related =  "Related" in array[2][0:30] or "Related" in array[-2][0:30]                    repeat_name = array[0] in last_seen_name or array[0] == last_seen_name
                    if promoted or related or repeat_name:                        last_seen_name = array[0]                        continue
                    reply_data = ReplyData(                        name=array[0],                        reply=array[-2]                    )                    answer_pipeline.add_data(reply_data)
                answer_pipeline.close_pipeline()                success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")



def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_post,                reader,                [location] * len(reader),                [retries] * len(reader)            )
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["learn rust"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Step 6: Production Run

To test our full project in production, we'll use the same settings as before. If you need a refresher, here is our main.

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 5    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["learn rust"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Here are the results from our full crawl and scrape. Scraper Performance Terminal

Our full run finished in 266.662 seconds. If you remember from earlier, our crawl for 5 pages took 32.635 seconds. 266.662 − 32.635 = 234.027 seconds. We scraped 35 post pages. 234.027 seconds / 35 pages = 6.68 seconds per page.

Legal and Ethical Considerations

When scraping the web, you need to pay attention to your target site's Terms of Service and their robots.txt. Legal or not, when you violate a site's terms, you can get suspended or even permanently banned.Quora's terms are available to view here. Here is their robots.txt.Public data is generally free to scrape. Public data is any data that is not gated behind a login page or some other type of authentication.When scraping private data, you are subject to a site's terms and privacy laws in the site's jurisdiction. If you don't know if your scraper is legal, you should consult an attorney.

Conclusion

Thank you for reading! You now have a solid grasp of Requests and BeautifulSoup. You're probably able to use CSS selectors effectively at this point as well. You've built a full scraping project that consists of parsing, pagination, data storage, concurrency, and proxy integration.If you'd like to learn more about the tech stack used in this article, check out these links below.

How to Scrape Quora With Selenium

In this tutorial, we will learn how to build a Quora scraper using Selenium.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

TLDR - How to Scrape Quora with Selenium

Need to scrape Quora but don't have time to code? Use the scraper below!To quickly scrape Quora without coding from scratch, follow the steps below:

Write or paste the code provided to initiate the scraper.
You can configure various parameters:
- MAX_THREADS: Controls the number of concurrent threads.
- MAX_RETRIES: Defines the retry attempts for failed requests.
- PAGES: Number of Google search result pages to scrape.
- LOCATION: Set the geographical region for search results.
- keyword_list: List of search terms to scrape.
Install the virtualenv and install the necessary libraries.
After setting up, run the scraper, and data will be saved to CSV files.

import osimport csvimport jsonimport loggingimport timeimport stringfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.chrome.service import Servicefrom selenium.webdriver.chrome.options import Optionsfrom selenium.webdriver.common.keys import Keysfrom concurrent.futures import ThreadPoolExecutorfrom dataclasses import dataclass, fields, asdictfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as EC
# Set up logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)
# Selenium configuration
# Set the path to your ChromeDriverCHROMEDRIVER_PATH = 'chromedriver.exe'  # Adjust this to the actual path if necessary
# Configure the service to use the specified driverservice = Service(CHROMEDRIVER_PATH)
# Setup Chrome options for headless browsingoptions = Options()options.add_argument("--headless")options.add_argument("--disable-gpu")  # Required for headless mode in some environmentsoptions.add_argument("--no-sandbox")  # Especially useful for Linux environmentsoptions.add_argument("--disable-dev-shm-usage")  # Helps with resource issues on some systemsoptions.headless = True  # Runs Chrome in headless mode (without GUI)
@dataclassclass SearchData:    name: str = ""    url: str = ""    rank: int = 0
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            value = getattr(self, field.name)            if isinstance(value, str):                if not value:                    setattr(self, field.name, f"No {field.name}")                else:                    setattr(self, field.name, value.strip())

@dataclassclass ReplyData:    name: str = ""    reply: str = ""
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            value = getattr(self, field.name)            if isinstance(value, str):                if not value:                    setattr(self, field.name, f"No {field.name}")                else:                    setattr(self, field.name, value.strip())

class DataPipeline:    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        try:            self.csv_file_open = True            data_to_save = self.storage_queue.copy()            self.storage_queue.clear()            if not data_to_save:                return
            keys = [field.name for field in fields(data_to_save[0])]            logger.info("saving file data")            logger.info(keys)            # Filter out invalid characters from the filename            valid_chars = "-_.() %s%s" % (string.ascii_letters, string.digits)            valid_filename = ''.join(c for c in self.csv_filename if c in valid_chars)            logger.info(valid_filename)            file_exists = (                os.path.isfile(valid_filename) and os.path.getsize(valid_filename) > 0            )                        if not file_exists:                with open(valid_filename, 'w', newline='') as output_file:                    writer = csv.DictWriter(output_file, fieldnames=keys)                    writer.writeheader()                                with open(                valid_filename, mode="a", newline="", encoding="utf-8"            ) as output_file:                writer = csv.DictWriter(output_file, fieldnames=keys)                
                for item in data_to_save:                    writer.writerow(asdict(item))
            self.csv_file_open = False                    except Exception as e:            logger.error(f"Error saving csv {e}")
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        logger.info("adding data")        logger.info(scraped_data)        if not self.is_duplicate(scraped_data):            self.storage_queue.append(scraped_data)            if (                len(self.storage_queue) >= self.storage_queue_limit                and not self.csv_file_open            ):                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if self.storage_queue:            self.save_to_csv()
def scrape_search_results(keyword, page_number, data_pipeline=None, retries=3):    # Use a context manager to ensure the driver is properly closed    with webdriver.Chrome(service=service, options=options) as driver:        formatted_keyword = keyword.replace(" ", "+")        result_number = page_number * 10        logger.info(f"page {page_number}")
        url = f"https://www.google.com/search?q={formatted_keyword}%20site%3Aquora.com&start={result_number}"        success = False        tries = 0
        while tries < retries and not success:            try:                driver.get(url)                logger.info(f"Accessing {url}")
                # Use explicit wait to ensure elements are loaded                wait = WebDriverWait(driver, 10)                wait.until(EC.presence_of_element_located((By.ID, "rso")))
                # Extract search result cards                for i in range(1, 11):                    try:                        # Attempt primary XPath                        name = driver.find_element(By.XPATH, f"//*[@id='rso']/div[{i}]/div/div/div[1]/div/div/span/a/h3").text                        link = driver.find_element(By.XPATH, f"//*[@id='rso']/div[{i}]/div/div/div[1]/div/div/span/a").get_attribute("href")                    except:                        try:                            # Fallback XPath                            name = driver.find_element(By.XPATH, f"//*[@id='rso']/div[{i}]/div/div/div/div[1]/div/div/span/a/h3").text                            link = driver.find_element(By.XPATH, f"//*[@id='rso']/div[{i}]/div/div/div/div[1]/div/div/span/a").get_attribute("href")                        except Exception as e:                            continue
                    search_data = SearchData(                        name=name,                        url=link,                        rank=result_number + i  # Increment rank per result                    )
                    data_pipeline.add_data(search_data)
                logger.info(f"Successfully parsed data from: {url}")                success = True
            except Exception as e:                logger.error(f"An error occurred while processing {url}: {e}")                tries += 1                if tries >= retries:                    logger.error(f"Max retries exceeded for {url}")                else:                    logger.info(f"Retrying {url} ({tries}/{retries})")                    time.sleep(2)
        logger.info(f"Storage queue length after page {page_number}: {len(data_pipeline.storage_queue)}")
def start_scrape(    keyword, pages, data_pipeline=None, max_threads=5, retries=3):    with ThreadPoolExecutor(max_workers=max_threads) as executor:        futures = []        for page in range(pages):            # No need to pass the driver anymore, each thread will create its own            futures.append(                executor.submit(                    scrape_search_results,                    keyword,                    page,                    data_pipeline,                    retries,                )            )
        # Ensure all threads complete        for future in futures:            future.result()  # This blocks until the thread finishes
def process_post(row, retries=3):    with webdriver.Chrome(service=service, options=options) as driver:        logger.info(f"Processing row: {row}")        url = row.get("url")        if not url:            logger.error(f"No URL found in row: {row}")            return        logger.info(f"Processing URL: {url}")        success = False        tries = 0
        while tries < retries and not success:            try:                driver.get(url)                logger.info(f"Accessing {url}")
                # Use explicit wait to ensure main content is loaded                wait = WebDriverWait(driver, 10)                main_content = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "div[id='mainContent']")))
                # Extract answer cards                answer_cards = main_content.find_elements(By.CSS_SELECTOR, "div.q-click-wrapper")                if not answer_cards:                    logger.warning(f"No answer cards found at {url}")
                # Initialize a new DataPipeline for replies                if 'name' not in row:                    logger.error(f"'name' key missing in row: {row}")                    break
                answer_pipeline = DataPipeline(                    csv_filename=f"{row['name'].replace(' ', '-')}.csv"                )                last_seen_name = ""
                for answer_card in answer_cards:                    try:                        name_element = answer_card.find_element(By.CSS_SELECTOR, "div.q-relative")                        name = name_element.text.replace("\n", "").strip()                        reply_element = answer_card.find_element(By.CSS_SELECTOR, "div.spacing_log_answer_content")                        reply = reply_element.text.strip()
                        if "Sponsored" in name:                            continue                        if "Related questions" in name:                            break                        if name == last_seen_name:                            continue                        last_seen_name = name
                        reply_data = ReplyData(name=name, reply=reply)                        answer_pipeline.add_data(reply_data)                    except Exception as e:                        continue
                answer_pipeline.close_pipeline()                success = True
            except Exception as e:                logger.error(f"Exception thrown while processing {url}: {e}")                tries += 1                if tries >= retries:                    logger.error(f"Max retries exceeded for {url}")                else:                    logger.info(f"Retrying {url} ({tries}/{retries})")                    time.sleep(2)
def process_results(csv_file, max_threads=5, retries=3):    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))        logger.info(f"file opened")                with ThreadPoolExecutor(max_workers=max_threads) as executor:            for row in reader:                executor.submit(process_post, row, retries)                                        if __name__ == "__main__":    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 5    logger.info(f"Crawl starting...")    # INPUT ---> List of keywords to scrape    keyword_list = ["learn rust"]    aggregate_files = []    # Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")    for file in aggregate_files:        process_results(file, max_threads=MAX_THREADS, retries=MAX_RETRIES)

How To Architect Our Quora Scraper

Our project consists of two main components: a crawler and a scraper.

Crawler: It searches Quora through Google, extracts posts, and saves relevant data.
Scraper: It reads the saved data, visits individual Quora posts, and scrapes detailed content.

Our crawler needs to perform the following actions:

Perform a search on Quora through Google.
Parse and extract search results, including pagination.
Save data (post titles and links) efficiently.
Execute concurrent searches on multiple result pages.

After the crawl, our scraper will execute these actions:

Read the saved data from the CSV.
Visit each individual Quora post and extract relevant information.
Store extracted data in a structured format.
Use concurrency to speed up scraping multiple posts.

Understanding How To Scrape Quora

Scraping Quora is unique compared to scraping other websites due to its frequent usage of dynamic content, anti-bot mechanisms, and requiring users to be logged in for most interactions. Quora Login Modal

However, by leveraging Google search to find Quora posts and extracting the content using Selenium, we can bypass the need for an account and scrape publicly available Quora posts indirectly.Here’s a breakdown of how we scrape Quora:

Step 1: How To Request Quora Pages

Directly scraping Quora pages can be difficult because accessing them usually requires logging in. To circumvent this, we scrape Quora indirectly by querying Google search results for Quora pages.If we just search Quora, on google it gives us an option to search it through there without going to the site.

Here we query Google with the following structure to find relevant Quora pages:

https://www.google.com/search?q={formatted_keyword}%20site%3Aquora.com

Where {formatted_keyword} is the term you're searching for on Quora.For example, searching for "learn Rust" on Quora via Google would look like this:

https://www.google.com/search?q=learn+rust+site%3Aquora.com

This URL returns search results where the website is limited to quora.com.

Step 2: How To Extract Data From Quora Results and Pages

Once you have retrieved the Google search results, you will need to extract URLs from the results and scrape the actual Quora pages for detailed answers.

Finding the Correct XPaths or CSS Selectors

You can use Chrome DevTools (right-click on a webpage, then click “Inspect”) to find the correct XPaths or CSS selectors for the elements you want to extract.

1. Find XPath for Quora Search Results:

For each search result in Google, you’ll need to locate the post titles and URLs.
To locate the post title in a Google search result, right-click on the title element in Chrome DevTools and select “CopXPath”.
This might give you an XPath like:
```
//*[@id='rso']/div[1]/div/div[1]/a/h3
```
Use similar techniques to extract the URL.

Screenshot 4

2. Find the Main Content and Replies on Quora:

Once you have the Quora post URLs, you’ll need to scrape the actual content and replies on each Quora page. In Quora posts, the answers are often deeply nested within div tags.Use the following steps to locate elements:

Inspect the page, and right-click the area containing the answers.
Copy the CSS selector or XPath of the answer's container. For example, Quora uses a class like q-click-wrapper for answers.

Screenshot 5

Step 3: How To Control Pagination:

When scraping multiple Google search result pages, you’ll need to control pagination by updating the search URL’s start parameter.For example:

Page 1: Returns the first 10 results.

https://www.google.com/search?q=learn+rust+site%3Aquora.com&start=0

Page 2: Returns the next 10 results.

https://www.google.com/search?q=learn+rust+site%3Aquora.com&start=10

By incrementing the start value by 10, you can paginate through all the results.

Setting Up Our Quora Scraper Project

To get started with scraping Quora using Selenium and ScrapeOps, follow the steps below to set up the project environment, install dependencies, and configure your WebDriver.Create a New Project Folder

mkdir quora-scrapercd quora-scraper

Set Up a Virtual EnvironmentIt's a good practice to isolate your project dependencies using a virtual environment:

python -m venv venvsource venv/bin/activate  # On Windows use: venv\Scripts\activate

Install DependenciesYou'll need Selenium for browser automation, WebDriverWait for adding wait until an element appears on the browser and any other necessary libraries. Install them using pip:

pip install seleniumpip install WebDriverWait

Download and Set Up ChromeDriverSelenium requires a WebDriver to interact with the browser. For this project, we are using ChromeDriver.

Download ChromeDriver:
- Go to the ChromeDriver download page.
- Make sure to download the version that matches your installed version of Google Chrome.
Move ChromeDriver to Project Path:
- Once downloaded, place chromedriver.exe in your project folder or somewhere accessible in your system’s PATH.

Configure ChromeDriver Path in CodeYou’ll need to specify the path to the ChromeDriver in your Python code. Here’s how you can configure the CHROMEDRIVER_PATH and set up the Service:

CHROMEDRIVER_PATH = 'chromedriver.exe'  # Adjust this to the actual path if chromedriver file is not in the current directory

With this setup complete, you’re ready to move on to the next step: building the Quora search crawler.

Build A Quora Search Crawler

Step 1: Create Simple Search Data Parser

Create a parser that extracts Quora post titles and links from Google search results. Use Selenium to select the HTML elements (h3 for titles and a for links).

import osimport csvimport jsonimport loggingimport timeimport stringfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.chrome.service import Servicefrom selenium.webdriver.chrome.options import Optionsfrom selenium.webdriver.common.keys import Keysfrom concurrent.futures import ThreadPoolExecutorfrom dataclasses import dataclass, fields, asdictfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as EC


# Set up logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)
# Selenium configuration
# Set the path to your ChromeDriverCHROMEDRIVER_PATH = 'chromedriver.exe'  # Adjust this to the actual path if necessary
# Configure the service to use the specified driverservice = Service(CHROMEDRIVER_PATH)
# Setup Chrome options for headless browsingoptions = Options()options.add_argument("--headless")options.add_argument("--disable-gpu")  # Required for headless mode in some environmentsoptions.add_argument("--no-sandbox")  # Especially useful for Linux environmentsoptions.add_argument("--disable-dev-shm-usage")  # Helps with resource issues on some systemsoptions.headless = True  # Runs Chrome in headless mode (without GUI)
def scrape_search_results(keyword, data_pipeline=None, retries=3):    # Use a context manager to ensure the driver is properly closed    with webdriver.Chrome(service=service, options=options) as driver:        formatted_keyword = keyword.replace(" ", "+")        result_number = 0        logger.info(f"page {page_number}")
        url = f"https://www.google.com/search?q={formatted_keyword}%20site%3Aquora.com&start={result_number}"        success = False        tries = 0
        while tries < retries and not success:            try:                driver.get(url)                logger.info(f"Accessing {url}")
                # Use explicit wait to ensure elements are loaded                wait = WebDriverWait(driver, 10)                wait.until(EC.presence_of_element_located((By.ID, "rso")))
                # Extract search result cards                for i in range(1, 11):                    try:                        # Attempt primary XPath                        name = driver.find_element(By.XPATH, f"//*[@id='rso']/div[{i}]/div/div/div[1]/div/div/span/a/h3").text                        link = driver.find_element(By.XPATH, f"//*[@id='rso']/div[{i}]/div/div/div[1]/div/div/span/a").get_attribute("href")                    except:                        try:                            # Fallback XPath                            name = driver.find_element(By.XPATH, f"//*[@id='rso']/div[{i}]/div/div/div/div[1]/div/div/span/a/h3").text                            link = driver.find_element(By.XPATH, f"//*[@id='rso']/div[{i}]/div/div/div/div[1]/div/div/span/a").get_attribute("href")                        except Exception as e:                            continue
                    search_data = SearchData(                        name=name,                        url=link,                        rank=result_number + i  # Increment rank per result                    )
                    data_pipeline.add_data(search_data)
                logger.info(f"Successfully parsed data from: {url}")                success = True
            except Exception as e:                logger.error(f"An error occurred while processing {url}: {e}")                tries += 1                if tries >= retries:                    logger.error(f"Max retries exceeded for {url}")                else:                    logger.info(f"Retrying {url} ({tries}/{retries})")                    time.sleep(2)
        logger.info(f"Storage queue length after page {page_number}: {len(data_pipeline.storage_queue)}")
        if __name__ == "__main__":    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 5    logger.info(f"Crawl starting...")    # INPUT ---> List of keywords to scrape    keyword_list = ["learn rust"]    aggregate_files = []    # Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")        scrape_search_results(keyword, retries=MAX_RETRIES)

Step 2: Add Pagination

Modify the search URL to paginate through results by adjusting the start parameter:

result_number = page_number * 10url = f"https://www.google.com/search?q={formatted_keyword}%20site%3Aquora.com&start={result_number}"

After adding pagination the code would be:

import osimport csvimport jsonimport loggingimport timeimport stringfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.chrome.service import Servicefrom selenium.webdriver.chrome.options import Optionsfrom selenium.webdriver.common.keys import Keysfrom concurrent.futures import ThreadPoolExecutorfrom dataclasses import dataclass, fields, asdictfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as EC


# Set up logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)
# Selenium configuration
# Set the path to your ChromeDriverCHROMEDRIVER_PATH = 'chromedriver.exe'  # Adjust this to the actual path if necessary
# Configure the service to use the specified driverservice = Service(CHROMEDRIVER_PATH)
# Setup Chrome options for headless browsingoptions = Options()options.add_argument("--headless")options.add_argument("--disable-gpu")  # Required for headless mode in some environmentsoptions.add_argument("--no-sandbox")  # Especially useful for Linux environmentsoptions.add_argument("--disable-dev-shm-usage")  # Helps with resource issues on some systemsoptions.headless = True  # Runs Chrome in headless mode (without GUI)
def scrape_search_results(keyword, page_number, data_pipeline=None, retries=3):    # Use a context manager to ensure the driver is properly closed    with webdriver.Chrome(service=service, options=options) as driver:        formatted_keyword = keyword.replace(" ", "+")        result_number = page_number * 10        logger.info(f"page {page_number}")
        url = f"https://www.google.com/search?q={formatted_keyword}%20site%3Aquora.com&start={result_number}"        success = False        tries = 0
        while tries < retries and not success:            try:                driver.get(url)                logger.info(f"Accessing {url}")
                # Use explicit wait to ensure elements are loaded                wait = WebDriverWait(driver, 10)                wait.until(EC.presence_of_element_located((By.ID, "rso")))
                # Extract search result cards                for i in range(1, 11):                    try:                        # Attempt primary XPath                        name = driver.find_element(By.XPATH, f"//*[@id='rso']/div[{i}]/div/div/div[1]/div/div/span/a/h3").text                        link = driver.find_element(By.XPATH, f"//*[@id='rso']/div[{i}]/div/div/div[1]/div/div/span/a").get_attribute("href")                    except:                        try:                            # Fallback XPath                            name = driver.find_element(By.XPATH, f"//*[@id='rso']/div[{i}]/div/div/div/div[1]/div/div/span/a/h3").text                            link = driver.find_element(By.XPATH, f"//*[@id='rso']/div[{i}]/div/div/div/div[1]/div/div/span/a").get_attribute("href")                        except Exception as e:                            continue
                    search_data = SearchData(                        name=name,                        url=link,                        rank=result_number + i  # Increment rank per result                    )
                    data_pipeline.add_data(search_data)
                logger.info(f"Successfully parsed data from: {url}")                success = True
            except Exception as e:                logger.error(f"An error occurred while processing {url}: {e}")                tries += 1                if tries >= retries:                    logger.error(f"Max retries exceeded for {url}")                else:                    logger.info(f"Retrying {url} ({tries}/{retries})")                    time.sleep(2)
        logger.info(f"Storage queue length after page {page_number}: {len(data_pipeline.storage_queue)}")
def start_scrape(keyword, pages, location, data_pipeline=None, retries=3):    for page in range(pages):        scrape_search_results(keyword, page, data_pipeline=data_pipeline, retries=retries)
if __name__ == "__main__":    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 5    logger.info(f"Crawl starting...")    # INPUT ---> List of keywords to scrape    keyword_list = ["learn rust"]    aggregate_files = []    # Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")        start_scrape(keyword, pages=PAGES, retries=MAX_RETRIES)

Step 3: Storing the Scraped Data

Once the data is extracted from Quora or the Google search results, it is essential to store it efficiently and avoid any duplicates. The storing process involves two classes: SearchData and DataPipeline.These two classes work together to manage the data, ensure no duplicates are stored, and handle writing the data to CSV files.Let’s dive into how these classes work and how they facilitate the storage of scraped data.SearchData ClassThe SearchData class represents a single scraped search result from Quora. Each instance of this class stores the name (title), URL, and rank of a Quora search result. Using this class ensures that the scraped data is structured and can be processed systematically.Here is the structure of the SearchData class:

@dataclassclass SearchData:    name: str = ""    url: str = ""    rank: int = 0
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            value = getattr(self, field.name)            if isinstance(value, str):                # If the field is a string and is empty, give it a default value                if not value:                    setattr(self, field.name, f"No {field.name}")                else:                    # Strip leading/trailing whitespace                    setattr(self, field.name, value.strip())

Data Validation (check_string_fields): After initializing the object, the __post_init__ method checks the string fields (name and url) to ensure they are not empty. If a field is empty, it assigns a default value (No {field.name}), making sure that empty data doesn’t enter the pipeline.
Rank: The rank of each search result is tracked. This is useful for sorting or prioritizing data during analysis.

When scraping search results from Google, each search result is parsed and stored as an instance of SearchData:

search_data = SearchData(    name=name,    url=link,    rank=result_number + i  # Increment rank per result)

DataPipeline ClassThe DataPipeline class is responsible for managing the collected data and writing it to a CSV file. It performs the following tasks:

Managing a storage queue: Holds the scraped data temporarily before writing it to the CSV file.
Checking for duplicates: Prevents duplicate entries based on the name field.
Saving to a CSV file: Writes the data to a CSV file once the storage queue reaches the defined limit or when the process ends.

Here’s the detailed breakdown of the DataPipeline class:

class DataPipeline:    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []  # Track names to avoid duplicates        self.storage_queue = []  # Temporary storage for scraped data        self.storage_queue_limit = storage_queue_limit  # Limit before writing to CSV        self.csv_filename = csv_filename  # Name of the CSV file        self.csv_file_open = False  # Check if the file is open
    def save_to_csv(self):        try:            self.csv_file_open = True            data_to_save = self.storage_queue.copy()            self.storage_queue.clear()  # Clear the queue after copying            if not data_to_save:                return
            keys = [field.name for field in fields(data_to_save[0])]            # Ensure the CSV filename is valid            valid_chars = "-_.() %s%s" % (string.ascii_letters, string.digits)            valid_filename = ''.join(c for c in self.csv_filename if c in valid_chars)            logger.info(valid_filename)            file_exists = os.path.isfile(valid_filename) and os.path.getsize(valid_filename) > 0
            # Write the header if the file does not exist            if not file_exists:                with open(valid_filename, 'w', newline='') as output_file:                    writer = csv.DictWriter(output_file, fieldnames=keys)                    writer.writeheader()
            # Append the data to the CSV            with open(valid_filename, mode="a", newline="", encoding="utf-8") as output_file:                writer = csv.DictWriter(output_file, fieldnames=keys)                for item in data_to_save:                    writer.writerow(asdict(item))
            self.csv_file_open = False                    except Exception as e:            logger.error(f"Error saving CSV: {e}")
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        if not self.is_duplicate(scraped_data):            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and not self.csv_file_open:                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if self.storage_queue:            self.save_to_csv()

Queue-based Storage (storage_queue): Scraped data is first stored in a temporary queue. Once the queue reaches the defined storage_queue_limit (e.g., 50 entries), the data is saved to a CSV file. This avoids frequent I/O operations and optimizes performance.
Duplicate Handling (is_duplicate): Before adding new data to the queue, the is_duplicate method checks whether the data already exists by comparing the name field. If a duplicate is found, it logs a warning and skips the entry.
CSV File Writing (save_to_csv): When the queue is full or when the scraping process is complete, the save_to_csv method is called to write the collected data to a CSV file. It also ensures that the filename is valid and does not contain any illegal characters.
Closing the Pipeline (close_pipeline): When scraping is finished, the close_pipeline method ensures that any remaining data in the queue is written to the CSV file.

Now the code would be after adding these classess and creating a DataPipeline in the main,

import osimport csvimport jsonimport loggingimport timeimport stringfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.chrome.service import Servicefrom selenium.webdriver.chrome.options import Optionsfrom selenium.webdriver.common.keys import Keysfrom concurrent.futures import ThreadPoolExecutorfrom dataclasses import dataclass, fields, asdictfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as EC


# Set up logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)
# Selenium configuration
# Set the path to your ChromeDriverCHROMEDRIVER_PATH = 'chromedriver.exe'  # Adjust this to the actual path if necessary
# Configure the service to use the specified driverservice = Service(CHROMEDRIVER_PATH)
# Setup Chrome options for headless browsingoptions = Options()options.add_argument("--headless")options.add_argument("--disable-gpu")  # Required for headless mode in some environmentsoptions.add_argument("--no-sandbox")  # Especially useful for Linux environmentsoptions.add_argument("--disable-dev-shm-usage")  # Helps with resource issues on some systemsoptions.headless = True  # Runs Chrome in headless mode (without GUI)
@dataclassclass SearchData:    name: str = ""    url: str = ""    rank: int = 0
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            value = getattr(self, field.name)            if isinstance(value, str):                if not value:                    setattr(self, field.name, f"No {field.name}")                else:                    setattr(self, field.name, value.strip())
class DataPipeline:    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        try:            self.csv_file_open = True            data_to_save = self.storage_queue.copy()            self.storage_queue.clear()            if not data_to_save:                return
            keys = [field.name for field in fields(data_to_save[0])]            # Filter out invalid characters from the filename            valid_chars = "-_.() %s%s" % (string.ascii_letters, string.digits)            valid_filename = ''.join(c for c in self.csv_filename if c in valid_chars)            logger.info(valid_filename)            file_exists = (                os.path.isfile(valid_filename) and os.path.getsize(valid_filename) > 0            )                        if not file_exists:                with open(valid_filename, 'w', newline='') as output_file:                    writer = csv.DictWriter(output_file, fieldnames=keys)                    writer.writeheader()                                with open(                valid_filename, mode="a", newline="", encoding="utf-8"            ) as output_file:                writer = csv.DictWriter(output_file, fieldnames=keys)                
                for item in data_to_save:                    writer.writerow(asdict(item))
            self.csv_file_open = False                    except Exception as e:            logger.error(f"Error saving csv {e}")
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        logger.info("adding data")        logger.info(scraped_data)        if not self.is_duplicate(scraped_data):            self.storage_queue.append(scraped_data)            if (                len(self.storage_queue) >= self.storage_queue_limit                and not self.csv_file_open            ):                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if self.storage_queue:            self.save_to_csv()

def scrape_search_results(keyword, page_number, data_pipeline=None, retries=3):          # Use a context manager to ensure the driver is properly closed    with webdriver.Chrome(service=service, options=options) as driver:        formatted_keyword = keyword.replace(" ", "+")        result_number = page_number * 10        logger.info(f"page {page_number}")
        url = f"https://www.google.com/search?q={formatted_keyword}%20site%3Aquora.com&start={result_number}"        success = False        tries = 0
        while tries < retries and not success:            try:                driver.get(url)                logger.info(f"Accessing {url}")
                # Use explicit wait to ensure elements are loaded                wait = WebDriverWait(driver, 10)                wait.until(EC.presence_of_element_located((By.ID, "rso")))
                # Extract search result cards                for i in range(1, 11):                    try:                        # Attempt primary XPath                        name = driver.find_element(By.XPATH, f"//*[@id='rso']/div[{i}]/div/div/div[1]/div/div/span/a/h3").text                        link = driver.find_element(By.XPATH, f"//*[@id='rso']/div[{i}]/div/div/div[1]/div/div/span/a").get_attribute("href")                    except:                        try:                            # Fallback XPath                            name = driver.find_element(By.XPATH, f"//*[@id='rso']/div[{i}]/div/div/div/div[1]/div/div/span/a/h3").text                            link = driver.find_element(By.XPATH, f"//*[@id='rso']/div[{i}]/div/div/div/div[1]/div/div/span/a").get_attribute("href")                        except Exception as e:                            continue
                    search_data = SearchData(                        name=name,                        url=link,                        rank=result_number + i  # Increment rank per result                    )
                    data_pipeline.add_data(search_data)
                logger.info(f"Successfully parsed data from: {url}")                success = True
            except Exception as e:                logger.error(f"An error occurred while processing {url}: {e}")                tries += 1                if tries >= retries:                    logger.error(f"Max retries exceeded for {url}")                else:                    logger.info(f"Retrying {url} ({tries}/{retries})")                    time.sleep(2)
        logger.info(f"Storage queue length after page {page_number}: {len(data_pipeline.storage_queue)}")

def start_scrape(keyword, pages, data_pipeline=None, retries=3):    for page in range(pages):        scrape_search_results(keyword, page, data_pipeline=data_pipeline, retries=retries)              if __name__ == "__main__":    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 5    logger.info(f"Crawl starting...")    # INPUT ---> List of keywords to scrape    keyword_list = ["learn rust"]    aggregate_files = []    # Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")

Step 4: Adding Concurrency

Concurrency is essential when scraping large amounts of data. If everything follows a single chain of events, one failure can disrupt the entire process.Concurrency helps distribute tasks across multiple chains that run simultaneously, saving time and improving efficiency.Use ThreadPoolExecutor to run concurrent scraping on multiple pages:

with ThreadPoolExecutor(max_workers=5) as executor:    executor.submit(scrape_search_results, keyword, page_number)

The start_scrape function would become:

def start_scrape(    keyword, pages, data_pipeline=None, max_threads=5, retries=3):    with ThreadPoolExecutor(max_workers=max_threads) as executor:        futures = []        for page in range(pages):            # No need to pass the driver anymore, each thread will create its own            futures.append(                executor.submit(                    scrape_search_results,                    keyword,                    page,                    data_pipeline,                    retries,                )            )
        # Ensure all threads complete        for future in futures:            future.result()  # This blocks until the thread finishes

The full code would be:

import osimport csvimport jsonimport loggingimport timeimport stringfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.chrome.service import Servicefrom selenium.webdriver.chrome.options import Optionsfrom selenium.webdriver.common.keys import Keysfrom concurrent.futures import ThreadPoolExecutorfrom dataclasses import dataclass, fields, asdictfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as EC


# Set up logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)
# Selenium configuration
# Set the path to your ChromeDriverCHROMEDRIVER_PATH = 'chromedriver.exe'  # Adjust this to the actual path if necessary
# Configure the service to use the specified driverservice = Service(CHROMEDRIVER_PATH)
# Setup Chrome options for headless browsingoptions = Options()options.add_argument("--headless")options.add_argument("--disable-gpu")  # Required for headless mode in some environmentsoptions.add_argument("--no-sandbox")  # Especially useful for Linux environmentsoptions.add_argument("--disable-dev-shm-usage")  # Helps with resource issues on some systemsoptions.headless = True  # Runs Chrome in headless mode (without GUI)
@dataclassclass SearchData:    name: str = ""    url: str = ""    rank: int = 0
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            value = getattr(self, field.name)            if isinstance(value, str):                if not value:                    setattr(self, field.name, f"No {field.name}")                else:                    setattr(self, field.name, value.strip())
class DataPipeline:    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        try:            self.csv_file_open = True            data_to_save = self.storage_queue.copy()            self.storage_queue.clear()            if not data_to_save:                return
            keys = [field.name for field in fields(data_to_save[0])]            # Filter out invalid characters from the filename            valid_chars = "-_.() %s%s" % (string.ascii_letters, string.digits)            valid_filename = ''.join(c for c in self.csv_filename if c in valid_chars)            logger.info(valid_filename)            file_exists = (                os.path.isfile(valid_filename) and os.path.getsize(valid_filename) > 0            )                        if not file_exists:                with open(valid_filename, 'w', newline='') as output_file:                    writer = csv.DictWriter(output_file, fieldnames=keys)                    writer.writeheader()                                with open(                valid_filename, mode="a", newline="", encoding="utf-8"            ) as output_file:                writer = csv.DictWriter(output_file, fieldnames=keys)                
                for item in data_to_save:                    writer.writerow(asdict(item))
            self.csv_file_open = False                    except Exception as e:            logger.error(f"Error saving csv {e}")
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        logger.info("adding data")        logger.info(scraped_data)        if not self.is_duplicate(scraped_data):            self.storage_queue.append(scraped_data)            if (                len(self.storage_queue) >= self.storage_queue_limit                and not self.csv_file_open            ):                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if self.storage_queue:            self.save_to_csv()

def scrape_search_results(keyword, page_number, data_pipeline=None, retries=3):          # Use a context manager to ensure the driver is properly closed    with webdriver.Chrome(service=service, options=options) as driver:        formatted_keyword = keyword.replace(" ", "+")        result_number = page_number * 10        logger.info(f"page {page_number}")
        url = f"https://www.google.com/search?q={formatted_keyword}%20site%3Aquora.com&start={result_number}"        success = False        tries = 0
        while tries < retries and not success:            try:                driver.get(url)                logger.info(f"Accessing {url}")
                # Use explicit wait to ensure elements are loaded                wait = WebDriverWait(driver, 10)                wait.until(EC.presence_of_element_located((By.ID, "rso")))
                # Extract search result cards                for i in range(1, 11):                    try:                        # Attempt primary XPath                        name = driver.find_element(By.XPATH, f"//*[@id='rso']/div[{i}]/div/div/div[1]/div/div/span/a/h3").text                        link = driver.find_element(By.XPATH, f"//*[@id='rso']/div[{i}]/div/div/div[1]/div/div/span/a").get_attribute("href")                    except:                        try:                            # Fallback XPath                            name = driver.find_element(By.XPATH, f"//*[@id='rso']/div[{i}]/div/div/div/div[1]/div/div/span/a/h3").text                            link = driver.find_element(By.XPATH, f"//*[@id='rso']/div[{i}]/div/div/div/div[1]/div/div/span/a").get_attribute("href")                        except Exception as e:                            continue
                    search_data = SearchData(                        name=name,                        url=link,                        rank=result_number + i  # Increment rank per result                    )
                    data_pipeline.add_data(search_data)
                logger.info(f"Successfully parsed data from: {url}")                success = True
            except Exception as e:                logger.error(f"An error occurred while processing {url}: {e}")                tries += 1                if tries >= retries:                    logger.error(f"Max retries exceeded for {url}")                else:                    logger.info(f"Retrying {url} ({tries}/{retries})")                    time.sleep(2)
        logger.info(f"Storage queue length after page {page_number}: {len(data_pipeline.storage_queue)}")

def start_scrape(    keyword, pages, data_pipeline=None, max_threads=5, retries=3):    with ThreadPoolExecutor(max_workers=max_threads) as executor:        futures = []        for page in range(pages):            # No need to pass the driver anymore, each thread will create its own            futures.append(                executor.submit(                    scrape_search_results,                    keyword,                    page,                    data_pipeline,                    retries,                )            )
        # Ensure all threads complete        for future in futures:            future.result()  # This blocks until the thread finishes        if __name__ == "__main__":    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 5    logger.info(f"Crawl starting...")    # INPUT ---> List of keywords to scrape    keyword_list = ["learn rust"]    aggregate_files = []    # Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")

Step 5: Production Run

Once the crawler is complete, set PAGES to the desired number, and initiate a production run. Tweak the following constants as needed:

MAX_THREADS = 5PAGES = 5

The main function would be:

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 5    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["learn rust"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

If everything goes well, the final result should be like this:Screenshot 7The crawling of results from google took 16.469 seconds for 5 pages. 16.469 / 5 = 3.2938 seconds per page

Build A Quora Scraper

Now to scrape the answers from quora, following steps would be taken:

Read the csv file generated from google search results.
Open each post from the search results in the csv files and parse the answers data from the posts pages.
Store the data.
Perform concurrency for step 2 and 3 to proccess multiple posts at the same time.
Run the Scraper

Step 1: Create Simple Business Data Parser

The goal of this step is to scrape each Quora post and extract the main content, specifically the answers and relevant replies, while filtering out non-relevant data such as promoted or related responses.The process_post function is responsible for visiting a Quora post, waiting for the content to load, and then extracting the answers. It uses Selenium to interact with the dynamically loaded elements on the Quora page.Here's how it works:

def process_post(row, retries=3):    with webdriver.Chrome(service=service, options=options) as driver:        logger.info(f"Processing row: {row}")        url = row.get("url")        if not url:            logger.error(f"No URL found in row: {row}")            return
        success = False        tries = 0        while tries < retries and not success:            try:                # Step 1: Open the URL and wait for the main content to load                driver.get(url)                logger.info(f"Accessing {url}")                wait = WebDriverWait(driver, 10)                main_content = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "div[id='mainContent']")))
                # Step 2: Extract answer cards                answer_cards = main_content.find_elements(By.CSS_SELECTOR, "div.q-click-wrapper")                if not answer_cards:                    logger.warning(f"No answer cards found at {url}")
                # Step 3: Initialize a DataPipeline to store replies                if 'name' not in row:                    logger.error(f"'name' key missing in row: {row}")                    break
                last_seen_name = ""
                # Step 4: Loop through each answer card and extract name and reply                for answer_card in answer_cards:                    try:                        name_element = answer_card.find_element(By.CSS_SELECTOR, "div.q-relative")                        name = name_element.text.replace("\n", "").strip()
                        reply_element = answer_card.find_element(By.CSS_SELECTOR, "div.spacing_log_answer_content")                        reply = reply_element.text.strip()
                        # Filter out promoted content and related questions                        if "Sponsored" in name:                            continue                        if "Related questions" in name:                            break                        if name == last_seen_name:                            continue                        last_seen_name = name                                                print("name:", name)                        print("reply:", reply)                    except Exception as e:                        continue
                success = True
            except Exception as e:                logger.error(f"Exception thrown while processing {url}: {e}")                tries += 1                if tries >= retries:                    logger.error(f"Max retries exceeded for {url}")                else:                    logger.info(f"Retrying {url} ({tries}/{retries})")                    time.sleep(2)

Extracting Answers: Uses Selenium to wait for and locate the main content of a Quora post. It then extracts each individual answer (skipping promoted or irrelevant content) and stores it.
Retries: Includes a retry mechanism to handle temporary failures, such as page load errors.

Step 2: Loading URLs To Scrape

Once you have scraped URLs from Google search results, you need to load these URLs into the scraper to process each Quora post.This is handled by the process_results function, which loads URLs from a CSV file and calls process_post to scrape each post.

def process_results(csv_file, max_threads=5, retries=3):    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))        logger.info(f"file opened")
        for row in reader:            process_post(row, retries)

Reading URLs: This function opens the CSV file generated by the search crawler, reads the URLs of the Quora posts, and processes them sequentially.
Processing Each Post: For each URL, it calls the process_post function to scrape the content.

The full code would be:

import osimport csvimport jsonimport loggingimport timeimport stringfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.chrome.service import Servicefrom selenium.webdriver.chrome.options import Optionsfrom selenium.webdriver.common.keys import Keysfrom concurrent.futures import ThreadPoolExecutorfrom dataclasses import dataclass, fields, asdictfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as EC


# Set up logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)
# Selenium configuration
# Set the path to your ChromeDriverCHROMEDRIVER_PATH = 'chromedriver.exe'  # Adjust this to the actual path if necessary
# Configure the service to use the specified driverservice = Service(CHROMEDRIVER_PATH)
# Setup Chrome options for headless browsingoptions = Options()options.add_argument("--headless")options.add_argument("--disable-gpu")  # Required for headless mode in some environmentsoptions.add_argument("--no-sandbox")  # Especially useful for Linux environmentsoptions.add_argument("--disable-dev-shm-usage")  # Helps with resource issues on some systemsoptions.headless = True  # Runs Chrome in headless mode (without GUI)
@dataclassclass SearchData:    name: str = ""    url: str = ""    rank: int = 0
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            value = getattr(self, field.name)            if isinstance(value, str):                if not value:                    setattr(self, field.name, f"No {field.name}")                else:                    setattr(self, field.name, value.strip())
class DataPipeline:    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        try:            self.csv_file_open = True            data_to_save = self.storage_queue.copy()            self.storage_queue.clear()            if not data_to_save:                return
            keys = [field.name for field in fields(data_to_save[0])]            # Filter out invalid characters from the filename            valid_chars = "-_.() %s%s" % (string.ascii_letters, string.digits)            valid_filename = ''.join(c for c in self.csv_filename if c in valid_chars)            logger.info(valid_filename)            file_exists = (                os.path.isfile(valid_filename) and os.path.getsize(valid_filename) > 0            )                        if not file_exists:                with open(valid_filename, 'w', newline='') as output_file:                    writer = csv.DictWriter(output_file, fieldnames=keys)                    writer.writeheader()                                with open(                valid_filename, mode="a", newline="", encoding="utf-8"            ) as output_file:                writer = csv.DictWriter(output_file, fieldnames=keys)                
                for item in data_to_save:                    writer.writerow(asdict(item))
            self.csv_file_open = False                    except Exception as e:            logger.error(f"Error saving csv {e}")
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        logger.info("adding data")        logger.info(scraped_data)        if not self.is_duplicate(scraped_data):            self.storage_queue.append(scraped_data)            if (                len(self.storage_queue) >= self.storage_queue_limit                and not self.csv_file_open            ):                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if self.storage_queue:            self.save_to_csv()
def scrape_search_results(keyword, page_number, data_pipeline=None, retries=3):          # Use a context manager to ensure the driver is properly closed    with webdriver.Chrome(service=service, options=options) as driver:        formatted_keyword = keyword.replace(" ", "+")        result_number = page_number * 10        logger.info(f"page {page_number}")
        url = f"https://www.google.com/search?q={formatted_keyword}%20site%3Aquora.com&start={result_number}"        success = False        tries = 0
        while tries < retries and not success:            try:                driver.get(url)                logger.info(f"Accessing {url}")
                # Use explicit wait to ensure elements are loaded                wait = WebDriverWait(driver, 10)                wait.until(EC.presence_of_element_located((By.ID, "rso")))
                # Extract search result cards                for i in range(1, 11):                    try:                        # Attempt primary XPath                        name = driver.find_element(By.XPATH, f"//*[@id='rso']/div[{i}]/div/div/div[1]/div/div/span/a/h3").text                        link = driver.find_element(By.XPATH, f"//*[@id='rso']/div[{i}]/div/div/div[1]/div/div/span/a").get_attribute("href")                    except:                        try:                            # Fallback XPath                            name = driver.find_element(By.XPATH, f"//*[@id='rso']/div[{i}]/div/div/div/div[1]/div/div/span/a/h3").text                            link = driver.find_element(By.XPATH, f"//*[@id='rso']/div[{i}]/div/div/div/div[1]/div/div/span/a").get_attribute("href")                        except Exception as e:                            continue
                    search_data = SearchData(                        name=name,                        url=link,                        rank=result_number + i  # Increment rank per result                    )
                    data_pipeline.add_data(search_data)
                logger.info(f"Successfully parsed data from: {url}")                success = True
            except Exception as e:                logger.error(f"An error occurred while processing {url}: {e}")                tries += 1                if tries >= retries:                    logger.error(f"Max retries exceeded for {url}")                else:                    logger.info(f"Retrying {url} ({tries}/{retries})")                    time.sleep(2)
        logger.info(f"Storage queue length after page {page_number}: {len(data_pipeline.storage_queue)}")

def start_scrape(    keyword, pages, data_pipeline=None, max_threads=5, retries=3):    with ThreadPoolExecutor(max_workers=max_threads) as executor:        futures = []        for page in range(pages):            # No need to pass the driver anymore, each thread will create its own            futures.append(                executor.submit(                    scrape_search_results,                    keyword,                    page,                    data_pipeline,                    retries,                )            )
        # Ensure all threads complete        for future in futures:            future.result()  # This blocks until the thread finishes
def process_post(row, retries=3):    with webdriver.Chrome(service=service, options=options) as driver:        logger.info(f"Processing row: {row}")        url = row.get("url")        if not url:            logger.error(f"No URL found in row: {row}")            return
        success = False        tries = 0        while tries < retries and not success:            try:                # Step 1: Open the URL and wait for the main content to load                driver.get(url)                logger.info(f"Accessing {url}")                wait = WebDriverWait(driver, 10)                main_content = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "div[id='mainContent']")))
                # Step 2: Extract answer cards                answer_cards = main_content.find_elements(By.CSS_SELECTOR, "div.q-click-wrapper")                if not answer_cards:                    logger.warning(f"No answer cards found at {url}")
                # Step 3: Initialize a DataPipeline to store replies                if 'name' not in row:                    logger.error(f"'name' key missing in row: {row}")                    break
                last_seen_name = ""
                # Step 4: Loop through each answer card and extract name and reply                for answer_card in answer_cards:                    try:                        name_element = answer_card.find_element(By.CSS_SELECTOR, "div.q-relative")                        name = name_element.text.replace("\n", "").strip()
                        reply_element = answer_card.find_element(By.CSS_SELECTOR, "div.spacing_log_answer_content")                        reply = reply_element.text.strip()
                        # Filter out promoted content and related questions                        if "Sponsored" in name:                            continue                        if "Related questions" in name:                            break                        if name == last_seen_name:                            continue                        last_seen_name = name
                        print("name:", name)                        print("reply:", reply)                    except Exception as e:                        continue
                success = True
            except Exception as e:                logger.error(f"Exception thrown while processing {url}: {e}")                tries += 1                if tries >= retries:                    logger.error(f"Max retries exceeded for {url}")                else:                    logger.info(f"Retrying {url} ({tries}/{retries})")                    time.sleep(2)
def process_results(csv_file, max_threads=5, retries=3):    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))        logger.info(f"file opened")
        for row in reader:            process_post(row, retries)

        if __name__ == "__main__":    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 5    logger.info(f"Crawl starting...")    # INPUT ---> List of keywords to scrape    keyword_list = ["learn rust"]    aggregate_files = []    # Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")    for file in aggregate_files:        process_results(file, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Step 3: Storing the Scraped Data

The extracted data (such as answers and user names) is stored in a CSV file using the ReplyData class and the DataPipeline class.Each scraped answer is stored as an instance of ReplyData, ensuring that the scraped content is well-structured.

`ReplyData` Class:

@dataclassclass ReplyData:    name: str = ""    reply: str = ""
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            value = getattr(self, field.name)            if isinstance(value, str):                if not value:                    setattr(self, field.name, f"No {field.name}")                else:                    setattr(self, field.name, value.strip())

Data Structuring: The ReplyData class is used to store the name of the user and the content of their reply in a structured format.
Field Validation: The check_string_fields method ensures that empty or malformed strings are handled by assigning a default value or removing unnecessary whitespace.

Each instance of ReplyData is passed to the DataPipeline class for storage in a CSV file.The full code would be:

import osimport csvimport jsonimport loggingimport timeimport stringfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.chrome.service import Servicefrom selenium.webdriver.chrome.options import Optionsfrom selenium.webdriver.common.keys import Keysfrom concurrent.futures import ThreadPoolExecutorfrom dataclasses import dataclass, fields, asdictfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as EC


# Set up logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)
# Selenium configuration
# Set the path to your ChromeDriverCHROMEDRIVER_PATH = 'chromedriver.exe'  # Adjust this to the actual path if necessary
# Configure the service to use the specified driverservice = Service(CHROMEDRIVER_PATH)
# Setup Chrome options for headless browsingoptions = Options()options.add_argument("--headless")options.add_argument("--disable-gpu")  # Required for headless mode in some environmentsoptions.add_argument("--no-sandbox")  # Especially useful for Linux environmentsoptions.add_argument("--disable-dev-shm-usage")  # Helps with resource issues on some systemsoptions.headless = True  # Runs Chrome in headless mode (without GUI)
@dataclassclass SearchData:    name: str = ""    url: str = ""    rank: int = 0
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            value = getattr(self, field.name)            if isinstance(value, str):                if not value:                    setattr(self, field.name, f"No {field.name}")                else:                    setattr(self, field.name, value.strip())
@dataclassclass ReplyData:    name: str = ""    reply: str = ""
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            value = getattr(self, field.name)            if isinstance(value, str):                if not value:                    setattr(self, field.name, f"No {field.name}")                else:                    setattr(self, field.name, value.strip())
class DataPipeline:    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        try:            self.csv_file_open = True            data_to_save = self.storage_queue.copy()            self.storage_queue.clear()            if not data_to_save:                return
            keys = [field.name for field in fields(data_to_save[0])]            # Filter out invalid characters from the filename            valid_chars = "-_.() %s%s" % (string.ascii_letters, string.digits)            valid_filename = ''.join(c for c in self.csv_filename if c in valid_chars)            logger.info(valid_filename)            file_exists = (                os.path.isfile(valid_filename) and os.path.getsize(valid_filename) > 0            )                        if not file_exists:                with open(valid_filename, 'w', newline='') as output_file:                    writer = csv.DictWriter(output_file, fieldnames=keys)                    writer.writeheader()                                with open(                valid_filename, mode="a", newline="", encoding="utf-8"            ) as output_file:                writer = csv.DictWriter(output_file, fieldnames=keys)                
                for item in data_to_save:                    writer.writerow(asdict(item))
            self.csv_file_open = False                    except Exception as e:            logger.error(f"Error saving csv {e}")
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        logger.info("adding data")        logger.info(scraped_data)        if not self.is_duplicate(scraped_data):            self.storage_queue.append(scraped_data)            if (                len(self.storage_queue) >= self.storage_queue_limit                and not self.csv_file_open            ):                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if self.storage_queue:            self.save_to_csv()
def scrape_search_results(keyword, page_number, data_pipeline=None, retries=3):          # Use a context manager to ensure the driver is properly closed    with webdriver.Chrome(service=service, options=options) as driver:        formatted_keyword = keyword.replace(" ", "+")        result_number = page_number * 10        logger.info(f"page {page_number}")
        url = f"https://www.google.com/search?q={formatted_keyword}%20site%3Aquora.com&start={result_number}"        success = False        tries = 0
        while tries < retries and not success:            try:                driver.get(url)                logger.info(f"Accessing {url}")
                # Use explicit wait to ensure elements are loaded                wait = WebDriverWait(driver, 10)                wait.until(EC.presence_of_element_located((By.ID, "rso")))
                # Extract search result cards                for i in range(1, 11):                    try:                        # Attempt primary XPath                        name = driver.find_element(By.XPATH, f"//*[@id='rso']/div[{i}]/div/div/div[1]/div/div/span/a/h3").text                        link = driver.find_element(By.XPATH, f"//*[@id='rso']/div[{i}]/div/div/div[1]/div/div/span/a").get_attribute("href")                    except:                        try:                            # Fallback XPath                            name = driver.find_element(By.XPATH, f"//*[@id='rso']/div[{i}]/div/div/div/div[1]/div/div/span/a/h3").text                            link = driver.find_element(By.XPATH, f"//*[@id='rso']/div[{i}]/div/div/div/div[1]/div/div/span/a").get_attribute("href")                        except Exception as e:                            continue
                    search_data = SearchData(                        name=name,                        url=link,                        rank=result_number + i  # Increment rank per result                    )
                    data_pipeline.add_data(search_data)
                logger.info(f"Successfully parsed data from: {url}")                success = True
            except Exception as e:                logger.error(f"An error occurred while processing {url}: {e}")                tries += 1                if tries >= retries:                    logger.error(f"Max retries exceeded for {url}")                else:                    logger.info(f"Retrying {url} ({tries}/{retries})")                    time.sleep(2)
        logger.info(f"Storage queue length after page {page_number}: {len(data_pipeline.storage_queue)}")

def start_scrape(    keyword, pages, data_pipeline=None, max_threads=5, retries=3):    with ThreadPoolExecutor(max_workers=max_threads) as executor:        futures = []        for page in range(pages):            # No need to pass the driver anymore, each thread will create its own            futures.append(                executor.submit(                    scrape_search_results,                    keyword,                    page,                    data_pipeline,                    retries,                )            )
        # Ensure all threads complete        for future in futures:            future.result()  # This blocks until the thread finishes
def process_post(row, retries=3):    with webdriver.Chrome(service=service, options=options) as driver:        logger.info(f"Processing row: {row}")        url = row.get("url")        if not url:            logger.error(f"No URL found in row: {row}")            return        logger.info(f"Processing URL: {url}")        success = False        tries = 0
        while tries < retries and not success:            try:                driver.get(url)                logger.info(f"Accessing {url}")
                # Use explicit wait to ensure main content is loaded                wait = WebDriverWait(driver, 10)                main_content = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "div[id='mainContent']")))
                # Extract answer cards                answer_cards = main_content.find_elements(By.CSS_SELECTOR, "div.q-click-wrapper")                if not answer_cards:                    logger.warning(f"No answer cards found at {url}")
                # Initialize a new DataPipeline for replies                if 'name' not in row:                    logger.error(f"'name' key missing in row: {row}")                    break
                answer_pipeline = DataPipeline(                    csv_filename=f"{row['name'].replace(' ', '-')}.csv"                )                last_seen_name = ""
                for answer_card in answer_cards:                    try:                        name_element = answer_card.find_element(By.CSS_SELECTOR, "div.q-relative")                        name = name_element.text.replace("\n", "").strip()                        reply_element = answer_card.find_element(By.CSS_SELECTOR, "div.spacing_log_answer_content")                        reply = reply_element.text.strip()
                        if "Sponsored" in name:                            continue                        if "Related questions" in name:                            break                        if name == last_seen_name:                            continue                        last_seen_name = name
                        reply_data = ReplyData(name=name, reply=reply)                        answer_pipeline.add_data(reply_data)                    except Exception as e:                        continue
                answer_pipeline.close_pipeline()                success = True
            except Exception as e:                logger.error(f"Exception thrown while processing {url}: {e}")                tries += 1                if tries >= retries:                    logger.error(f"Max retries exceeded for {url}")                else:                    logger.info(f"Retrying {url} ({tries}/{retries})")                    time.sleep(2)
def process_results(csv_file, max_threads=5, retries=3):    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))        logger.info(f"file opened")
        for row in reader:            process_post(row, retries)

        if __name__ == "__main__":    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 5    logger.info(f"Crawl starting...")    # INPUT ---> List of keywords to scrape    keyword_list = ["learn rust"]    aggregate_files = []    # Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")    for file in aggregate_files:        process_results(file, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Step 4: Adding Concurrency

To scrape multiple Quora posts concurrently and improve efficiency, you can modify the process_results function to use ThreadPoolExecutor. This allows the scraper to handle multiple posts at once, significantly speeding up the process.

from concurrent.futures import ThreadPoolExecutor
def process_results(csv_file, max_threads=5, retries=3):    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))        logger.info(f"file opened")
        with ThreadPoolExecutor(max_workers=max_threads) as executor:            for row in reader:                executor.submit(process_post, row, retries)

Threading: ThreadPoolExecutor is used to run multiple threads, allowing the scraper to process several Quora posts simultaneously.
Concurrency: The max_workers parameter defines the number of threads running concurrently. Each thread calls the process_post function to handle a single Quora post.

The full code would be:

import osimport csvimport jsonimport loggingimport timeimport stringfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.chrome.service import Servicefrom selenium.webdriver.chrome.options import Optionsfrom selenium.webdriver.common.keys import Keysfrom concurrent.futures import ThreadPoolExecutorfrom dataclasses import dataclass, fields, asdictfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as EC


# Set up logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)
# Selenium configuration
# Set the path to your ChromeDriverCHROMEDRIVER_PATH = 'chromedriver.exe'  # Adjust this to the actual path if necessary
# Configure the service to use the specified driverservice = Service(CHROMEDRIVER_PATH)
# Setup Chrome options for headless browsingoptions = Options()options.add_argument("--headless")options.add_argument("--disable-gpu")  # Required for headless mode in some environmentsoptions.add_argument("--no-sandbox")  # Especially useful for Linux environmentsoptions.add_argument("--disable-dev-shm-usage")  # Helps with resource issues on some systemsoptions.headless = True  # Runs Chrome in headless mode (without GUI)
@dataclassclass SearchData:    name: str = ""    url: str = ""    rank: int = 0
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            value = getattr(self, field.name)            if isinstance(value, str):                if not value:                    setattr(self, field.name, f"No {field.name}")                else:                    setattr(self, field.name, value.strip())
@dataclassclass ReplyData:    name: str = ""    reply: str = ""
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            value = getattr(self, field.name)            if isinstance(value, str):                if not value:                    setattr(self, field.name, f"No {field.name}")                else:                    setattr(self, field.name, value.strip())
class DataPipeline:    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        try:            self.csv_file_open = True            data_to_save = self.storage_queue.copy()            self.storage_queue.clear()            if not data_to_save:                return
            keys = [field.name for field in fields(data_to_save[0])]            # Filter out invalid characters from the filename            valid_chars = "-_.() %s%s" % (string.ascii_letters, string.digits)            valid_filename = ''.join(c for c in self.csv_filename if c in valid_chars)            logger.info(valid_filename)            file_exists = (                os.path.isfile(valid_filename) and os.path.getsize(valid_filename) > 0            )                        if not file_exists:                with open(valid_filename, 'w', newline='') as output_file:                    writer = csv.DictWriter(output_file, fieldnames=keys)                    writer.writeheader()                                with open(                valid_filename, mode="a", newline="", encoding="utf-8"            ) as output_file:                writer = csv.DictWriter(output_file, fieldnames=keys)                
                for item in data_to_save:                    writer.writerow(asdict(item))
            self.csv_file_open = False                    except Exception as e:            logger.error(f"Error saving csv {e}")
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        logger.info("adding data")        logger.info(scraped_data)        if not self.is_duplicate(scraped_data):            self.storage_queue.append(scraped_data)            if (                len(self.storage_queue) >= self.storage_queue_limit                and not self.csv_file_open            ):                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if self.storage_queue:            self.save_to_csv()
def scrape_search_results(keyword, page_number, data_pipeline=None, retries=3):          # Use a context manager to ensure the driver is properly closed    with webdriver.Chrome(service=service, options=options) as driver:        formatted_keyword = keyword.replace(" ", "+")        result_number = page_number * 10        logger.info(f"page {page_number}")
        url = f"https://www.google.com/search?q={formatted_keyword}%20site%3Aquora.com&start={result_number}"        success = False        tries = 0
        while tries < retries and not success:            try:                driver.get(url)                logger.info(f"Accessing {url}")
                # Use explicit wait to ensure elements are loaded                wait = WebDriverWait(driver, 10)                wait.until(EC.presence_of_element_located((By.ID, "rso")))
                # Extract search result cards                for i in range(1, 11):                    try:                        # Attempt primary XPath                        name = driver.find_element(By.XPATH, f"//*[@id='rso']/div[{i}]/div/div/div[1]/div/div/span/a/h3").text                        link = driver.find_element(By.XPATH, f"//*[@id='rso']/div[{i}]/div/div/div[1]/div/div/span/a").get_attribute("href")                    except:                        try:                            # Fallback XPath                            name = driver.find_element(By.XPATH, f"//*[@id='rso']/div[{i}]/div/div/div/div[1]/div/div/span/a/h3").text                            link = driver.find_element(By.XPATH, f"//*[@id='rso']/div[{i}]/div/div/div/div[1]/div/div/span/a").get_attribute("href")                        except Exception as e:                            continue
                    search_data = SearchData(                        name=name,                        url=link,                        rank=result_number + i  # Increment rank per result                    )
                    data_pipeline.add_data(search_data)
                logger.info(f"Successfully parsed data from: {url}")                success = True
            except Exception as e:                logger.error(f"An error occurred while processing {url}: {e}")                tries += 1                if tries >= retries:                    logger.error(f"Max retries exceeded for {url}")                else:                    logger.info(f"Retrying {url} ({tries}/{retries})")                    time.sleep(2)
        logger.info(f"Storage queue length after page {page_number}: {len(data_pipeline.storage_queue)}")

def start_scrape(    keyword, pages, data_pipeline=None, max_threads=5, retries=3):    with ThreadPoolExecutor(max_workers=max_threads) as executor:        futures = []        for page in range(pages):            # No need to pass the driver anymore, each thread will create its own            futures.append(                executor.submit(                    scrape_search_results,                    keyword,                    page,                    data_pipeline,                    retries,                )            )
        # Ensure all threads complete        for future in futures:            future.result()  # This blocks until the thread finishes
def process_post(row, retries=3):    with webdriver.Chrome(service=service, options=options) as driver:        logger.info(f"Processing row: {row}")        url = row.get("url")        if not url:            logger.error(f"No URL found in row: {row}")            return        logger.info(f"Processing URL: {url}")        success = False        tries = 0
        while tries < retries and not success:            try:                driver.get(url)                logger.info(f"Accessing {url}")
                # Use explicit wait to ensure main content is loaded                wait = WebDriverWait(driver, 10)                main_content = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "div[id='mainContent']")))
                # Extract answer cards                answer_cards = main_content.find_elements(By.CSS_SELECTOR, "div.q-click-wrapper")                if not answer_cards:                    logger.warning(f"No answer cards found at {url}")
                # Initialize a new DataPipeline for replies                if 'name' not in row:                    logger.error(f"'name' key missing in row: {row}")                    break
                answer_pipeline = DataPipeline(                    csv_filename=f"{row['name'].replace(' ', '-')}.csv"                )                last_seen_name = ""
                for answer_card in answer_cards:                    try:                        name_element = answer_card.find_element(By.CSS_SELECTOR, "div.q-relative")                        name = name_element.text.replace("\n", "").strip()                        reply_element = answer_card.find_element(By.CSS_SELECTOR, "div.spacing_log_answer_content")                        reply = reply_element.text.strip()
                        if "Sponsored" in name:                            continue                        if "Related questions" in name:                            break                        if name == last_seen_name:                            continue                        last_seen_name = name
                        reply_data = ReplyData(name=name, reply=reply)                        answer_pipeline.add_data(reply_data)                    except Exception as e:                        continue
                answer_pipeline.close_pipeline()                success = True
            except Exception as e:                logger.error(f"Exception thrown while processing {url}: {e}")                tries += 1                if tries >= retries:                    logger.error(f"Max retries exceeded for {url}")                else:                    logger.info(f"Retrying {url} ({tries}/{retries})")                    time.sleep(2)
def process_results(csv_file, max_threads=5, retries=3):    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))        logger.info(f"file opened")
        with ThreadPoolExecutor(max_workers=max_threads) as executor:            for row in reader:                executor.submit(process_post, row, retries)
if __name__ == "__main__":    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 5    logger.info(f"Crawl starting...")    # INPUT ---> List of keywords to scrape    keyword_list = ["learn rust"]    aggregate_files = []    # Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")    for file in aggregate_files:        process_results(file, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Step 5: Production Run

Finally, when your scraper is ready to run on a larger dataset, you can execute the full scraping process. This includes scraping search results for multiple keywords, storing the data in CSV files, and processing the scraped URLs concurrently.The main function would be:

if __name__ == "__main__":    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 5    logger.info(f"Crawl starting...")
    # INPUT ---> List of keywords to scrape    keyword_list = ["learn rust"]    aggregate_files = []
    # Job Processes: Scraping Search Results    for keyword in keyword_list:        filename = keyword.replace(" ", "-")        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")
    logger.info(f"Crawl complete.")
    # Processing Scraped Quora Posts    for file in aggregate_files:        process_results(file, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Scraping Search Results: For each keyword in the keyword_list, the scraper collects search results from Google and stores them in a CSV file.
Concurrent Processing: After collecting the URLs, the process_results function processes each Quora post concurrently, using multiple threads for efficiency.
Parameters: You can adjust MAX_RETRIES, MAX_THREADS, and PAGES to fine-tune the scraper's performance. More threads will increase speed, but be mindful of server load and anti-bot measures.

After running the code, if everything runs fine, you will get the following results:Screeshot 8The crawling of all the quora posts took 651.258 seconds and out of those 16.469 were taken by google. So, the scraping of posts took: 651.258 - 16.469 = 634.789 seconds. We scraped 50 posts so, 634.789 / 50 = 12.695 seconds per page

Legal and Ethical Considerations

When scraping the web, you need to pay attention to your target site's Terms of Service and their robots.txt. Legal or not, when you violate a site's terms, you can get suspended or even permanently banned.Public data is typically free to scrape, but be cautious when dealing with private or gated content.When scraping Quora, be mindful of their Terms of Service and review their robots.txt file.Ensure that your scraping activities do not violate legal or ethical guidelines.When scraping private data, you are subject to a site's terms and privacy laws in the site's jurisdiction. If you don't know if your scraper is legal, you should consult an attorney.

Conclusion

This guide walks you through building a robust Quora scraper using Python and Selenium. With scraping logic, pagination, and concurrency You're now equipped to scrape Quora effectively. Be sure to follow ethical guidelines and monitor the performance of your scraper.If you'd like to learn more about the tech stack used in this article, check out these links below.

Selenium

More Web Scraping Guides

Here at ScrapeOps, we've got a ton of learning resources. Whether you're brand new or a seasoned web developer, we've got something for you. Check out our extensive Web Scraping Playbook and build something! If you'd like to learn more from our "How To Scrape" series, take a look at the links below.

How to Scrape Quora

How to Scrape Quora With Requests and BeautifulSoup

Need help scraping the web?

TLDR - How to Scrape Quora with Python

How To Architect Our Quora Scraper

Understanding How To Scrape Quora

Step 1: How To Request Quora Pages

Step 2: How To Extract Data From Quora Results and Pages

Step 3: How To Control Pagination

Step 4: Geolocated Data

Setting Up Our Quora Scraper Project

Build A Quora Search Crawler

Step 1: Create Simple Search Data Parser

Step 2: Add Pagination

Step 3: Storing the Scraped Data

Step 4: Adding Concurrency

Step 5: Bypassing Anti-Bots

Step 6: Production Run

Build A Quora Scraper

Step 1: Create Simple Business Data Parser

Step 2: Loading URLs To Scrape

Step 3: Storing the Scraped Data

Step 4: Adding Concurrency

Step 5: Bypassing Anti-Bots

Step 6: Production Run

Legal and Ethical Considerations

Conclusion

How to Scrape Quora With Selenium

Need help scraping the web?

TLDR - How to Scrape Quora with Selenium

How To Architect Our Quora Scraper

Understanding How To Scrape Quora

Step 1: How To Request Quora Pages

Step 2: How To Extract Data From Quora Results and Pages

Finding the Correct XPaths or CSS Selectors

1. Find XPath for Quora Search Results:

2. Find the Main Content and Replies on Quora:

Step 3: How To Control Pagination:

Setting Up Our Quora Scraper Project

Build A Quora Search Crawler

Step 1: Create Simple Search Data Parser

Step 2: Add Pagination

Step 3: Storing the Scraped Data

Step 4: Adding Concurrency

Step 5: Production Run

Build A Quora Scraper

Step 1: Create Simple Business Data Parser

Step 2: Loading URLs To Scrape

Step 3: Storing the Scraped Data

ReplyData Class:

Step 4: Adding Concurrency

Step 5: Production Run

Legal and Ethical Considerations

Conclusion

More Web Scraping Guides

`ReplyData` Class: