How to Scrape Zillow

When you're searching for a house in the US, Zillow is one of the most prominent platforms to explore. It provides comprehensive listings for properties, including detailed descriptions, photos, and even virtual tours. However, Zillow is also known for implementing robust anti-scraping measures, making it challenging to extract data using traditional methods. In this article, we’ll explore how to overcome these challenges using Python and Selenium, which allows us to interact with websites in a way that mimics human behavior, making it more resilient against anti-bot measures. Feel free to choose the language and setup that best fits your development environment.

Let’s dive into the step-by-step process and learn how to effectively scrape Zillow using Python.

How to Scrape Zillow With Requests and BeautifulSoup

💡GitHub CodeThe full code for this Zillow Scraper is available on Github here.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

TLDR - How to Scrape Zillow

Need to scrape Zillow? We've got you covered.

Create a new project folder.
Inside that project folder, add a config.json file with your API key and then add this script.
Run it and you're good to go!

First, it will generate CSV file based on your search. If you searched for houses in "pr" (Puerto Rico), it spits out a file called pr.csv.It then reads this file and creates an individual report on each house from pr.csv.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "residential": True        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    property_type: str = ""    street_address: str = ""    locality: str  = ""    region: str = ""    postal_code: str = ""    url: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass PropertyData:    name: str = ""    price: int = 0    time_on_zillow: str = ""    views: int = 0    saves: int = 0

    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.zillow.com/{keyword}/{page_number+1}_p/"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        script_tags = soup.select("script[type='application/ld+json']")
            for script_tag in script_tags:                json_data = json.loads(script_tag.text)                if json_data["@type"] != "BreadcrumbList":                    search_data = SearchData(                        name=json_data["name"],                        property_type=json_data["@type"],                        street_address=json_data["address"]["streetAddress"],                        locality=json_data["address"]["addressLocality"],                        region=json_data["address"]["addressRegion"],                        postal_code=json_data["address"]["postalCode"],                        url=json_data["url"]                    )                    data_pipeline.add_data(search_data)               
            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

def process_property(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(get_scrapeops_url(url, location=location))        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")
                price_holder = soup.select_one("span[data-testid='price']")                price = int(price_holder.text.replace("$", "").replace(",", ""))
                info_holders = soup.select("dt")                time_listed = info_holders[0].text                views = int(info_holders[2].text.replace(",", ""))                saves = info_holders[4].text
                property_pipeline = DataPipeline(csv_filename=f"{row['name']}.csv")
                property_data = PropertyData(                    name=row["name"],                    price=price,                    time_on_zillow=time_listed,                    views=views,                    saves=saves                )                property_pipeline.add_data(property_data)                property_pipeline.close_pipeline()                                success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")



def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_property,                reader,                [location] * len(reader),                [retries] * len(reader)            )
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["pr"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Feel free to change any of the following from main:

MAX_THREADS: Determines the maximum number of threads used for concurrent scraping and processing.
MAX_RETRIES: Sets the maximum number of retries for each request in case of failure (e.g., network issues, server errors).
PAGES: Specifies the number of pages to scrape for each keyword. Each page contains multiple property listings.
LOCATION: Defines the geographical location for the scraping. This parameter is used to adjust the proxy location to simulate requests from a specific country.
keyword_list: A list of keywords representing different geographical areas or search terms on Zillow. Each keyword triggers a separate scraping job. ("pr" is Puerto Ricto, if you want to do Michigan, add "mi")

How To How To Architect Our Zillow Scraper

Like many other scrapers from our "How To Scrape" series, our Zillow project will consist of two scrapers:

a crawler: Our crawler will perform a search for houses in a specific area and spit out a CSV report on all the properties it finds.
a parser: The scraper will then read the CSV file, and run an individual scrape on each property from the list.

Throughout the development process we'll use the following concepts in our design:

Parsing to extract valuable data from webpages.
Pagination to get our results in batches.
Data Storage to generate CSV reports from that data.
Concurrency to handle all the above steps on multiple pages simultaneously.
Proxy Integration to get past anti-bot systems and anything else that might get in the way of our scraper.

Understanding How To Scrape Zillow

Before we dive head first into coding, we need to get a better understanding of Zillow at a high level. In the following sections, we're going to discuss how to request pages, how to extract data, how to control pagination, and how to handle geolocation.

Step 1: How To Request Zillow Pages

We can't get very far if we can't request pages. We can request pages on Zillow with a simple GET request. A GET request does exactly what it sounds like, it gets information. When you lookup a site in your browser, you're actually performing a GET request.Take a look at the image below, specifically our address bar. Here is our URL:

https://www.zillow.com/pr/2_p/

pr is our location.When we look at a specific home, we get a pop-up about the home information, but no worries, we still get a URL that we can work with.For the house below, our URL is

https://www.zillow.com/homedetails/459-Carr-Km-7-2-Int-Bo-Arenales-Aguadilla-PR-00603/363559698_zpid/

Building these URLs from scratch would be pretty difficult, luckily for us, we can scrape them on our initial crawl.

Step 2: How To Extract Data From Zillow Results and Pages

Extracting data from Zillow can be a bit tricky. For search results, our data is actually embedded in a JSON blob. For individual property pages, it's nested within the HTML elements. Let's take a closer look at exactly which data we'll be extracting.Here is the search page and the JSON blob inside it.Here is a look at some the HTML we want to parse for an individual property page.

Step 3: How To Control Pagination

Pagination might seem a bit cryptic, but take a closer look at our URL from before:

https://www.zillow.com/pr/2_p/

2_p actually denotes our page number, 2.If we want to search for page 1, our URL is

https://www.zillow.com/pr/1_p/

If we want page 3, we would end the url with 3_p.

Step 4: Geolocated Data

For geolocated data, we'll be using the ScrapeOps Proxy API and our keyword_list. In the keyword_list, we'll hold the locations we'd like to scrape.When interacting with the ScrapeOps API, we'll pass in a country param as well. country will not have any effect on our actual search results, but instead it will route us through a server in whichever country we specify.For instance, if we want to appear in the US, we'd pass us in as our country.

Setting Up Our Zillow Scraper Project

Let's get started. You can run the following commands to get setup.Create a New Project Folder

mkdir zillow-scraper
cd zillow-scraper

Create a New Virtual Environment

python -m venv venv

Activate the Environment

source venv/bin/activate

Install Our Dependencies

pip install requests

pip install beautifulsoup4

Build A Zillow Search Crawler

Step 1: Create Simple Search Data Parser

We'll start off by building a simple parser. The job of the parser is relatively straightforward. Our parser needs to perform a search and extract data from the search results. This is the bedrock of everything else we'll add into our script.While the script below adds some basic structure (logging, error handling, retry logic), what you should pay attention to is the parsing logic inside of our scrape_search_results() function.Take a look at the script so far.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)

def scrape_search_results(keyword, location, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.zillow.com/{keyword}/"    tries = 0    success = False        while tries <= retries and not success:        try:            response = requests.get(url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        script_tags = soup.select("script[type='application/ld+json']")
            for script_tag in script_tags:                json_data = json.loads(script_tag.text)                if json_data["@type"] != "BreadcrumbList":                    search_data = {                        "name": json_data["name"],                        "property_type": json_data["@type"],                        "street_address": json_data["address"]["streetAddress"],                        "locality": json_data["address"]["addressLocality"],                        "region": json_data["address"]["addressRegion"],                        "postal_code": json_data["address"]["postalCode"],                        "url": json_data["url"]                    }                                         print(search_data)               
            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")


if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["pr"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        scrape_search_results(keyword, LOCATION, retries=retries)    logger.info(f"Crawl complete.")

In our parsing function, we do the following things.

Find all of our JSON blobs with script_tags = soup.select("script[type='application/ld+json']")
Iterate through the blobs.
For each blob that does not have a "@type" of "BreadcrumbList", we parse its data:
- name
- property_type
- street_address
- locality
- region
- postal_code
- url

Step 2: Add Pagination

As we discussed earlier, we can add pagination with the following string by constructin our URL like so:

https://www.zillow.com/{keyword}/{page_number+1}_p/

As you know, keyword is the location we'd like to search.
{page_number+1}_p, denotes our page number. We use page_number+1 because we'll be using Python's range() function to create our page list.
range() starts counting at zero and Zillow starts our pages at 1. So, we add 1 to our page when we pass it into the URL.

We also added a start_scrape() function to support the pagination we just added.

def start_scrape(keyword, pages, location, max_threads=5, retries=3):    for page in range(pages):        scrape_search_results(keyword, location, page, retries=retries)

Here is our updated code. We added a start_scrape() function to support multiple pages, but all in all, our code isn't all that different.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)

def scrape_search_results(keyword, location, page_number, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.zillow.com/{keyword}/{page_number+1}_p/"    tries = 0    success = False        while tries <= retries and not success:        try:            response = requests.get(url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        script_tags = soup.select("script[type='application/ld+json']")
            for script_tag in script_tags:                json_data = json.loads(script_tag.text)                if json_data["@type"] != "BreadcrumbList":                    search_data = {                        "name": json_data["name"],                        "property_type": json_data["@type"],                        "street_address": json_data["address"]["streetAddress"],                        "locality": json_data["address"]["addressLocality"],                        "region": json_data["address"]["addressRegion"],                        "postal_code": json_data["address"]["postalCode"],                        "url": json_data["url"]                    }                                         print(search_data)               
            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def start_scrape(keyword, pages, location, max_threads=5, retries=3):    for page in range(pages):        scrape_search_results(keyword, location, page, retries=retries)

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["pr"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        start_scrape(keyword, PAGES, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)    logger.info(f"Crawl complete.")

Step 3: Storing the Scraped Data

When we scrape our data, we need to store it. If we didn't, there wouldn't be a point in scraping it to begin with. When we store our data, we can review it and we can also allow other functions or programs to read it.This lays the groundwork for our scraper, which will read the stored CSV file and then lookup individual data about the properties from the CSV file.First, we need a SearchData class. This class simply holds data.

@dataclassclass SearchData:    name: str = ""    property_type: str = ""    street_address: str = ""    locality: str  = ""    region: str = ""    postal_code: str = ""    url: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

Here is our DataPipeline. It takes in a dataclass (such as SearchData) and pipes it to a CSV file. This pipeline then filters out our duplicates and then saves the data to a CSV file.Additionally, our pipeline writes the file safely. If the CSV exists, we append it, otherwise the pipeline creates it.

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()

Here is our code now that it's been fully updated for storage.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    property_type: str = ""    street_address: str = ""    locality: str  = ""    region: str = ""    postal_code: str = ""    url: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.zillow.com/{keyword}/{page_number+1}_p/"    tries = 0    success = False        while tries <= retries and not success:        try:            response = requests.get(url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        script_tags = soup.select("script[type='application/ld+json']")
            for script_tag in script_tags:                json_data = json.loads(script_tag.text)                if json_data["@type"] != "BreadcrumbList":                    search_data = SearchData(                        name=json_data["name"],                        property_type=json_data["@type"],                        street_address=json_data["address"]["streetAddress"],                        locality=json_data["address"]["addressLocality"],                        region=json_data["address"]["addressRegion"],                        postal_code=json_data["address"]["postalCode"],                        url=json_data["url"]                    )                    data_pipeline.add_data(search_data)               
            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    for page in range(pages):        scrape_search_results(keyword, location, page, data_pipeline=data_pipeline, retries=retries)

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["pr"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

Step 4: Adding Concurrency

To add concurrency, we'll use ThreadPoolExecutor. We'll also add a max_threads argument to start_scrape(). Take a look at the snippet below.

def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

Pay close attention to the arguments we pass into executor.map():

scrape_search_results is the function we'd like to run on each thread.
All other arguments are the arguments that we pass into scrape_search_results()
Each argument after the first gets passed in as an array.

Here is our full code as of now.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    property_type: str = ""    street_address: str = ""    locality: str  = ""    region: str = ""    postal_code: str = ""    url: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.zillow.com/{keyword}/{page_number+1}_p/"    tries = 0    success = False        while tries <= retries and not success:        try:            response = requests.get(url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        script_tags = soup.select("script[type='application/ld+json']")
            for script_tag in script_tags:                json_data = json.loads(script_tag.text)                if json_data["@type"] != "BreadcrumbList":                    search_data = SearchData(                        name=json_data["name"],                        property_type=json_data["@type"],                        street_address=json_data["address"]["streetAddress"],                        locality=json_data["address"]["addressLocality"],                        region=json_data["address"]["addressRegion"],                        postal_code=json_data["address"]["postalCode"],                        url=json_data["url"]                    )                    data_pipeline.add_data(search_data)               
            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["pr"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

Step 5: Bypassing Anti-Bots

When we scrape the web, we tend to run into anti-bots. Anti-bots are software designed to detect malware and block it from accessing a site. Our scraper isn't malware, but it doesn't look human at all and anti-bots tend to flag this.Take a look at the snippet below, this unleashes the power of the ScrapeOps Proxy API.

def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "residential": True        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

Arguments we pass into the ScrapeOps API:

"api_key": our ScrapeOps API key.
"url": the url of the site we'd like to scrape.
"country": the country we'd like to be routed through.
"residential": a boolean value. If we set this to True, we're telling ScrapeOps to give us a residential IP address which decreases our likelihood of getting blocked.

Here is our production ready code.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "residential": True        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    property_type: str = ""    street_address: str = ""    locality: str  = ""    region: str = ""    postal_code: str = ""    url: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.zillow.com/{keyword}/{page_number+1}_p/"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        script_tags = soup.select("script[type='application/ld+json']")
            excluded_types = ["BreadcrumbList", "Event"]            for script_tag in script_tags:                json_data = json.loads(script_tag.text)                                if json_data["@type"] not in excluded_types:                    search_data = SearchData(                        name=json_data["name"],                        property_type=json_data["@type"],                        street_address=json_data["address"]["streetAddress"],                        locality=json_data["address"]["addressLocality"],                        region=json_data["address"]["addressRegion"],                        postal_code=json_data["address"]["postalCode"],                        url=json_data["url"]                    )                    data_pipeline.add_data(search_data)               
            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 5    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["pr"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

Step 6: Production Run

We're now ready for a production run. Take a look at our main.

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 5    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["pr"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

We'll change PAGES to 5, and LOCATION to "us". Feel free to change any of these constants in the main to tweak your results.Here are our results.Our crawler parsed the results at approximately one second per page. This is great!

Build A Zillow Scraper

Our crawler is now performing searches, parsing results and saving the data. Now, it's time to build the scraper. Our scraper is going to do the following:

Read a CSV file.
Parse all properties from the CSV file.
Save their data.
Perform the actions above concurrently.
Integrate with a proxy.

Step 1: Create Simple Business Data Parser

Like before, we'll start with a basic parsing function. The overall structure looks alot like our first parsing function, but there are some key differences.Mainly, instead of finding JSON data nested within the page, we're going to find our data from different elements on the page.Take a look at the function below.

def process_property(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(url)        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")
                price_holder = soup.select_one("span[data-testid='price']")                price = int(price_holder.text.replace("$", "").replace(",", ""))
                info_holders = soup.select("dt")                time_listed = info_holders[0].text                views = int(info_holders[2].text.replace(",", ""))                saves = info_holders[4].text
                property_data = {                    "name": row["name"],                    "price": price,                    "time_on_zillow": time_listed,                    "views": views,                    "saves": saves                }                                                print(property_data)                success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")

Things you need to pay attention to here:

"span[data-testid='price']" is the CSS selector of our price_holder.
int(price_holder.text.replace("$", "").replace(",", "")) gives us our actual price and converts it to an integer.
We find all of our info_holders with soup.select("dt")
We pull time_listed, views, and saves from the info_holders array.

Step 2: Loading URLs To Scrape

In order to use our parsing function, we need to be able to feed urls into it. for each row in the file, we run process_property() on that row. Later, we'll add concurrency to this function just like we did eariler.

def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))        for row in reader:            process_property(row, location, retries=retries)

Here is our fully updated code.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "residential": True        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    property_type: str = ""    street_address: str = ""    locality: str  = ""    region: str = ""    postal_code: str = ""    url: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.zillow.com/{keyword}/{page_number+1}_p/"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        script_tags = soup.select("script[type='application/ld+json']")                        excluded_types = ["BreadcrumbList", "Event"]            for script_tag in script_tags:                json_data = json.loads(script_tag.text)                if json_data["@type"] not in excluded_types:                    search_data = SearchData(                        name=json_data["name"],                        property_type=json_data["@type"],                        street_address=json_data["address"]["streetAddress"],                        locality=json_data["address"]["addressLocality"],                        region=json_data["address"]["addressRegion"],                        postal_code=json_data["address"]["postalCode"],                        url=json_data["url"]                    )                    data_pipeline.add_data(search_data)               
            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

def process_property(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(url)        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")
                price_holder = soup.select_one("span[data-testid='price']")                price = int(price_holder.text.replace("$", "").replace(",", ""))
                info_holders = soup.select("dt")                time_listed = info_holders[0].text                views = int(info_holders[2].text.replace(",", ""))                saves = info_holders[4].text
                property_data = {                    "name": row["name"],                    "price": price,                    "time_on_zillow": time_listed,                    "views": views,                    "saves": saves                }                                                print(property_data)                success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")



def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))        for row in reader:            process_property(row, location, retries=retries)
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["pr"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Our code now reads the CSV file generated by the crawler and runs process_post() on each row from the file.

Step 3: Storing the Scraped Data

As you already know at this point, we need to store the data we scrape. For the most part, we've already got everything in place to do this.We simply need to add a PropertyData class. This class will act much like the SearchData class from before and it will also get passed into a DataPipeline.Here is our SearchData class.

@dataclassclass PropertyData:    name: str = ""    price: int = 0    time_on_zillow: str = ""    views: int = 0    saves: int = 0

    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

In our full code below, instead of printing the data, we instantiate a PropertyData object and then pass it into a DataPipeline.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "residential": True        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    property_type: str = ""    street_address: str = ""    locality: str  = ""    region: str = ""    postal_code: str = ""    url: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass PropertyData:    name: str = ""    price: int = 0    time_on_zillow: str = ""    views: int = 0    saves: int = 0

    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.zillow.com/{keyword}/{page_number+1}_p/"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        script_tags = soup.select("script[type='application/ld+json']")            excluded_types = ["BreadcrumbList", "Event"]            for script_tag in script_tags:                json_data = json.loads(script_tag.text)                if json_data["@type"] not in excluded_types:                    search_data = SearchData(                        name=json_data["name"],                        property_type=json_data["@type"],                        street_address=json_data["address"]["streetAddress"],                        locality=json_data["address"]["addressLocality"],                        region=json_data["address"]["addressRegion"],                        postal_code=json_data["address"]["postalCode"],                        url=json_data["url"]                    )                    data_pipeline.add_data(search_data)               
            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

def process_property(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(url)        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")
                price_holder = soup.select_one("span[data-testid='price']")                price = int(price_holder.text.replace("$", "").replace(",", ""))
                info_holders = soup.select("dt")                time_listed = info_holders[0].text                views = int(info_holders[2].text.replace(",", ""))                saves = info_holders[4].text
                property_pipeline = DataPipeline(csv_filename=f"{row['name']}.csv")
                property_data = PropertyData(                    name=row["name"],                    price=price,                    time_on_zillow=time_listed,                    views=views,                    saves=saves                )                property_pipeline.add_data(property_data)                property_pipeline.close_pipeline()                                success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")



def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))        for row in reader:            process_property(row, location, retries=retries)
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["pr"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Step 4: Adding Concurrency

With a small refactor, we can now add concurrency. Once again, we'll replace our for loop with ThreadPoolExecutor. Take a look at the new function.

def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_property,                reader,                [location] * len(reader),                [retries] * len(reader)            )

Quite similar to when we added concurrency earlier, executor.map() takes the following arguments.

process_property is the function we want to run on each thread.
reader is the array of property from our CSV file.
All other arguments are passed in as arrays, just like before.

Step 5: Bypassing Anti-Bots

Once again, we want to get past anti-bots and anything else that might block us. We only need to change one line.

response = requests.get(get_scrapeops_url(url, location=location))

Here is our finalized code, ready for production.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "residential": True        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    property_type: str = ""    street_address: str = ""    locality: str  = ""    region: str = ""    postal_code: str = ""    url: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass PropertyData:    name: str = ""    price: int = 0    time_on_zillow: str = ""    views: int = 0    saves: int = 0

    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.zillow.com/{keyword}/{page_number+1}_p/"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        script_tags = soup.select("script[type='application/ld+json']")            excluded_types = ["BreadcrumbList", "Events"]            for script_tag in script_tags:                json_data = json.loads(script_tag.text)                if json_data["@type"] not in excluded_types:                    search_data = SearchData(                        name=json_data["name"],                        property_type=json_data["@type"],                        street_address=json_data["address"]["streetAddress"],                        locality=json_data["address"]["addressLocality"],                        region=json_data["address"]["addressRegion"],                        postal_code=json_data["address"]["postalCode"],                        url=json_data["url"]                    )                    data_pipeline.add_data(search_data)               
            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

def process_property(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(get_scrapeops_url(url, location=location))        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")
                price_holder = soup.select_one("span[data-testid='price']")                price = int(price_holder.text.replace("$", "").replace(",", ""))
                info_holders = soup.select("dt")                time_listed = info_holders[0].text                views = int(info_holders[2].text.replace(",", ""))                saves = info_holders[4].text
                property_pipeline = DataPipeline(csv_filename=f"{row['name']}.csv")
                property_data = PropertyData(                    name=row["name"],                    price=price,                    time_on_zillow=time_listed,                    views=views,                    saves=saves                )                property_pipeline.add_data(property_data)                property_pipeline.close_pipeline()                                success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")



def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_property,                reader,                [location] * len(reader),                [retries] * len(reader)            )
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["pr"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Step 6: Production Run

We're now ready to test this thing out in production. Once again, I've set PAGES to 5 and our LOCATION to "us".Feel free to change any of the constants within main to tweak your results.

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 5    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["pr"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

The full code ran and exited in 42.8 seconds.In total, there were 45 properties in the CSV file generated from the crawler. If you recall, the crawler took roughly 5 seconds. 42.8 - 5 = 37.8 seconds spent parsing properties. 37.8 seconds / 45 properties = 0.84 seconds per property.

Legal and Ethical Considerations

As with any website, when you access Zillow, you are subject to their Terms of Use. You can view those terms here.It's also important to pay attention to their robots.txt which you can view here. It's important to note that violation of these Terms could result in your account getting blocked or even permanently removed from the site.When scraping, public data is generally considered legal throughout the world. Private data is any data that is gated behind a login or some other form of authentication.If you're not sure your scraper is legal, it's best to consult with an attorney who handles the jurisdiction of the site you're scraping.

Conclusion

You've finished our tutorial and you've now got another skill for your scraping toolbox. You understand parsing, pagination, data storage, concurrency, and proxy integration. Go out and build something!If you're interested in the tech stack used in this article, take a look at the links below.

How to Scrape Zillow with Selenium

Let’s dive into the step-by-step process and learn how to effectively scrape Zillow using Selenium.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

TLDR - How to Scrape Zillow with Selenium

Need to scrape Zillow? We've got you covered.

Create a new project folder.
Inside the folder, create a .env file and add your API key in this format:

SCRAPEOPS_API_KEY=your_api_key_here

Create file e.g main.py and insert this code into it:

import osimport csvimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, fields, asdictimport timefrom dotenv import load_dotenvfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECfrom selenium.common.exceptions import TimeoutException, WebDriverException, NoSuchElementException
load_dotenv()
API_KEY = os.getenv("SCRAPEOPS_API_KEY")
def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "residential": True        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)

@dataclassclass SearchData:    name: str = ""    property_type: str = ""    street_address: str = ""    locality: str  = ""    region: str = ""    postal_code: str = ""    url: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass PropertyData:    name: str = ""    price: int = 0    time_on_zillow: str = ""    views: int = 0    saves: int = 0

    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3, timeout=10):    url = f"https://www.zillow.com/{keyword}/{page_number+1}_p/"    scrapeops_proxy_url = get_scrapeops_url(url, location=location)        options = webdriver.ChromeOptions()    options.add_argument('--headless')  # Run in headless mode        for attempt in range(retries):        try:            with webdriver.Chrome(options=options) as driver:                driver.get(scrapeops_proxy_url)                                # Wait for the body to ensure page has started loading                WebDriverWait(driver, timeout).until(                    EC.presence_of_element_located((By.TAG_NAME, "body"))                )                                # Wait for and find script elements                script_elements = WebDriverWait(driver, timeout).until(                    EC.presence_of_all_elements_located((By.XPATH, "//script[@type='application/ld+json']"))                )                                for script in script_elements:                    json_data = json.loads(script.get_attribute('innerHTML'))                    if json_data["@type"] != "BreadcrumbList":                        search_data = SearchData(                            name=json_data["name"],                            property_type=json_data["@type"],                            street_address=json_data["address"]["streetAddress"],                            locality=json_data["address"]["addressLocality"],                            region=json_data["address"]["addressRegion"],                            postal_code=json_data["address"]["postalCode"],                            url=json_data["url"]                        )                        data_pipeline.add_data(search_data)                                logger.info(f"Successfully parsed data from: {url}")                return  # Success, exit the function                        except (TimeoutException, WebDriverException) as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, attempts left: {retries-attempt-1}")        raise Exception(f"Max retries ({retries}) exceeded for URL: {url}")
def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

def process_property(row, location, retries=3, timeout=10):    url = row["url"]    scrapeops_proxy_url = get_scrapeops_url(url, location=location)        options = webdriver.ChromeOptions()    options.add_argument('--headless')
    for attempt in range(retries):        try:            with webdriver.Chrome(options=options) as driver:                driver.get(scrapeops_proxy_url)
                # Wait for the body to ensure page has started loading                WebDriverWait(driver, timeout).until(                    EC.presence_of_element_located((By.TAG_NAME, "body"))                )
                # Extract price                price_element = WebDriverWait(driver, timeout).until(                    EC.presence_of_element_located((By.CSS_SELECTOR, "span[data-testid='price']"))                )                price = int(price_element.text.replace("$", "").replace(",", ""))
                # Extract other information                info_elements = driver.find_elements(By.TAG_NAME, "dt")                time_listed = info_elements[0].text if len(info_elements) > 0 else "No time listed"                views = int(info_elements[2].text.replace(",", "")) if len(info_elements) > 2 else 0                saves = info_elements[4].text if len(info_elements) > 4 else "No saves"
                property_pipeline = DataPipeline(csv_filename=f"{row['name']}.csv")
                property_data = PropertyData(                    name=row["name"],                    price=price,                    time_on_zillow=time_listed,                    views=views,                    saves=saves                )                property_pipeline.add_data(property_data)                property_pipeline.close_pipeline()
                logger.info(f"Successfully parsed: {url}")                return  # Success, exit the function
        except (TimeoutException, WebDriverException, NoSuchElementException) as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, attempts left: {retries-attempt-1}")
    raise Exception(f"Max retries ({retries}) exceeded for URL: {url}")


def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_property,                reader,                [location] * len(reader),                [retries] * len(reader)            )
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["pr"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Run the script, and it will generate a CSV file based on your search. For example, searching for properties in "pr" (Puerto Rico) will output a file called pr.csv.

Feel free to change any of the following:

MAX_THREADS: Determines the maximum number of threads used for concurrent scraping and processing.
MAX_RETRIES: Sets the maximum number of retries for each request in case of failure (e.g., network issues, server errors).
PAGES: Specifies the number of pages to scrape for each keyword. Each page contains multiple property listings.
LOCATION: Defines the geographical location for the scraping. This parameter is used to adjust the proxy location to simulate requests from a specific country.
keyword_list: A list of keywords representing different geographical areas or search terms on Zillow. Each keyword triggers a separate scraping job. ("pr" is Puerto Ricto, if you want to do Michigan, add "mi")

The script then processes the file and generates individual reports for each property listed in pr.csv.

How to Architect Our Zillow Scraper with Selenium

The Zillow scraper we are building will consist of two main components:

Crawler: This component will search for properties in a specific location, gathering relevant data and saving it in a CSV file. Each search will yield results like property names, addresses, and URLs, allowing us to compile a comprehensive dataset.
Parser: Once the crawler completes, the parser will read the CSV file and scrape additional details for each individual property listed, such as price, time on the market, and views.

We will focus on the following key concepts during development:

Parsing: Extracting valuable data from Zillow's web pages using Selenium.
Pagination: Handling multiple pages of search results to ensure all available listings are collected.
Data Storage: Saving the scraped data into CSV reports for further processing and analysis.
Concurrency: Speeding up the scraping process by executing multiple tasks (page scrapes) simultaneously using multithreading.
Proxy Integration: Overcoming anti-bot measures by routing requests through a proxy service to reduce detection and blocking.

Understanding How to Scrape Zillow with Selenium

In this section, we'll look at how Selenium can help us mimic real browser interactions to retrieve pages, extract data, navigate through pagination, and handle geolocation restrictions.

Step 1: How To Request Zillow Pages

To scrape Zillow effectively, we first need to load the pages. While a traditional GET request (used in the original code with requests) can fetch a page’s HTML, Zillow's anti-bot protections make that approach unreliable. Instead, we'll use Selenium to simulate a human browsing experience.In the code, the following URL structure represents a search result page for Puerto Rico, with pagination handled by the number at the end:

https://www.zillow.com/pr/2_p/

Here:

pr refers to the location (Puerto Rico),
2_p specifies that we are on the second page of results.

Using Selenium, we can open this URL in a browser instance, wait for the page to load, and extract the necessary data, while minimizing the chances of detection.For the house below, our URL is

https://www.zillow.com/homedetails/459-Carr-Km-7-2-Int-Bo-Arenales-Aguadilla-PR-00603/363559698_zpid/

Step 2: How To Extract Data From Zillow Results and Pages

Extracting data from Zillow using Selenium involves interacting with both JSON data on search result pages and HTML elements on individual property pages.Let’s break down how we handle each scenario:

Extracting JSON Data from Search Results

On search results pages, Zillow embeds key property data within a JSON structure, which can be extracted using Selenium.Here is the search page and the JSON blob inside it.We will wait for all script elements of type application/ld+json, which contain the data we need.Here’s how we would approach this in our scraper:

# Extract JSON data from search resultsscript_elements = WebDriverWait(driver, timeout).until(    EC.presence_of_all_elements_located((By.XPATH, "//script[@type='application/ld+json']")))
for script in script_elements:    json_data = json.loads(script.get_attribute('innerHTML'))    if json_data["@type"] != "BreadcrumbList":        # Extract relevant fields from JSON        search_data = SearchData(            name=json_data["name"],            property_type=json_data["@type"],            street_address=json_data["address"]["streetAddress"],            locality=json_data["address"]["addressLocality"],            region=json_data["address"]["addressRegion"],            postal_code=json_data["address"]["postalCode"],            url=json_data["url"]        )        data_pipeline.add_data(search_data)

This approach allows us to gather crucial information, such as property names, addresses, and URLs, from the search results.

Extracting HTML Data from Individual Property Pages

On individual property pages, the data is usually buried within the HTML, such as the price, time on Zillow, views, and saves.Here is a look at some the HTML we want to parse for an individual property page.Using Selenium, we can wait for these elements to load and then extract the data using their corresponding CSS selectors or tags:

# Wait for the price element and extract its valueprice_element = WebDriverWait(driver, timeout).until(    EC.presence_of_element_located((By.CSS_SELECTOR, "span[data-testid='price']")))price = int(price_element.text.replace("$", "").replace(",", ""))
# Extract other details such as time listed, views, and savesinfo_elements = driver.find_elements(By.TAG_NAME, "dt")time_listed = info_elements[0].text if len(info_elements) > 0 else "No time listed"views = int(info_elements[2].text.replace(",", "")) if len(info_elements) > 2 else 0saves = info_elements[4].text if len(info_elements) > 4 else "No saves"

This process ensures we retrieve all necessary data fields for each property and store them in the CSV report.

Step 3: How to Control Pagination

Controlling pagination is straightforward, as Zillow URLs follow a predictable pattern. The page number is embedded directly in the URL, for example:

Page 1: https://www.zillow.com/pr/1_p/
Page 2: https://www.zillow.com/pr/2_p/
Page 3: https://www.zillow.com/pr/3_p/

In our scraper, we simply increment the page number to navigate through multiple result pages.The scrape_search_results function handles pagination by iterating over pages:

def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3, timeout=10):    url = f"https://www.zillow.com/{keyword}/{page_number+1}_p/"    # The rest of the code follows to scrape data from this page

Step 4: Handling Geolocated Data

To ensure our scraper works even when location-based restrictions apply, we use the ScrapeOps API to route requests through servers located in different countries.When interacting with the ScrapeOps API, we'll pass in a country param as well. country will not have any effect on our actual search results, but instead it will route us through a server in whichever country we specify.For instance, if we want to appear in the US, we'd pass us in as our country.This helps bypass Zillow’s geolocation blocks and improves the chances of successful scraping.The get_scrapeops_url() function integrates this proxy service:

def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "residential": True    }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

Here, the location parameter determines which country we appear to be browsing from (e.g., us for the US).This doesn’t change the search results but helps us avoid being blocked by Zillow’s anti-scraping mechanisms.

Setting Up Our Zillow Scraper Project

You can run the following commands to get setup.Create a New Project Folder

mkdir <your_directory_name>cd <your_directory_name>

Create a New Virtual Environment

python -m venv venv

Activate the Environment

source venv/bin/activate # Linux# ORvenv\Scripts\activate # Windows

Install Our Dependencies

pip install selenium python-dotenv

We'll use selenium to automate web browser interactions.We'll use python-dotenv to securely manage sensitive information like login credentials or API keys by storing them in a separate .env file, which helps keep our main code clean and our secrets safe from accidental exposure.

Build A Zillow Search Crawler

In this section, we'll build a Zillow search crawler by combining several key components:

parsing search results,
handling pagination,
storing data, and
adding concurrency.

We’ll also look at how to bypass anti-bot measures to ensure our scraper runs effectively in production.

Step 1: Create a Simple Search Data Parser

Let’s create the core function, scrape_search_results(). This function handles the search result page scraping, extracting necessary details (like property URLs, addresses, and prices) and storing them in a CSV file.Here’s an outline:

def scrape_search_results(keyword, location, retries=3, timeout=10):    url = f"https://www.zillow.com/{keyword}/"    scrapeops_proxy_url = get_scrapeops_url(url, location=location)        options = webdriver.ChromeOptions()    options.add_argument('--headless')        for attempt in range(retries):        try:            with webdriver.Chrome(options=options) as driver:                driver.get(scrapeops_proxy_url)                                # Wait for the body to ensure page has started loading                WebDriverWait(driver, timeout).until(                    EC.presence_of_element_located((By.TAG_NAME, "body"))                )                                # Wait for and find script elements                script_elements = WebDriverWait(driver, timeout).until(                    EC.presence_of_all_elements_located((By.XPATH, "//script[@type='application/ld+json']"))                )                                for script in script_elements:                    json_data = json.loads(script.get_attribute('innerHTML'))                    if json_data["@type"] != "BreadcrumbList":                        search_data = {                            "name": json_data["name"],                            "property_type": json_data["@type"],                            "street_address": json_data["address"]["streetAddress"],                            "locality": json_data["address"]["addressLocality"],                            "region": json_data["address"]["addressRegion"],                            "postal_code": json_data["address"]["postalCode"],                            "url": json_data["url"]                        }                        print(search_data)                                logger.info(f"Successfully parsed data from: {url}")                return  # Success, exit the function                        except (TimeoutException, WebDriverException) as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, attempts left: {retries-attempt-1}")        raise Exception(f"Max retries ({retries}) exceeded for URL: {url}")

This function first constructs the Zillow search URL using the location (keyword) and page_number. It sends a GET request to fetch the page, extracts the data, and prints the data.We've also included a retry mechanism to handle potential errors during scraping.

Step 2: Add Pagination

Pagination is critical for collecting data from multiple pages. Zillow uses page numbers in the URL (_p/) to navigate between search results.As discussed earlier, we need to increment the page number in our URL.Here’s how:

def start_scrape(keyword, pages, location, retries=3):    for page in range(pages):        scrape_search_results(keyword, location, page, retries=retries)

Keyword: Represents the search location (e.g., city or state).
Page: A variable we increment with each loop iteration to move through the search result pages.
range(): We use Python’s range() to handle multiple pages of results. Since Zillow pages start at 1 and range() starts at 0, we add +1 to the page number in the URL.

This allows us to scrape multiple pages of Zillow search results by looping through the specified number of pages.

Step 3: Storing the Scraped Data

To store the scraped data, we can create a SearchData class to structure the extracted information, like property address, price, and more.This ensures that we store data consistently in our CSV file.

@dataclassclass SearchData:    name: str = ""    property_type: str = ""    street_address: str = ""    locality: str  = ""    region: str = ""    postal_code: str = ""    url: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

We created a SearchData class to structure our scraped property information. This class:

Uses Python's @ dataclass decorator for automatic method generation.
Defines fields for property details like name, type, address components, and URL.
Implements a post-initialization method to check and clean string fields.
Sets default values for empty fields and strips whitespace from non-empty ones.

Here is the DataPipeline class that handles saving data to a csv file:

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()

We create a DataPipeline class to handle data efficiently and store it in CSV format:

We accumulate data in a storage queue until it hits a set limit.
The save_to_csv method writes data to a CSV file. It creates new files or appends to existing ones as needed.
We use DictWriter for flexible field handling, writing headers for new files.
After saving, we clear the queue to prevent data duplication.
We implement a flag csv_file_open to avoid concurrent CSV writes.
The close_pipeline method saves any leftover data before shutdown.

Step 4: Adding Concurrency

To speed up the scraping process, especially when scraping multiple pages or properties, we can introduce concurrency using Python’s ThreadPoolExecutor.This allows us to scrape several pages simultaneously, reducing overall runtime.

from concurrent.futures import ThreadPoolExecutor
def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

We implemented a start_scrape function to introduce concurrency:

It uses Python's ThreadPoolExecutor to run multiple scraping tasks simultaneously
We map the scrape_search_results function across multiple threads
This function takes parameters like keyword, pages, location, and data pipeline
We control concurrency with the max_threads parameter
We include a retry mechanism for resilience against failures

Step 5: Bypassing Anti-Bots

To bypass Zillow’s anti-bot measures, we integrate ScrapeOps Proxy Aggregator by using the get_scrapeops_url() function. This function generates a proxy URL to route our requests through servers in different regions, making it more difficult for Zillow to block our scrapers.

import osfrom dotenv import load_dotenvfrom urllib.parse import urlencode
load_dotenv()API_KEY = os.getenv("SCRAPEOPS_API_KEY")
def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "residential": True        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url
print(get_scrapeops_url('https://zillow.com'))

By passing the url and a location parameter (e.g., us), we ensure our requests are routed through a residential proxy, minimizing the risk of getting blocked by Zillow’s anti-bot system.

Step 6: Production Run

Finally, we define the main method to initiate the scraping process.

if __name__ == "__main__":    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    # INPUT ---> List of keywords to scrape    keyword_list = ["pr"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

We set up the main execution flow in the script's entry point:

We define constants for retries, thread count, pages to scrape, and location
The script accepts a list of keywords to search for
We iterate through each keyword:
- We create a unique filename for each keyword
- We instantiate a DataPipeline for each keyword
- We call start_scrape to begin the concurrent scraping process
- After scraping, we close the pipeline to ensure data is saved
- We collect filenames for potential aggregation later
We use logging to track the start and completion of the crawl process

This structure allows for efficient, concurrent scraping of multiple keywords, with each keyword's data saved to a separate CSV file.Here is the full code for the Zillow data crawler:

import osimport csvimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, fields, asdictimport timefrom dotenv import load_dotenvfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECfrom selenium.common.exceptions import TimeoutException, WebDriverException
load_dotenv()
API_KEY = os.getenv("SCRAPEOPS_API_KEY")
def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "residential": True        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)

@dataclassclass SearchData:    name: str = ""    property_type: str = ""    street_address: str = ""    locality: str  = ""    region: str = ""    postal_code: str = ""    url: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3, timeout=10):    url = f"https://www.zillow.com/{keyword}/{page_number+1}_p/"    scrapeops_proxy_url = get_scrapeops_url(url, location=location)        options = webdriver.ChromeOptions()    options.add_argument('--headless')  # Run in headless mode        for attempt in range(retries):        try:            with webdriver.Chrome(options=options) as driver:                driver.get(scrapeops_proxy_url)                                # Wait for the body to ensure page has started loading                WebDriverWait(driver, timeout).until(                    EC.presence_of_element_located((By.TAG_NAME, "body"))                )                                # Wait for and find script elements                script_elements = WebDriverWait(driver, timeout).until(                    EC.presence_of_all_elements_located((By.XPATH, "//script[@type='application/ld+json']"))                )                                for script in script_elements:                    json_data = json.loads(script.get_attribute('innerHTML'))                    if json_data["@type"] != "BreadcrumbList":                        search_data = SearchData(                            name=json_data["name"],                            property_type=json_data["@type"],                            street_address=json_data["address"]["streetAddress"],                            locality=json_data["address"]["addressLocality"],                            region=json_data["address"]["addressRegion"],                            postal_code=json_data["address"]["postalCode"],                            url=json_data["url"]                        )                        data_pipeline.add_data(search_data)                                logger.info(f"Successfully parsed data from: {url}")                return  # Success, exit the function                        except (TimeoutException, WebDriverException) as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, attempts left: {retries-attempt-1}")        raise Exception(f"Max retries ({retries}) exceeded for URL: {url}")


def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )
if __name__ == "__main__":    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    # INPUT ---> List of keywords to scrape    keyword_list = ["pr"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

Let's summarize the key steps we took to build this crawler:

Bypassing Anti-Bots: We implemented get_scrapeops_url to use ScrapeOps proxy, allowing us to bypass anti-bot measures.
Simple Search Data Parser: We created scrape_search_results to extract property data from Zillow pages using Selenium.
Pagination: We added start_scrape to handle multiple pages of search results (initially without concurrency).
Data Storage: We developed SearchData class to structure our scraped information and DataPipeline class to manage data storage and CSV writing.
Concurrency: We enhanced the start_scrape function with ThreadPoolExecutor to scrape multiple pages simultaneously, improving efficiency.
Production Setup: We set up the main execution block to handle multiple keywords and manage the overall scraping process.

Build A Zillow Scraper

Our Zillow crawler is already capable of searching for properties, parsing the results, and saving the data in a CSV. Now, we’ll move on to building a scraper that processes this CSV and scrapes individual property details. Our scraper will:

Read the CSV file generated by the crawler.
Scrape detailed information about each property.
Save the property details in a structured format.
Implement concurrency for efficiency.
Integrate with a proxy to bypass anti-bot protections.

Step 1: Create a Simple Property Data Parser

We’ll begin by creating a function to parse property data from the Zillow property pages. This function will look similar to the initial parsing function we wrote for search results, but now we’re dealing with individual property pages, which have different HTML structures.Once again, we want to get past anti-bots and anything else that might block us. We will use the same get_scrapeops_url() function:

scrapeops_proxy_url = get_scrapeops_url(url, location=location)

Here’s an example of a basic property data parser:

def process_property(row, location, retries=3, timeout=10):    url = row["url"]    scrapeops_proxy_url = get_scrapeops_url(url, location=location)        options = webdriver.ChromeOptions()    options.add_argument('--headless')
    for attempt in range(retries):        try:            with webdriver.Chrome(options=options) as driver:                driver.get(scrapeops_proxy_url)
                # Wait for the body to ensure page has started loading                WebDriverWait(driver, timeout).until(                    EC.presence_of_element_located((By.TAG_NAME, "body"))                )
                # Extract price                price_element = WebDriverWait(driver, timeout).until(                    EC.presence_of_element_located((By.CSS_SELECTOR, "span[data-testid='price']"))                )                price = int(price_element.text.replace("$", "").replace(",", ""))
                # Extract other information                info_elements = driver.find_elements(By.TAG_NAME, "dt")                time_listed = info_elements[0].text if len(info_elements) > 0 else "No time listed"                views = int(info_elements[2].text.replace(",", "")) if len(info_elements) > 2 else 0                saves = info_elements[4].text if len(info_elements) > 4 else "No saves"
                property_data = {                    'name': row["name"],                    'price': price,                    'time_on_zillow': time_listed,                    'views': views,                    'saves': saves                }
                print(property_data)
                logger.info(f"Successfully parsed: {url}")                return  # Success, exit the function
        except (TimeoutException, WebDriverException, NoSuchElementException) as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, attempts left: {retries-attempt-1}")
    raise Exception(f"Max retries ({retries}) exceeded for URL: {url}")

Key points to note:

"span[data-testid='price']" is the CSS selector for the price.
int(price_holder.text.replace("$", "").replace(",", "")) cleans up and converts the price into an integer.
We extract time_listed, views, and saves from the info_holders list using the correct CSS selectors.

Step 2: Loading URLs to Scrape

To feed property URLs into our parsing function, we’ll load the CSV file generated by the crawler. For each row (property) in the CSV, we’ll call process_property(). Later, we’ll add concurrency to speed things up.

def process_results(csv_file):    with open(csv_file, newline='') as file:        reader = csv.DictReader(file)        for row in reader:            process_property(row['url'])

Here, we’re iterating over the rows in the CSV file and processing each property URL by calling the process_property() function.

Step 3: Storing the Scraped Data

We’ll store the scraped property details in a structured format using a PropertyData class.This class will be similar to the SearchData class we used earlier but specific to the details scraped from individual property pages.

@dataclassclass PropertyData:    name: str = ""    price: int = 0    time_on_zillow: str = ""    views: int = 0    saves: int = 0

    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

The scraped data is saved in a CSV, where each property is stored as a row. We'll re-use the data pipeline DataPipeline we created in the crawler section as follows:

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()

@dataclassclass PropertyData:    name: str = ""    price: int = 0    time_on_zillow: str = ""    views: int = 0    saves: int = 0

    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

def process_property(row, location, retries=3, timeout=10):    url = row["url"]    scrapeops_proxy_url = get_scrapeops_url(url, location=location)        options = webdriver.ChromeOptions()    options.add_argument('--headless')
    for attempt in range(retries):        try:            with webdriver.Chrome(options=options) as driver:                driver.get(scrapeops_proxy_url)
                # Wait for the body to ensure page has started loading                WebDriverWait(driver, timeout).until(                    EC.presence_of_element_located((By.TAG_NAME, "body"))                )
                # Extract price                price_element = WebDriverWait(driver, timeout).until(                    EC.presence_of_element_located((By.CSS_SELECTOR, "span[data-testid='price']"))                )                price = int(price_element.text.replace("$", "").replace(",", ""))
                # Extract other information                info_elements = driver.find_elements(By.TAG_NAME, "dt")                time_listed = info_elements[0].text if len(info_elements) > 0 else "No time listed"                views = int(info_elements[2].text.replace(",", "")) if len(info_elements) > 2 else 0                saves = info_elements[4].text if len(info_elements) > 4 else "No saves"
                property_pipeline = DataPipeline(csv_filename=f"{row['name']}.csv")
                property_data = PropertyData(                    name=row["name"],                    price=price,                    time_on_zillow=time_listed,                    views=views,                    saves=saves                )                property_pipeline.add_data(property_data)                property_pipeline.close_pipeline()
                logger.info(f"Successfully parsed: {url}")                return  # Success, exit the function
        except (TimeoutException, WebDriverException, NoSuchElementException) as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, attempts left: {retries-attempt-1}")
    raise Exception(f"Max retries ({retries}) exceeded for URL: {url}")

Step 4: Adding Concurrency

To scrape multiple properties in parallel, we’ll use Python’s ThreadPoolExecutor. This helps us speed up the process by running multiple process_property() functions concurrently.

def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_property,                reader,                [location] * len(reader),                [retries] * len(reader)            )

Here, executor.map() handles the parallel processing of multiple property URLs. The process_property() function is called on each URL, and the results are saved concurrently.

Step 5: Production Run

We're now ready to test this thing out in production.Once again, we've set PAGES to 5 and our LOCATION to "uk". Feel free to change any of the constants within main to tweak your results.

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["pr"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

With that, our full production-ready code is as follows:

from dotenv import load_dotenvimport osfrom urllib.parse import urlencodefrom selenium import webdriverfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECfrom selenium.webdriver.common.by import Byfrom selenium.common.exceptions import TimeoutException, WebDriverException, NoSuchElementExceptionfrom dataclasses import fields, asdict, dataclassimport csv import loggingimport timeimport concurrent.futuresimport json
## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)
# Load environment variablesload_dotenv()API_KEY = os.getenv("SCRAPEOPS_API_KEY")

def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "residential": True        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

@dataclassclass SearchData:    name: str = ""    property_type: str = ""    street_address: str = ""    locality: str  = ""    region: str = ""    postal_code: str = ""    url: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3, timeout=10):    url = f"https://www.zillow.com/{keyword}/{page_number+1}_p/"    scrapeops_proxy_url = get_scrapeops_url(url, location=location)        options = webdriver.ChromeOptions()    options.add_argument('--headless')  # Run in headless mode        for attempt in range(retries):        try:            with webdriver.Chrome(options=options) as driver:                driver.get(scrapeops_proxy_url)                                # Wait for the body to ensure page has started loading                WebDriverWait(driver, timeout).until(                    EC.presence_of_element_located((By.TAG_NAME, "body"))                )                                # Wait for and find script elements                script_elements = WebDriverWait(driver, timeout).until(                    EC.presence_of_all_elements_located((By.XPATH, "//script[@type='application/ld+json']"))                )                                for script in script_elements:                    json_data = json.loads(script.get_attribute('innerHTML'))                    if json_data["@type"] != "BreadcrumbList":                        search_data = SearchData(                            name=json_data["name"],                            property_type=json_data["@type"],                            street_address=json_data["address"]["streetAddress"],                            locality=json_data["address"]["addressLocality"],                            region=json_data["address"]["addressRegion"],                            postal_code=json_data["address"]["postalCode"],                            url=json_data["url"]                        )                        data_pipeline.add_data(search_data)                                logger.info(f"Successfully parsed data from: {url}")                return  # Success, exit the function                        except (TimeoutException, WebDriverException) as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, attempts left: {retries-attempt-1}")        raise Exception(f"Max retries ({retries}) exceeded for URL: {url}")


def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

@dataclassclass PropertyData:    name: str = ""    price: int = 0    time_on_zillow: str = ""    views: int = 0    saves: int = 0

    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

def process_property(row, location, retries=3, timeout=10):    url = row["url"]    scrapeops_proxy_url = get_scrapeops_url(url, location=location)        options = webdriver.ChromeOptions()    options.add_argument('--headless')
    for attempt in range(retries):        try:            with webdriver.Chrome(options=options) as driver:                driver.get(scrapeops_proxy_url)
                # Wait for the body to ensure page has started loading                WebDriverWait(driver, timeout).until(                    EC.presence_of_element_located((By.TAG_NAME, "body"))                )
                # Extract price                price_element = WebDriverWait(driver, timeout).until(                    EC.presence_of_element_located((By.CSS_SELECTOR, "span[data-testid='price']"))                )                price = int(price_element.text.replace("$", "").replace(",", ""))
                # Extract other information                info_elements = driver.find_elements(By.TAG_NAME, "dt")                time_listed = info_elements[0].text if len(info_elements) > 0 else "No time listed"                views = int(info_elements[2].text.replace(",", "")) if len(info_elements) > 2 else 0                saves = info_elements[4].text if len(info_elements) > 4 else "No saves"
                property_pipeline = DataPipeline(csv_filename=f"{row['name']}.csv")
                property_data = PropertyData(                    name=row["name"],                    price=price,                    time_on_zillow=time_listed,                    views=views,                    saves=saves                )                property_pipeline.add_data(property_data)                property_pipeline.close_pipeline()
                logger.info(f"Successfully parsed: {url}")                return  # Success, exit the function
        except (TimeoutException, WebDriverException, NoSuchElementException) as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, attempts left: {retries-attempt-1}")
    raise Exception(f"Max retries ({retries}) exceeded for URL: {url}")

def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_property,                reader,                [location] * len(reader),                [retries] * len(reader)            )
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    PAGES = 1    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["pr"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

    logger.info(f"Scrape starting...")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)
    logger.info(f"Scrape complete.")

To see the output, launch the terminal and run the script.

python <your_script_name>.py

What did you notice?The script creates a file called pr.csv. It then reads this file and creates an individual report on each house.

Legal and Ethical Considerations

When utilizing Zillow, it’s important to follow their Terms of Use. In addition, be sure to review their robots.txt file here, which outlines rules for automated access. Not following these guidelines could lead to account suspension or a permanent ban.Generally, scraping publicly available data is legal in many regions, but accessing private data—such as that which requires login or authentication—requires permission.If you’re unsure about the legal aspects of your scraping activities, it’s advisable to seek legal advice from an attorney who is familiar with the laws in your area.

Conclusion

You've now completed our tutorial and have added another valuable skill to your scraping toolkit. You’ve learned about parsing, pagination, data storage, concurrency, and proxy integration. Now go ahead and create something awesome!Interested in the tools covered in this guide? Check out the resources below:

More Web Scraping Guides

Here at ScrapeOps, we've got tons of learning resources. Whether you're brand new to scraping or you're a seasoned vet, we've got something for you. If you're in the mood to learn more, check our Web Scraping Playbook, or take a look at the articles below: