How To Scrape Walmart.com [2025]

In this guide for our "How To Scrape" series, we're going to look at how to scrape Walmart.com. Behind Amazon, Walmart is the 2nd most popular e-commerce website for web scrapers with billions of product pages being scraped every month. When we scrape Walmart, we get access to all sorts of products from all over the globe.

So in this Overview we will go through:

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

How To Build A List Of Walmart Product URLs

The first part of scraping Walmart is designing a web crawler that will generate a list of product URLs for our scrapers to scrape.With Walmart.com the easiest way to do this is to use the Walmart Search page which returns up to 40 products per page.For example, here is how we would get search results for iPads.


'https://www.walmart.com/search?q=ipad&sort=best_seller&page=1&affinityOverride=default'

This URL contains a number of parameters that we will explain:

q stands for the search query. In our case, q=ipad. Note: If you want to search for a keyword that contains spaces or special characters then remember you need to encode this value.
sort stands for the sorting order of the query. In our case, we used sort=best_seller, however other options are best_match, price_low and price_high.
page stands for the page number. In our cases, we've requested page=1.

Using these parameters we can customise our requests to the search endpoint to start building a list of URLs to scrape.💡Max 25 Search PagesWalmart only returns a maximum of 25 pages, so if you would to get more results for your particular query then you can:

More Specific: Say I wanted to find iPhones on Walmart but wanted more than 25 pages of results. Instead of setting the query parameter as "iPhone" you could go "iPhone 14", "iPhone 13", etc.
Change Sorting: You could make requests with different sorting parameters (sort=price_low and sort=price_high) and then filter it for the unique values.

Extracting the list of products returned in the response is actually pretty easy with these Walmart search requests, as the data is available as hidden JSON data on the page.So we just need to extract the JSON blob in the <script id="__NEXT_DATA__" type="application/json"> tag and parse it into JSON.

<script id="__NEXT_DATA__" type="application/json" nonce="">"{    ...DATA...    }"</script>

This JSON response is pretty big but the data we are looking for is in:

product_list = json_blob["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]

Here is an example Python Requests/BeautifulSoup scraper that retrieves all products from a input keyword from all 25 pages.

import jsonimport requestsfrom bs4 import BeautifulSoupfrom urllib.parse import urlencode
def create_walmart_product_url(product):    return 'https://www.walmart.com' + product.get('canonicalUrl', '').split('?')[0]
headers={"User-Agent": "Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148"}product_url_list = []
## Walmart Search Keywordkeyword = 'ipad'
## Loop Through Walmart Pages Until No More Productsfor page in range(1, 5):    try:        payload = {'q': keyword, 'sort': 'best_seller', 'page': page, 'affinityOverride': 'default'}        walmart_search_url = 'https://www.walmart.com/search?' + urlencode(payload)        response = requests.get(walmart_search_url, headers=headers)
        if response.status_code == 200:            html_response = response.text            soup = BeautifulSoup(html_response, "html.parser")            script_tag = soup.find("script", {"id": "__NEXT_DATA__"})            if script_tag is not None:                json_blob = json.loads(script_tag.get_text())                product_list = json_blob["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]                product_urls = [create_walmart_product_url(product) for product in product_list]                product_url_list.extend(product_urls)                if len(product_urls) == 0:                    break                        except Exception as e:        print('Error', e)                print(product_url_list)

The output will look like this:

[  "https://www.walmart.com/ip/2021-Apple-10-2-inch-iPad-Wi-Fi-64GB-Space-Gray-9th-Generation/483978365",  "https://www.walmart.com/ip/2021-Apple-iPad-Mini-Wi-Fi-64GB-Purple-6th-Generation/996045822",  "https://www.walmart.com/ip/2022-Apple-10-9-inch-iPad-Air-Wi-Fi-64GB-Purple-5th-Generation/860872590",  "https://www.walmart.com/ip/2021-Apple-11-inch-iPad-Pro-Wi-Fi-128GB-Space-Gray-3rd-Generation/354993710",  "https://www.walmart.com/ip/2021-Apple-12-9-inch-iPad-Pro-Wi-Fi-128GB-Space-Gray-5th-Generation/774697337",  "https://www.walmart.com/ip/2020-Apple-10-9-inch-iPad-Air-Wi-Fi-64GB-Sky-Blue-4th-Generation/462727496",  "https://www.walmart.com/ip/2021-Apple-iPad-Mini-Wi-Fi-Cellular-64GB-Starlight-6th-Generation/406091219",  "https://www.walmart.com/ip/2020-Apple-10-9-inch-iPad-Air-Wi-Fi-Cellular-64GB-Silver-4th-Generation/470306039",  "https://www.walmart.com/ip/2022-Apple-10-9-inch-iPad-Air-Wi-Fi-Cellular-64GB-Blue-5th-Generation/234669711",  "https://www.walmart.com/ip/2021-Apple-10-2-inch-iPad-Wi-Fi-Cellular-64GB-Space-Gray-9th-Generation/414515010",  "https://www.walmart.com/ip/2021-Apple-11-inch-iPad-Pro-Wi-Fi-Cellular-128GB-Space-Gray-3rd-Generation/851470965",  "https://www.walmart.com/ip/2021-Apple-12-9-inch-iPad-Pro-Wi-Fi-Cellular-256GB-Space-Gray-5th-Generation/169993514" ]

Now that we have way to get lists of Walmart product URLs we can extract the data from each individual Walmart product page.💡Walmart Search Product DataThe Walmart Search request returns a lot more information than just the product URLs. You can get the product name, price, image URL, rating, number of reviews, etc from the JSON blob as well. So depending on what data you need you mightn't actually need to request each product page as you can get the data from the search results.

def extract_product_data(product):    return {        'url': create_walmart_url(product),        'name': product.get('name', ''),        'description': product.get('description', ''),        'image_url': product.get('image', ''),        'average_rating': product['rating'].get('averageRating'),        'number_reviews': product['rating'].get('numberOfReviews'),    }
product_data_list = [extract_product_data(product) for product in product_list]

Scraping Walmart Product Data

Once we have a list of Walmart product URLs then we can scrape all the product data from each individual Walmart product page.Again, as Walmart returns the data in the <script id="__NEXT_DATA__" type="application/json"> tag in the HTML response it is pretty easy to extract the data.

<script id="__NEXT_DATA__" type="application/json" nonce="">"{    ...DATA...    }"</script>

We don't need to build CSS/xPath selectors for each field, just take the data we want from the JSON response. An extra bonus from this is that the data is very clean so we have to little to no data cleaning.The product data can be found here:

product_data = json_blob["props"]["pageProps"]["initialData"]["data"]["product"]

And the product reviews can be found here:

product_reviews = json_blob["props"]["pageProps"]["initialData"]["data"]["reviews"]

The JSON blob with the product data is pretty big so we will configure our scraper to only extract the data we want.Here is an example Python Requests/BeautifulSoup scraper that will request the page of each Walmart URL and parse the data we want.

import jsonimport requestsfrom bs4 import BeautifulSoupfrom urllib.parse import urlencode
headers={"User-Agent": "Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148"}
product_data_list = []
## Loop Through Walmart Product URL Listfor url in product_url_list:    try:        response = requests.get(url, headers=headers)
        if response.status_code == 200:            html_response = response.text            soup = BeautifulSoup(html_response, "html.parser")            script_tag = soup.find("script", {"id": "__NEXT_DATA__"})            if script_tag is not None:                json_blob = json.loads(script_tag.get_text())                raw_product_data = json_blob["props"]["pageProps"]["initialData"]["data"]["product"]                product_data_list.append({                    'id':  raw_product_data.get('id'),                    'type':  raw_product_data.get('type'),                    'name':  raw_product_data.get('name'),                    'brand':  raw_product_data.get('brand'),                    'averageRating':  raw_product_data.get('averageRating'),                    'manufacturerName':  raw_product_data.get('manufacturerName'),                    'shortDescription':  raw_product_data.get('shortDescription'),                    'thumbnailUrl':  raw_product_data['imageInfo'].get('thumbnailUrl'),                    'price':  raw_product_data['priceInfo']['currentPrice'].get('price'),                     'currencyUnit':  raw_product_data['priceInfo']['currentPrice'].get('currencyUnit'),                  })                        except Exception as e:        print('Error', e)                print(product_data_list)

Here an example output:

[  {    "id": "4SR8VU90LQ0P",    "type": "Tablet Computers",    "name": "2021 Apple 10.2-inch iPad Wi-Fi 64GB - Space Gray (9th Generation)",    "brand": "Apple",    "averageRating": 4.7,    "manufacturerName": "Apple",    "shortDescription": "Powerful. Easy to use. Versatile. The new iPad has a beautiful 10.2-inch Retina display, powerful A13 Bionic chip, an Ultra Wide front camera with Center Stage, and works with Apple Pencil and the Smart Keyboard. iPad lets you do more, more easily. All for an incredible value.<p></p>",    "thumbnailUrl": "https://i5.walmartimages.com/asr/86cda84e-4f55-4ffa-954e-9ca5ae27b723.8a72a9690e1951f535eed412cc9e5fc3.jpeg",    "price": 299,    "currencyUnit": "USD"  },]

You can expand this scraper to extract much more information from the JSON blob.

Walmart Anti-Bot Protection

As you might have seen already if you run this code a couple times Walmart might already have started to redirecting you to its blocked page.This is because Walmart uses anti-bot protection to try and prevent (or at least make it harder) developers from scraping their site.There anti-bot isn't super complex, however, it is still pretty sophisticated so you will need to using rotating proxies, browser-profiles and possibly fortify your headless browser if you want to scrape it reliably at scale.We have written about how to do this here:

However, if you don't want to implement all this anti-bot bypassing logic yourself the easier option is to use a smart proxy solution like ScrapeOps Proxy Aggregator.The ScrapeOps Proxy Aggregator is a smart proxy that handles everything for you:

Proxy rotation & selection
Rotating user-agents & browser headers
Ban detection & CAPTCHA bypassing
Country IP geotargeting
Javascript rendering with headless browsers

To use the ScrapeOps Proxy Aggregator, we just need to send the URL we want to scrape to the Proxy API instead of making the request directly ourselves. We can do this with a simple wrapper function:

SCRAPEOPS_API_KEY = 'YOUR_API_KEY'
def scrapeops_url(url):    payload = {'api_key': SCRAPEOPS_API_KEY, 'url': url, 'country': 'us'}    proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)    return proxy_url

walmart_url = 'https://www.walmart.com/search?q=ipad&sort=best_seller&page=1&affinityOverride=default'
## Send URL To ScrapeOps Instead of Walmart response = requests.get(scrapeops_url(walmart_url))

You can get a API key with 1,000 free API credits by signing up here.Here is our updated Walmart Product Scraper using the ScrapeOps Proxy:

import jsonimport requestsfrom bs4 import BeautifulSoupfrom urllib.parse import urlencode

SCRAPEOPS_API_KEY = 'YOUR_API_KEY'
def scrapeops_url(url):    payload = {'api_key': SCRAPEOPS_API_KEY, 'url': url, 'country': 'us'}    proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)    return proxy_url
product_data_list = []
## Loop Through Walmart Product URL Listfor url in product_url_list:    try:        response = requests.get(scrapeops_url(url))
        if response.status_code == 200:            html_response = response.text            soup = BeautifulSoup(html_response, "html.parser")            script_tag = soup.find("script", {"id": "__NEXT_DATA__"})            if script_tag is not None:                json_blob = json.loads(script_tag.get_text())                raw_product_data = json_blob["props"]["pageProps"]["initialData"]["data"]["product"]                product_data_list.append({                    'id':  raw_product_data.get('id'),                    'type':  raw_product_data.get('type'),                    'name':  raw_product_data.get('name'),                    'brand':  raw_product_data.get('brand'),                    'averageRating':  raw_product_data.get('averageRating'),                    'manufacturerName':  raw_product_data.get('manufacturerName'),                    'shortDescription':  raw_product_data.get('shortDescription'),                    'thumbnailUrl':  raw_product_data['imageInfo'].get('thumbnailUrl'),                    'price':  raw_product_data['priceInfo']['currentPrice'].get('price'),                     'currencyUnit':  raw_product_data['priceInfo']['currentPrice'].get('currencyUnit'),                  })                        except Exception as e:        print('Error', e)                print(product_data_list)

Now when we make requests with our scraper Walmart won't be block them.

How to Scrape Walmart With Requests and BeautifulSoup

In this Python scraping guide, we'll be crawling Walmart search pages and then scraping product reviews.

💡GitHub CodeThe full code for this Walmart Scraper is available on Github here.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

TLDR - How to Scrape Walmart

If you need a pre-built Walmart scraper, give this one a try.

Create a new project folder.
Then, make a config.json file with your ScrapeOps API key.
Finally, copy/paste the code below into a new Python file in the same folder.
You can then run it with python name_of_your_file.py.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoup
import concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url
## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    sponsored: bool = False    price: float = 0.0    product_id: int = 0
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass ReviewData:    name: str = ""    author_id: str = ""    rating: int = 0    date: str = ""    review: str = ""

    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.walmart.com/search?q={formatted_keyword}&page={page_number+1}"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed to get page {page_number}, status code {response.status_code}")
            soup = BeautifulSoup(response.text, "html.parser")            script_tag = soup.select_one("script[id='__NEXT_DATA__'][type='application/json']")            json_data = json.loads(script_tag.text)            item_list = json_data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]
            for item in item_list:                if item["__typename"] != "Product":                    continue                                name = item.get("name")                product_id = item["usItemId"]                if not name:                    continue                link = f"https://www.walmart.com/reviews/product/{product_id}"                price = item["price"]                sponsored = item["isSponsoredFlag"]                rating = item["averageRating"]                                search_data = SearchData(                    name=name,                    stars=rating,                    url=link,                    sponsored=sponsored,                    price=price,                    product_id=product_id                )                                data_pipeline.add_data(search_data)                                            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

def process_item(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(get_scrapeops_url(url, location=location))        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")                review_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-').replace('/', '')}.csv")                script_tag = soup.select_one("script[id='__NEXT_DATA__'][type='application/json']")                json_data = json.loads(script_tag.text)                review_list = json_data["props"]["pageProps"]["initialData"]["data"]["reviews"]["customerReviews"]
                for review in review_list:                    name = review["userNickname"]                    author_id = review["authorId"]                    rating = review["rating"]                    date = review["reviewSubmissionTime"]                    review = review["reviewText"]
                    review_data = ReviewData(                        name=name,                        author_id=author_id,                        rating=rating,                        date=date,                        review=review                    )                                    review_pipeline.add_data(review_data)
                review_pipeline.close_pipeline()                success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")        
def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_item,                reader,                [location] * len(reader),                [retries] * len(reader)            )
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 3    PAGES = 4    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["laptop"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

To change your results, feel free to change any of the following:

MAX_RETRIES: Defines the maximum number of times the script will retry fetching a webpage if a request fails due to issues such as network timeouts or non-200 HTTP responses.
MAX_THREADS: Sets the maximum number of threads that will be used concurrently while scraping.
PAGES: The number of search result pages to scrape for each keyword.
LOCATION: The location or country code where the products or reviews will be scraped from.
keyword_list: A list of product keywords to search for on Walmart’s website (e.g., ["laptop"]).

How To Architect Our Walmart Scraper

If you look at the screenshot below, this is the page you'll get if you don't scrape Walmart properly.To scrape Walmart, our project is going to consist of two separate scrapers, a result crawler and a review scraper.

We'll run our result crawler to generate a large report of items matching our keyword. For example, if we want to search laptops, our crawler will generate a large report on laptops.
Our review scraper is going to look up the individual reviews for each laptop in the report.

Here are the steps for our result crawler.

Parse Walmart search data.
Paginate our results for control over our data.
Store our scraped data.
Concurrently run steps 1 through 3 on multiple pages.
Add proxy integration to bypass any anti-bots and avoid the block screen you saw earlier in this article.

The review scraper will then perform these actions.

Read the CSV into an array.
Parse reviews on each item from the array.
Store our extracted data.
Run steps 2 and 3 on multiple pages using concurrency.
Integrate with the ScrapeOps Proxy to once again bypass any anti-bots.

Understanding How To Scrape Walmart

Time to get a look at Walmart from a high level. In these coming sections, we're going to look at different Walmart pages and see how they're built. Then, we'll look at how to extract our data and how to better control our results.

Step 1: How To Request Walmart Pages

Just like with any other site, we first need to perform a simple GET request.We make a GET, and Walmart's server sends our response back as an HTML page.The first page we'll look at is our search results. Then, we'll look at the review pages.Our search URLs are laid out like this:

https://www.walmart.com/search?q={formatted_keyword}

Our review pages are laid out like this:

https://www.walmart.com/reviews/product/{product_id}

Take a look at the page below.

Step 2: How To Extract Data From Walmart Results and Pages

Now, lets get a better look at exactly which data we'll be extracting. Luckily for us, both the results page and the review page hold all their information nested inside a JSON blob.You can view the search result JSON below. Here is the JSON blob for the reviews page.

Step 3: How To Control Pagination

Like many other sites, we can control our pagination by adding a simple query param, page. If we want to view page 1, our URL would contain page=1.Our fully paginated URL would be

https://www.walmart.com/search?q={formatted_keyword}&page={page_number+1}

We use page_number+1 because Python's range() function begins counting at 0.

Step 4: Geolocated Data

For geolocation, we'll be using the country param when talking to the ScrapeOps Proxy API. When we pass this parameter, ScrapeOps will route us through the country of our choosing.

If we want to appear in the US, we'd pass "country": "us".
If we wish to appear in the UK, we'd pass "country": "uk".

Setting Up Our Walmart Scraper Project

Let's get started. You can run the following commands to get setup.Create a New Project Folder

mkdir walmart-scraper
cd walmart-scraper

Create a New Virtual Environment

python -m venv venv

Activate the Environment

source venv/bin/activate

Install Our Dependencies

pip install requests

pip install beautifulsoup4

Build A Walmart Search Crawler

Time to get started with our search crawler. This will be the foundation of the entire project. We'll go through and add the following features in order.

Parsing
Pagination
Data Storage
Concurrency
Proxy Integration.

Step 1: Create Simple Search Data Parser

Let's get started by building a parser. Our parser will do a keyword search on Walmart. Then, it will extract data from the results.In our code example below, we add error handling, basic structure, retry logic and our base parsing function. Everything is important, but you should pay special attention to our parsing function, scrape_search_results().Here's the code we'll start with.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoup
import concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)

def scrape_search_results(keyword, location, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.walmart.com/search?q={formatted_keyword}"    tries = 0    success = False        while tries <= retries and not success:        try:            response = requests.get(url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed to get page {page_number}, status code {response.status_code}")
            soup = BeautifulSoup(response.text, "html.parser")            script_tag = soup.select_one("script[id='__NEXT_DATA__'][type='application/json']")            json_data = json.loads(script_tag.text)            item_list = json_data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]
            for item in item_list:                if item["__typename"] != "Product":                    continue                                name = item.get("name")                product_id = item["usItemId"]                if not name:                    continue                link = f"https://www.walmart.com/reviews/product/{product_id}"                price = item["price"]                sponsored = item["isSponsoredFlag"]                rating = item["averageRating"]                                search_data = {                    "name": name,                    "stars": rating,                    "url": link,                    "sponsored": sponsored,                    "price": price,                    "product_id": product_id                }                            print(search_data)                                            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 3    PAGES = 1    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["laptop"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        scrape_search_results(keyword, LOCATION, retries=MAX_RETRIES)            logger.info(f"Crawl complete.")

All of our data comes embedded within a JSON blob, here's how we handle it.

soup.select_one("script[id='__NEXT_DATA__'][type='application/json']") finds the JSON.
We use json.loads() to convert the text into a JSON object.
json_data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"] is used to access our item list.
We iterate through the list and pull the following from each item:
- "name": the name of the product.
- "stars": the overall rating for the product.
- "url": the link to the product's reviews.
- "sponsored": whether or not it is a sponsored item, basically an ad.
- "price": the price of the item.
- "product_id": the unique number assigned to each product on the site.

Step 2: Add Pagination

We know how to parse our results, but when we do a search, we can only get the first page of results. To handle this issue, we need to add pagination. Pagination allows us to request specific pages. To add pagination, we'll add a page parameter to our URL.Our URLs will now look like this. We use page_number+1 because our Walmart pages start at 1, but Python's range() function begins counting at 0.

"https://www.walmart.com/search?q={formatted_keyword}&page={page_number+1}"

We also add a start_scrape() function. This one takes in a list of pages and runs scrape_search_results() on each page from the list.

def start_scrape(keyword, pages, location, retries=3):    for page in range(pages):        scrape_search_results(keyword, location, page, retries=retries)

You can see our fuller updated code below.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoup
import concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)

def scrape_search_results(keyword, location, page_number, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.walmart.com/search?q={formatted_keyword}&page={page_number+1}"    tries = 0    success = False        while tries <= retries and not success:        try:            response = requests.get(url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed to get page {page_number}, status code {response.status_code}")
            soup = BeautifulSoup(response.text, "html.parser")            script_tag = soup.select_one("script[id='__NEXT_DATA__'][type='application/json']")            json_data = json.loads(script_tag.text)            item_list = json_data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]
            for item in item_list:                if item["__typename"] != "Product":                    continue                                name = item.get("name")                product_id = item["usItemId"]                if not name:                    continue                link = f"https://www.walmart.com/reviews/product/{product_id}"                price = item["price"]                sponsored = item["isSponsoredFlag"]                rating = item["averageRating"]                                search_data = {                    "name": name,                    "stars": rating,                    "url": link,                    "sponsored": sponsored,                    "price": price,                    "product_id": product_id                }                            print(search_data)                                            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def start_scrape(keyword, pages, location, retries=3):    for page in range(pages):        scrape_search_results(keyword, location, page, retries=retries)
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 3    PAGES = 1    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["laptop"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        start_scrape(keyword, PAGES, LOCATION, retries=MAX_RETRIES)            logger.info(f"Crawl complete.")

We added a page argument to our both our parsing function and our url.
start_scrape() allows us to scrape a list of pages.

Step 3: Storing the Scraped Data

After we extract our data, we need to store it. Without storage, scraping the web would be entirely pointless!We'll store our data inside a CSV file. The CSV format is really convenient. It allows a human to open it up easily and view results as a spreadsheet. On top of that, a CSV is just a list of key-value pairs. Later, we'll write Python code to read the CSV file and then scrape reviews on the products from it.First, we need a dataclass. We're going to call this one, SearchData, because we'll use it to represent objects from our search results.

@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    sponsored: bool = False    price: float = 0.0    product_id: int = 0
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

Now, we need a DataPipeline. This class opens up a pipe to a CSV file. On top of that, it does a couple other important things. If the file doesn't exist, our class creates it. If the CSV does exist, it appends it instead.Our DataPipeline also uses the name attribute to filter out duplicate data.

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()

In our full, code, we now open a DataPipeline and pass it into our crawling functions. Then, instead of printing our data, we turn it into SearchData and pass it into the DataPipeline.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoup
import concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)

@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    sponsored: bool = False    price: float = 0.0    product_id: int = 0
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.walmart.com/search?q={formatted_keyword}&page={page_number+1}"    tries = 0    success = False        while tries <= retries and not success:        try:            response = requests.get(url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed to get page {page_number}, status code {response.status_code}")
            soup = BeautifulSoup(response.text, "html.parser")            script_tag = soup.select_one("script[id='__NEXT_DATA__'][type='application/json']")            json_data = json.loads(script_tag.text)            item_list = json_data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]
            for item in item_list:                if item["__typename"] != "Product":                    continue                                name = item.get("name")                product_id = item["usItemId"]                if not name:                    continue                link = f"https://www.walmart.com/reviews/product/{product_id}"                price = item["price"]                sponsored = item["isSponsoredFlag"]                rating = item["averageRating"]                                search_data = SearchData(                    name=name,                    stars=rating,                    url=link,                    sponsored=sponsored,                    price=price,                    product_id=product_id                )                                data_pipeline.add_data(search_data)                                            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def start_scrape(keyword, pages, location, data_pipeline=None, retries=3):    for page in range(pages):        scrape_search_results(keyword, location, page, data_pipeline=data_pipeline, retries=retries)
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 3    PAGES = 1    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["laptop"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

SearchData represents a search results object.
DataPipeline is used to pipe dataclass (in this case, SearchData) objects to a CSV.

Step 4: Adding Concurrency

Now, to add concurrency. We'll use ThreadPoolExecutor. This gives us the power of multithreading. We'll open up a new threadpool and then parse an individual page on each available thread.Take a look below, we've rewritten start_scrape() and removed the for loop.

def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

You may view our fully updated script below.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoup
import concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url
## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    sponsored: bool = False    price: float = 0.0    product_id: int = 0
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.walmart.com/search?q={formatted_keyword}&page={page_number+1}"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed to get page {page_number}, status code {response.status_code}")
            soup = BeautifulSoup(response.text, "html.parser")            script_tag = soup.select_one("script[id='__NEXT_DATA__'][type='application/json']")            json_data = json.loads(script_tag.text)            item_list = json_data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]
            for item in item_list:                if item["__typename"] != "Product":                    continue                                name = item.get("name")                product_id = item["usItemId"]                if not name:                    continue                link = f"https://www.walmart.com/reviews/product/{product_id}"                price = item["price"]                sponsored = item["isSponsoredFlag"]                rating = item["averageRating"]                                search_data = SearchData(                    name=name,                    stars=rating,                    url=link,                    sponsored=sponsored,                    price=price,                    product_id=product_id                )                                data_pipeline.add_data(search_data)                                            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 3    PAGES = 1    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["laptop"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

Pay attention to our args in executor.map():

scrape_search_results: our parsing function.
All other arguments get passed in as arrays. Then, our executor passes the args from these arrays into each call of scrape_search_results.

Step 5: Bypassing Anti-Bots

Remember the blocked screen from the top of the page? Anti-bots get us past this.In this section, we're going to write a simple function that unlocks the power of proxy. It will take a URL, an ScrapeOps Proxy API key, and several other parameters and spit out a fully proxied URL.Here is our proxy function.

def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

Look closely at the payload:

"api_key": our ScrapeOps API key.
"url": the URL we want to scrape.
"country": is the location we wish to appear in.

Here is our crawler now that it's production ready.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoup
import concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url
## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    sponsored: bool = False    price: float = 0.0    product_id: int = 0
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.walmart.com/search?q={formatted_keyword}&page={page_number+1}"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed to get page {page_number}, status code {response.status_code}")
            soup = BeautifulSoup(response.text, "html.parser")            script_tag = soup.select_one("script[id='__NEXT_DATA__'][type='application/json']")            json_data = json.loads(script_tag.text)            item_list = json_data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]
            for item in item_list:                if item["__typename"] != "Product":                    continue                                name = item.get("name")                product_id = item["usItemId"]                if not name:                    continue                link = f"https://www.walmart.com/reviews/product/{product_id}"                price = item["price"]                sponsored = item["isSponsoredFlag"]                rating = item["averageRating"]                                search_data = SearchData(                    name=name,                    stars=rating,                    url=link,                    sponsored=sponsored,                    price=price,                    product_id=product_id                )                                data_pipeline.add_data(search_data)                                            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 3    PAGES = 1    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["laptop"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

Step 6: Production Run

Now, let's run this thing in production! Here is the main we'll be working with. we're crawling 4 pages on 3 threads.Feel free to change any of these to tweak your results:

MAX_RETRIES: Defines the maximum number of times the script will retry fetching a webpage if a request fails due to issues such as network timeouts or non-200 HTTP responses.
MAX_THREADS: Sets the maximum number of threads that will be used concurrently while scraping.
PAGES: The number of search result pages to scrape for each keyword.
LOCATION: The location or country code where the products or reviews will be scraped from.
keyword_list: A list of product keywords to search for on Walmart’s website (e.g., ["laptop"]).

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 3    PAGES = 4    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["laptop"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

Here are the results.We scraped 4 pages in 10.092 seconds. 10.092 seconds / 4 pages = 2.523 seconds per page. Our crawl generated a report with 164 different laptop items.

Build A Walmart Scraper

When we build our scraper, we're going to follow a similar plan to what we did with the crawler. Our scraper will perform the following steps in order.

Read the CSV file into an array.
Parse the individual items from our array.
Store the scraped data from the parse.
Utilize concurrency to handle scraping multiple webpages at once.
Use the ScrapeOps Proxy API to once again get past anti-bots.

Step 1: Create Simple Business Data Parser

Like we did earlier, we're going to start with our basic parsing function. We'll add basic structure including error handling, retry logic. Once again, pay close attention to our parsing in the function. This is where the magic happens.

def process_item(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(url)        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")                review_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")                script_tag = soup.select_one("script[id='__NEXT_DATA__'][type='application/json']")                json_data = json.loads(script_tag.text)                review_list = json_data["props"]["pageProps"]["initialData"]["data"]["reviews"]["customerReviews"]
                for review in review_list:                    name = review["userNickname"]                    author_id = review["authorId"]                    rating = review["rating"]                    date = review["reviewSubmissionTime"]                    review = review["reviewText"]
                    review_data = {                        "name": name,                        "author_id": author_id,                        "rating": rating,                        "date": date,                        "review": review                    }                                   print(review_data)
                success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")

We find our JSON exactly the same way we did earlier: soup.select_one("script[id='__NEXT_DATA__'][type='application/json']").
json_data["props"]["pageProps"]["initialData"]["data"]["reviews"]["customerReviews"] finds our list of customer reviews.
We iterate through the reviews and pull the following data:
- name: the name of the reviewer.
- author_id: a unique identifier for the reviewer, much like our product_id from earlier.
- rating: the rating left by the reviewer.
- date: the date that the review was left.
- review: the actual text of the review, for instance "It was good. I really like [x] about this laptop."

Step 2: Loading URLs To Scrape

To scrape these reviews, we need to feed urls into our parsing function. In order to accomplish this, we need to read the report generated by the crawler.Here, we're going to write another function, similar to start_scrape() from earlier.

def process_results(csv_file, location, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_item(row, location, retries=retries)

After we've read the CSV, we call process_item() on each row from the file using a for loop. We'll remove the for loop a little later on when we add concurrency later on. You can see our fully updated code below.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoup
import concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url
## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    sponsored: bool = False    price: float = 0.0    product_id: int = 0
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.walmart.com/search?q={formatted_keyword}&page={page_number+1}"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed to get page {page_number}, status code {response.status_code}")
            soup = BeautifulSoup(response.text, "html.parser")            script_tag = soup.select_one("script[id='__NEXT_DATA__'][type='application/json']")            json_data = json.loads(script_tag.text)            item_list = json_data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]
            for item in item_list:                if item["__typename"] != "Product":                    continue                                name = item.get("name")                product_id = item["usItemId"]                if not name:                    continue                link = f"https://www.walmart.com/reviews/product/{product_id}"                price = item["price"]                sponsored = item["isSponsoredFlag"]                rating = item["averageRating"]                                search_data = SearchData(                    name=name,                    stars=rating,                    url=link,                    sponsored=sponsored,                    price=price,                    product_id=product_id                )                                data_pipeline.add_data(search_data)                                            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

def process_item(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(url)        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")                review_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")                script_tag = soup.select_one("script[id='__NEXT_DATA__'][type='application/json']")                json_data = json.loads(script_tag.text)                review_list = json_data["props"]["pageProps"]["initialData"]["data"]["reviews"]["customerReviews"]
                for review in review_list:                    name = review["userNickname"]                    author_id = review["authorId"]                    rating = review["rating"]                    date = review["reviewSubmissionTime"]                    review = review["reviewText"]
                    review_data = {                        "name": name,                        "author_id": author_id,                        "rating": rating,                        "date": date,                        "review": review                    }                                   print(review_data)
                success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")

def process_results(csv_file, location, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_item(row, location, retries=retries)
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 3    PAGES = 1    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["laptop"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, retries=MAX_RETRIES)

process_results() reads our CSV into an array and iterates through it.
As we iterate, we call process_item() on each item from the array to scrape its reviews.

Step 3: Storing the Scraped Data

As mentioned earlier, scraping without storing is a waste of time. We already have our DataPipeline that can take in a dataclass, but the only class we have is SearchData.To address this, we're going to add another one called ReviewData. Then, from within our parsing function, we'll create a new DataPipeline and pass ReviewData objects into it while we parse.Take a look at ReviewData.

@dataclassclass ReviewData:    name: str = ""    author_id: str = ""    rating: int = 0    date: str = ""    review: str = ""

    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

ReviewData holds all of the data we extracted during the parse. You can see how everything works in the full code below.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoup
import concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url
## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    sponsored: bool = False    price: float = 0.0    product_id: int = 0
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass ReviewData:    name: str = ""    author_id: str = ""    rating: int = 0    date: str = ""    review: str = ""

    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.walmart.com/search?q={formatted_keyword}&page={page_number+1}"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed to get page {page_number}, status code {response.status_code}")
            soup = BeautifulSoup(response.text, "html.parser")            script_tag = soup.select_one("script[id='__NEXT_DATA__'][type='application/json']")            json_data = json.loads(script_tag.text)            item_list = json_data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]
            for item in item_list:                if item["__typename"] != "Product":                    continue                                name = item.get("name")                product_id = item["usItemId"]                if not name:                    continue                link = f"https://www.walmart.com/reviews/product/{product_id}"                price = item["price"]                sponsored = item["isSponsoredFlag"]                rating = item["averageRating"]                                search_data = SearchData(                    name=name,                    stars=rating,                    url=link,                    sponsored=sponsored,                    price=price,                    product_id=product_id                )                                data_pipeline.add_data(search_data)                                            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

def process_item(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(url)        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")                review_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")                script_tag = soup.select_one("script[id='__NEXT_DATA__'][type='application/json']")                json_data = json.loads(script_tag.text)                review_list = json_data["props"]["pageProps"]["initialData"]["data"]["reviews"]["customerReviews"]
                for review in review_list:                    name = review["userNickname"]                    author_id = review["authorId"]                    rating = review["rating"]                    date = review["reviewSubmissionTime"]                    review = review["reviewText"]
                    review_data = ReviewData(                        name=name,                        author_id=author_id,                        rating=rating,                        date=date,                        review=review                    )                                    review_pipeline.add_data(review_data)
                review_pipeline.close_pipeline()                success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")

def process_results(csv_file, location, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_item(row, location, retries=retries)
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 3    PAGES = 1    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["laptop"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, retries=MAX_RETRIES)

From within our parsing function, we now:

Open a DataPipeline.
Pass ReviewData objects into the pipeline as we parse them.
Close the pipeline once we're finished parsing.

Step 4: Adding Concurrency

We'll add concurrency almost exactly the same way we did before. ThreadPoolExecutor will open up a new pool of threads. We'll pass our parsing function in as the first argument, and then everything else gets passed in as an array to then get passed into our parser.Here is the finished process_results().

def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_item,                reader,                [location] * len(reader),                [retries] * len(reader)            )

Our CSV reading logic stays the same, but the parse is now running on multiple threads concurrently.

Step 5: Bypassing Anti-Bots

We'll bypass anti-bots exactly the same way we did earlier. We'll call get_scrapeops_url() from within our parser and now we have a custom proxy for each request we make during our review scrape.

scrapeops_proxy_url = get_scrapeops_url(url, location=location)

Here is our code now that it's ready for production.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoup
import concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url
## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    stars: float = 0    url: str = ""    sponsored: bool = False    price: float = 0.0    product_id: int = 0
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass ReviewData:    name: str = ""    author_id: str = ""    rating: int = 0    date: str = ""    review: str = ""

    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.walmart.com/search?q={formatted_keyword}&page={page_number+1}"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed to get page {page_number}, status code {response.status_code}")
            soup = BeautifulSoup(response.text, "html.parser")            script_tag = soup.select_one("script[id='__NEXT_DATA__'][type='application/json']")            json_data = json.loads(script_tag.text)            item_list = json_data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]
            for item in item_list:                if item["__typename"] != "Product":                    continue                                name = item.get("name")                product_id = item["usItemId"]                if not name:                    continue                link = f"https://www.walmart.com/reviews/product/{product_id}"                price = item["price"]                sponsored = item["isSponsoredFlag"]                rating = item["averageRating"]                                search_data = SearchData(                    name=name,                    stars=rating,                    url=link,                    sponsored=sponsored,                    price=price,                    product_id=product_id                )                                data_pipeline.add_data(search_data)                                            logger.info(f"Successfully parsed data from: {url}")            success = True                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_search_results,            [keyword] * pages,            [location] * pages,            range(pages),            [data_pipeline] * pages,            [retries] * pages        )

def process_item(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(get_scrapeops_url(url, location=location))        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")                review_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-').replace('/', '')}.csv")                script_tag = soup.select_one("script[id='__NEXT_DATA__'][type='application/json']")                json_data = json.loads(script_tag.text)                review_list = json_data["props"]["pageProps"]["initialData"]["data"]["reviews"]["customerReviews"]
                for review in review_list:                    name = review["userNickname"]                    author_id = review["authorId"]                    rating = review["rating"]                    date = review["reviewSubmissionTime"]                    review = review["reviewText"]
                    review_data = ReviewData(                        name=name,                        author_id=author_id,                        rating=rating,                        date=date,                        review=review                    )                                    review_pipeline.add_data(review_data)
                review_pipeline.close_pipeline()                success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")        
def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_item,                reader,                [location] * len(reader),                [retries] * len(reader)            )
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 3    PAGES = 1    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["laptop"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Step 6: Production Run

In our production run, we'll use the same settings as we did before. We'll do a 4 page crawl with 3 threads. Remember, that you can tweak your results by changing the constants.

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 3    PAGES = 4    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["laptop"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Here are the final results from a full crawl and scrape.

The full process finished in 255.135 seconds.
Our crawl spat out a CSV with 164 laptop items and if you remember from earlier, a 4 page crawl was 10.092 seconds.
255.135 - 10.092 = 245.043 seconds scraping.
245.043 seconds / 164 pages = 1.494 seconds per page. This is even faster than our crawler.

It's not uncommon for proxied scrapes to take between 7 and 10 seconds per page... some take even longer than that. This is super fast!

Legal and Ethical Considerations

Scraping data from Walmart is generally considered legal (depending on the country you reside in and their individual laws). Public data is almost always considered fair game.When data is gated behind a login page, it is considered private data and subject to privacy and intellectual property laws. If you think your scraper might be illegal, contact an attorney.However, when we scrape Walmart, we are subject to both their Terms of Service and their robots.txt. Violating their policies can lead to suspension and even deletion of your account.You can view their terms here.Walmart's robots.txt is available here.

Conclusion

You've made it to the end! You've had a crash course in Requests and BeautifulSoup. You should also have a decent understanding of parsing, pagination, data storage, concurrency, and proxy integration. Take this new knowledge and go build something!!!

How to Scrape Walmart With Selenium

In this Selenium scraping guide, we will first crawl Walmart search pages and then proceed to scrape product reviews.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

TLDR - How to Scrape Walmart

If you require a pre-built Walmart scraper, consider trying this one.

First, create a new project folder.
Next, make a config.json file containing your ScrapeOps API key.
After that, copy and paste the code provided below into a new Python file within the same folder.
You can then execute it by running python name_of_your_file.py.

import os  import csv  import json  import logging  import time  import concurrent.futures  from dataclasses import dataclass, field, fields, asdict  from urllib.parse import urlencode  from selenium import webdriver  from selenium.webdriver.chrome.service import Service as ChromeService  from webdriver_manager.chrome import ChromeDriverManager  from selenium.webdriver.chrome.options import Options  from selenium.webdriver.common.by import By
API_KEY = ""
with open("config.json", "r") as config_file:      config = json.load(config_file)      API_KEY = config["api_key"]
def get_scrapeops_url(url, location="us"):      payload = {          "api_key": API_KEY,          "url": url,          "country": location,      }      proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)      return proxy_url
# Logging  logging.basicConfig(level=logging.INFO)  logger = logging.getLogger(__name__)
@dataclass  class SearchData:      name: str = ""      stars: float = 0      url: str = ""      sponsored: bool = False      price: float = 0.0      product_id: int = 0
    def __post_init__(self):          self.check_string_fields()               def check_string_fields(self):          for field in fields(self):              # Check string fields              if isinstance(getattr(self, field.name), str):                  # If empty set default text                  if getattr(self, field.name) == "":                      setattr(self, field.name, f"No {field.name}")                      continue                  # Strip any trailing spaces, etc.                  value = getattr(self, field.name)                  setattr(self, field.name, value.strip())                   @dataclass  class ReviewData:      name: str = ""      author_id: int = 0      rating: float = 0.0      date: str = ""      review: str = ""
    def __post_init__(self):          self.check_string_fields()
    def check_string_fields(self):          for field in fields(self):              # Check string fields              if isinstance(getattr(self, field.name), str):                  # If empty set default text                  if getattr(self, field.name) == "":                      setattr(self, field.name, f"No {field.name}")                      continue                  # Strip any trailing spaces, etc.                  value = getattr(self, field.name)                  setattr(self, field.name, value.strip())
class DataPipeline:           def __init__(self, csv_filename="", storage_queue_limit=50):          self.names_seen = []          self.storage_queue = []          self.storage_queue_limit = storage_queue_limit          self.csv_filename = csv_filename          self.csv_file_open = False           def save_to_csv(self):          self.csv_file_open = True          data_to_save = []          data_to_save.extend(self.storage_queue)          self.storage_queue.clear()          if not data_to_save:              return
        keys = [field.name for field in fields(data_to_save[0])]          file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0          with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:              writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                  writer.writeheader()
            for item in data_to_save:                  writer.writerow(asdict(item))
        self.csv_file_open = False                           def is_duplicate(self, input_data):          if input_data.name in self.names_seen:              logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")              return True          self.names_seen.append(input_data.name)          return False                   def add_data(self, scraped_data):          if self.is_duplicate(scraped_data) == False:              self.storage_queue.append(scraped_data)              if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                  self.save_to_csv()                               def close_pipeline(self):          if self.csv_file_open:              time.sleep(3)          if len(self.storage_queue) > 0:              self.save_to_csv()
def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):      formatted_keyword = keyword.replace(" ", "+")      url = f"https://www.walmart.com/search?q={formatted_keyword}&page={page_number+1}"      tries = 0      success = False           while tries <= retries and not success:          try:              scrapeops_proxy_url = get_scrapeops_url(url, location=location)
            # Set up Selenium WebDriver              options = Options()              options.add_argument("--headless")                options.add_argument("--no-sandbox")              options.add_argument("--disable-dev-shm-usage")              driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)
            driver.get(scrapeops_proxy_url)              logger.info(f"Received page: {url}")
            # Wait for the page to load and get the script tag              time.sleep(3)  # Adjust the sleep time as needed              script_tag = driver.find_element(By.CSS_SELECTOR, "script[id='__NEXT_DATA__'][type='application/json']")              json_data = json.loads(script_tag.get_attribute("innerHTML"))              item_list = json_data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]
            for item in item_list:                  if item["__typename"] != "Product":                      continue                                   name = item.get("name")                  product_id = item["usItemId"]                  if not name:                      continue                  link = f"https://www.walmart.com/reviews/product/{product_id}"                  price = item["price"]                  sponsored = item["isSponsoredFlag"]                  rating = item["averageRating"]                                   search_data = SearchData(                      name=name,                      stars=rating,                      url=link,                      sponsored=sponsored,                      price=price,                      product_id=product_id                  )                                  data_pipeline.add_data(search_data)                                               logger.info(f"Successfully parsed data from: {url}")              success = True                      driver.quit()  # Close the WebDriver
        except Exception as e:              logger.error(f"An error occurred while processing page {url}: {e}")              logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")              tries += 1              driver.quit()  # Close the WebDriver on error
    if not success:          raise Exception(f"Max Retries exceeded: {retries}")
def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):      with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:          executor.map(              scrape_search_results,              [keyword] * pages,              [location] * pages,              range(pages),              [data_pipeline] * pages,              [retries] * pages          )
def process_item(row, location, retries=3):      url = row.get("url")      tries = 0      success = False
    options = Options()      options.add_argument("--headless")      options.add_argument("--no-sandbox")      options.add_argument("--disable-dev-shm-usage")      driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)
    while tries <= retries and not success:          try:              scrapeops_proxy_url = get_scrapeops_url(url, location=location)              logger.info(f"Attempting to access URL: {scrapeops_proxy_url}")              driver.get(scrapeops_proxy_url)              logger.info(f"Status: {driver.title}")
            # Wait for the page to load and get the script tag              time.sleep(3)  # Adjust the sleep time as needed              script_tag = driver.find_element(By.CSS_SELECTOR, "script[id='__NEXT_DATA__'][type='application/json']")              json_data = json.loads(script_tag.get_attribute("innerHTML"))              review_list = json_data["props"]["pageProps"]["initialData"]["data"]["reviews"]["customerReviews"]
            review_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")
            for review in review_list:                  name = review["userNickname"]                  author_id = review["authorId"]                  rating = review["rating"]                  date = review["reviewSubmissionTime"]                  review_text = review["reviewText"]
                review_data = ReviewData(                      name=name,                      author_id=author_id,                      rating=rating,                      date=date,                      review=review_text                  )                  review_pipeline.add_data(review_data)
            review_pipeline.close_pipeline()              success = True
        except Exception as e:              logger.error(f"Exception thrown: {e}")              logger.warning(f"Failed to process page: {url}")              logger.warning(f"Retries left: {retries-tries}")              tries += 1
    driver.quit()  # Close the WebDriver
    if not success:          raise Exception(f"Max Retries exceeded: {retries}")      else:          logger.info(f"Successfully parsed: {row['url']}")
def process_results(csv_file, location, max_threads, retries=3):      logger.info(f"Processing {csv_file}")      with open(csv_file, newline="", encoding="utf-8") as file:          reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:              executor.map(                  process_item,                  reader,  # Pass each row in the CSV as the first argument to process_item                  [location] * len(reader),  # Location as second argument for all rows                  [retries] * len(reader)  # Retries as third argument for all rows              )
if __name__ == "__main__":
    MAX_RETRIES = 3      MAX_THREADS = 5      PAGES = 1      LOCATION = "us"
    logger.info(f"Crawl starting...")
    # INPUT ---> List of keywords to scrape      keyword_list = ["laptop"]      aggregate_files = []
    # Job Processes      for keyword in keyword_list:          filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")          start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)          crawl_pipeline.close_pipeline()          aggregate_files.append(f"{filename}.csv")           logger.info(f"Crawl complete.")
    for file in aggregate_files:          process_results(file, LOCATION,max_threads=MAX_THREADS, retries=MAX_RETRIES)

To adjust your results, you can modify the following:

MAX_RETRIES: Sets the maximum number of retries if a request fails. This can happen due to network timeouts or non-200 HTTP responses.
MAX_THREADS Defines how many threads will run at the same time during scraping.
PAGES The number of search result pages to scrape for each keyword.
LOCATION The location or country code for scraping products or reviews.
keyword_list A list of product keywords to search on Walmart’s website (e.g., ["laptop"])

How To Architect Our Walmart Scraper

If you check the screenshot below, this is the page you'll see. It appears when Walmart isn't scraped correctly.

The result crawler will generate a report of items matching a keyword. For instance, if we search for laptops, it will create a detailed report on laptops.
The review scraper will then gather reviews for each laptop in the report.

Here are the steps for the result crawler:

Parse Walmart search data.
Paginate results to control the data.
Store the scraped data.
Run steps 1 through 3 on multiple pages concurrently.
Add proxy integration to bypass anti-bots and avoid block screens.

For the review scraper:

Read the CSV file into an array.
Parse reviews for each item in the array.
Store the extracted review data.
Run steps 2 and 3 on multiple pages using concurrency.
Integrate ScrapeOps Proxy to bypass anti-bots again.

Understanding How To Scrape Walmart

We'll start by getting a high-level view of Walmart’s structure. We will examine different Walmart pages and their layout.Next, we’ll see how to extract data and improve control over results.

Step 1: How To Request Walmart Pages

Like other websites, we first need to make a GET request. Walmart's server will respond with an HTML page. We'll focus first on the search result pages and then move on to the review pages.Search URLs follow this pattern:

https://www.walmart.com/search?q={formatted_keyword}

Our review pages are structured like this:

https://www.walmart.com/reviews/product/{product_id}

Check out the example page below.

Step 2: How To Extract Data From Walmart Results and Pages

Let's take a closer look at the specific data we'll be extracting. Fortunately, both the results page and the review page store all their information inside a JSON blob.You can see the search result JSON below. Here is the JSON blob for the reviews page.

Step 3: How To Control Pagination

Like many other websites, we can control pagination by adding a query parameter, page. To view page 1, the URL will have page=1.The full paginated URL will be:

https://www.walmart.com/search?q={formatted_keyword}&page={page_number+1}

We use page_number+1 because Python’s range() function starts counting at 0.

Step 4: Geolocated Data

For geolocation, we’ll use the country parameter with the ScrapeOps Proxy API. This parameter allows us to choose the country.

If we want to appear in the US, we pass "country": "us".
For the UK, we pass "country": "uk".

Setting Up Our Walmart Scraper Project

Let’s begin by setting up the project. Run these commands to get started:Create a New Project Folder

mkdir walmart-scraper

Navigate to the Folder

cd walmart-scraper

Create a New Virtual Environment

python -m venv venv

Activate the Environment

source venv/bin/activate

Install Our Dependencies

pip install seleniumpip install webdriver-manager

Build A Walmart Search Crawler

Now, we’ll start building our search crawler. This will be the foundation of the project. The key features we’ll add include:

Parsing
Pagination
Data storage
Concurrency
Proxy integration

Step 1: Create Simple Search Data Parser

We’ll start by building a parser. The parser will search Walmart using keywords and extract data from the results.In the code below, we add basic structure, error handling, retry logic, and the base parsing function. Focus on the parsing function, scrape_search_results().Here’s the code we’ll start with:

import os  import json  import logging  import time
from selenium import webdriver  from selenium.webdriver.chrome.service import Service  from webdriver_manager.chrome import ChromeDriverManager  from selenium.webdriver.common.by import By  from selenium.webdriver.chrome.options import Options
import concurrent.futures  from dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:      config = json.load(config_file)      API_KEY = config["api_key"]
# Logging  logging.basicConfig(level=logging.INFO)  logger = logging.getLogger(__name__)
def scrape_search_results(keyword, location, retries=3):      formatted_keyword = keyword.replace(" ", "+")      url = f"https://www.walmart.com/search?q={formatted_keyword}"      tries = 0      success = False
    # Set up Selenium WebDriver      options = Options()      options.add_argument("--headless")        options.add_argument("--no-sandbox")      options.add_argument("--disable-dev-shm-usage")
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
    while tries < retries and not success:          try:              driver.get(url)              logger.info(f"Received page from: {url}")
            # Wait for the necessary elements to load              time.sleep(3)  # Adjust sleep time as necessary
            # Retrieve the JSON data from the page              script_tag = driver.find_element(By.CSS_SELECTOR, "script[id='__NEXT_DATA__'][type='application/json']")              json_data = json.loads(script_tag.get_attribute('innerText'))              item_list = json_data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]
            for item in item_list:                  if item["__typename"] != "Product":                      continue                                   name = item.get("name")                  product_id = item["usItemId"]                  if not name:                      continue                  link = f"https://www.walmart.com/reviews/product/{product_id}"                  price = item["price"]                  sponsored = item["isSponsoredFlag"]                  rating = item["averageRating"]                                   search_data = {                      "name": name,                      "stars": rating,                      "url": link,                      "sponsored": sponsored,                      "price": price,                      "product_id": product_id                  }                              print(search_data)                
            logger.info(f"Successfully parsed data from: {url}")              success = True        
        except Exception as e:              logger.error(f"An error occurred while processing page {url}: {e}")              logger.info(f"Retrying request for page: {url}, retries left {retries - tries - 1}")              tries += 1
    driver.quit()  # Ensure to close the browser
    if not success:          raise Exception(f"Max Retries exceeded: {retries}")
if __name__ == "__main__":      MAX_RETRIES = 3      MAX_THREADS = 3      PAGES = 1      LOCATION = "us"
    logger.info(f"Crawl starting...")
    # INPUT ---> List of keywords to scrape      keyword_list = ["laptop"]      aggregate_files = []
    # Job Processes      for keyword in keyword_list:          filename = keyword.replace(" ", "-")          scrape_search_results(keyword, LOCATION, retries=MAX_RETRIES)
    logger.info(f"Crawl complete.")

All of our data is embedded in a JSON blob. Here's how we handle it:

Find the JSON: We use soup.select_one("script[id='__NEXT_DATA__'][type='application/json']") to locate the JSON.
Convert to JSON Object: We use json.loads() to turn the text into a JSON object.
Access Item List: To get our list of items, we access it through json_data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"].

We then go through each item in the list and extract the following:

"name": The product's name.
"stars": The product's overall rating.
"url": The link to the product's reviews.
"sponsored": Whether it's a sponsored item (an ad).
"price": The product's price.
"product_id": The unique ID assigned to the product on the site.

Step 2: Add Pagination

We can parse the results, but we only get the first page. To solve this, we need to add pagination. This allows us to request specific pages.To do this, we add a page parameter to our URL. Now, the URLs will look like this:

https://www.walmart.com/search?q={formatted_keyword}&page={page_number+1}

We use page_number+1 because Walmart pages start at 1, while Python's range() starts from 0.Next, we create a start_scrape() function. This function takes a list of pages and runs scrape_search_results() on each one.

def start_scrape(keyword, pages, location, retries=3):      for page in range(pages):          scrape_search_results(keyword, location, page, retries=retries)

You can find the updated code below.

import os  import json  import logging  import time
from selenium import webdriver  from selenium.webdriver.chrome.service import Service  from webdriver_manager.chrome import ChromeDriverManager  from selenium.webdriver.common.by import By  from selenium.webdriver.chrome.options import Options
import concurrent.futures  from dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:      config = json.load(config_file)      API_KEY = config["api_key"]
# Logging  logging.basicConfig(level=logging.INFO)  logger = logging.getLogger(__name__)
def scrape_search_results(keyword, location, page_number, retries=3):      formatted_keyword = keyword.replace(" ", "+")      url = f"https://www.walmart.com/search?q={formatted_keyword}&page={page_number + 1}"      tries = 0      success = False
    # Set up Selenium WebDriver      options = Options()      options.add_argument("--headless")        options.add_argument("--no-sandbox")      options.add_argument("--disable-dev-shm-usage")
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
    while tries < retries and not success:          try:              driver.get(url)              logger.info(f"Received page from: {url}")
            # Wait for the necessary elements to load              time.sleep(3)  # Adjust sleep time as necessary
            # Retrieve the JSON data from the page              script_tag = driver.find_element(By.CSS_SELECTOR, "script[id='__NEXT_DATA__'][type='application/json']")              json_data = json.loads(script_tag.get_attribute('innerText'))              item_list = json_data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]
            for item in item_list:                  if item["__typename"] != "Product":                      continue                                   name = item.get("name")                  product_id = item["usItemId"]                  if not name:                      continue                  link = f"https://www.walmart.com/reviews/product/{product_id}"                  price = item["price"]                  sponsored = item["isSponsoredFlag"]                  rating = item["averageRating"]                                   search_data = {                      "name": name,                      "stars": rating,                      "url": link,                      "sponsored": sponsored,                      "price": price,                      "product_id": product_id                  }                              print(search_data)                
            logger.info(f"Successfully parsed data from: {url}")              success = True        
        except Exception as e:              logger.error(f"An error occurred while processing page {url}: {e}")              logger.info(f"Retrying request for page: {url}, retries left {retries - tries - 1}")              tries += 1
    driver.quit()  # Ensure to close the browser
    if not success:          raise Exception(f"Max Retries exceeded: {retries}")
def start_scrape(keyword, pages, location, retries=3):      for page in range(pages):          scrape_search_results(keyword, location, page, retries=retries)
if __name__ == "__main__":      MAX_RETRIES = 3      MAX_THREADS = 3      PAGES = 2      LOCATION = "us"
    logger.info(f"Crawl starting...")
    # INPUT ---> List of keywords to scrape      keyword_list = ["laptop"]      aggregate_files = []
    # Job Processes      for keyword in keyword_list:          filename = keyword.replace(" ", "-")          start_scrape(keyword, PAGES, LOCATION, retries=MAX_RETRIES)               logger.info(f"Crawl complete.")

We included a page argument in both our parsing function and our URL. The function start_scrape() enables us to scrape multiple pages.

Step 3: Storing the Scraped Data

Saving the Scraped Data Once we have extracted the data, it is necessary to store it.Without storage, web scraping would be useless! We will save the data in a CSV file.The CSV format is very handy since it allows humans to open it easily and view the results in a spreadsheet. Additionally, a CSV is simply a collection of key-value pairs.Later, we will write Python code to read the CSV file and scrape product reviews from it. But first, we need a dataclass. We'll name it SearchData, as it will represent objects from our search results.

@dataclass  class SearchData:      name: str = ""      stars: float = 0      url: str = ""      sponsored: bool = False      price: float = 0.0      product_id: int = 0
    def __post_init__(self):          self.check_string_fields()               def check_string_fields(self):          for field in fields(self):              if isinstance(getattr(self, field.name), str):                  if getattr(self, field.name) == "":                      setattr(self, field.name, f"No {field.name}")                      continue                  value = getattr(self, field.name)                  setattr(self, field.name, value.strip())

At this point, we require a DataPipeline. This class establishes a connection to a CSV file. In addition, it performs a few other crucial tasks.If the file isn’t already present, the class creates it. If the CSV file exists, it appends to it instead.The DataPipeline also utilizes the name attribute to eliminate duplicate entries.

class DataPipeline:           def __init__(self, csv_filename="", storage_queue_limit=50):          self.names_seen = []          self.storage_queue = []          self.storage_queue_limit = storage_queue_limit          self.csv_filename = csv_filename          self.csv_file_open = False           def save_to_csv(self):          self.csv_file_open = True          data_to_save = []          data_to_save.extend(self.storage_queue)          self.storage_queue.clear()          if not data_to_save:              return
        keys = [field.name for field in fields(data_to_save[0])]          file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0          with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:              writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                  writer.writeheader()
            for item in data_to_save:                  writer.writerow(asdict(item))
        self.csv_file_open = False                           def is_duplicate(self, input_data):          if input_data.name in self.names_seen:              logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")              return True          self.names_seen.append(input_data.name)          return False                   def add_data(self, scraped_data):          if not self.is_duplicate(scraped_data):              self.storage_queue.append(scraped_data)              if len(self.storage_queue) >= self.storage_queue_limit and not self.csv_file_open:                  self.save_to_csv()                               def close_pipeline(self):          if self.csv_file_open:              time.sleep(3)          if len(self.storage_queue) > 0:              self.save_to_csv()

In our complete code, we now open a DataPipeline and provide it to our crawling functions.Rather than printing our data, we convert it into SearchData and then pass it through the DataPipeline.

import os  import csv  import json  import logging  import time
from selenium import webdriver  from selenium.webdriver.chrome.service import Service  from webdriver_manager.chrome import ChromeDriverManager  from selenium.webdriver.common.by import By  from selenium.webdriver.chrome.options import Options
from dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:      config = json.load(config_file)      API_KEY = config["api_key"]
# Logging  logging.basicConfig(level=logging.INFO)  logger = logging.getLogger(__name__)
@dataclass  class SearchData:      name: str = ""      stars: float = 0      url: str = ""      sponsored: bool = False      price: float = 0.0      product_id: int = 0
    def __post_init__(self):          self.check_string_fields()               def check_string_fields(self):          for field in fields(self):              if isinstance(getattr(self, field.name), str):                  if getattr(self, field.name) == "":                      setattr(self, field.name, f"No {field.name}")                      continue                  value = getattr(self, field.name)                  setattr(self, field.name, value.strip())
class DataPipeline:           def __init__(self, csv_filename="", storage_queue_limit=50):          self.names_seen = []          self.storage_queue = []          self.storage_queue_limit = storage_queue_limit          self.csv_filename = csv_filename          self.csv_file_open = False           def save_to_csv(self):          self.csv_file_open = True          data_to_save = []          data_to_save.extend(self.storage_queue)          self.storage_queue.clear()          if not data_to_save:              return
        keys = [field.name for field in fields(data_to_save[0])]          file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0          with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:              writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                  writer.writeheader()
            for item in data_to_save:                  writer.writerow(asdict(item))
        self.csv_file_open = False                           def is_duplicate(self, input_data):          if input_data.name in self.names_seen:              logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")              return True          self.names_seen.append(input_data.name)          return False                   def add_data(self, scraped_data):          if not self.is_duplicate(scraped_data):              self.storage_queue.append(scraped_data)              if len(self.storage_queue) >= self.storage_queue_limit and not self.csv_file_open:                  self.save_to_csv()                               def close_pipeline(self):          if self.csv_file_open:              time.sleep(3)          if len(self.storage_queue) > 0:              self.save_to_csv()
def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):      formatted_keyword = keyword.replace(" ", "+")      url = f"https://www.walmart.com/search?q={formatted_keyword}&page={page_number + 1}"      tries = 0      success = False           # Set up Selenium WebDriver      options = Options()      options.add_argument("--headless")        options.add_argument("--no-sandbox")      options.add_argument("--disable-dev-shm-usage")
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
    while tries < retries and not success:          try:              driver.get(url)              logger.info(f"Received page from: {url}")
            # Wait for the necessary elements to load              time.sleep(3)  # Adjust sleep time as necessary
            # Retrieve the JSON data from the page              script_tag = driver.find_element(By.CSS_SELECTOR, "script[id='__NEXT_DATA__'][type='application/json']")              json_data = json.loads(script_tag.get_attribute('innerText'))              item_list = json_data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]
            for item in item_list:                  if item["__typename"] != "Product":                      continue                                   name = item.get("name")                  product_id = item["usItemId"]                  if not name:                      continue                  link = f"https://www.walmart.com/reviews/product/{product_id}"                  price = item["price"]                  sponsored = item["isSponsoredFlag"]                  rating = item["averageRating"]                                   search_data = SearchData(                      name=name,                      stars=rating,                      url=link,                      sponsored=sponsored,                      price=price,                      product_id=product_id                  )                  data_pipeline.add_data(search_data)                                               logger.info(f"Successfully parsed data from: {url}")              success = True        
        except Exception as e:              logger.error(f"An error occurred while processing page {url}: {e}")              logger.info(f"Retrying request for page: {url}, retries left {retries - tries - 1}")              tries += 1
    driver.quit()  # Ensure to close the browser
    if not success:          raise Exception(f"Max Retries exceeded: {retries}")
def start_scrape(keyword, pages, location, data_pipeline=None, retries=3):      for page in range(pages):          scrape_search_results(keyword, location, page, data_pipeline=data_pipeline, retries=retries)
if __name__ == "__main__":      MAX_RETRIES = 3      MAX_THREADS = 3      PAGES = 1      LOCATION = "us"
    logger.info(f"Crawl starting...")
    # INPUT ---> List of keywords to scrape      keyword_list = ["laptop"]      aggregate_files = []
    # Job Processes      for keyword in keyword_list:          filename = keyword.replace(" ", "-")          crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")          start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, retries=MAX_RETRIES)          crawl_pipeline.close_pipeline()          aggregate_files.append(f"{filename}.csv")
    logger.info(f"Crawl complete.")

SearchData is an object that represents search results.
DataPipeline is responsible for piping dataclass objects, specifically SearchData, into a CSV format.

Step 4: Adding Concurrency

At this point, we will introduce concurrency. We'll be utilizing ThreadPoolExecutor to enable multithreading. By opening a new threadpool, we can parse a single page on each available thread.Below is a rewritten version of start_scrape(), where the for loop has been eliminated.

def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):      with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:          executor.map(              scrape_search_results,              [keyword] * pages,              [location] * pages,              range(pages),              [data_pipeline] * pages,              [retries] * pages          )

Here is the complete updated code:

import os  import csv  import json  import logging  import time  import concurrent.futures  from dataclasses import dataclass, field, fields, asdict  from selenium import webdriver  from selenium.webdriver.chrome.service import Service as ChromeService  from webdriver_manager.chrome import ChromeDriverManager  from selenium.webdriver.common.by import By  from selenium.webdriver.chrome.options import Options
API_KEY = ""
with open("config.json", "r") as config_file:      config = json.load(config_file)      API_KEY = config["api_key"]     
# Set up logging  logging.basicConfig(level=logging.INFO)  logger = logging.getLogger(__name__)
@dataclass  class SearchData:      name: str = ""      stars: float = 0      url: str = ""      sponsored: bool = False      price: float = 0.0      product_id: int = 0
    def __post_init__(self):          self.check_string_fields()
    def check_string_fields(self):          for field in fields(self):              if isinstance(getattr(self, field.name), str):                  if getattr(self, field.name) == "":                      setattr(self, field.name, f"No {field.name}")                      continue                  value = getattr(self, field.name)                  setattr(self, field.name, value.strip())
class DataPipeline:           def __init__(self, csv_filename="", storage_queue_limit=50):          self.names_seen = []          self.storage_queue = []          self.storage_queue_limit = storage_queue_limit          self.csv_filename = csv_filename          self.csv_file_open = False
    def save_to_csv(self):          self.csv_file_open = True          data_to_save = []          data_to_save.extend(self.storage_queue)          self.storage_queue.clear()          if not data_to_save:              return
        keys = [field.name for field in fields(data_to_save[0])]          file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0          with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:              writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                  writer.writeheader()
            for item in data_to_save:                  writer.writerow(asdict(item))
        self.csv_file_open = False
    def is_duplicate(self, input_data):          if input_data.name in self.names_seen:              logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")              return True          self.names_seen.append(input_data.name)          return False
    def add_data(self, scraped_data):          if not self.is_duplicate(scraped_data):              self.storage_queue.append(scraped_data)              if len(self.storage_queue) >= self.storage_queue_limit and not self.csv_file_open:                  self.save_to_csv()
    def close_pipeline(self):          if self.csv_file_open:              time.sleep(3)          if len(self.storage_queue) > 0:              self.save_to_csv()
def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):      formatted_keyword = keyword.replace(" ", "+")      url = f"https://www.walmart.com/search?q={formatted_keyword}&page={page_number + 1}"      tries = 0      success = False
    # Set up Selenium WebDriver      options = Options()      options.add_argument("--headless")        options.add_argument("--no-sandbox")      options.add_argument("--disable-dev-shm-usage")
    driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)
    while tries <= retries and not success:          try:              driver.get(url)              time.sleep(3)  # Allow time for the page to load
            # Find and parse the script tag with the JSON data              script_tag = driver.find_element(By.CSS_SELECTOR, "script[id='__NEXT_DATA__'][type='application/json']")              json_data = json.loads(script_tag.get_attribute("innerHTML"))              item_list = json_data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]
            for item in item_list:                  if item["__typename"] != "Product":                      continue                                   name = item.get("name")                  product_id = item["usItemId"]                  if not name:                      continue                  link = f"https://www.walmart.com/reviews/product/{product_id}"                  price = item["price"]                  sponsored = item["isSponsoredFlag"]                  rating = item["averageRating"]
                search_data = SearchData(                      name=name,                      stars=rating,                      url=link,                      sponsored=sponsored,                      price=price,                      product_id=product_id                  )                  data_pipeline.add_data(search_data)
            logger.info(f"Successfully parsed data from: {url}")              success = True
        except Exception as e:              logger.error(f"An error occurred while processing page {url}: {e}")              logger.info(f"Retrying request for page: {url}, retries left {retries - tries}")              tries += 1
    if not success:          raise Exception(f"Max Retries exceeded: {retries}")           driver.quit()  # Close the WebDriver
def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):      with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:          executor.map(              scrape_search_results,              [keyword] * pages,              [location] * pages,              range(pages),              [data_pipeline] * pages,              [retries] * pages          )
if __name__ == "__main__":
    MAX_RETRIES = 3      MAX_THREADS = 3      PAGES = 1      LOCATION = "us"
    logger.info(f"Crawl starting...")
    keyword_list = ["laptop"]      aggregate_files = []
    for keyword in keyword_list:          filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")          start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)          crawl_pipeline.close_pipeline()          aggregate_files.append(f"{filename}.csv")      logger.info(f"Crawl complete.")

In executor.map(), pay attention to our arguments: scrape_search_results, which is the function for parsing. All other arguments are passed as arrays, and then the executor passes these array elements as arguments into each call to scrape_search_results.

Step 5: Bypassing Anti-Bots

Remember the screen at the top of the page that was blocked? Anti-bots help us bypass it.In this part, we will write a basic function that harnesses the potential of a proxy. It will receive a URL, a ScrapeOps Proxy API key, along with some additional parameters, and return a completely proxied URL.Below is our proxy function.

def get_scrapeops_url(url, location="us"):      payload = {          "api_key": API_KEY,          "url": url,          "country": location,      }      proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)      return proxy_url

Look at the payload closely: "api_key": ScrapeOps API key. "url": the URL to be scraped. "country": the location we want to appear in.Now our crawler is ready for production.

import os  import csv  import json  import logging  import time  from urllib.parse import urlencode  import concurrent.futures  from dataclasses import dataclass, field, fields, asdict  from selenium import webdriver  from selenium.webdriver.chrome.service import Service as ChromeService  from webdriver_manager.chrome import ChromeDriverManager  from selenium.webdriver.common.by import By  from selenium.webdriver.chrome.options import Options
API_KEY = ""
with open("config.json", "r") as config_file:      config = json.load(config_file)      API_KEY = config["api_key"]       def get_scrapeops_url(url, location="us"):      payload = {          "api_key": API_KEY,          "url": url,          "country": location,      }      proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)      return proxy_url
# Set up logging  logging.basicConfig(level=logging.INFO)  logger = logging.getLogger(__name__)
@dataclass  class SearchData:      name: str = ""      stars: float = 0      url: str = ""      sponsored: bool = False      price: float = 0.0      product_id: int = 0
    def __post_init__(self):          self.check_string_fields()
    def check_string_fields(self):          for field in fields(self):              if isinstance(getattr(self, field.name), str):                  if getattr(self, field.name) == "":                      setattr(self, field.name, f"No {field.name}")                      continue                  value = getattr(self, field.name)                  setattr(self, field.name, value.strip())
class DataPipeline:           def __init__(self, csv_filename="", storage_queue_limit=50):          self.names_seen = []          self.storage_queue = []          self.storage_queue_limit = storage_queue_limit          self.csv_filename = csv_filename          self.csv_file_open = False
    def save_to_csv(self):          self.csv_file_open = True          data_to_save = []          data_to_save.extend(self.storage_queue)          self.storage_queue.clear()          if not data_to_save:              return
        keys = [field.name for field in fields(data_to_save[0])]          file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0          with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:              writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                  writer.writeheader()
            for item in data_to_save:                  writer.writerow(asdict(item))
        self.csv_file_open = False
    def is_duplicate(self, input_data):          if input_data.name in self.names_seen:              logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")              return True          self.names_seen.append(input_data.name)          return False
    def add_data(self, scraped_data):          if not self.is_duplicate(scraped_data):              self.storage_queue.append(scraped_data)              if len(self.storage_queue) >= self.storage_queue_limit and not self.csv_file_open:                  self.save_to_csv()
    def close_pipeline(self):          if self.csv_file_open:              time.sleep(3)          if len(self.storage_queue) > 0:              self.save_to_csv()
def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):      formatted_keyword = keyword.replace(" ", "+")      url = f"https://www.walmart.com/search?q={formatted_keyword}&page={page_number + 1}"      tries = 0      success = False
    # Set up Selenium WebDriver      options = Options()      options.add_argument("--headless")        options.add_argument("--no-sandbox")      options.add_argument("--disable-dev-shm-usage")      driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)
    while tries <= retries and not success:          try:              scrapeops_proxy_url = get_scrapeops_url(url, location=location)              driver.get(scrapeops_proxy_url)  # Use Selenium to load the page              time.sleep(3)  # Allow time for the page to load
            # Find and parse the script tag with the JSON data              script_tag = driver.find_element(By.CSS_SELECTOR, "script[id='__NEXT_DATA__'][type='application/json']")              json_data = json.loads(script_tag.get_attribute("innerHTML"))              item_list = json_data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]
            for item in item_list:                  if item["__typename"] != "Product":                      continue                                   name = item.get("name")                  product_id = item["usItemId"]                  if not name:                      continue                  link = f"https://www.walmart.com/reviews/product/{product_id}"                  price = item["price"]                  sponsored = item["isSponsoredFlag"]                  rating = item["averageRating"]
                search_data = SearchData(                      name=name,                      stars=rating,                      url=link,                      sponsored=sponsored,                      price=price,                      product_id=product_id                  )                  data_pipeline.add_data(search_data)
            logger.info(f"Successfully parsed data from: {url}")              success = True
        except Exception as e:              logger.error(f"An error occurred while processing page {url}: {e}")              logger.info(f"Retrying request for page: {url}, retries left {retries - tries}")              tries += 1
    if not success:          raise Exception(f"Max Retries exceeded: {retries}")           driver.quit()  # Close the WebDriver
def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):      with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:          executor.map(              scrape_search_results,              [keyword] * pages,              [location] * pages,              range(pages),              [data_pipeline] * pages,              [retries] * pages          )
if __name__ == "__main__":
    MAX_RETRIES = 3      MAX_THREADS = 3      PAGES = 1      LOCATION = "us"
    logger.info(f"Crawl starting...")
    keyword_list = ["laptop"]      aggregate_files = []
    for keyword in keyword_list:          filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")          start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)          crawl_pipeline.close_pipeline()          aggregate_files.append(f"{filename}.csv")      logger.info(f"Crawl complete.")

Step 6: Production Run

Let's put this into production now!Here's the main part we'll be using.We'll be crawling 4 pages with 3 threads. You can modify any of these to adjust your results:

MAX_RETRIES: Specifies the maximum attempts the script will make to retrieve a webpage if a request fails due to network issues or non-200 HTTP status codes.
MAX_THREADS: Determines the highest number of threads that will be run simultaneously during the scraping process.
PAGES: The total number of result pages to scrape for every keyword.
LOCATION: Indicates the region or country code from which products or reviews will be scraped.
keyword_list: Contains a list of product keywords to search on Walmart's website (for example, ["laptop"]).

if __name__ == "__main__":
    MAX_RETRIES = 3      MAX_THREADS = 3      PAGES = 4      LOCATION = "us"
    logger.info(f"Crawl starting...")
    keyword_list = ["laptop"]      aggregate_files = []
    for keyword in keyword_list:          filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")          start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)          crawl_pipeline.close_pipeline()          aggregate_files.append(f"{filename}.csv")      logger.info(f"Crawl complete.")

We scraped 4 pages in 49.06 seconds. 49.06 seconds / 4 pages = 12.265 seconds per page.Our crawl generated a report with 160 different laptop items.

Build A Walmart Scraper

When we construct our scraper, we will adhere to a plan similar to the one used with the crawler. The steps our scraper will take are as follows, in this order:

An array will be created by reading the CSV file.
The items within the array will be parsed individually.
The data obtained from the parse will be stored.
Concurrency will be employed to scrape multiple webpages simultaneously.
Once again, we will bypass anti-bots by utilizing the ScrapeOps Proxy API.

Step 1: Create Simple Business Data Parser

Just like before, we’ll begin with our basic parsing function. We’ll incorporate the basic structure, which includes retry logic and error handling.Again, focus carefully on the parsing in the function. This is where the magic takes place.

def process_item(row,location, retries=3):      url = row["url"]      tries = 0      success = False
    # Set up Selenium WebDriver      options = Options()      options.add_argument("--headless")        options.add_argument("--no-sandbox")      options.add_argument("--disable-dev-shm-usage")      driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)
    while tries <= retries and not success:          try:              driver.get(url)              time.sleep(3)  # Allow time for the page to load
            # Find and parse the script tag with the JSON data              script_tag = driver.find_element(By.CSS_SELECTOR, "script[id='__NEXT_DATA__'][type='application/json']")              json_data = json.loads(script_tag.get_attribute("innerHTML"))              review_list = json_data["props"]["pageProps"]["initialData"]["data"]["reviews"]["customerReviews"]
            review_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")
            for review in review_list:                  name = review["userNickname"]                  author_id = review["authorId"]                  rating = review["rating"]                  date = review["reviewSubmissionTime"]                  review_text = review["reviewText"]
                review_data = ReviewData(                      name=name,                      author_id=author_id,                      rating=rating,                      date=date,                      review=review_text                  )                  review_pipeline.add_data(review_data)
            review_pipeline.close_pipeline()              success = True              logger.info(f"Successfully parsed: {url}")
        except Exception as e:              logger.error(f"Exception thrown: {e}")              logger.warning(f"Failed to process page: {row['url']}")              logger.warning(f"Retries left: {retries - tries}")              tries += 1
    driver.quit()  # Close the WebDriver
    if not success:          raise Exception(f"Max Retries exceeded: {retries}")

We locate our JSON in the same manner as before: driver.find_element(By.CSS_SELECTOR, "script[id='**NEXT_DATA**'][type='application/json']").
The list of customer reviews is accessed via json_data["props"]["pageProps"]["initialData"]["data"]["reviews"]["customerReviews"].

We then loop through the reviews and extract the following details:

name: the reviewer's name.
author_id: a unique ID for the reviewer, similar to the product_id mentioned previously.
rating: the score given by the reviewer.
date: the date on which the review was posted.
review: the review text, such as "It was good. I really liked [x] about this laptop."

Step 2: Loading URLs To Scrape

To scrape these reviews, we need to input URLs into our parsing function. To do this, we must read the report that the crawler generated.Now, we will write another function, which is similar to start_scrape() from before.

def process_results(csv_file, location, retries=3):      logger.info(f"Processing {csv_file}")      with open(csv_file, newline="") as file:          reader = list(csv.DictReader(file))
        for row in reader:              process_item(row, location, retries=retries)

After reading the CSV, we use a for loop to call process_item() on every row from the file. We will remove the for loop later when we incorporate concurrency.The fully updated code is shown below.

import os  import csv  import json  import logging  import time  import concurrent.futures  from dataclasses import dataclass, field, fields, asdict  from urllib.parse import urlencode  from selenium import webdriver  from selenium.webdriver.chrome.service import Service as ChromeService  from webdriver_manager.chrome import ChromeDriverManager  from selenium.webdriver.chrome.options import Options  from selenium.webdriver.common.by import By
API_KEY = ""
with open("config.json", "r") as config_file:      config = json.load(config_file)      API_KEY = config["api_key"]
def get_scrapeops_url(url, location="us"):      payload = {          "api_key": API_KEY,          "url": url,          "country": location,      }      proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)      return proxy_url
# Logging  logging.basicConfig(level=logging.INFO)  logger = logging.getLogger(__name__)
@dataclass  class SearchData:      name: str = ""      stars: float = 0      url: str = ""      sponsored: bool = False      price: float = 0.0      product_id: int = 0
    def __post_init__(self):          self.check_string_fields()               def check_string_fields(self):          for field in fields(self):              # Check string fields              if isinstance(getattr(self, field.name), str):                  # If empty set default text                  if getattr(self, field.name) == "":                      setattr(self, field.name, f"No {field.name}")                      continue                  # Strip any trailing spaces, etc.                  value = getattr(self, field.name)                  setattr(self, field.name, value.strip())
class DataPipeline:           def __init__(self, csv_filename="", storage_queue_limit=50):          self.names_seen = []          self.storage_queue = []          self.storage_queue_limit = storage_queue_limit          self.csv_filename = csv_filename          self.csv_file_open = False           def save_to_csv(self):          self.csv_file_open = True          data_to_save = []          data_to_save.extend(self.storage_queue)          self.storage_queue.clear()          if not data_to_save:              return
        keys = [field.name for field in fields(data_to_save[0])]          file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0          with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:              writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                  writer.writeheader()
            for item in data_to_save:                  writer.writerow(asdict(item))
        self.csv_file_open = False                           def is_duplicate(self, input_data):          if input_data.name in self.names_seen:              logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")              return True          self.names_seen.append(input_data.name)          return False                   def add_data(self, scraped_data):          if self.is_duplicate(scraped_data) == False:              self.storage_queue.append(scraped_data)              if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                  self.save_to_csv()                               def close_pipeline(self):          if self.csv_file_open:              time.sleep(3)          if len(self.storage_queue) > 0:              self.save_to_csv()
def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):      formatted_keyword = keyword.replace(" ", "+")      url = f"https://www.walmart.com/search?q={formatted_keyword}&page={page_number+1}"      tries = 0      success = False           while tries <= retries and not success:          try:              scrapeops_proxy_url = get_scrapeops_url(url, location=location)
            # Set up Selenium WebDriver              options = Options()              options.add_argument("--headless")                options.add_argument("--no-sandbox")              options.add_argument("--disable-dev-shm-usage")              driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)
            driver.get(scrapeops_proxy_url)              logger.info(f"Received page: {url}")
            # Wait for the page to load and get the script tag              time.sleep(3)  # Adjust the sleep time as needed              script_tag = driver.find_element(By.CSS_SELECTOR, "script[id='__NEXT_DATA__'][type='application/json']")              json_data = json.loads(script_tag.get_attribute("innerHTML"))              item_list = json_data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]
            for item in item_list:                  if item["__typename"] != "Product":                      continue                                   name = item.get("name")                  product_id = item["usItemId"]                  if not name:                      continue                  link = f"https://www.walmart.com/reviews/product/{product_id}"                  price = item["price"]                  sponsored = item["isSponsoredFlag"]                  rating = item["averageRating"]                                   search_data = SearchData(                      name=name,                      stars=rating,                      url=link,                      sponsored=sponsored,                      price=price,                      product_id=product_id                  )                                  data_pipeline.add_data(search_data)                                               logger.info(f"Successfully parsed data from: {url}")              success = True                      driver.quit()  # Close the WebDriver
        except Exception as e:              logger.error(f"An error occurred while processing page {url}: {e}")              logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")              tries += 1              driver.quit()  # Close the WebDriver on error
    if not success:          raise Exception(f"Max Retries exceeded: {retries}")
def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):      with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:          executor.map(              scrape_search_results,              [keyword] * pages,              [location] * pages,              range(pages),              [data_pipeline] * pages,              [retries] * pages          )
def process_item(row, location, retries=3):      url = row.get("url")      tries = 0      success = False
    options = Options()      options.add_argument("--headless")      options.add_argument("--no-sandbox")      options.add_argument("--disable-dev-shm-usage")      driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)
    while tries <= retries and not success:          try:              driver.get(url)              logger.info(f"Status: {driver.title}")
            # Wait for the page to load and get the script tag              time.sleep(3)  # Adjust the sleep time as needed              script_tag = driver.find_element(By.CSS_SELECTOR, "script[id='__NEXT_DATA__'][type='application/json']")              json_data = json.loads(script_tag.get_attribute("innerHTML"))              review_list = json_data["props"]["pageProps"]["initialData"]["data"]["reviews"]["customerReviews"]
            review_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")
            for review in review_list:                      name = review["userNickname"]                      author_id = review["authorId"]                      rating = review["rating"]                      date = review["reviewSubmissionTime"]                      review = review["reviewText"]
                    review_data = {                          "name": name,                          "author_id": author_id,                          "rating": rating,                          "date": date,                          "review": review                      }                                    print(review_data)
            success = True
        except Exception as e:              logger.error(f"Exception thrown: {e}")              logger.warning(f"Failed to process page: {url}")              logger.warning(f"Retries left: {retries-tries}")              tries += 1                   driver.quit()  # Close the WebDriver
    if not success:          raise Exception(f"Max Retries exceeded: {retries}")      else:          logger.info(f"Successfully parsed: {row['url']}")
def process_results(csv_file, location, retries=3):      logger.info(f"Processing {csv_file}")      with open(csv_file, newline="") as file:          reader = list(csv.DictReader(file))
        for row in reader:              process_item(row, location, retries=retries)
if __name__ == "__main__":
    MAX_RETRIES = 3      MAX_THREADS = 3      PAGES = 1      LOCATION = "us"
    logger.info(f"Crawl starting...")
    # INPUT ---> List of keywords to scrape      keyword_list = ["laptop"]      aggregate_files = []
    # Job Processes      for keyword in keyword_list:          filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")          start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)          crawl_pipeline.close_pipeline()          aggregate_files.append(f"{filename}.csv")           logger.info(f"Crawl complete.")
    for file in aggregate_files:          process_results(file, LOCATION, retries=MAX_RETRIES)

The function process_results() reads our CSV file into an array and iterates over it. During each iteration, we call process_item() on the individual items from the array to scrape their reviews.

Step 3: Storing the Scraped Data

As stated previously, scraping data without saving it is pointless. We already have a DataPipeline that can accept a dataclass, but so far, we only have SearchData.To solve this, we will introduce another one called ReviewData. Then, inside our parsing function, we will create a new DataPipeline and send ReviewData objects to it as we parse.Here’s a look at ReviewData.

@dataclass  class ReviewData:      name: str = ""      author_id: int = 0      rating: float = 0.0      date: str = ""      review: str = ""
    def __post_init__(self):          self.check_string_fields()
    def check_string_fields(self):          for field in fields(self):              # Check string fields              if isinstance(getattr(self, field.name), str):                  # If empty set default text                  if getattr(self, field.name) == "":                      setattr(self, field.name, f"No {field.name}")                      continue                  # Strip any trailing spaces, etc.                  value = getattr(self, field.name)                  setattr(self, field.name, value.strip())

All of the data we extracted during the parse is held by ReviewData. The full code below shows how everything functions.

import os  import csv  import json  import logging  import time  import concurrent.futures  from dataclasses import dataclass, field, fields, asdict  from urllib.parse import urlencode  from selenium import webdriver  from selenium.webdriver.chrome.service import Service as ChromeService  from webdriver_manager.chrome import ChromeDriverManager  from selenium.webdriver.chrome.options import Options  from selenium.webdriver.common.by import By
API_KEY = ""
with open("config.json", "r") as config_file:      config = json.load(config_file)      API_KEY = config["api_key"]
def get_scrapeops_url(url, location="us"):      payload = {          "api_key": API_KEY,          "url": url,          "country": location,      }      proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)      return proxy_url
# Logging  logging.basicConfig(level=logging.INFO)  logger = logging.getLogger(__name__)
@dataclass  class SearchData:      name: str = ""      stars: float = 0      url: str = ""      sponsored: bool = False      price: float = 0.0      product_id: int = 0
    def __post_init__(self):          self.check_string_fields()               def check_string_fields(self):          for field in fields(self):              # Check string fields              if isinstance(getattr(self, field.name), str):                  # If empty set default text                  if getattr(self, field.name) == "":                      setattr(self, field.name, f"No {field.name}")                      continue                  # Strip any trailing spaces, etc.                  value = getattr(self, field.name)                  setattr(self, field.name, value.strip())                   @dataclass  class ReviewData:      name: str = ""      author_id: int = 0      rating: float = 0.0      date: str = ""      review: str = ""
    def __post_init__(self):          self.check_string_fields()
    def check_string_fields(self):          for field in fields(self):              # Check string fields              if isinstance(getattr(self, field.name), str):                  # If empty set default text                  if getattr(self, field.name) == "":                      setattr(self, field.name, f"No {field.name}")                      continue                  # Strip any trailing spaces, etc.                  value = getattr(self, field.name)                  setattr(self, field.name, value.strip())
class DataPipeline:           def __init__(self, csv_filename="", storage_queue_limit=50):          self.names_seen = []          self.storage_queue = []          self.storage_queue_limit = storage_queue_limit          self.csv_filename = csv_filename          self.csv_file_open = False           def save_to_csv(self):          self.csv_file_open = True          data_to_save = []          data_to_save.extend(self.storage_queue)          self.storage_queue.clear()          if not data_to_save:              return
        keys = [field.name for field in fields(data_to_save[0])]          file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0          with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:              writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                  writer.writeheader()
            for item in data_to_save:                  writer.writerow(asdict(item))
        self.csv_file_open = False                           def is_duplicate(self, input_data):          if input_data.name in self.names_seen:              logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")              return True          self.names_seen.append(input_data.name)          return False                   def add_data(self, scraped_data):          if self.is_duplicate(scraped_data) == False:              self.storage_queue.append(scraped_data)              if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                  self.save_to_csv()                               def close_pipeline(self):          if self.csv_file_open:              time.sleep(3)          if len(self.storage_queue) > 0:              self.save_to_csv()
def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):      formatted_keyword = keyword.replace(" ", "+")      url = f"https://www.walmart.com/search?q={formatted_keyword}&page={page_number+1}"      tries = 0      success = False           while tries <= retries and not success:          try:              scrapeops_proxy_url = get_scrapeops_url(url, location=location)
            # Set up Selenium WebDriver              options = Options()              options.add_argument("--headless")                options.add_argument("--no-sandbox")              options.add_argument("--disable-dev-shm-usage")              driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)
            driver.get(scrapeops_proxy_url)              logger.info(f"Received page: {url}")
            # Wait for the page to load and get the script tag              time.sleep(3)  # Adjust the sleep time as needed              script_tag = driver.find_element(By.CSS_SELECTOR, "script[id='__NEXT_DATA__'][type='application/json']")              json_data = json.loads(script_tag.get_attribute("innerHTML"))              item_list = json_data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]
            for item in item_list:                  if item["__typename"] != "Product":                      continue                                   name = item.get("name")                  product_id = item["usItemId"]                  if not name:                      continue                  link = f"https://www.walmart.com/reviews/product/{product_id}"                  price = item["price"]                  sponsored = item["isSponsoredFlag"]                  rating = item["averageRating"]                                   search_data = SearchData(                      name=name,                      stars=rating,                      url=link,                      sponsored=sponsored,                      price=price,                      product_id=product_id                  )                                  data_pipeline.add_data(search_data)                                               logger.info(f"Successfully parsed data from: {url}")              success = True                      driver.quit()  # Close the WebDriver
        except Exception as e:              logger.error(f"An error occurred while processing page {url}: {e}")              logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")              tries += 1              driver.quit()  # Close the WebDriver on error
    if not success:          raise Exception(f"Max Retries exceeded: {retries}")
def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):      with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:          executor.map(              scrape_search_results,              [keyword] * pages,              [location] * pages,              range(pages),              [data_pipeline] * pages,              [retries] * pages          )
def process_item(row, location, retries=3):      url = row.get("url")      tries = 0      success = False
    options = Options()      options.add_argument("--headless")      options.add_argument("--no-sandbox")      options.add_argument("--disable-dev-shm-usage")      driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)
    while tries <= retries and not success:          try:              driver.get(url)              logger.info(f"Status: {driver.title}")
            # Wait for the page to load and get the script tag              time.sleep(3)  # Adjust the sleep time as needed              script_tag = driver.find_element(By.CSS_SELECTOR, "script[id='__NEXT_DATA__'][type='application/json']")              json_data = json.loads(script_tag.get_attribute("innerHTML"))              review_list = json_data["props"]["pageProps"]["initialData"]["data"]["reviews"]["customerReviews"]
            review_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")
            for review in review_list:                      name = review["userNickname"]                      author_id = review["authorId"]                      rating = review["rating"]                      date = review["reviewSubmissionTime"]                      review = review["reviewText"]
                    review_data = ReviewData(                          name=name,                          author_id=author_id,                          rating=rating,                          date=date,                          review=review                      )                                      review_pipeline.add_data(review_data)
            review_pipeline.close_pipeline()              success = True
        except Exception as e:              logger.error(f"Exception thrown: {e}")              logger.warning(f"Failed to process page: {url}")              logger.warning(f"Retries left: {retries-tries}")              tries += 1                   driver.quit()  # Close the WebDriver
    if not success:          raise Exception(f"Max Retries exceeded: {retries}")      else:          logger.info(f"Successfully parsed: {row['url']}")
def process_results(csv_file, location, retries=3):      logger.info(f"Processing {csv_file}")      with open(csv_file, newline="", encoding="utf-8") as file:          reader = list(csv.DictReader(file))
        for row in reader:              process_item(row, location, retries=retries)
if __name__ == "__main__":
    MAX_RETRIES = 3      MAX_THREADS = 3      PAGES = 1      LOCATION = "us"
    logger.info(f"Crawl starting...")
    # INPUT ---> List of keywords to scrape      keyword_list = ["laptop"]      aggregate_files = []
    # Job Processes      for keyword in keyword_list:          filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")          start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)          crawl_pipeline.close_pipeline()          aggregate_files.append(f"{filename}.csv")           logger.info(f"Crawl complete.")
    for file in aggregate_files:          process_results(file, LOCATION, retries=MAX_RETRIES)

Within our parsing function, we now:

Open a DataPipeline.
As we parse them, pass ReviewData objects into the pipeline.
Once parsing is finished, close the pipeline.

Step 4: Adding Concurrency

We will add concurrency in nearly the same manner as before. A new pool of threads will be opened by ThreadPoolExecutor. Our parsing function will be passed as the first argument, and everything else will be passed in as an array to be sent into our parser.Here is the final process_results().

def process_results(csv_file, location, max_threads, retries=3):      logger.info(f"Processing {csv_file}")      with open(csv_file, newline="", encoding="utf-8") as file:          reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:              executor.map(                  process_item,                  reader,  # Pass each row in the CSV as the first argument to process_item                  [location] * len(reader),  # Location as second argument for all rows                  [retries] * len(reader)  # Retries as third argument for all rows              )

The logic for reading our CSV remains the same, but now the parsing is running concurrently on multiple threads.

Step 5: Bypassing Anti-Bots

We'll bypass anti-bots in the exact same way we did previously. From within our parser, we'll call get_scrapeops_url(), and now, for each request made during our review scrape, we have a custom proxy.

scrapeops_proxy_url = get_scrapeops_url(url, location=location)              driver.get(scrapeops_proxy_url)

Here is the complete code ready for production:

import os  import csv  import json  import logging  import time  import concurrent.futures  from dataclasses import dataclass, field, fields, asdict  from urllib.parse import urlencode  from selenium import webdriver  from selenium.webdriver.chrome.service import Service as ChromeService  from webdriver_manager.chrome import ChromeDriverManager  from selenium.webdriver.chrome.options import Options  from selenium.webdriver.common.by import By
API_KEY = ""
with open("config.json", "r") as config_file:      config = json.load(config_file)      API_KEY = config["api_key"]
def get_scrapeops_url(url, location="us"):      payload = {          "api_key": API_KEY,          "url": url,          "country": location,      }      proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)      return proxy_url
# Logging  logging.basicConfig(level=logging.INFO)  logger = logging.getLogger(__name__)
@dataclass  class SearchData:      name: str = ""      stars: float = 0      url: str = ""      sponsored: bool = False      price: float = 0.0      product_id: int = 0
    def __post_init__(self):          self.check_string_fields()               def check_string_fields(self):          for field in fields(self):              # Check string fields              if isinstance(getattr(self, field.name), str):                  # If empty set default text                  if getattr(self, field.name) == "":                      setattr(self, field.name, f"No {field.name}")                      continue                  # Strip any trailing spaces, etc.                  value = getattr(self, field.name)                  setattr(self, field.name, value.strip())                   @dataclass  class ReviewData:      name: str = ""      author_id: int = 0      rating: float = 0.0      date: str = ""      review: str = ""
    def __post_init__(self):          self.check_string_fields()
    def check_string_fields(self):          for field in fields(self):              # Check string fields              if isinstance(getattr(self, field.name), str):                  # If empty set default text                  if getattr(self, field.name) == "":                      setattr(self, field.name, f"No {field.name}")                      continue                  # Strip any trailing spaces, etc.                  value = getattr(self, field.name)                  setattr(self, field.name, value.strip())
class DataPipeline:           def __init__(self, csv_filename="", storage_queue_limit=50):          self.names_seen = []          self.storage_queue = []          self.storage_queue_limit = storage_queue_limit          self.csv_filename = csv_filename          self.csv_file_open = False           def save_to_csv(self):          self.csv_file_open = True          data_to_save = []          data_to_save.extend(self.storage_queue)          self.storage_queue.clear()          if not data_to_save:              return
        keys = [field.name for field in fields(data_to_save[0])]          file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0          with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:              writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                  writer.writeheader()
            for item in data_to_save:                  writer.writerow(asdict(item))
        self.csv_file_open = False                           def is_duplicate(self, input_data):          if input_data.name in self.names_seen:              logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")              return True          self.names_seen.append(input_data.name)          return False                   def add_data(self, scraped_data):          if self.is_duplicate(scraped_data) == False:              self.storage_queue.append(scraped_data)              if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                  self.save_to_csv()                               def close_pipeline(self):          if self.csv_file_open:              time.sleep(3)          if len(self.storage_queue) > 0:              self.save_to_csv()
def scrape_search_results(keyword, location, page_number, data_pipeline=None, retries=3):      formatted_keyword = keyword.replace(" ", "+")      url = f"https://www.walmart.com/search?q={formatted_keyword}&page={page_number+1}"      tries = 0      success = False           while tries <= retries and not success:          try:              scrapeops_proxy_url = get_scrapeops_url(url, location=location)
            # Set up Selenium WebDriver              options = Options()              options.add_argument("--headless")                options.add_argument("--no-sandbox")              options.add_argument("--disable-dev-shm-usage")              driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)
            driver.get(scrapeops_proxy_url)              logger.info(f"Received page: {url}")
            # Wait for the page to load and get the script tag              time.sleep(3)  # Adjust the sleep time as needed              script_tag = driver.find_element(By.CSS_SELECTOR, "script[id='__NEXT_DATA__'][type='application/json']")              json_data = json.loads(script_tag.get_attribute("innerHTML"))              item_list = json_data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]
            for item in item_list:                  if item["__typename"] != "Product":                      continue                                   name = item.get("name")                  product_id = item["usItemId"]                  if not name:                      continue                  link = f"https://www.walmart.com/reviews/product/{product_id}"                  price = item["price"]                  sponsored = item["isSponsoredFlag"]                  rating = item["averageRating"]                                   search_data = SearchData(                      name=name,                      stars=rating,                      url=link,                      sponsored=sponsored,                      price=price,                      product_id=product_id                  )                                  data_pipeline.add_data(search_data)                                               logger.info(f"Successfully parsed data from: {url}")              success = True                      driver.quit()  # Close the WebDriver
        except Exception as e:              logger.error(f"An error occurred while processing page {url}: {e}")              logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")              tries += 1              driver.quit()  # Close the WebDriver on error
    if not success:          raise Exception(f"Max Retries exceeded: {retries}")
def start_scrape(keyword, pages, location, data_pipeline=None, max_threads=5, retries=3):      with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:          executor.map(              scrape_search_results,              [keyword] * pages,              [location] * pages,              range(pages),              [data_pipeline] * pages,              [retries] * pages          )
def process_item(row, location, retries=3):      url = row.get("url")      tries = 0      success = False
    options = Options()      options.add_argument("--headless")      options.add_argument("--no-sandbox")      options.add_argument("--disable-dev-shm-usage")      driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)
    while tries <= retries and not success:          try:              scrapeops_proxy_url = get_scrapeops_url(url, location=location)              logger.info(f"Attempting to access URL: {scrapeops_proxy_url}")              driver.get(scrapeops_proxy_url)              logger.info(f"Status: {driver.title}")
            # Wait for the page to load and get the script tag              time.sleep(3)  # Adjust the sleep time as needed              script_tag = driver.find_element(By.CSS_SELECTOR, "script[id='__NEXT_DATA__'][type='application/json']")              json_data = json.loads(script_tag.get_attribute("innerHTML"))              review_list = json_data["props"]["pageProps"]["initialData"]["data"]["reviews"]["customerReviews"]
            review_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")
            for review in review_list:                  name = review["userNickname"]                  author_id = review["authorId"]                  rating = review["rating"]                  date = review["reviewSubmissionTime"]                  review_text = review["reviewText"]
                review_data = ReviewData(                      name=name,                      author_id=author_id,                      rating=rating,                      date=date,                      review=review_text                  )                  review_pipeline.add_data(review_data)
            review_pipeline.close_pipeline()              success = True
        except Exception as e:              logger.error(f"Exception thrown: {e}")              logger.warning(f"Failed to process page: {url}")              logger.warning(f"Retries left: {retries-tries}")              tries += 1
    driver.quit()  # Close the WebDriver
    if not success:          raise Exception(f"Max Retries exceeded: {retries}")      else:          logger.info(f"Successfully parsed: {row['url']}")
def process_results(csv_file, location, max_threads, retries=3):      logger.info(f"Processing {csv_file}")      with open(csv_file, newline="", encoding="utf-8") as file:          reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:              executor.map(                  process_item,                  reader,  # Pass each row in the CSV as the first argument to process_item                  [location] * len(reader),  # Location as second argument for all rows                  [retries] * len(reader)  # Retries as third argument for all rows              )
if __name__ == "__main__":
    MAX_RETRIES = 3      MAX_THREADS = 5      PAGES = 1      LOCATION = "us"
    logger.info(f"Crawl starting...")
    # INPUT ---> List of keywords to scrape      keyword_list = ["laptop"]      aggregate_files = []
    # Job Processes      for keyword in keyword_list:          filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")          start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)          crawl_pipeline.close_pipeline()          aggregate_files.append(f"{filename}.csv")           logger.info(f"Crawl complete.")
    for file in aggregate_files:          process_results(file, LOCATION,max_threads=MAX_THREADS, retries=MAX_RETRIES)

Step 6: Production Run

For our production run, we will use the same settings as previously. We'll conduct a 4-page crawl with 3 threads.Keep in mind that you can adjust your results by modifying the constants.

if __name__ == "__main__":
    MAX_RETRIES = 3      MAX_THREADS = 5      PAGES = 3      LOCATION = "us"
    logger.info(f"Crawl starting...")
    # INPUT ---> List of keywords to scrape      keyword_list = ["laptop"]      aggregate_files = []
    # Job Processes      for keyword in keyword_list:          filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")          start_scrape(keyword, PAGES, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)          crawl_pipeline.close_pipeline()          aggregate_files.append(f"{filename}.csv")           logger.info(f"Crawl complete.")
    for file in aggregate_files:          process_results(file, LOCATION,max_threads=MAX_THREADS, retries=MAX_RETRIES)

Here are the final results from a full crawl and scrape.

The full process finished in 939.41 seconds.
Our crawl spat out a CSV with 160 laptop items and if you remember from earlier, a 4 page crawl was 49.06 seconds.
939.41 - 49.06 = 890.35 seconds scraping.
890.35 seconds / 160 pages = 5.56 seconds per page. This is even faster than our crawler.

Proxied scrapes often take between 7 and 10 seconds per page, and in some cases, they take even longer.

Legal and Ethical Considerations

Scraping data from Walmart is generally considered legal (depending on the country you reside in and their individual laws). Public data is generally considered fair to use.Data behind a login page is different—it’s private and protected by privacy and intellectual property laws. If you're unsure whether your scraper is legal, it's best to consult an attorney.However, when we scrape Walmart, we are subject to both their Terms of Service and their robots.txt. Violating their policies can lead to suspension and even deletion of your account.You can view their terms here.Walmart's robots.txt is available here.

Conclusion

You've made it to the end!Along the way, you learned how to use Python's Selenium library. You’ve also covered key topics like parsing, pagination, storing data, concurrency, and working with proxies. Use these skills to build your next project!If you'd like to learn more about the tech stack used in this site, take a look at the links below.

Selenium

More Web Scraping Guides

In this edition of our "How To Scrape X" series, we went through how you can scrape Walmart.com including how to bypass its anti-bot protection. If you would like to learn how to scrape other popular websites then check out our other How To Scrape Guides:

Or if you would like to learn more about web scraping in general, then be sure to check out The Web Scraping Playbook, or check out one of our more in-depth guides:

You can also take a look at our Web Scraping Playbook here: