How to Scrape Pinterest

For years, Pinterest has been the go-to for all things creative on the internet. Whether you're looking for interesting recipes, decorative ideas, or anything else, Pinterest is a great place to go! Along with all of this, Pinterest is also a social network. This means that we can scrape valuable data such as account names, followers and more.

How to Scrape Pinterest Requests and BeautifulSoup

In this Python guide, we'll go over the following topics:

💡GitHub CodeThe full code for this Pinterest Scraper is available on Github here.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

TLDR - How to Scrape Pinterest using Python

If you're looking for a Pinterest scraper and you don't have time to read the article, we've got one for you right here and ready to go.To use this code, create a config.json file with your "api_key" and place it in the same folder as this scraper. At that point, it's ready to go!!!

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 2000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    url: str = ""    image: str = ""
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
class DataPipeline:
    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.pinterest.com/search/pins/?q={formatted_keyword}&rs=typed"    tries = 0    success = False
    while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            print(scrapeops_proxy_url)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")
                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")
            div_cards = soup.find_all("div")
            result_count = 0
            for div_card in div_cards:                if div_card.get("data-grid-item"):
                    result_count += 1
                    title = div_card.text                    a_element = div_card.find("a")                    url = f"https://pinterest.com{a_element['href']}"                    img = div_card.find("img")                    img_url = img["src"]
                    search_data = SearchData(                        name=title,                        url=url,                        image=img_url                    )
                    data_pipeline.add_data(search_data)
            success = True

        except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def process_pin(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(url)        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")                main_card = soup.select_one("div[data-test-id='CloseupDetails']")
                website = "n/a"                has_website = main_card.select_one("span[style='text-decoration: underline;']")                if has_website:                    website = f"https://{has_website.text}"
                star_divs = main_card.select("div[data-test-id='rating-star-full']")                stars = len(star_divs)
                profile_info = main_card.select_one("div[data-test-id='follower-count']")
                account_name_div = profile_info.select_one("div[data-test-id='creator-profile-name']")                nested_divs = account_name_div.find_all("div")                account_name = nested_divs[0].get("title")                follower_count = profile_info.text.replace(account_name, "").replace(" followers", "")
                img_container = soup.select_one("div[data-test-id='pin-closeup-image']")                img = img_container.find("img").get("src")
                pin_data = {                    "name": account_name,                    "website": website,                    "stars": stars,                    "follower_count": follower_count,                    "image": img                }
                print(pin_data)                success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")

def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_pin(row, location, retries=retries)

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["grilling"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        scrape_search_results(keyword, LOCATION, data_pipeline=crawl_pipeline, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Feel free to tweak it as you need. You can change any of the following:

MAX_RETRIES: This parameter sets the maximum number of attempts the script will make to fetch data from a URL if the initial request fails.
MAX_THREADS: This parameter sets the maximum number of threads to use for processing results concurrently. This can speed up the processing of multiple pins or search results.
LOCATION: This parameter sets the geographical location from which the requests are made. It can affect the content returned by the website due to region-specific restrictions or differences.
keywords_list: This list contains the keywords for which you want to scrape Pinterest search results.

How To How To Architect Our Pinterest Scraper

Our scraper is going to utilize parsing, data storage, concurrency, and proxy integration. When we use a headless browser such as Selenium or Puppeteer, have have the ability to interact with the page and render JavaScript.With Requests/BeautifulSoup, we don't get this luxury, so we'll be using the ScrapeOps Headless Browser to compensate for that.In this tutorial, we'll be building both a scraper and a crawler.Our project will utilize:

Parsing: to extract the important data from Pinterest.
Data Storage: To store our data for later review and also to feed information into our scraper.
Concurrency: to process multiple pages simultaneously and efficiently.
Proxy Integration: Pinterest is notoriously difficult to access programmatically, so we'll be using the ScrapeOps Proxy API.

Understanding How To Scrape Pinterest

Step 1: How To Request Pinterest Pages

When we perform a search on Pinterest, we're making a GET request to the server. A get request includes our base url and some additional parameters. Feel free to take a look at the screenshot below, it's a search for the keyword "grilling".If you look at the address bar, our URL is:

https://www.pinterest.com/search/pins/?q=grilling&rs=typed

Our base domain is https://www.pinterest.com/search/pins/ and our query parameters are q=grilling&rs=typed.
rs=typed is a standard param that gets added to the url when you perform a search on Pinterest.
q=grilling contains the actual keywords we're searching for (in this case, "grilling").

Individual pages on Pinterest are just a simple number. Here is a pin page from the search we performed above. As you can see, the URL is pretty simple:

https://www.pinterest.com/pin/45176802505307132/

https://www.pinterest.com/pin/ tells the server that we want a pin. 45176802505307132 represents the number of the pin.

Step 2: How To Extract Data From Pinterest Results and Pages

There are a couple important things to extract the data from a Pinterest page.

First, our content is all loaded via JavaScript, so we won't be able to pull the page content until it's been rendered. To do this, we'll be passing the wait argument into the ScrapeOps API.

The ScrapeOps API actually runs a headless browser inside of it. When we use the wait param, this tells to the ScrapeOps server to wait a certain amount of time for the content to render and then send the page results back to us.

Once we've got our content, it's nested pretty badly inside the page. Even worse, Pinterest uses dynamic CSS classes and does not use traditional CSS for the page layout.

If you look below, you'll see exactly what I'm talking about.Now, let's take a look at the pin page. Most of our important pieces of data contain the trait data-test-id. When scraping the pin page, we'll be using data-test-id to find most of our relevant information.

Step 3: Geolocated Data

To scrape Pinterest, we'll be using the country param to the ScrapeOps API as well. This parameter allows us to be routed through a server in whichever country we choose.

For instance, if we want to appear in the US, we'd set our country to "us".
If we want to appear in the UK, we can set our country to "uk".

During testing, this parameter was incredibly important. Occasionally you can even get blocked when using a proxy and this happened during testing. The simple fix was to change our country from the US to the UK.If you are following along and have issues with your scrape even though it worked successfully earlier, first try changing your location with the ScrapeOps API, this did the trick for us.

Setting Up Our pinterest Scraper Project

Let's get started. You can run the following commands to get setup.Create a New Project Folder

mkdir pinterest-scraper
cd pinterest-scraper

Create a New Virtual Environment

python -m venv venv

Activate the Environment

source venv/bin/activate

Install Our Dependencies

pip install requests

pip install beautifulsoup4

Build A Pinterest Search Crawler

The first scraper we build is going to be our crawler. Let's get started! Our crawler is going to do the following:

Parsing: to pull the important data from the page.
Data Storage: to safely store our data for later use.
Proxy: to get past anti-bots and any other potential roadblocks we may encounter.

If our crawler utilizes these things, we can:

Fetch a page
Extract the results
Save the results
Bypass any potential anti-bots or other blockers

Step 1: Create Simple Search Data Parser

Let's get started by building a parser. The goal of our parser is to fetch a page, and then extract information from that website.The code structure below is relatively simple. After our imports, we read our API key with the script below:

API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

First, we declare our API key as a variable, then after reading config.json, we assign they key from the file to our API_KEY variable.
Then, we create a function, scrape_search_results(), which does the parsing.
As long as we have retries left and the operation has not succeeded, we try to get the page and then pull the information from it.
- If the operation fails, we retry until it either succeeds or runs out of retries.
- If we completely run out of retries, we allow the scraper to crash and print an error message.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


def scrape_search_results(keyword, location, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.pinterest.com/search/pins/?q={formatted_keyword}&rs=typed"    tries = 0    success = False
    while tries <= retries and not success:        try:            response = requests.get(url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")
                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")
            div_cards = soup.find_all("div")
            result_count = 0
            for div_card in div_cards:                if div_card.get("data-grid-item"):
                    result_count += 1
                    title = div_card.text                    a_element = div_card.find("a")                    url = f"https://pinterest.com{a_element['href']}"                    img = div_card.find("img")                    img_url = img["src"]
                    search_data = {                        "name": title,                        "url": url,                        "image": img_url                    }
                    print(search_data)
            success = True

        except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")




if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["grilling"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        scrape_search_results(keyword, LOCATION, retries=MAX_RETRIES)
    logger.info(f"Crawl complete.")

The code above does the following:

After finding all the divs, we check each one with div_card.get("data-grid-item"). Each result in our search is a data-grid-item.
We then find each link element with div_card.find("a") and we extract it with url = f"https://pinterest.com{a_element['href']}".
To find our image, we use img = div_card.find("img") and we then pull the link to the image with img_url = img["src"].

Step 2: Storing the Scraped Data

Now that we're getting the proper information, we need to be able to store our data. We'll be using two separate classes for our data, SearchData, and Datapipeline.

SearchData is a class built specifically to hold our data.
DataPipeline is a pipeline to a CSV file. This class filters out duplicates from hitting our CSV and then stores the CSV safely.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    url: str = ""    image: str = ""
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
class DataPipeline:
    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.pinterest.com/search/pins/?q={formatted_keyword}&rs=typed"    tries = 0    success = False
    while tries <= retries and not success:        try:            response = requests.get(url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")
                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")
            div_cards = soup.find_all("div")
            result_count = 0
            for div_card in div_cards:                if div_card.get("data-grid-item"):
                    result_count += 1
                    title = div_card.text                    a_element = div_card.find("a")                    url = f"https://pinterest.com{a_element['href']}"                    img = div_card.find("img")                    img_url = img["src"]
                    search_data = SearchData(                        name=title,                        url=url,                        image=img_url                    )
                    data_pipeline.add_data(search_data)
            success = True

        except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")




if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["grilling"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        scrape_search_results(keyword, LOCATION, data_pipeline=crawl_pipeline, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

After we've pulled our data from the page, we turn it into a SearchData object.
Next, we add search_data to the pipeline with data_pipeline.add_data(search_data).
Once our operation has finished, we close the pipeline.

Step 3: Bypassing Anti-Bots

At this point, the crawler is more or less finished, but first, we need to add anti-bot support.Typically, we would not have the wait parameter in the code below, but on Pinterest, all of our content is dynamically generated, so "wait": 2000 tells the ScrapeOps server to wait 2 seconds for our content to render and then it sends us the page.

def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 2000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

Take a look at our overall script now:

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 2000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    url: str = ""    image: str = ""
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
class DataPipeline:
    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.pinterest.com/search/pins/?q={formatted_keyword}&rs=typed"    tries = 0    success = False
    while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")
                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")
            div_cards = soup.find_all("div")
            result_count = 0
            for div_card in div_cards:                if div_card.get("data-grid-item"):
                    result_count += 1
                    title = div_card.text                    a_element = div_card.find("a")                    url = f"https://pinterest.com{a_element['href']}"                    img = div_card.find("img")                    img_url = img["src"]
                    search_data = SearchData(                        name=title,                        url=url,                        image=img_url                    )
                    data_pipeline.add_data(search_data)
            success = True

        except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")




if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["grilling"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        scrape_search_results(keyword, LOCATION, data_pipeline=crawl_pipeline, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

In this code, we parse the search results. Afterward, we store them in a CSV file. Finally we do all of this with a proxy. Our proxy is incredibly important, it does the following:

Penetrates any systems that may block us.
wait 2 seconds for the page to render.
Sends us the page after it has loaded.

Step 4: Production Run

Now that we've got a working crawler, it's time to run it in production. Take a look at our main.

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["grilling"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        scrape_search_results(keyword, LOCATION, data_pipeline=crawl_pipeline, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

We do a search for "grilling". Feel free to change any of the constants yourself and tweak the code, just remember, we don't have actual concurrency yet, this will be added in when we're scraping the individual posts that we find with the crawler.Here are the results from our crawler:We crawled "grilling" in 7.331 seconds. Results may vary based on the location of your server and the quality of your internet connection.

Build A Pinterest Scraper

Next, it's time to build our Pinterest scraper. The scraper needs to be able to do the following:

Parse the information from a pin.
Read the rows from the CSV file.
Store the data we extracted when parsing.
Perform all these actions concurrently.
Integrate with the ScrapeOps Proxy API

Step 1: Create Simple Data Parser

Let's get started building our pin parser. This parser needs to lookup a pin, and then pull information from that pin. The code below contains our process_pin() function.Similar to our crawler, we use the retries and success model. While we still have retries left and the operation hasn't succeeded, we find the main card and pull relevant information from it.

def process_pin(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(url)        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")                main_card = soup.select_one("div[data-test-id='CloseupDetails']")
                website = "n/a"                has_website = main_card.select_one("span[style='text-decoration: underline;']")                if has_website:                    website = f"https://{has_website.text}"
                star_divs = main_card.select("div[data-test-id='rating-star-full']")                stars = len(star_divs)
                profile_info = main_card.select_one("div[data-test-id='follower-count']")
                account_name_div = profile_info.select_one("div[data-test-id='creator-profile-name']")                nested_divs = account_name_div.find_all("div")                account_name = nested_divs[0].get("title")                follower_count = profile_info.text.replace(account_name, "").replace(" followers", "")
                img_container = soup.select_one("div[data-test-id='pin-closeup-image']")                img = img_container.find("img").get("src")
                pin_data = {                    "name": account_name,                    "website": website,                    "stars": stars,                    "follower_count": follower_count,                    "image": img                }
                print(pin_data)                success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")

When we're attempting to parse the pin, we do the following:

Find the main_card using its CSS selector: main_card = soup.select_one("div[data-test-id='CloseupDetails']").
main_card.select("div[data-test-id='rating-star-full']") finds all of the star elements on the page. We then count the stars with stars = len(star_divs).
Find the div that holds the account information with account_name_div = profile_info.select_one("div[data-test-id='creator-profile-name']").
nested_divs[0].get("title") finds our account name.
We remove our account_name and other irrelevant text with profile_info.text.replace(account_name, "").replace(" followers", "")

Step 2: Loading URLs To Scrape

Now, we need to load our urls. We can't look our pins up and parse them if we can't load the urls from the CSV file. It's time to update our overall code to add the parsing function above and to read the CSV file.Let's start with our process_results() function:

def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_pin(row, location, retries=retries)

Now, take a look at the overall code to see how it all fits together.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 2000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    url: str = ""    image: str = ""
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
class DataPipeline:
    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.pinterest.com/search/pins/?q={formatted_keyword}&rs=typed"    tries = 0    success = False
    while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            print(scrapeops_proxy_url)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")
                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")
            div_cards = soup.find_all("div")
            result_count = 0
            for div_card in div_cards:                if div_card.get("data-grid-item"):
                    result_count += 1
                    title = div_card.text                    a_element = div_card.find("a")                    url = f"https://pinterest.com{a_element['href']}"                    img = div_card.find("img")                    img_url = img["src"]
                    search_data = SearchData(                        name=title,                        url=url,                        image=img_url                    )
                    data_pipeline.add_data(search_data)
            success = True

        except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def process_pin(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(url)        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")                main_card = soup.select_one("div[data-test-id='CloseupDetails']")
                website = "n/a"                has_website = main_card.select_one("span[style='text-decoration: underline;']")                if has_website:                    website = f"https://{has_website.text}"
                star_divs = main_card.select("div[data-test-id='rating-star-full']")                stars = len(star_divs)
                profile_info = main_card.select_one("div[data-test-id='follower-count']")
                account_name_div = profile_info.select_one("div[data-test-id='creator-profile-name']")                nested_divs = account_name_div.find_all("div")                account_name = nested_divs[0].get("title")                follower_count = profile_info.text.replace(account_name, "").replace(" followers", "")
                img_container = soup.select_one("div[data-test-id='pin-closeup-image']")                img = img_container.find("img").get("src")
                pin_data = {                    "name": account_name,                    "website": website,                    "stars": stars,                    "follower_count": follower_count,                    "image": img                }
                print(pin_data)                success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")

def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_pin(row, location, retries=retries)

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["grilling"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        scrape_search_results(keyword, LOCATION, data_pipeline=crawl_pipeline, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Step 3: Storing the Scraped Data

As before, we need to store our scraped data. We need to create another dataclass, PinData. Just like SearchData, the job of PinData is to simply hold data. We then go ahead and pass this into a DataPipeline.Take a look, it's almost identical to SearchData.

@dataclassclass PinData:    name: str = ""    website: str = ""    stars: int = 0    follower_count: str = ""    image: str = ""

    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

Now, lets update our script.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 2000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    url: str = ""    image: str = ""
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass PinData:    name: str = ""    website: str = ""    stars: int = 0    follower_count: str = ""    image: str = ""

    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:
    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.pinterest.com/search/pins/?q={formatted_keyword}&rs=typed"    tries = 0    success = False
    while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            print(scrapeops_proxy_url)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")
                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")
            div_cards = soup.find_all("div")
            result_count = 0
            for div_card in div_cards:                if div_card.get("data-grid-item"):
                    result_count += 1
                    title = div_card.text                    a_element = div_card.find("a")                    url = f"https://pinterest.com{a_element['href']}"                    img = div_card.find("img")                    img_url = img["src"]
                    search_data = SearchData(                        name=title,                        url=url,                        image=img_url                    )
                    data_pipeline.add_data(search_data)
            success = True

        except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def process_pin(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(url)        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")                main_card = soup.select_one("div[data-test-id='CloseupDetails']")                pin_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv")
                website = "n/a"                has_website = main_card.select_one("span[style='text-decoration: underline;']")                if has_website:                    website = f"https://{has_website.text}"
                star_divs = main_card.select("div[data-test-id='rating-star-full']")                stars = len(star_divs)
                profile_info = main_card.select_one("div[data-test-id='follower-count']")
                account_name_div = profile_info.select_one("div[data-test-id='creator-profile-name']")                nested_divs = account_name_div.find_all("div")                account_name = nested_divs[0].get("title")                follower_count = profile_info.text.replace(account_name, "").replace(" followers", "")
                img_container = soup.select_one("div[data-test-id='pin-closeup-image']")                img = img_container.find("img").get("src")
                pin_data = PinData(                    name=account_name,                    website=website,                    stars=stars,                    follower_count=follower_count,                    image=img                )
                pin_pipeline.add_data(pin_data)                pin_pipeline.close_pipeline()
                success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")



def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_pin(row, location, retries=retries)

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["grilling"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        scrape_search_results(keyword, LOCATION, data_pipeline=crawl_pipeline, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Aside from our new class, here are the key differences you should notice:

We open a new DataPipeline for our PinData, pin_pipeline = DataPipeline(csv_filename=f"{row['name'].replace(' ', '-')}.csv") .
Intead of printing the dictionary like we did earlier, we construct a PinData object out of it.
We pass the PinData into our pipeline and then close the pipeline.

Step 4: Adding Concurrency

We've hit the point that we need to start thinking about performace. To achieve better performance, we need to add concurrency.To do this, we're going to use ThreadPoolExecutor to add multithreading support to our scraper. Our MAX_THREADS constant will finally get used now.

def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_pin,                reader,                [location] * len(reader),                [retries] * len(reader)            )

Let's outline the different arguments to executor.map():

process_pin is the function that we're calling to to run on multiple threads
reader is an array of dict objects that we read from the CSV file.
We then pass the location in as an array the length of the reader
We pass the retries in as an array as well

Step 5: Bypassing Anti-Bots

There is one final change we need to make to our scraper. Inside of process_pin(), we change the following line.

response = requests.get(get_scrapeops_url(url, location=location))

Here is our fully updated scraper:

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 2000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    url: str = ""    image: str = ""
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass PinData:    name: str = ""    website: str = ""    stars: int = 0    follower_count: str = ""    image: str = ""

    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:
    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    url = f"https://www.pinterest.com/search/pins/?q={formatted_keyword}&rs=typed"    tries = 0    success = False
    while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            print(scrapeops_proxy_url)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code != 200:                raise Exception(f"Failed request, Status Code {response.status_code}")
                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")
            div_cards = soup.find_all("div")
            result_count = 0
            for div_card in div_cards:                if div_card.get("data-grid-item"):
                    result_count += 1
                    title = div_card.text                    a_element = div_card.find("a")                    url = f"https://pinterest.com{a_element['href']}"                    img = div_card.find("img")                    img_url = img["src"]
                    search_data = SearchData(                        name=title,                        url=url,                        image=img_url                    )
                    data_pipeline.add_data(search_data)
            success = True

        except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def process_pin(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:        response = requests.get(get_scrapeops_url(url, location=location))        try:            if response.status_code == 200:                logger.info(f"Status: {response.status_code}")
                soup = BeautifulSoup(response.text, "html.parser")                main_card = soup.select_one("div[data-test-id='CloseupDetails']")                pin_pipeline = DataPipeline(csv_filename=f"{row['name'][0:20].replace(' ', '-')}.csv")
                website = "n/a"                has_website = main_card.select_one("span[style='text-decoration: underline;']")                if has_website:                    website = f"https://{has_website.text}"
                star_divs = main_card.select("div[data-test-id='rating-star-full']")                stars = len(star_divs)
                profile_info = main_card.select_one("div[data-test-id='follower-count']")
                account_name_div = profile_info.select_one("div[data-test-id='creator-profile-name']")                nested_divs = account_name_div.find_all("div")                account_name = nested_divs[0].get("title")                follower_count = profile_info.text.replace(account_name, "").replace(" followers", "")
                img_container = soup.select_one("div[data-test-id='pin-closeup-image']")                img = img_container.find("img").get("src")
                pin_data = PinData(                    name=account_name,                    website=website,                    stars=stars,                    follower_count=follower_count,                    image=img                )
                pin_pipeline.add_data(pin_data)                pin_pipeline.close_pipeline()
                success = True
            else:                logger.warning(f"Failed Response: {response.status_code}")                raise Exception(f"Failed Request, status code: {response.status_code}")        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")



def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_pin,                reader,                [location] * len(reader),                [retries] * len(reader)            )
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["grilling"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        scrape_search_results(keyword, LOCATION, data_pipeline=crawl_pipeline, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Step 6: Production Run

Now that we've got our production scraper, it's time for our production run. Once again, take a look at the main and feel free to change any constant you want.

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["grilling"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        scrape_search_results(keyword, LOCATION, data_pipeline=crawl_pipeline, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Here are the results.The crawl and the scraping process finished in 51.88 seconds. All in all, this comes out to just over 2 seconds per page. Considering that ScrapeOps is waiting 2 seconds before sending each page back to us, 2 seconds per page is incredible!!!

Legal and Ethical Considerations

Whenever you scrape a website, you need to be aware of the Terms of Service and robots.txt.. You can view Pinterest's terms here.If you access private data on their site in a way that violates these terms, you can even lose your Pinterest account! You can view their robots.txt here.Also, keep in mind whether you are scraping public data. Private data (data behind a login), can often be illegal to scrape. Generally, public data (data not behind a login) is public information and therefore fair game when scraping.If you are unsure of the legality of a your scraper, it is best to consult an attorney based in your jurisdiction.

Conclusion

You made it! Congratulations on finishing this tutorial. You now know how to build both a crawler and a scraper. You also have a solid grasp of parsing, data storage, concurrency, and proxy integration.You should also have a solid grasp of how to use requests and beautifulsoup.

How to Scrape Pinterest With Selenium

In this Selenium guide and you will learn how to scrape Pinterest.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

TLDR - How to Scrape Pinterest Using Selenium

Looking to scrape, but you don't have the time for a tutorial, use this one.Just make a config.json file with your API key and place it in the same folder as this file and you're ready to go!

import osimport csvimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom dataclasses import dataclass, field, fields, asdictfrom time import sleep
OPTIONS = webdriver.ChromeOptions()
prefs = {    "profile.managed_default_content_settings.javascript": 2}OPTIONS.add_experimental_option("prefs", prefs)
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"OPTIONS.add_argument(f"useragent={user_agent}")
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 2000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    url: str = ""    image: str = ""
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass PinData:    name: str = ""    website: str = ""    stars: int = 0    follower_count: str = ""    image: str = ""

    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:
    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    tries = 0    success = False
    while tries <= retries and not success:        url = f"https://www.pinterest.com/search/pins/?q={formatted_keyword}&rs=typed"        driver = webdriver.Chrome(options=OPTIONS)        driver.set_page_load_timeout(30)        driver.implicitly_wait(10)        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            driver.get(scrapeops_proxy_url)            logger.info(f"Fetched {url}")
            ## Extract Data            div_cards = driver.find_elements(By.CSS_SELECTOR, "div")
            print("found div cards:", len(div_cards))

            for div_card in div_cards:                is_card = div_card.get_attribute("data-grid-item")                if is_card:                    a_element = div_card.find_element(By.CSS_SELECTOR, "a")                    title = a_element.get_attribute("aria-label")                    href = a_element.get_attribute("href").replace("https://proxy.scrapeops.io", "")                    url = f"https://pinterest.com{href}"                    img = div_card.find_element(By.CSS_SELECTOR, "img")                    img_url = img.get_attribute("src")
                    search_data = SearchData(                        name=title,                        url=url,                        image=img_url                    )                    data_pipeline.add_data(search_data)


            logger.info(f"Successfully parsed data from: {url}")            success = True

        except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def process_pin(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:
        driver = webdriver.Chrome(options=OPTIONS)        driver.get(get_scrapeops_url(url, location=location))
        try:            main_card = driver.find_element(By.CSS_SELECTOR, "div[data-test-id='CloseupDetails']")            pin_pipeline = DataPipeline(csv_filename=f"{row['name'][0:20].replace(' ', '-')}.csv")            website = "n/a"
            website_holder = main_card.find_elements(By.CSS_SELECTOR, "span[style='text-decoration: underline;']")            has_website = len(website_holder) > 0            if has_website:                website = f"https://{website_holder[0].text}"
            star_divs = main_card.find_elements(By.CSS_SELECTOR, "div[data-test-id='rating-star-full']")            stars = len(star_divs)
            profile_info = main_card.find_element(By.CSS_SELECTOR, "div[data-test-id='follower-count']")
            account_name_div = profile_info.find_element(By.CSS_SELECTOR, "div[data-test-id='creator-profile-name']")            nested_divs = account_name_div.find_elements(By.CSS_SELECTOR, "div")            account_name = nested_divs[0].get_attribute("title")            follower_count = profile_info.text.replace(account_name, "").replace(" followers", "")
            img = "n/a"            img_container = driver.find_elements(By.CSS_SELECTOR, "div[data-test-id='pin-closeup-image']")            if len(img_container) > 0:                img = img_container[0].find_element(By.CSS_SELECTOR, "img").get_attribute("src")
            pin_data = PinData(                name=account_name,                website=website,                stars=stars,                follower_count=follower_count,                image=img            )
            pin_pipeline.add_data(pin_data)            pin_pipeline.close_pipeline()

            success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1
        finally:            driver.quit()    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")



def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_pin,                reader,                [location] * len(reader),                [retries] * len(reader)            )
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["grilling"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        scrape_search_results(keyword, LOCATION, data_pipeline=crawl_pipeline, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

To change your results, feel free to tweak any of the following:

MAX_RETRIES: This parameter sets the maximum number of attempts the script will make to fetch data from a URL if the initial request fails.
MAX_THREADS: This parameter sets the maximum number of threads to use for processing results concurrently. This can speed up the processing of multiple pins or search results.
LOCATION: This parameter sets the geographical location from which the requests are made. It can affect the content returned by the website due to region-specific restrictions or differences.
keywords_list: This list contains the keywords for which you want to scrape Pinterest search results.

How To How To Architect Our Pinterest Scraper

For this scraping project, we'll actually be building two scrapers.

Our first scraper will be the crawler. The crawler needs to perform a search, pull information from the search and save it to a CSV file.
Once our crawler is spitting out CSV files, we'll make a pin scraper that reads the CSV file and scrapes every pin individually from the CSV file.

We'll utilize the following in this project:

Parsing: to extract the important data from Pinterest.
Data Storage: To store our data for later review and also to feed information into our scraper.
Concurrency: to process multiple pages simultaneously and efficiently.
Proxy Integration: Pinterest is notoriously difficult to access programmatically, so we'll be using the ScrapeOps Proxy API.

Understanding How To Scrape pinterest

Step 1: How To Request Pinterest Pages

Any scrape starts with a simple GET request to the site's server and Pinterest is no exception. When you type a domain name into your address bar, you're performing a GET request.If we perform a search for "grilling", our URL looks like this

https://www.pinterest.com/search/pins/?q=grilling&rs=typed

Our actual domain is https://www.pinterest.com/search/pins/.
The ? character denotes our queries which get separated by the & if we have multiple queries.
In this example, our queries are grilling, and typed. The full query string is ?q=grilling&rs=typed.

Take a look at the image below and examine it for yourself.Individual pins all receive their own unique number on Pinterest. Here is the URL for the pin below:

https://www.pinterest.com/pin/45176802505307132/

The unique number for this pin is 45176802505307132. For any pin on Pinterest, the URL gets laid out like this:

https://www.pinterest.com/pin/PIN-NUMBER-GOES-HERE/

Step 2: How To Extract Data From Pinterest Results and Pages

Data from Pinterest can be quite a pain to extract. Not only is it deeply nested within the page, it is all generated dynamically. Along with the dynamic content, Pinterest uses numerous frontend JavaScript tactics to block you even when you're routed through a proxy.Take a look at the results page below and you can see how nasty the HTML is.And here is our pin page.In order to get around this, we'll be using the ScrapeOps Proxy API along with its builtin headless browser. Even though we're using Selenium, we're actually going to have to disable JavaScript so we can stop from getting blocked!

Step 3: Geolocated Data

To handle geolocation, we'll be passing a country parameter into the ScrapeOps API.

With the ScrapeOps API, if you pass "us" in as your country, you'll be routed through a server in the US.
If you want to be routed through the UK, you can pass "uk".

In Pinterest's case, geolocation is pretty important. They sometimes even block proxies. If you are getting blocked while using a proxy, try changing your location.This worked for us every time in testing.

Setting Up Our Pinterest Scraper Project

Let's get started. You can run the following commands to get setup.Create a New Project Folder

mkdir pinterest-scraper
cd pinterest-scraper

Create a New Virtual Environment

python -m venv venv

Activate the Environment

source venv/bin/activate

Install Our Dependencies

pip install selenium

Make sure you have webdriver installed! If you don't, you can check here

Build A Pinterest Search Crawler

As we mentioned earlier in this article, the first scraper we need to build is our crawler. The crawler is going to use the following:

Parsing: to extract valuable data from the page.
Data Storage: to store our data in a safe and efficient manner.
Proxy Integration: to get past anti-bots and anything else that might block us.

Step 1: Create Simple Search Data Parser

We'll start by creating setting up a basic scraper with some error handling and retry logic. Along with that, we'll read our API key from a config file, config.json. Simply create this file and add your API key to it.The entire config file should look like this:

{    "api_key": "YOUR-SUPER-SECRET-API-KEY"}

Here is our full code so far:

import osimport csvimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom dataclasses import dataclass, field, fields, asdictfrom time import sleep
OPTIONS = webdriver.ChromeOptions()
prefs = {    "profile.managed_default_content_settings.javascript": 2}OPTIONS.add_experimental_option("prefs", prefs)
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"OPTIONS.add_argument(f"useragent={user_agent}")
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)

def scrape_search_results(keyword, location, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    tries = 0    success = False
    while tries <= retries and not success:        url = f"https://www.pinterest.com/search/pins/?q={formatted_keyword}&rs=typed"        driver = webdriver.Chrome(options=OPTIONS)        driver.set_page_load_timeout(30)        driver.implicitly_wait(10)        try:            driver.get(url)            logger.info(f"Fetched {url}")
            ## Extract Data            div_cards = driver.find_elements(By.CSS_SELECTOR, "div")
            print("found div cards:", len(div_cards))

            for div_card in div_cards:                is_card = div_card.get_attribute("data-grid-item")                if is_card:                    a_element = div_card.find_element(By.CSS_SELECTOR, "a")                    title = a_element.get_attribute("aria-label")                    href = a_element.get_attribute("href").replace("https://proxy.scrapeops.io", "")                    url = f"https://pinterest.com{href}"                    img = div_card.find_element(By.CSS_SELECTOR, "img")                    img_url = img.get_attribute("src")
                    search_data = {                        "name": title,                        "url": url,                        "image": img_url                    }
                    print(search_data)
            logger.info(f"Successfully parsed data from: {url}")            success = True

        except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["grilling"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        scrape_search_results(keyword, LOCATION, retries=MAX_RETRIES)
    logger.info(f"Crawl complete.")

In the example above, pay attention to our parsing logic:

is_card = div_card.get_attribute("data-grid-item") determines whether or not each div is a search result. All search results contain the attribute, data-grif-item.
We pull the title with a_element.get_attribute("aria-label")
We find the url of each pin with a_element.get_attribute("href").replace("https://proxy.scrapeops.io", "") We then replace the ScrapeOps URL with Pinterest's URL.
Once we've parsed through all these results, we set success to True and exit the function.

Step 2: Storing the Scraped Data

Pulling the right data doesn't do us much good if we can't store it. To store our data, we'll be adding both a SearchData class, and a DataPipeline class.

SearchData simply takes our data and turns it into a uniform object that holds it.
Once we have our SearchData, we can then pass it into the DataPipeline which filters out our duplicates and saves all of our relevant information to a CSV file.

Take a look at our updated code now:

import osimport csvimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom dataclasses import dataclass, field, fields, asdictfrom time import sleep
OPTIONS = webdriver.ChromeOptions()
prefs = {    "profile.managed_default_content_settings.javascript": 2}OPTIONS.add_experimental_option("prefs", prefs)
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"OPTIONS.add_argument(f"useragent={user_agent}")
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    url: str = ""    image: str = ""
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:
    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    tries = 0    success = False
    while tries <= retries and not success:        url = f"https://www.pinterest.com/search/pins/?q={formatted_keyword}&rs=typed"        driver = webdriver.Chrome(options=OPTIONS)        driver.set_page_load_timeout(30)        driver.implicitly_wait(10)        try:            driver.get(url)            logger.info(f"Fetched {url}")
            ## Extract Data            div_cards = driver.find_elements(By.CSS_SELECTOR, "div")
            print("found div cards:", len(div_cards))

            for div_card in div_cards:                is_card = div_card.get_attribute("data-grid-item")                if is_card:                    a_element = div_card.find_element(By.CSS_SELECTOR, "a")                    title = a_element.get_attribute("aria-label")                    href = a_element.get_attribute("href").replace("https://proxy.scrapeops.io", "")                    url = f"https://pinterest.com{href}"                    img = div_card.find_element(By.CSS_SELECTOR, "img")                    img_url = img.get_attribute("src")
                    search_data = SearchData(                        name=title,                        url=url,                        image=img_url                    )                    data_pipeline.add_data(search_data)


            logger.info(f"Successfully parsed data from: {url}")            success = True

        except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["grilling"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        scrape_search_results(keyword, LOCATION, data_pipeline=crawl_pipeline, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

Quick recap of our changes:

Instead of turning our search_data into a dict, we use it to build a SearchData object.
Once we have our search_data, we pass it into the data_pipeline.
After this operation is complete and we've exited scrape_search_results(), we close the pipeline.

Step 3: Bypassing Anti-Bots

As you may have noticed in our earlier examples, we're not running in headless mode and we have JavaScript turned off.The reason for this: we need ScrapeOps to render the page (rendering in headless mode sometimes causes issues with Selenium), and we can't execute the JavaScript that Pinterest uses for authentication/fetching more content... it will destroy the page.The options below show our ChromeOptions. In the prefs, you should see "profile.managed_default_content_settings.javascript": 2. This turns off JavaScript support.

OPTIONS = webdriver.ChromeOptions()
prefs = {    "profile.managed_default_content_settings.javascript": 2}OPTIONS.add_experimental_option("prefs", prefs)
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"OPTIONS.add_argument(f"useragent={user_agent}")

Here is the function we'll be using for proxy support:

def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 2000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

Here is our production ready crawler:

import osimport csvimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom dataclasses import dataclass, field, fields, asdictfrom time import sleep
OPTIONS = webdriver.ChromeOptions()
prefs = {    "profile.managed_default_content_settings.javascript": 2}OPTIONS.add_experimental_option("prefs", prefs)
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"OPTIONS.add_argument(f"useragent={user_agent}")
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 2000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    url: str = ""    image: str = ""
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:
    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    tries = 0    success = False
    while tries <= retries and not success:        url = f"https://www.pinterest.com/search/pins/?q={formatted_keyword}&rs=typed"        driver = webdriver.Chrome(options=OPTIONS)        driver.set_page_load_timeout(30)        driver.implicitly_wait(10)        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            driver.get(scrapeops_proxy_url)            logger.info(f"Fetched {url}")
            ## Extract Data            div_cards = driver.find_elements(By.CSS_SELECTOR, "div")
            print("found div cards:", len(div_cards))

            for div_card in div_cards:                is_card = div_card.get_attribute("data-grid-item")                if is_card:                    a_element = div_card.find_element(By.CSS_SELECTOR, "a")                    title = a_element.get_attribute("aria-label")                    href = a_element.get_attribute("href").replace("https://proxy.scrapeops.io", "")                    url = f"https://pinterest.com{href}"                    img = div_card.find_element(By.CSS_SELECTOR, "img")                    img_url = img.get_attribute("src")
                    search_data = SearchData(                        name=title,                        url=url,                        image=img_url                    )                    data_pipeline.add_data(search_data)


            logger.info(f"Successfully parsed data from: {url}")            success = True

        except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["grilling"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        scrape_search_results(keyword, LOCATION, data_pipeline=crawl_pipeline, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

Step 4: Production Run

Now it's time to run it in production and test it out. Here is our main.

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["grilling"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        scrape_search_results(keyword, LOCATION, data_pipeline=crawl_pipeline, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")

If you're following along, feel free to tweak these constants anyway you'd like and experiment with the results.Here are the results from our production run.It took 16.68 seconds to parse the page and save the results. Your results will probably vary based on your hardware, internet connection, and the location of your server.

Build A Pinterest Scraper

When we build our pin scraper, we need to incorporate the following things into our design:

Parse the information from a pin.
Read the rows from the CSV file.
Store the data we extracted when parsing.
Perform all these actions concurrently.
Integrate with the ScrapeOps Proxy API

Step 1: Create Simple Data Parser

Just like before, we'll get started with a simple parsing function. Take a look a the snippet below, while the parsing logic looks different, the overall structure is virtually identical.

def process_pin(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:
        driver = webdriver.Chrome(options=OPTIONS)        driver.get(url)
        try:            main_card = driver.find_element(By.CSS_SELECTOR, "div[data-test-id='CloseupDetails']")            website = "n/a"
            website_holder = main_card.find_elements(By.CSS_SELECTOR, "span[style='text-decoration: underline;']")            has_website = len(website_holder) > 0            if has_website:                website = f"https://{website_holder[0].text}"
            star_divs = main_card.find_elements(By.CSS_SELECTOR, "div[data-test-id='rating-star-full']")            stars = len(star_divs)
            profile_info = main_card.find_element(By.CSS_SELECTOR, "div[data-test-id='follower-count']")
            account_name_div = profile_info.find_element(By.CSS_SELECTOR, "div[data-test-id='creator-profile-name']")            nested_divs = account_name_div.find_elements(By.CSS_SELECTOR, "div")            account_name = nested_divs[0].get_attribute("title")            follower_count = profile_info.text.replace(account_name, "").replace(" followers", "")
            img = "n/a"            img_container = driver.find_elements(By.CSS_SELECTOR, "div[data-test-id='pin-closeup-image']")            if len(img_container) > 0:                img = img_container[0].find_element(By.CSS_SELECTOR, "img").get_attribute("src")
            pin_data = {                "name": account_name,                "website": website,                "stars": stars,                "follower_count": follower_count,                "image": img            }
            print(pin_data)
            success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1
        finally:            driver.quit()    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")

Key points from above:

We use find_elements() to look for website_holder.
If our website_holder is longer than zero, there is a website present, so we reassign the website from "n/a" to the actual website.
We find all the star_divs. Each star is a unique element on the page, so we can obtain the rating by counting these elements.
We then find the follower_count, account_name and the image of the pin.

Step 2: Loading URLs To Scrape

Our parsing function doesn't do much good if our scraping doesn't know what to parse.

This function uses csv.Dictreader() to read the file into an array.
We then pass each row from the array into process_pin().
Later on, we'll add concurrency to this function, but for now, we'll use a for loop as a placeholder.

def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_pin(row, location, retries=retries)

Here is our fully updated code:

import osimport csvimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom dataclasses import dataclass, field, fields, asdictfrom time import sleep
OPTIONS = webdriver.ChromeOptions()
prefs = {    "profile.managed_default_content_settings.javascript": 2}OPTIONS.add_experimental_option("prefs", prefs)
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"OPTIONS.add_argument(f"useragent={user_agent}")
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 2000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    print(proxy_url)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    url: str = ""    image: str = ""
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:
    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    tries = 0    success = False
    while tries <= retries and not success:        url = f"https://www.pinterest.com/search/pins/?q={formatted_keyword}&rs=typed"        driver = webdriver.Chrome(options=OPTIONS)        driver.set_page_load_timeout(30)        driver.implicitly_wait(10)        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            driver.get(scrapeops_proxy_url)            logger.info(f"Fetched {url}")
            ## Extract Data            div_cards = driver.find_elements(By.CSS_SELECTOR, "div")
            print("found div cards:", len(div_cards))

            for div_card in div_cards:                is_card = div_card.get_attribute("data-grid-item")                if is_card:                    a_element = div_card.find_element(By.CSS_SELECTOR, "a")                    title = a_element.get_attribute("aria-label")                    href = a_element.get_attribute("href").replace("https://proxy.scrapeops.io", "")                    url = f"https://pinterest.com{href}"                    img = div_card.find_element(By.CSS_SELECTOR, "img")                    img_url = img.get_attribute("src")
                    search_data = SearchData(                        name=title,                        url=url,                        image=img_url                    )                    data_pipeline.add_data(search_data)


            logger.info(f"Successfully parsed data from: {url}")            success = True

        except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def process_pin(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:
        driver = webdriver.Chrome(options=OPTIONS)        driver.get(url)
        try:            main_card = driver.find_element(By.CSS_SELECTOR, "div[data-test-id='CloseupDetails']")            website = "n/a"
            website_holder = main_card.find_elements(By.CSS_SELECTOR, "span[style='text-decoration: underline;']")            has_website = len(website_holder) > 0            if has_website:                website = f"https://{website_holder[0].text}"
            star_divs = main_card.find_elements(By.CSS_SELECTOR, "div[data-test-id='rating-star-full']")            stars = len(star_divs)
            profile_info = main_card.find_element(By.CSS_SELECTOR, "div[data-test-id='follower-count']")
            account_name_div = profile_info.find_element(By.CSS_SELECTOR, "div[data-test-id='creator-profile-name']")            nested_divs = account_name_div.find_elements(By.CSS_SELECTOR, "div")            account_name = nested_divs[0].get_attribute("title")            follower_count = profile_info.text.replace(account_name, "").replace(" followers", "")
            img = "n/a"            img_container = driver.find_elements(By.CSS_SELECTOR, "div[data-test-id='pin-closeup-image']")            if len(img_container) > 0:                img = img_container[0].find_element(By.CSS_SELECTOR, "img").get_attribute("src")
            pin_data = {                "name": account_name,                "website": website,                "stars": stars,                "follower_count": follower_count,                "image": img            }
            print(pin_data)
            success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1
        finally:            driver.quit()    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")



def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_pin(row, location, retries=retries)
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["grilling"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        scrape_search_results(keyword, LOCATION, data_pipeline=crawl_pipeline, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Step 3: Storing the Scraped Data

Time to store our data... sound familiar? In this section we'll create another dataclass, PinData. Take a look below, our PinData is actually very similar to SearchData.This object will even be passed into the DataPipeline the same way.

@dataclassclass PinData:    name: str = ""    website: str = ""    stars: int = 0    follower_count: str = ""    image: str = ""

    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

Now, let's update our script:

import osimport csvimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom dataclasses import dataclass, field, fields, asdictfrom time import sleep
OPTIONS = webdriver.ChromeOptions()
prefs = {    "profile.managed_default_content_settings.javascript": 2}OPTIONS.add_experimental_option("prefs", prefs)
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"OPTIONS.add_argument(f"useragent={user_agent}")
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 2000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    print(proxy_url)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    url: str = ""    image: str = ""
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass PinData:    name: str = ""    website: str = ""    stars: int = 0    follower_count: str = ""    image: str = ""

    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:
    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    tries = 0    success = False
    while tries <= retries and not success:        url = f"https://www.pinterest.com/search/pins/?q={formatted_keyword}&rs=typed"        driver = webdriver.Chrome(options=OPTIONS)        driver.set_page_load_timeout(30)        driver.implicitly_wait(10)        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            driver.get(scrapeops_proxy_url)            logger.info(f"Fetched {url}")
            ## Extract Data            div_cards = driver.find_elements(By.CSS_SELECTOR, "div")
            print("found div cards:", len(div_cards))

            for div_card in div_cards:                is_card = div_card.get_attribute("data-grid-item")                if is_card:                    a_element = div_card.find_element(By.CSS_SELECTOR, "a")                    title = a_element.get_attribute("aria-label")                    href = a_element.get_attribute("href").replace("https://proxy.scrapeops.io", "")                    url = f"https://pinterest.com{href}"                    img = div_card.find_element(By.CSS_SELECTOR, "img")                    img_url = img.get_attribute("src")
                    search_data = SearchData(                        name=title,                        url=url,                        image=img_url                    )                    data_pipeline.add_data(search_data)


            logger.info(f"Successfully parsed data from: {url}")            success = True

        except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def process_pin(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:
        driver = webdriver.Chrome(options=OPTIONS)        driver.get(url)
        try:            main_card = driver.find_element(By.CSS_SELECTOR, "div[data-test-id='CloseupDetails']")            pin_pipeline = DataPipeline(csv_filename=f"{row['name'][0:20].replace(' ', '-')}.csv")            website = "n/a"
            website_holder = main_card.find_elements(By.CSS_SELECTOR, "span[style='text-decoration: underline;']")            has_website = len(website_holder) > 0            if has_website:                website = f"https://{website_holder[0].text}"
            star_divs = main_card.find_elements(By.CSS_SELECTOR, "div[data-test-id='rating-star-full']")            stars = len(star_divs)
            profile_info = main_card.find_element(By.CSS_SELECTOR, "div[data-test-id='follower-count']")
            account_name_div = profile_info.find_element(By.CSS_SELECTOR, "div[data-test-id='creator-profile-name']")            nested_divs = account_name_div.find_elements(By.CSS_SELECTOR, "div")            account_name = nested_divs[0].get_attribute("title")            follower_count = profile_info.text.replace(account_name, "").replace(" followers", "")
            img = "n/a"            img_container = driver.find_elements(By.CSS_SELECTOR, "div[data-test-id='pin-closeup-image']")            if len(img_container) > 0:                img = img_container[0].find_element(By.CSS_SELECTOR, "img").get_attribute("src")
            pin_data = PinData(                name=account_name,                website=website,                stars=stars,                follower_count=follower_count,                image=img            )
            pin_pipeline.add_data(pin_data)            pin_pipeline.close_pipeline()

            success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1
        finally:            driver.quit()    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")



def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            process_pin(row, location, retries=retries)
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["grilling"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        scrape_search_results(keyword, LOCATION, data_pipeline=crawl_pipeline, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

The main differences in this version of the script are:

Inside of process_pin() we instantiate a DataPipeline object.
Once we've parsed our data, we use it to create a PinData object.
We pass the pin_data variable into the the data_pipeline.
After we've finished using the pipeline, we close it and exit the function.

Step 4: Adding Concurrency

Efficiency and speed are key when doing any large task at scale. To get more make our scraper more efficient and faster, we're going to add concurrency with ThreadPoolExecutor.Here we make a simple, but big change to process_results().

def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_pin,                reader,                [location] * len(reader),                [retries] * len(reader)            )

Time to explain the arguments to executor.map():

process_pin is the function we wish to run on multiple threads
reader is the array of objects we want to pass into the function
We also pass in our location and our retries as arrays.

Step 5: Bypassing Anti-Bots

We're almost ready for production, but before we can run our scraper, we need to once again add in proxy support. Since we've already got our get_scrapeops_url() function, we just need to add it into one line.

driver.get(get_scrapeops_url(url, location=location))

Here is our production ready code containing both the crawler and the scraper:

import osimport csvimport jsonimport loggingfrom urllib.parse import urlencodeimport concurrent.futuresfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom dataclasses import dataclass, field, fields, asdictfrom time import sleep
OPTIONS = webdriver.ChromeOptions()
prefs = {    "profile.managed_default_content_settings.javascript": 2}OPTIONS.add_experimental_option("prefs", prefs)
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"OPTIONS.add_argument(f"useragent={user_agent}")
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "wait": 2000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass SearchData:    name: str = ""    url: str = ""    image: str = ""
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass PinData:    name: str = ""    website: str = ""    stars: int = 0    follower_count: str = ""    image: str = ""

    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:
    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()


def scrape_search_results(keyword, location, data_pipeline=None, retries=3):    formatted_keyword = keyword.replace(" ", "+")    tries = 0    success = False
    while tries <= retries and not success:        url = f"https://www.pinterest.com/search/pins/?q={formatted_keyword}&rs=typed"        driver = webdriver.Chrome(options=OPTIONS)        driver.set_page_load_timeout(30)        driver.implicitly_wait(10)        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            driver.get(scrapeops_proxy_url)            logger.info(f"Fetched {url}")
            ## Extract Data            div_cards = driver.find_elements(By.CSS_SELECTOR, "div")
            print("found div cards:", len(div_cards))

            for div_card in div_cards:                is_card = div_card.get_attribute("data-grid-item")                if is_card:                    a_element = div_card.find_element(By.CSS_SELECTOR, "a")                    title = a_element.get_attribute("aria-label")                    href = a_element.get_attribute("href").replace("https://proxy.scrapeops.io", "")                    url = f"https://pinterest.com{href}"                    img = div_card.find_element(By.CSS_SELECTOR, "img")                    img_url = img.get_attribute("src")
                    search_data = SearchData(                        name=title,                        url=url,                        image=img_url                    )                    data_pipeline.add_data(search_data)


            logger.info(f"Successfully parsed data from: {url}")            success = True

        except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

def process_pin(row, location, retries=3):    url = row["url"]    tries = 0    success = False
    while tries <= retries and not success:
        driver = webdriver.Chrome(options=OPTIONS)        driver.get(get_scrapeops_url(url, location=location))
        try:            main_card = driver.find_element(By.CSS_SELECTOR, "div[data-test-id='CloseupDetails']")            pin_pipeline = DataPipeline(csv_filename=f"{row['name'][0:20].replace(' ', '-')}.csv")            website = "n/a"
            website_holder = main_card.find_elements(By.CSS_SELECTOR, "span[style='text-decoration: underline;']")            has_website = len(website_holder) > 0            if has_website:                website = f"https://{website_holder[0].text}"
            star_divs = main_card.find_elements(By.CSS_SELECTOR, "div[data-test-id='rating-star-full']")            stars = len(star_divs)
            profile_info = main_card.find_element(By.CSS_SELECTOR, "div[data-test-id='follower-count']")
            account_name_div = profile_info.find_element(By.CSS_SELECTOR, "div[data-test-id='creator-profile-name']")            nested_divs = account_name_div.find_elements(By.CSS_SELECTOR, "div")            account_name = nested_divs[0].get_attribute("title")            follower_count = profile_info.text.replace(account_name, "").replace(" followers", "")
            img = "n/a"            img_container = driver.find_elements(By.CSS_SELECTOR, "div[data-test-id='pin-closeup-image']")            if len(img_container) > 0:                img = img_container[0].find_element(By.CSS_SELECTOR, "img").get_attribute("src")
            pin_data = PinData(                name=account_name,                website=website,                stars=stars,                follower_count=follower_count,                image=img            )
            pin_pipeline.add_data(pin_data)            pin_pipeline.close_pipeline()

            success = True
        except Exception as e:            logger.error(f"Exception thrown: {e}")            logger.warning(f"Failed to process page: {row['url']}")            logger.warning(f"Retries left: {retries-tries}")            tries += 1
        finally:            driver.quit()    if not success:        raise Exception(f"Max Retries exceeded: {retries}")    else:        logger.info(f"Successfully parsed: {row['url']}")



def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                process_pin,                reader,                [location] * len(reader),                [retries] * len(reader)            )
if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["grilling"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        scrape_search_results(keyword, LOCATION, data_pipeline=crawl_pipeline, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Step 6: Production Run

Time to run the whole thing in production. Feel free to take a look at the main again and tweak whatever constants you'd like.

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "us"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    keyword_list = ["grilling"]    aggregate_files = []
    ## Job Processes    for keyword in keyword_list:        filename = keyword.replace(" ", "-")
        crawl_pipeline = DataPipeline(csv_filename=f"{filename}.csv")        scrape_search_results(keyword, LOCATION, data_pipeline=crawl_pipeline, retries=MAX_RETRIES)        crawl_pipeline.close_pipeline()        aggregate_files.append(f"{filename}.csv")    logger.info(f"Crawl complete.")
    for file in aggregate_files:        process_results(file, LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)

Here are the results:The operation finished in roughly 103 seconds. This comes out to an average of about 4.68 seconds per page. Considering that we're waiting 2 seconds for the server to return the page, this is pretty decent.

Legal and Ethical Considerations

Conclusion

Congratulations! You've finished this tutorial. Take your new knowledge of Selenium, CSS selectors, parsing, proxy integration, data storage and build something!If you'd like to know more about the tech stack used in this article, take a look at the links below.

How to Scrape Pinterest With NodeJS Puppeteer

In this Puppeteer guide, we'll go over the following topics:

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

TLDR - How to Scrape Pinterest Using Puppeteer

If you don't have time to read, but you need a Pinterest scraper, give the one below a try. Just create a new Puppeteer project and install the dependencies at the top.Also make sure to create a config.json file with your API key.

const puppeteer = require('puppeteer');const createCsvWriter = require('csv-writer').createObjectCsvWriter;const csvParse = require('csv-parse');const fs = require('fs');
const API_KEY = JSON.parse(fs.readFileSync('config.json')).api_key;
async function writeToCsv(data, outputFile) {  if (!data || data.length === 0) {    throw new Error('No data to write!');  }  const fileExists = fs.existsSync(outputFile);
  const headers = Object.keys(data[0]).map((key) => ({ id: key, title: key }));
  const csvWriter = createCsvWriter({    path: outputFile,    header: headers,    append: fileExists,  });  try {    await csvWriter.writeRecords(data);  } catch (e) {    throw new Error('Failed to write to csv');  }}
async function readCsv(inputFile) {  const results = [];  const parser = fs.createReadStream(inputFile).pipe(    csvParse.parse({      columns: true,      delimiter: ',',      trim: true,      skip_empty_lines: true,    })  );
  for await (const record of parser) {    results.push(record);  }  return results;}
function getScrapeOpsUrl(url, location = 'us') {  const params = new URLSearchParams({    api_key: API_KEY,    url: url,    country: location,    wait: 3000,    residential: true,  });  return `https://proxy.scrapeops.io/v1/?${params.toString()}`;}
async function scrapeSearchResults(  browser,  keyword,  location = 'us',  retries = 3) {  let tries = 0;  let success = false;
  while (tries <= retries && !success) {    const formattedKeyword = keyword.replace(' ', '+');    const page = await browser.newPage();    await page.setJavaScriptEnabled(false);
    try {      const url = `https://www.pinterest.com/search/pins/?q=${formattedKeyword}&rs=typed`;      const proxyUrl = getScrapeOpsUrl(url, location);      await page.goto(proxyUrl);
      console.log(`Successfully fetched: ${url}`);
      const divCards = await page.$$("div[data-grid-item='true']");
      for (const divCard of divCards) {        const aElement = await divCard.$('a');        const name = await page.evaluate(          (element) => element.getAttribute('aria-label'),          aElement        );        const href = await page.evaluate(          (element) => element.getAttribute('href'),          aElement        );        const imgElement = await divCard.$('img');        const imgLink = await page.evaluate(          (element) => element.getAttribute('src'),          imgElement        );
        const searchData = {          name: name,          url: `https://www.pinterest.com${href.replace('https://proxy.scrapeops.io', '')}`,          image: imgLink,        };
        await writeToCsv([searchData], `${keyword.replace(' ', '-')}.csv`);      }
      success = true;    } catch (err) {      console.log(`Error: ${err}, tries left ${retries - tries}`);      tries++;    } finally {      await page.close();    }  }}
async function startScrape(keyword, location, retries) {  const browser = await puppeteer.launch();
  await scrapeSearchResults(browser, keyword, location, retries);
  await browser.close();}
async function processPin(browser, row, location, retries = 3) {  const url = row.url;  let tries = 0;  let success = false;
  while (tries <= retries && !success) {    const page = await browser.newPage();    await page.setExtraHTTPHeaders({      'User-Agent':        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36',    });
    try {      await page.goto(getScrapeOpsUrl(url, location), { timeout: 60000 });
      const mainCard = await page.$("div[data-test-id='CloseupDetails']");      if (!mainCard) {        throw new Error('Failed to load the page!');      }      let website = 'n/a';      const websiteHolder = await page.$(        "span[style='text-decoration: underline;']"      );      if (websiteHolder) {        website = await page.evaluate(          (element) => element.textContent,          websiteHolder        );      }
      const starDivs = await page.$$("div[data-test-id='rating-star-full']");      const stars = starDivs.length;
      const profileInfoDiv = await mainCard.$(        "div[data-test-id='follower-count']"      );      if (profileInfoDiv === null) {        throw new Error('Page failed to loaded, most likely blocked!');      }
      const profileText = await page.evaluate(        (element) => element.textContent,        profileInfoDiv      );
      const accountNameDiv = await profileInfoDiv.$(        "div[data-test-id='creator-profile-name']"      );      const nestedDiv = await accountNameDiv.$('div');      const accountName = await page.evaluate(        (element) => element.getAttribute('title'),        nestedDiv      );
      const followerCount = profileText        .replace(accountName, '')        .replace(' followers', '');
      const pinData = {        name: accountName,        website: website,        stars: stars,        follower_count: followerCount,        image: row.image,      };
      await writeToCsv([pinData], `${row.name.replace(' ', '-')}.csv`);
      success = true;    } catch (err) {      await page.screenshot({ path: 'ERROR.png' });      console.log(`Error: ${err}, tries left: ${retries - tries}, url: ${url}`);      tries++;    } finally {      await page.close();    }  }}
async function processResults(csvFile, location, concurrencyLimit, retries) {  const pins = await readCsv(csvFile);  const browser = await puppeteer.launch();
  while (pins.length > 0) {    const currentBatch = pins.splice(0, concurrencyLimit);    const tasks = currentBatch.map((pin) =>      processPin(browser, pin, location, retries)    );
    try {      await Promise.all(tasks);    } catch (err) {      console.log(`Failed to process batch: ${err}`);    }  }  await browser.close();}
async function main() {  const keywords = ['grilling'];  const concurrencyLimit = 4;  const location = 'us';  const retries = 3;  const aggregateFiles = [];
  for (const keyword of keywords) {    console.log('Crawl starting');    await startScrape(keyword, location, retries);    console.log('Crawl complete');    aggregateFiles.push(`${keyword.replace(' ', '-')}.csv`);  }
  console.log('Starting scrape');  for (const file of aggregateFiles) {    await processResults(file, location, concurrencyLimit, retries);  }  console.log('Scrape complete');}
main();

Feel free to change any of the const variables inside of main if you'd like to tweak your results. Try changing the following:

keywords: This list contains the keywords for which you want to scrape Pinterest search results.
concurrencyLimit: This parameter sets the number of concurrent tasks (or browser pages) that the script will process at the same time.
location: This parameter sets the geographical location from which the requests are made. It can affect the content returned by the website due to region-specific restrictions or differences.
retries: This parameter sets the maximum number of attempts the script will make to fetch data from a URL if the initial request fails.

If you are having any issues, try changing your country and maybe lowering your concurrencyLimit.

How To How To Architect Our Pinterest Scraper

When we scrape Pinterest, we're going to need two separate scrapers.

The first scraper will be our crawler. The purpose of the crawler is simple: fetch search results and save them to a CSV file.
Along with the crawler, we need a pin scraper. The pin scraper needs to lookup individual pins and save their information.

We're going to design this project to use the following things:

Parsing: to extract the important data from Pinterest.
Data Storage: To store our data for later review and also to feed information into our scraper.
Concurrency: to process multiple pages simultaneously and efficiently.
Proxy Integration: Pinterest is notoriously difficult to access programmatically, so we'll be using the ScrapeOps Proxy API.

Understanding How To Scrape Pinterest

Step 1: How To Request Pinterest Pages

Pretty much everything we do on the web starts with a GET request. When you perform a Google search, or even type a domain name into your address bar, you're performing a GET.As you probably guessed, we start with a GET when we access Pinterest too. In this tutorial, we're going to scrape results for the keyword, "grilling".When you perform a search for it in your normal browser, the URL looks like this:

https://www.pinterest.com/search/pins/?q=grilling&rs=typed

Our actual URL is https://www.pinterest.com/search/pins/.
? tells the server that we'd like to perform a query.
We can string multiple queries together with &.
Our query string for the grilling page is ?q=grilling&rs=typed.
typed is a standard query when we perform a Pinterest search on our computer.
grilling is the search we actually want to perform.

Take a look at the address bar in the image below and see for yourself.The pin pages are even simpler. With pins, we don't even have to deal with queries, we only need the basic URL. All pin URLs are constructed like this:

https://www.pinterest.com/pin/PIN-NUMBER-GOES-HERE/

Take a look at the pin and the url below.

Step 2: How To Extract Data From Pinterest Results and Pages

Pinterest is very difficult to scrape. Not only is all of our content generated dynamically, but they use some pretty active client side JavaScript to authenticate during sessions.Even though we're already using a headless browser, Puppeteer, our secret weapon here is the ScrapeOps Headless Browser.Take a look at the results page below and you can see how nasty the HTML is.And here is our pin page.

Step 3: Geolocated Data

To handle geolocation, we'll make full use of the ScrapeOps Proxy API.When we use the API, we can set a country param which will actually route us through a server in that country.

If we want to be in the US, we'll be routed through a server in the US.
If we want to show up in the UK, we'll be routed through a server in the UK.

Setting Up Our Pinterest Scraper Project

Let's get started. You can run the following commands to get setup.Create a New Project Folder

mkdir pinterest-scraper
cd pinterest-scraper

Create a New JavaScript Project

npm init --y

Install Our Dependencies

npm install puppeteer

npm install csv-writer

npm install csv-parse

npm install fs

Build A Pinterest Search Crawler

Time to start on our first scraper, the crawler. Our crawler will make good use of the following:

Parsing: to extract valuable data from the page.
Data Storage: to store our data in a safe and efficient manner.
Proxy Integration: to get past anti-bots and anything else that might block us.

Step 1: Create Simple Search Data Parser

To start our crawler, we'll build a parser. The job of the parser is relatively simple. It performs a search and pulls data from the search results.You should notice some basic retry logic and error handling. scrapeSearchResults() is our parsing function.Take a look at the code below.

const puppeteer = require('puppeteer');const createCsvWriter = require('csv-writer').createObjectCsvWriter;const csvParse = require('csv-parse');const fs = require('fs');
const API_KEY = JSON.parse(fs.readFileSync('config.json')).api_key;
async function scrapeSearchResults(  browser,  keyword,  pageNumber,  location = 'us',  retries = 3) {  let tries = 0;  let success = false;
  while (tries <= retries && !success) {    const formattedKeyword = keyword.replace(' ', '+');    const page = await browser.newPage();    await page.setJavaScriptEnabled(false);
    try {      const url = `https://www.pinterest.com/search/pins/?q=${formattedKeyword}&rs=typed`;      await page.goto(url);
      console.log(`Successfully fetched: ${url}`);
      const divCards = await page.$$("div[data-grid-item='true']");
      for (const divCard of divCards) {        const aElement = await divCard.$('a');        const name = await page.evaluate(          (element) => element.getAttribute('aria-label'),          aElement        );        const href = await page.evaluate(          (element) => element.getAttribute('href'),          aElement        );        const imgElement = await divCard.$('img');        const imgLink = await page.evaluate(          (element) => element.getAttribute('src'),          imgElement        );
        const searchData = {          name: name,          url: `https://www.pinterest.com${href.replace('https://proxy.scrapeops.io', '')}`,          image: imgLink,        };
        console.log(searchData);      }
      success = true;    } catch (err) {      console.log(`Error: ${err}, tries left ${retries - tries}`);      tries++;    } finally {      await page.close();    }  }}
async function startScrape(keyword, location, concurrencyLimit, retries) {  const browser = await puppeteer.launch();
  await scrapeSearchResults(browser, keyword, location, retries);
  await browser.close();}
async function main() {  const keywords = ['grilling'];  const concurrencyLimit = 4;  const location = 'uk';  const retries = 3;  const aggregateFiles = [];
  for (const keyword of keywords) {    console.log('Crawl starting');    await startScrape(keyword, location, retries);    console.log('Crawl complete');    aggregateFiles.push(`${keyword.replace(' ', '-')}.csv`);  }}
main();

Here are some things you should pay attention to in our parsing function:

await page.$$("div[data-grid-item='true']") finds all the result items on the page. On Pinterest, data-grid-item='true'] denotes an individual search result.
await divCard.$("a") pulls the link or <a> element from the search result.
We get the name of the search result from await page.evaluate(element => element.getAttribute("aria-label"), aElement).
The link to the pin gets extracted using await page.evaluate(element => element.getAttribute("href"), aElement)
We find the url of the image with await page.evaluate(element => element.getAttribute("src"), imgElement)

Step 2: Storing the Scraped Data

Parsing our data isn't enough, we need to be able to store it for later use. Take a look at the function in the snippet below. This function takes an array of JSON objects and writes them to CSV.

async function writeToCsv(data, outputFile) {  if (!data || data.length === 0) {    throw new Error('No data to write!');  }  const fileExists = fs.existsSync(outputFile);
  const headers = Object.keys(data[0]).map((key) => ({ id: key, title: key }));
  const csvWriter = createCsvWriter({    path: outputFile,    header: headers,    append: fileExists,  });  try {    await csvWriter.writeRecords(data);  } catch (e) {    throw new Error('Failed to write to csv');  }}

writeToCsv() takes an array of JSON objects and a filename. First, it checks if our outputFile exists. If it doesn't exist, we create it. If the file does exist, we append it.This approach allows us to always write the maximum possible data to a file without overwriting existing data.In our updated code below, we adjust it to write the object to a CSV file instead of printing it to the console.

const puppeteer = require('puppeteer');const createCsvWriter = require('csv-writer').createObjectCsvWriter;const csvParse = require('csv-parse');const fs = require('fs');
const API_KEY = JSON.parse(fs.readFileSync('config.json')).api_key;
async function writeToCsv(data, outputFile) {  if (!data || data.length === 0) {    throw new Error('No data to write!');  }  const fileExists = fs.existsSync(outputFile);
  const headers = Object.keys(data[0]).map((key) => ({ id: key, title: key }));
  const csvWriter = createCsvWriter({    path: outputFile,    header: headers,    append: fileExists,  });  try {    await csvWriter.writeRecords(data);  } catch (e) {    throw new Error('Failed to write to csv');  }}
async function scrapeSearchResults(  browser,  keyword,  pageNumber,  location = 'us',  retries = 3) {  let tries = 0;  let success = false;
  while (tries <= retries && !success) {    const formattedKeyword = keyword.replace(' ', '+');    const page = await browser.newPage();    await page.setJavaScriptEnabled(false);
    try {      const url = `https://www.pinterest.com/search/pins/?q=${formattedKeyword}&rs=typed`;      await page.goto(url);
      console.log(`Successfully fetched: ${url}`);
      const divCards = await page.$$("div[data-grid-item='true']");
      for (const divCard of divCards) {        const aElement = await divCard.$('a');        const name = await page.evaluate(          (element) => element.getAttribute('aria-label'),          aElement        );        const href = await page.evaluate(          (element) => element.getAttribute('href'),          aElement        );        const imgElement = await divCard.$('img');        const imgLink = await page.evaluate(          (element) => element.getAttribute('src'),          imgElement        );
        const searchData = {          name: name,          url: `https://www.pinterest.com${href.replace('https://proxy.scrapeops.io', '')}`,          image: imgLink,        };
        await writeToCsv([searchData], `${keyword.replace(' ', '-')}.csv`);      }
      success = true;    } catch (err) {      console.log(`Error: ${err}, tries left ${retries - tries}`);      tries++;    } finally {      await page.close();    }  }}
async function startScrape(keyword, location, concurrencyLimit, retries) {  const browser = await puppeteer.launch();
  await scrapeSearchResults(browser, keyword, location, retries);
  await browser.close();}
async function main() {  const keywords = ['grilling'];  const concurrencyLimit = 4;  const location = 'uk';  const retries = 3;  const aggregateFiles = [];
  for (const keyword of keywords) {    console.log('Crawl starting');    await startScrape(keyword, location, retries);    console.log('Crawl complete');    aggregateFiles.push(`${keyword.replace(' ', '-')}.csv`);  }}
main();

Step 3: Bypassing Anti-Bots

Bypassing anti-bots is a very important part of scraping any site. Most sites use these to block malicious traffic. Even though we're not being malicious, our scraper doesn't really look human at all.To bypass anti-bots and add support for geolocation, we'll make a getScrapeOpsUrl() function. While it's only a small amount of code, this function converts any regular URL into a ScrapeOps proxied URL.Another important point in our case today is the wait parameter. If you remember from our earlier examples, we actually disable JavaScript from running inside Puppeteer. wait: 2000 tell the ScrapeOps server to wait two seconds for our content to render before sending the page back to us.We're then able to read the static page without getting blocked or redirected by the JavaScript code that Pinterest tries to execute.

function getScrapeOpsUrl(url, location = 'us') {  const params = new URLSearchParams({    api_key: API_KEY,    url: url,    country: location,    wait: 2000,  });  return `https://proxy.scrapeops.io/v1/?${params.toString()}`;}

Here is our fully updated code with proxy integration.

const puppeteer = require('puppeteer');const createCsvWriter = require('csv-writer').createObjectCsvWriter;const csvParse = require('csv-parse');const fs = require('fs');
const API_KEY = JSON.parse(fs.readFileSync('config.json')).api_key;
async function writeToCsv(data, outputFile) {  if (!data || data.length === 0) {    throw new Error('No data to write!');  }  const fileExists = fs.existsSync(outputFile);
  const headers = Object.keys(data[0]).map((key) => ({ id: key, title: key }));
  const csvWriter = createCsvWriter({    path: outputFile,    header: headers,    append: fileExists,  });  try {    await csvWriter.writeRecords(data);  } catch (e) {    throw new Error('Failed to write to csv');  }}
function getScrapeOpsUrl(url, location = 'us') {  const params = new URLSearchParams({    api_key: API_KEY,    url: url,    country: location,    wait: 2000,  });  return `https://proxy.scrapeops.io/v1/?${params.toString()}`;}
async function scrapeSearchResults(  browser,  keyword,  pageNumber,  location = 'us',  retries = 3) {  let tries = 0;  let success = false;
  while (tries <= retries && !success) {    const formattedKeyword = keyword.replace(' ', '+');    const page = await browser.newPage();    await page.setJavaScriptEnabled(false);
    try {      const url = `https://www.pinterest.com/search/pins/?q=${formattedKeyword}&rs=typed`;      const proxyUrl = getScrapeOpsUrl(url, location);      await page.goto(proxyUrl);
      console.log(`Successfully fetched: ${url}`);
      const divCards = await page.$$("div[data-grid-item='true']");
      for (const divCard of divCards) {        const aElement = await divCard.$('a');        const name = await page.evaluate(          (element) => element.getAttribute('aria-label'),          aElement        );        const href = await page.evaluate(          (element) => element.getAttribute('href'),          aElement        );        const imgElement = await divCard.$('img');        const imgLink = await page.evaluate(          (element) => element.getAttribute('src'),          imgElement        );
        const searchData = {          name: name,          url: `https://www.pinterest.com${href.replace('https://proxy.scrapeops.io', '')}`,          image: imgLink,        };
        await writeToCsv([searchData], `${keyword.replace(' ', '-')}.csv`);      }
      success = true;    } catch (err) {      console.log(`Error: ${err}, tries left ${retries - tries}`);      tries++;    } finally {      await page.close();    }  }}
async function startScrape(keyword, location, concurrencyLimit, retries) {  const browser = await puppeteer.launch();
  await scrapeSearchResults(browser, keyword, location, retries);
  await browser.close();}
async function main() {  const keywords = ['grilling'];  const concurrencyLimit = 4;  const location = 'uk';  const retries = 3;  const aggregateFiles = [];
  for (const keyword of keywords) {    console.log('Crawl starting');    await startScrape(keyword, location, retries);    console.log('Crawl complete');    aggregateFiles.push(`${keyword.replace(' ', '-')}.csv`);  }}
main();

Step 4: Production Run

Alright! Now that we've got a fully functional crawler, let's run it in production. Take a look at the main below.

async function main() {  const keywords = ['grilling'];  const concurrencyLimit = 4;  const location = 'us';  const retries = 3;  const aggregateFiles = [];
  for (const keyword of keywords) {    console.log('Crawl starting');    await startScrape(keyword, location, retries);    console.log('Crawl complete');    aggregateFiles.push(`${keyword.replace(' ', '-')}.csv`);  }}

Feel free to change any of the following constants to tweak your results:

keywords
location
retries

Don't change the concurrencyLimit yet because we're not using it just yet. This will come into play when we build our pin scraper.Here are our results:It took 9.49 seconds to fetch and parse the page. Results may vary based on your hardware, internet speed, and the country you passed into the API.

Build A Pinterest Scraper

Our pin scraper is going to fetch an individual pin and pull information from it. Our design needs to be able to all of the following things:

Parse the information from a pin.
Read the rows from the CSV file.
Store the data we extracted when parsing.
Perform all these actions concurrently.
Integrate with the ScrapeOps Proxy API

Step 1: Create Simple Data Parser

Just like we did earlier with the crawler, we're going to start our scraper out with a parsing function. Take a look at the code below. It has retries and error handling just like before, but this time, the parsing logic is a bit different.

async function processPin(browser, row, location, retries = 3) {  const url = row.url;  let tries = 0;  let success = false;
  while (tries <= retries && !success) {    const page = await browser.newPage();
    try {      await page.goto(url, { timeout: 60000 });
      const mainCard = await page.$("div[data-test-id='CloseupDetails']");      let website = 'n/a';      const websiteHolder = await page.$(        "span[style='text-decoration: underline;']"      );      if (websiteHolder) {        website = await page.evaluate(          (element) => element.textContent,          websiteHolder        );      }
      const starDivs = await page.$$("div[data-test-id='rating-star-full']");      const stars = starDivs.length;
      const profileInfoDiv = await mainCard.$(        "div[data-test-id='follower-count']"      );      if (profileInfoDiv === null) {        throw new Error('Page failed to loaded, most likely blocked!');      }
      const profileText = await page.evaluate(        (element) => element.textContent,        profileInfoDiv      );
      const accountNameDiv = await profileInfoDiv.$(        "div[data-test-id='creator-profile-name']"      );      const nestedDiv = await accountNameDiv.$('div');      const accountName = await page.evaluate(        (element) => element.getAttribute('title'),        nestedDiv      );
      const followerCount = profileText        .replace(accountName, '')        .replace(' followers', '');
      const pinData = {        name: accountName,        website: website,        stars: stars,        follower_count: followerCount,        image: row.image,      };
      console.log(pinData);
      success = true;    } catch (err) {      await page.screenshot({ path: 'ERROR.png' });      console.log(`Error: ${err}, tries left: ${retries - tries}, url: ${url}`);      tries++;    } finally {      await page.close();    }  }}

Here is some key parsing logic to notice in processPin():

await page.$("div[data-test-id='CloseupDetails']") finds the main card on the page.
We find the websiteHolder with await page.$("span[style='text-decoration: underline;']")
If there is a websiteHolder present, we use await page.evaluate(element => element.textContent, websiteHolder) to extract the textContent from it.
await mainCard.$("div[data-test-id='follower-count']") looks for the profile section on the page. If this item isn't present, we throw an error because the page didn't load correctly.
await page.evaluate(element => element.getAttribute("title"), nestedDiv) pulls the account name from our nestedDiv.
We then use replace() to remove unneeded text and retrieve our follower count.

Step 2: Loading URLs To Scrape

Our processPin() function isn't very useful if it doesn't know what to scrape. We need to read the CSV file created by our crawler and then pass all the rows from the crawler into processPin().The function below takes a CSV file and reads it into an array of JSON objects.

async function readCsv(inputFile) {  const results = [];  const parser = fs.createReadStream(inputFile).pipe(    csvParse.parse({      columns: true,      delimiter: ',',      trim: true,      skip_empty_lines: true,    })  );
  for await (const record of parser) {    results.push(record);  }  return results;}

Now that we know how to read a CSV, let's add this and our parser into our overall code.

const puppeteer = require('puppeteer');const createCsvWriter = require('csv-writer').createObjectCsvWriter;const csvParse = require('csv-parse');const fs = require('fs');
const API_KEY = JSON.parse(fs.readFileSync('config.json')).api_key;
async function writeToCsv(data, outputFile) {  if (!data || data.length === 0) {    throw new Error('No data to write!');  }  const fileExists = fs.existsSync(outputFile);
  const headers = Object.keys(data[0]).map((key) => ({ id: key, title: key }));
  const csvWriter = createCsvWriter({    path: outputFile,    header: headers,    append: fileExists,  });  try {    await csvWriter.writeRecords(data);  } catch (e) {    throw new Error('Failed to write to csv');  }}
async function readCsv(inputFile) {  const results = [];  const parser = fs.createReadStream(inputFile).pipe(    csvParse.parse({      columns: true,      delimiter: ',',      trim: true,      skip_empty_lines: true,    })  );
  for await (const record of parser) {    results.push(record);  }  return results;}
function getScrapeOpsUrl(url, location = 'us') {  const params = new URLSearchParams({    api_key: API_KEY,    url: url,    country: location,    wait: 2000,  });  return `https://proxy.scrapeops.io/v1/?${params.toString()}`;}
async function scrapeSearchResults(  browser,  keyword,  pageNumber,  location = 'us',  retries = 3) {  let tries = 0;  let success = false;
  while (tries <= retries && !success) {    const formattedKeyword = keyword.replace(' ', '+');    const page = await browser.newPage();    await page.setJavaScriptEnabled(false);
    try {      const url = `https://www.pinterest.com/search/pins/?q=${formattedKeyword}&rs=typed`;      const proxyUrl = getScrapeOpsUrl(url, location);      await page.goto(proxyUrl);
      console.log(`Successfully fetched: ${url}`);
      const divCards = await page.$$("div[data-grid-item='true']");
      for (const divCard of divCards) {        const aElement = await divCard.$('a');        const name = await page.evaluate(          (element) => element.getAttribute('aria-label'),          aElement        );        const href = await page.evaluate(          (element) => element.getAttribute('href'),          aElement        );        const imgElement = await divCard.$('img');        const imgLink = await page.evaluate(          (element) => element.getAttribute('src'),          imgElement        );
        const searchData = {          name: name,          url: `https://www.pinterest.com${href.replace('https://proxy.scrapeops.io', '')}`,          image: imgLink,        };
        await writeToCsv([searchData], `${keyword.replace(' ', '-')}.csv`);      }
      success = true;    } catch (err) {      console.log(`Error: ${err}, tries left ${retries - tries}`);      tries++;    } finally {      await page.close();    }  }}
async function startScrape(keyword, location, concurrencyLimit, retries) {  const browser = await puppeteer.launch();
  await scrapeSearchResults(browser, keyword, location, retries);
  await browser.close();}
async function processPin(browser, row, location, retries = 3) {  const url = row.url;  let tries = 0;  let success = false;
  while (tries <= retries && !success) {    const page = await browser.newPage();
    try {      await page.goto(url, { timeout: 60000 });
      const mainCard = await page.$("div[data-test-id='CloseupDetails']");      let website = 'n/a';      const websiteHolder = await page.$(        "span[style='text-decoration: underline;']"      );      if (websiteHolder) {        website = await page.evaluate(          (element) => element.textContent,          websiteHolder        );      }
      const starDivs = await page.$$("div[data-test-id='rating-star-full']");      const stars = starDivs.length;
      const profileInfoDiv = await mainCard.$(        "div[data-test-id='follower-count']"      );      if (profileInfoDiv === null) {        throw new Error('Page failed to loaded, most likely blocked!');      }
      const profileText = await page.evaluate(        (element) => element.textContent,        profileInfoDiv      );
      const accountNameDiv = await profileInfoDiv.$(        "div[data-test-id='creator-profile-name']"      );      const nestedDiv = await accountNameDiv.$('div');      const accountName = await page.evaluate(        (element) => element.getAttribute('title'),        nestedDiv      );
      const followerCount = profileText        .replace(accountName, '')        .replace(' followers', '');
      const pinData = {        name: accountName,        website: website,        stars: stars,        follower_count: followerCount,        image: row.image,      };
      console.log(pinData);
      success = true;    } catch (err) {      await page.screenshot({ path: 'ERROR.png' });      console.log(`Error: ${err}, tries left: ${retries - tries}, url: ${url}`);      tries++;    } finally {      await page.close();    }  }}
async function processResults(csvFile, location, concurrencyLimit, retries) {  const pins = await readCsv(csvFile);  const browser = await puppeteer.launch();
  for (const pin of pins) {    await processPin(browser, pin, location, location, retries);  }  await browser.close();}
async function main() {  const keywords = ['grilling'];  const concurrencyLimit = 4;  const location = 'uk';  const retries = 3;  const aggregateFiles = [];
  for (const keyword of keywords) {    console.log('Crawl starting');    await startScrape(keyword, location, retries);    console.log('Crawl complete');    aggregateFiles.push(`${keyword.replace(' ', '-')}.csv`);  }
  console.log('Starting scrape');  for (const file of aggregateFiles) {    await processResults(file, location, concurrencyLimit, retries);  }  console.log('Scrape complete');}
main();

In the next section, we'll do a little more than just print the JSON object to the console.

Step 3: Storing the Scraped Data

We already have our writeToCsv() function from earlier, we just need to put it in the right place. Instead of logging each pin item to the console, we're going to do this.

await writeToCsv([pinData], `${row.name.replace(' ', '-')}.csv`);

Even though we only changed one line, here is the full code if you need it.

const puppeteer = require('puppeteer');const createCsvWriter = require('csv-writer').createObjectCsvWriter;const csvParse = require('csv-parse');const fs = require('fs');
const API_KEY = JSON.parse(fs.readFileSync('config.json')).api_key;
async function writeToCsv(data, outputFile) {  if (!data || data.length === 0) {    throw new Error('No data to write!');  }  const fileExists = fs.existsSync(outputFile);
  const headers = Object.keys(data[0]).map((key) => ({ id: key, title: key }));
  const csvWriter = createCsvWriter({    path: outputFile,    header: headers,    append: fileExists,  });  try {    await csvWriter.writeRecords(data);  } catch (e) {    throw new Error('Failed to write to csv');  }}
async function readCsv(inputFile) {  const results = [];  const parser = fs.createReadStream(inputFile).pipe(    csvParse.parse({      columns: true,      delimiter: ',',      trim: true,      skip_empty_lines: true,    })  );
  for await (const record of parser) {    results.push(record);  }  return results;}
function getScrapeOpsUrl(url, location = 'us') {  const params = new URLSearchParams({    api_key: API_KEY,    url: url,    country: location,    wait: 2000,  });  return `https://proxy.scrapeops.io/v1/?${params.toString()}`;}
async function scrapeSearchResults(  browser,  keyword,  pageNumber,  location = 'us',  retries = 3) {  let tries = 0;  let success = false;
  while (tries <= retries && !success) {    const formattedKeyword = keyword.replace(' ', '+');    const page = await browser.newPage();    await page.setJavaScriptEnabled(false);
    try {      const url = `https://www.pinterest.com/search/pins/?q=${formattedKeyword}&rs=typed`;      const proxyUrl = getScrapeOpsUrl(url, location);      await page.goto(proxyUrl);
      console.log(`Successfully fetched: ${url}`);
      const divCards = await page.$$("div[data-grid-item='true']");
      for (const divCard of divCards) {        const aElement = await divCard.$('a');        const name = await page.evaluate(          (element) => element.getAttribute('aria-label'),          aElement        );        const href = await page.evaluate(          (element) => element.getAttribute('href'),          aElement        );        const imgElement = await divCard.$('img');        const imgLink = await page.evaluate(          (element) => element.getAttribute('src'),          imgElement        );
        const searchData = {          name: name,          url: `https://www.pinterest.com${href.replace('https://proxy.scrapeops.io', '')}`,          image: imgLink,        };
        await writeToCsv([searchData], `${keyword.replace(' ', '-')}.csv`);      }
      success = true;    } catch (err) {      console.log(`Error: ${err}, tries left ${retries - tries}`);      tries++;    } finally {      await page.close();    }  }}
async function startScrape(keyword, location, concurrencyLimit, retries) {  const browser = await puppeteer.launch();
  await scrapeSearchResults(browser, keyword, location, retries);
  await browser.close();}
async function processPin(browser, row, location, retries = 3) {  const url = row.url;  let tries = 0;  let success = false;
  while (tries <= retries && !success) {    const page = await browser.newPage();
    try {      await page.goto(url, { timeout: 60000 });
      const mainCard = await page.$("div[data-test-id='CloseupDetails']");      let website = 'n/a';      const websiteHolder = await page.$(        "span[style='text-decoration: underline;']"      );      if (websiteHolder) {        website = await page.evaluate(          (element) => element.textContent,          websiteHolder        );      }
      const starDivs = await page.$$("div[data-test-id='rating-star-full']");      const stars = starDivs.length;
      const profileInfoDiv = await mainCard.$(        "div[data-test-id='follower-count']"      );      if (profileInfoDiv === null) {        throw new Error('Page failed to load, most likely blocked!');      }
      const profileText = await page.evaluate(        (element) => element.textContent,        profileInfoDiv      );
      const accountNameDiv = await profileInfoDiv.$(        "div[data-test-id='creator-profile-name']"      );      const nestedDiv = await accountNameDiv.$('div');      const accountName = await page.evaluate(        (element) => element.getAttribute('title'),        nestedDiv      );
      const followerCount = profileText        .replace(accountName, '')        .replace(' followers', '');
      const pinData = {        name: accountName,        website: website,        stars: stars,        follower_count: followerCount,        image: row.image,      };
      await writeToCsv([pinData], `${row.name.replace(' ', '-')}.csv`);
      success = true;    } catch (err) {      await page.screenshot({ path: 'ERROR.png' });      console.log(`Error: ${err}, tries left: ${retries - tries}, url: ${url}`);      tries++;    } finally {      await page.close();    }  }}
async function processResults(csvFile, location, concurrencyLimit, retries) {  const pins = await readCsv(csvFile);  const browser = await puppeteer.launch();
  for (const pin of pins) {    await processPin(browser, pin, location, location, retries);  }  await browser.close();}
async function main() {  const keywords = ['grilling'];  const concurrencyLimit = 4;  const location = 'uk';  const retries = 3;  const aggregateFiles = [];
  for (const keyword of keywords) {    console.log('Crawl starting');    await startScrape(keyword, location, retries);    console.log('Crawl complete');    aggregateFiles.push(`${keyword.replace(' ', '-')}.csv`);  }
  console.log('Starting scrape');  for (const file of aggregateFiles) {    await processResults(file, location, concurrencyLimit, retries);  }  console.log('Scrape complete');}
main();

One line changed and we now have proper storage!!!

Step 4: Adding Concurrency

Our scraper is getting really close to production ready, but first, we need to add both concurrency and proxy support. In this section, we'll make some small changes that create a big impact.We're going to refactor our processResults() function to look like this.

async function processResults(csvFile, location, concurrencyLimit, retries) {  const pins = await readCsv(csvFile);  const browser = await puppeteer.launch();
  while (pins.length > 0) {    const currentBatch = pins.splice(0, concurrencyLimit);    const tasks = currentBatch.map((pin) =>      processPin(browser, pin, location, retries)    );
    try {      await Promise.all(tasks);    } catch (err) {      console.log(`Failed to process batch: ${err}`);    }  }  await browser.close();}

while pins is longer than 0, we're going to splice from index 0 up to our concurrencyLimit. This shortens the array (therefore reducing its size in memory) and also runs processPin() on each row we spliced from the array.
After await Promise.all(tasks) resolves, we repeat this process, constantly shrinking the array and improving performance as time goes on.

Step 5: Bypassing Anti-Bots

There is one final change we need to make inside of our processPin() function. We need to replace page.goto(url) with the following line.For extra redundancy, in getScrapeOpsUrl(), we'll be setting residential to true. Adding the residential argument reduces the likelihood that Pinterest will block the proxy.During extensive testing, the Pinterest server was able to detect and block the scraper a good portion of the time when not using residential.Here is our updated proxy function.

function getScrapeOpsUrl(url, location = 'us') {  const params = new URLSearchParams({    api_key: API_KEY,    url: url,    country: location,    wait: 3000,    residential: true,  });  return `https://proxy.scrapeops.io/v1/?${params.toString()}`;}

And here is the single line we change in the parser.

await page.goto(getScrapeOpsUrl(url, location), { timeout: 60000 });

Make sure to use a long timeout (we used 60 seconds). Sometimes it takes a minute to get a response back from the server.Here is our production ready code.

const puppeteer = require('puppeteer');const createCsvWriter = require('csv-writer').createObjectCsvWriter;const csvParse = require('csv-parse');const fs = require('fs');
const API_KEY = JSON.parse(fs.readFileSync('config.json')).api_key;
async function writeToCsv(data, outputFile) {  if (!data || data.length === 0) {    throw new Error('No data to write!');  }  const fileExists = fs.existsSync(outputFile);
  const headers = Object.keys(data[0]).map((key) => ({ id: key, title: key }));
  const csvWriter = createCsvWriter({    path: outputFile,    header: headers,    append: fileExists,  });  try {    await csvWriter.writeRecords(data);  } catch (e) {    throw new Error('Failed to write to csv');  }}
async function readCsv(inputFile) {  const results = [];  const parser = fs.createReadStream(inputFile).pipe(    csvParse.parse({      columns: true,      delimiter: ',',      trim: true,      skip_empty_lines: true,    })  );
  for await (const record of parser) {    results.push(record);  }  return results;}
function getScrapeOpsUrl(url, location = 'us') {  const params = new URLSearchParams({    api_key: API_KEY,    url: url,    country: location,    residential: true,    wait: 3000,  });  return `https://proxy.scrapeops.io/v1/?${params.toString()}`;}
async function scrapeSearchResults(  browser,  keyword,  pageNumber,  location = 'us',  retries = 3) {  let tries = 0;  let success = false;
  while (tries <= retries && !success) {    const formattedKeyword = keyword.replace(' ', '+');    const page = await browser.newPage();    await page.setJavaScriptEnabled(false);
    try {      const url = `https://www.pinterest.com/search/pins/?q=${formattedKeyword}&rs=typed`;      const proxyUrl = getScrapeOpsUrl(url, location);      await page.goto(proxyUrl);
      console.log(`Successfully fetched: ${url}`);
      const divCards = await page.$$("div[data-grid-item='true']");
      for (const divCard of divCards) {        const aElement = await divCard.$('a');        const name = await page.evaluate(          (element) => element.getAttribute('aria-label'),          aElement        );        const href = await page.evaluate(          (element) => element.getAttribute('href'),          aElement        );        const imgElement = await divCard.$('img');        const imgLink = await page.evaluate(          (element) => element.getAttribute('src'),          imgElement        );
        const searchData = {          name: name,          url: `https://www.pinterest.com${href.replace('https://proxy.scrapeops.io', '')}`,          image: imgLink,        };
        await writeToCsv([searchData], `${keyword.replace(' ', '-')}.csv`);      }
      success = true;    } catch (err) {      console.log(`Error: ${err}, tries left ${retries - tries}`);      tries++;    } finally {      await page.close();    }  }}
async function startScrape(keyword, location, concurrencyLimit, retries) {  const browser = await puppeteer.launch();
  await scrapeSearchResults(browser, keyword, location, retries);
  await browser.close();}
async function processPin(browser, row, location, retries = 3) {  const url = row.url;  let tries = 0;  let success = false;
  while (tries <= retries && !success) {    const page = await browser.newPage();
    try {      await page.goto(getScrapeOpsUrl(url, location), { timeout: 60000 });
      const mainCard = await page.$("div[data-test-id='CloseupDetails']");      let website = 'n/a';      const websiteHolder = await page.$(        "span[style='text-decoration: underline;']"      );      if (websiteHolder) {        website = await page.evaluate(          (element) => element.textContent,          websiteHolder        );      }
      const starDivs = await page.$$("div[data-test-id='rating-star-full']");      const stars = starDivs.length;
      const profileInfoDiv = await mainCard.$(        "div[data-test-id='follower-count']"      );      if (profileInfoDiv === null) {        throw new Error('Page failed to loaded, most likely blocked!');      }
      const profileText = await page.evaluate(        (element) => element.textContent,        profileInfoDiv      );
      const accountNameDiv = await profileInfoDiv.$(        "div[data-test-id='creator-profile-name']"      );      const nestedDiv = await accountNameDiv.$('div');      const accountName = await page.evaluate(        (element) => element.getAttribute('title'),        nestedDiv      );
      const followerCount = profileText        .replace(accountName, '')        .replace(' followers', '');
      const pinData = {        name: accountName,        website: website,        stars: stars,        follower_count: followerCount,        image: row.image,      };
      await writeToCsv([pinData], `${row.name.replace(' ', '-')}.csv`);
      success = true;    } catch (err) {      await page.screenshot({ path: 'ERROR.png' });      console.log(`Error: ${err}, tries left: ${retries - tries}, url: ${url}`);      tries++;    } finally {      await page.close();    }  }}
async function processResults(csvFile, location, concurrencyLimit, retries) {  const pins = await readCsv(csvFile);  const browser = await puppeteer.launch();
  while (pins.length > 0) {    const currentBatch = pins.splice(0, concurrencyLimit);    const tasks = currentBatch.map((pin) =>      processPin(browser, pin, location, retries)    );
    try {      await Promise.all(tasks);    } catch (err) {      console.log(`Failed to process batch: ${err}`);    }  }  await browser.close();}
async function main() {  const keywords = ['grilling'];  const concurrencyLimit = 4;  const location = 'uk';  const retries = 3;  const aggregateFiles = [];
  for (const keyword of keywords) {    console.log('Crawl starting');    await startScrape(keyword, location, retries);    console.log('Crawl complete');    aggregateFiles.push(`${keyword.replace(' ', '-')}.csv`);  }
  console.log('Starting scrape');  for (const file of aggregateFiles) {    await processResults(file, location, concurrencyLimit, retries);  }  console.log('Scrape complete');}
main();

Step 6: Production Run

Now that we've put everything together in working order, it's time to run it all in production. Here is tha main function that we'll be running.

async function main() {  const keywords = ['grilling'];  const concurrencyLimit = 4;  const location = 'us';  const retries = 3;  const aggregateFiles = [];
  for (const keyword of keywords) {    console.log('Crawl starting');    await startScrape(keyword, location, retries);    console.log('Crawl complete');    aggregateFiles.push(`${keyword.replace(' ', '-')}.csv`);  }
  console.log('Starting scrape');  for (const file of aggregateFiles) {    await processResults(file, location, concurrencyLimit, retries);  }  console.log('Scrape complete');}

As always, feel free to change any of the constants above to tweak your own results.Here was our result. There was one page that simply wouldn't load.And here is the page that didn't load.All in all, we scraped 8 pages in 74.428 seconds. While this brings our average result to 9.30 seconds per page, about half the run was spent trying to reload the failed page above.

Legal and Ethical Considerations

When scraping any site, you always need to be mindful of their Terms of Service and robots.txt.. Pinterest's terms are available here.If you violate these terms, you can even lose your Pinterest account! Their robots.txt is available here.Also, keep in mind whether the data you're scraping is public. Private data (data behind a login), can often be illegal to scrape. Generally, public data (data not behind a login) is public information and therefore fair game when scraping.If you are unsure of the legality of a your scraper, it is best to consult an attorney based in your jurisdiction.

Conclusion

You've finished our tutorial!!! You now know how to use Puppeteer and CSS selectors. You should also have a solid grasp of parsing, data storage, concurrency, and proxy integration. You might have even gotten a taste of the brand new ScrapeOps Residential Proxy.If you'd like to learn more about the stack used in this article, take a look at the links below!

More Web Scraping Guides

Now that you know how to scrape Pinterest, you have a whole new skillset for your scraping toolbox. Take this knowledge and go build something!!! If you're in the mood to learn more, check out our Web Scraping Playbook or one of these cool articles below!!!