How to Scrape Google Search

Data is everything in today's world. Everybody wants data driven results and that's why we have search engines to begin with. When scraping the web, Google is a goldmine. While you are probably already familiar with Google and its convenience, you are probably unfamiliar with its utility when it comes to scraping. If you can scrape a search engine, you can essentially build your own data mining operation. If you can scrape a search engine, you can identify other sites to scrape. In a data driven world, this is an incredibly lucrative skill to have.

How to Scrape Google Search With Python Requests and BeautifulSoup

In this Python guide, we'll go over the following topics:

💡GitHub CodeThe full code for this Google Search Results Scraper is available on Github here.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

TLDR - How to Scrape Google Search With Python Requests

When scraping search results, pay attention to the following things:

We get uniform results nested inside of <div> elements.
Each result has its own <h3> tag
Each result comes with an href that links to a website

import requestsfrom bs4 import BeautifulSoupfrom urllib.parse import urlparse, parse_qs, urlencodeimport csvimport concurrentfrom concurrent.futures import ThreadPoolExecutorimport osimport loggingimport timefrom dataclasses import dataclass, field, fields, asdictheaders = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.3'}proxy_url = "https://proxy.scrapeops.io/v1/"API_KEY = "YOUR-SUPER-SECRET-API-KEY"
logging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)
@dataclassclass SearchData:    name: str    base_url: str    link: str    page: int    result_number: int
    def __post_init__(self):        self.check_string_fields()    def check_string_fields(self):        for field in fields(self):            if isinstance(getattr(self, field.name), str):                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_open = False    def save_to_csv(self):        self.csv_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]
        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename)
        with open(self.csv_filename, mode="a", encoding="UTF-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)            if not file_exists:                writer.writeheader()            for item in data_to_save:                writer.writerow(asdict(item))        self.csv_open = False
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate Item Found: {input_data.name}. Item dropped")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)        if len(self.storage_queue) >= self.storage_queue_limit and self.csv_open == False:            self.save_to_csv()
    def close_pipeline(self):        if self.csv_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()
def get_scrapeops_url(url):    payload = {'api_key': API_KEY, 'url': url, 'country': 'us'}    proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)    return proxy_url
def search_page(query, page, location="United States", headers=headers, pipeline=None, num=100, retries=3):    url = f"https://www.google.com/search?q={query}&start={page * num}&num={num}"    payload = {        "api_key": API_KEY,        "url": url,    }    tries = 0    success = False    while tries <= retries and not success:        try:            response = requests.get(get_scrapeops_url(url))            soup = BeautifulSoup(response.text, 'html.parser')            divs = soup.find_all("div")            index = 0            last_link = ""            for div in divs:                h3s = div.find_all("h3")                if len(h3s) > 0:                    link = div.find("a", href=True)                    parsed_url = urlparse(link["href"])                    base_url = f"{parsed_url.scheme}://{parsed_url.netloc}"                    site_info = {'title': h3s[0].text, "base_url": base_url, 'link': link["href"], "page": page, "result_number": index}                    search_data = SearchData(                        name = site_info["title"],                        base_url = site_info["base_url"],                        link = site_info["link"],                        page = site_info["page"],                        result_number = site_info["result_number"]                    )                    if site_info["link"] != last_link:                        index += 1                        last_link = site_info["link"]                        if pipeline:                            pipeline.add_data(search_data)                            success = True
        except:            print(f"Failed to scrape page {page}")            print(f"Retries left: {retries-tries}")            tries += 1    if not success:        print(f"Failed to scrape page {page}, no retries left")        raise Exception(f"Max retries exceeded: {retries}")    else:        print(f"Scraped page {page} with {retries-tries} retries left")
def full_search(query, pages=3, location="us", MAX_THREADS=5, MAX_RETRIES=3, num=10):    with ThreadPoolExecutor(max_workers=MAX_THREADS) as executor:        pipeline = DataPipeline(csv_filename=f"{query.replace(' ', '-')}.csv")        tasks = [executor.submit(search_page, query, page, location, None, pipeline, num, MAX_RETRIES) for page in range(pages)]        for future in tasks:            future.result()        pipeline.close_pipeline()
if __name__ == "__main__":    MAX_THREADS = 5    MAX_RETRIES = 5    queries = ["cool stuff"]

    logger.info("Starting full search...")    for query in queries:        full_search(query, pages=3, num=10)    logger.info("Search complete.")

When running this code simply change the search_results line:

full_search(query, pages=20)

If you'd like to search for 100 pages of boring stuff, simply change it to:

full_search(query, pages=100)

And add boring stuff to your QUERIES array: ["cool stuff", "boring stuff"]Other things you can tweak are:

location
MAX_THREADS
MAX_RETRIES
num

If you'd like to run this code, simply copy and paste it into a Python script yourscript.py. Obviously you can name it whatever you want. Once you have your script and dependencies installed, run the following command:

python yourscript.py

This will output a CSV file containing your search results.

How To Architect Our Google Scraper

There are many different tools we can use to scrape Google Search. In this article, we'll focus on Requests and BeautifulSoup. Our scraper needs to be able to do the following:

Perform a get request for a Google Search
Interpret the results
Save the number of each result, its url, and site description

For starters, we need to build a simple parser to pull the nested information from all of Google's nested divs. Once we have parser, we'll add pagination. Next, we'll add data processing and concurrency. Afterward, we'll add a proxy. Once we're ready for the production run, we'll clean everything up a bit more and improve our data processing as well.Remember, our scraper needs:

Parsing
Pagination
Data Processing
Concurrency
Proxy

Understanding How To Scrape Google Search

Before plunging head first into code, we're going to talk about how our scraper works on a high level. In this section, we're going over the required steps in greater detail. If you've got some experience in web scraping already, feel free to skip this section.

Step 1: How To Request Google Search Pages

Let's start by taking a look at Google Search results manually. The screenshot below is the result of searching the term "cool stuff".As you can see in the address bar, we send the following GET request:

https://www.google.com/search?q=cool+stuff

The actual domain we're pinging is https://www.google.com/search. At the end, you should notice ?q=cool+stuff.In the address bar, a question mark, ? denotes a query (in this case we're querying q) and the variable for our query is denoted by an equal operator, =.So, ?q=cool+stuff means that our search query is for cool stuff. If we wanted to search for boring stuff, we could instead use ?q=boring+stuff.In the days of old, at the bottom of our page, we would see a list of page numbers. This made search results incredibly easy to scrape.While Google doesn't exactly give us page numbers anymore, they do give us a start query that we can use in order to paginate our results. We get our results in batches of 10. With variables figured in, our url will look like this:

https://www.google.com/search?q={query}&start={page * 10}

They also give us a num query that we can use to control the number of results that we get. Taking num into account, our url would look more like this:

https://www.google.com/search?q={query}&start={page * num}&num={num}"

We can set num up to 100 results, but Google's response doesn't always give us these results when we request them. Multiple times throughout the writing of this article, I've used num=100 and been blocked or gotten smaller results. Other times I have gotten proper results.

Step 2: How To Extract Data From Google Search

As you saw in the screenshot earlier, all of our results come with an <h3> tag. To find our results, we can simply use BeautifulSoup's .find_all() method. Some websites like to nest a bunch of different things inside of an element and Google is no exception.Here is the full HTML of our first result: <h3 class="LC20lb MBeuO DKV0Md">Cool Stuff</h3>. As you can see, the class name is a bunch of jumbled garbage and there is no link within the tag! This is because Google (like many other sites) nests all of our important information within a <div>.If the class name of each result was more legible and not subject to change, I would recommend using this as a way to parse the result. Since the class name is likely to change over time, we're simply going to get all of the <div> elements and find the <h3> elements nested inside of them. We'll use soup.find_all() and we'll use a last_link variable.For each result we get, we'll compair its link to the last link. If the current link is the same as the last link, we'll ignore this element and move on to the next one.

Step 3: How To Control Pagination

As we touched on briefly earlier, Google no longer gives a real page numbers. What they do give us is a result number. Our results tend to come in batches of 10, so it's quite easy to parse our data in a page like fashion.Take a closer look at our url format again:

https://www.google.com/search?q={query}&start={page * 10}

We simply multiply our page by 10. Using this method, we'll fetch results 1 through 10, then 20 through 30, and so on and so forth. In testing, Google has occasionally given us up to 12 results but that's ok. Even if we get duplicates, we can remove them when we handle the data.

Step 4: Geolocated Data

Some websites will return different results depending on our location. Once again, Google is no exception. To add a location, we can simply add the geo_location parameter to our request.At the moment, our full request looks like this:

https://www.google.com/search?q={query}&start={page * 10}

Here is what it looks like with our location added:

https://www.google.com/search?q={query}&start={page * 10}&geo_location={location}

All in all, it's a pretty simple change.

Setting Up Our Google Scraper Project

Now that you've got a basic understanding of the process, it's time to begin setting up our project. We'll start by creating a new project folder. We can call it google-search-requests.You can create a new folder through your file explorer or enter the following command:

mkdir google-search-requests

Next, we need to create a virtual environment. I'll be using Python3.10-venv.First, we'll create a new virtual environment:Linux/Mac

python3 -m venv google-search

Windows

python -m venv google-search

One we've got our new environment created, let's activate it:Linux/Mac

source google-search/bin/activate

Windows

.\google-search\Scripts\Activate.ps1

Once your venv is activated, it's time to install individual dependencies. This command will install both requests and beautifulsoup4.

pip install requests beautifulsoup4

Step 1: Create Simple Search Data Parser

We've already been through our base logic. Let's create an intitial script that we can build from.

import requestsfrom bs4 import BeautifulSoupfrom urllib.parse import urlparse, parse_qs, urlencode
#search a single pagedef google_search(query, retries=3):    tries = 0    #runtime loop for the scrape    while tries <= retries:        try:            url = f"https://www.google.com/search?q={query}"            response = requests.get(url)            results = []            last_link = ""            soup = BeautifulSoup(response.text, 'html.parser')            index = 0            for result in soup.find_all('div'):                title = result.find('h3')                if title:                    title = title.text                else:                    continue                base_url = ""                link = result.find('a', href=True)                if link:                    link = link['href']                    parsed_url = urlparse(link)                    base_url = f"{parsed_url.scheme}://{parsed_url.netloc}"                else:                    continue                #this is the full site info we wish to extract                site_info = {'title': title, "base_url": base_url, 'link': link, "result_number": index}                if last_link != site_info["link"]:                    results.append(result)            #return our list of results            print(f"Finished scrape with {tries} retries")            return results        except:            print("Failed to scrape the page")            print("Retries left:", retries-tries)            tries += 1    #if this line executes, the scrape has failed    raise Exception(f"Max retries exceeded: {retries}")

if __name__ == "__main__":
    MAX_RETRIES = 5    QUERIES = ["cool stuff"]
    for query in QUERIES:        results = google_search("cool stuff", retries=MAX_RETRIES)        for result in results:            print(result)

In this example, we:

Create a google_search() function that takes our query as a parameter
When we get the result, BeautifulSoup(response.text, 'html.parser') creates a BeautifulSoup instance to parse through the HTML
soup.find_all("div") finds all the <div> objects
result.find("h3") is used to find the header element of each result
link = result.find('a', href=True) extracts the link from the result
urlparse(link) parses our link
base_url = f"{parsed_url.scheme}://{parsed_url.netloc}" reconstructs the base_url so we can save it
We then create a dict, site_info from the data we've extracted
If the link from site_info is different than last_link, we add our result to the results list
After parsing through the response and creating our list, we return the results list

Step 2: Add Pagination

We've already been through our base logic. Let's create an intitial script that we can build from.

import requestsfrom bs4 import BeautifulSoupfrom urllib.parse import urlparse, parse_qs
def google_search(query, pages=3, location="United States", retries=3):    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.3'}    results = []    last_link = ""    for page in range(0, pages):        tries = 0
        while tries <= retries:
            try:                url = f"https://www.google.com/search?q={query}&start={page * 10}&geo_location={location}"                response = requests.get(url, headers=headers)                soup = BeautifulSoup(response.text, 'html.parser')                index = 0                for result in soup.find_all('div'):                    title = result.find('h3')                    if title:                        title = title.text                    else:                        continue                    base_url = ""                    #pull the raw link from the result                    link = result.find('a', href=True)                    if link:                        link = link['href']                        parsed_url = urlparse(link)                        base_url = f"{parsed_url.scheme}://{parsed_url.netloc}"                    else:                        continue                    #this is the full site info we wish to extract                    site_info = {'title': title, "base_url": base_url, 'link': link, "page": page, "result_number": index}                    #if the link is different from the last link                    if last_link != site_info["link"]:                        results.append(site_info)                        index += 1                    last_link = link                print(f"Scraped page {page} with {retries} retries left")                return results
            except:                print(f"Failed to scrape page {page}")                print(f"Retries left: {retries-tries}")
        raise Exception(f"Max retries exceeded: {retries}")
if __name__ == "__main__":
    MAX_RETRIES = 5    QUERIES = ["cool stuff"]
    for query in QUERIES:        results = google_search("cool stuff", retries=MAX_RETRIES)        for result in results:            print(result)

There's alot going on in the code above. While it's not much different from our initial prototype, there is one really important part to pay attention to here, &start={page * 10}. This is the basis for how we try to batch our results. We also add in functionality for our geo_location, but by the time we add our proxy, this functionality is actually going to be moved elsewhere in our code.

Step 3: Storing the Scraped Data

As you probably noticed in our earlier examples, we store each object as a dict with key-value pairs. Here is the first result so you can see how the data is laid out:

{'title': 'Cool Stuff', 'base_url': '://', 'link': '/search?sca_esv=3d5aec0ebbda9031&q=cool+stuff&uds=AMwkrPusHYa-Y5lqXPwpg8jJI99FKYz2zi9dec3bfM0lH-hil3eHKWSsmwBdtnNX2uzO7rvzH_UOAG-8W6q5RMgyj5EtPQRweAkj97b7yv-dxhFjVNmTpUmjIG8LX5BTVMn1i8RvhFDaroRDPKXSl9mGzRdmu5ujMGh35B6t9hZQe5OWf6qF9qyxdHJPailq0Was2Ti5R1Efg6G0TWkZl8Q0a4QgLEUcLEh8uM-Gr_AIA73YM8e13Y_Y5x_btmkZoDODrensXIErfUplY9wGJ9in8N6PV9WQjCg77wu2IOm5pmE8706LnWQ&udm=2&prmd=isvnmbtz&sa=X&ved=2ahUKEwi459DNvrOFAxXzh1YBHfFMDlsQtKgLegQIEhAB', 'page': 0, 'result_number': 0}

Each object has a title, base_url, link, page, and result_number. Because we have uniform data stored in key-value pairs, we already have the makings of a DataFrame and therefore a CSV.To our imports, add the following line:

import csv

Then, update the script to look like this:

import requestsfrom bs4 import BeautifulSoupfrom urllib.parse import urlparse, parse_qsimport csvfrom os import path
def write_page_to_csv(filename, object_array):    path_to_csv = filename    file_exists = path.exists(filename)    with open(path_to_csv, mode="a", newline="", encoding="UTF-8") as file:        #name the headers after our object keys        writer = csv.DictWriter(file, fieldnames=object_array[0].keys())        if not file_exists:            writer.writeheader()        writer.writerows(object_array)
def google_search(query, pages=3, location="United States", retries=3):    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.3'}    results = []    last_link = ""    for page in range(0, pages):        tries = 0        success = False
        while tries <= retries and not success:
            try:                url = f"https://www.google.com/search?q={query}&start={page * 10}"                response = requests.get(url, headers=headers)                print(f"Response Code: {response.status_code}")                soup = BeautifulSoup(response.text, 'html.parser')                index = 0                for result in soup.find_all('div'):                    title = result.find('h3')                    if title:                        title = title.text                    else:                        continue                    base_url = ""                    #pull the raw link from the result                    link = result.find('a', href=True)                    if link:                        link = link['href']                        parsed_url = urlparse(link)                        base_url = f"{parsed_url.scheme}://{parsed_url.netloc}"                    else:                        continue                    #this is the full site info we wish to extract                    site_info = {'title': title, "base_url": base_url, 'link': link, "page": page, "result_number": index}                    #if the link is different from the last link                    if last_link != site_info["link"]:                        results.append(site_info)                        index += 1                    last_link = link                print(f"Scraped page {page} with {retries} retries left")                write_page_to_csv(f"{query}.csv", results)                success = True
            except:                print(f"Failed to scrape page {page}")                print(f"Retries left: {retries-tries}")                tries += 1    if not success:        raise Exception(f"Max retries exceeded: {retries}")
if __name__ == "__main__":
    MAX_RETRIES = 5    QUERIES = ["cool stuff"]
    for query in QUERIES:        google_search("cool stuff", retries=MAX_RETRIES)

Pay close attention to the write_page_to_csv() function:

def write_page_to_csv(filename, object_array):    path_to_csv = filename    file_exists = path.exists(filename)    with open(path_to_csv, mode="a", newline="", encoding="UTF-8") as file:        #name the headers after our object keys        writer = csv.DictWriter(file, fieldnames=object_array[0].keys())        if not file_exists:            writer.writeheader()        writer.writerows(object_array)

The function above takes in an object_array (in this case our page results) and writes it to our filename.
If the file doesn't already exist, we create it. If it doesn't exist, we simply open and append it.
This allows us to put multiple page results into the same file without corrupting it.
When scraping at scale, we need to be able to scrape multiple pages of results and put them into the same file... This is the whole purpose of scraping to begin with... collecting data!

Here is a screenshot of the resulting file:This is almost identical to our previous iteration, with some small differences at the end:

We do not print our results
We instead write each list of page results to a csv file

It's important to append the csv file as soon as we have our results. If our scraper crashes halfway through the job, we still get some data. It's also very important to open this file in append mode so we don't overwrite any important data that we've scraped previously.

Step 4: Adding Concurrency

Now that we've got a working model from start to finish, let's focus on performance! We're going to split our google_search() function into two separate functions, search_page() and full_search(). search_page() will search a single page and full_search() will create multiple threads that call search_page() concurrently.Add the following import statement:

from concurrent.futures import ThreadPoolExecutor

Now we'll refactor our google_search() function into our search_page() function.

def search_page(query, page, location="United States", retries=3, num=100):    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.3'}    results = []    last_link = ""    tries = 0    success = False    while tries <= retries and not success:        try:            url = f"https://www.google.com/search?q={query}&start={page * num}&num={num}"            response = requests.get(url, headers=headers)            if response.status_code != 200:                print("Failed server response", response.status_code)                raise Exception("Failed server response!")            print(f"Response Code: {response.status_code}")            soup = BeautifulSoup(response.text, 'html.parser')            index = 0            for result in soup.find_all('div'):                title = result.find('h3')                if title:                    title = title.text                else:                    continue                base_url = ""                #pull the raw link from the result                link = result.find('a', href=True)                if link:                    link = link['href']                    parsed_url = urlparse(link)                    base_url = f"{parsed_url.scheme}://{parsed_url.netloc}"                else:                    continue                #this is the full site info we wish to extract                site_info = {'title': title, "base_url": base_url, 'link': link, "page": page, "result_number": index}                #if the link is different from the last link                if last_link != site_info["link"]:                    results.append(site_info)                    index += 1                    last_link = link
            write_page_to_csv(f"{query}.csv", results)            success = True
        except:            print(f"Failed to scrape page {page}")            print(f"Retries left: {retries-tries}")            tries += 1
    if not success:        print(f"Failed to scrape page {page}, no retries left")        raise Exception(f"Max retries exceeded: {retries}")    else:        print(f"Scraped page {page} with {retries} retries left")

This function:

Removes the pages argument and replaces it with page
Instead of running a for loop and iterating through pages, we simply execute our parsing logic on the page we're searching

Next, we'll create a full_search() function:

def full_search(query, pages=3, location="United States", MAX_THREADS=5, MAX_RETRIES=4, num=100):    page_numbers = list(range(pages))    full_results = []    with ThreadPoolExecutor(max_workers=MAX_THREADS) as executor:        executor.map(search_page, [query]*pages, page_numbers, [location]*pages, [MAX_RETRIES], [num])

While it may look a bit intimidating, this function is actually rather simple.

It takes one real argument, query. Everything else is a kwarg used to tweak our settings.
The scariest looking portion of this code is executor.map(). As bizarre as it looks, it's actually pretty simple.
It takes search_page as the first argument and the rest of the args are just parameters that we wish to pass into search_page().

This function is super important, it allows us to use a single thread for each page. When doing this, we can scrape multiple pages at the same time.At this point, our full scraper should look like this:

import requestsfrom bs4 import BeautifulSoupfrom urllib.parse import urlparse, parse_qsimport csvfrom os import pathfrom concurrent.futures import ThreadPoolExecutor

def write_page_to_csv(filename, object_array):    path_to_csv = filename    file_exists = path.exists(filename)    with open(path_to_csv, mode="a", newline="", encoding="UTF-8") as file:        #name the headers after our object keys        writer = csv.DictWriter(file, fieldnames=object_array[0].keys())        if not file_exists:            writer.writeheader()        writer.writerows(object_array)
def search_page(query, page, location="United States", retries=3, num=100):    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.3'}    results = []    last_link = ""    tries = 0    success = False    while tries <= retries and not success:        try:            url = f"https://www.google.com/search?q={query}&start={page * num}&num={num}"            response = requests.get(url, headers=headers)            if response.status_code != 200:                print("Failed server response", response.status_code)                raise Exception("Failed server response!")            print(f"Response Code: {response.status_code}")            soup = BeautifulSoup(response.text, 'html.parser')            index = 0            for result in soup.find_all('div'):                title = result.find('h3')                if title:                    title = title.text                else:                    continue                base_url = ""                #pull the raw link from the result                link = result.find('a', href=True)                if link:                    link = link['href']                    parsed_url = urlparse(link)                    base_url = f"{parsed_url.scheme}://{parsed_url.netloc}"                else:                    continue                #this is the full site info we wish to extract                site_info = {'title': title, "base_url": base_url, 'link': link, "page": page, "result_number": index}                #if the link is different from the last link                if last_link != site_info["link"]:                    results.append(site_info)                    index += 1                    last_link = link
            write_page_to_csv(f"{query}.csv", results)            success = True
        except:            print(f"Failed to scrape page {page}")            print(f"Retries left: {retries-tries}")            tries += 1    if not success:        print(f"Failed to scrape page {page}, no retries left")        raise Exception(f"Max retries exceeded: {retries}")    else:        print(f"Scraped page {page} with {retries} retries left")

def full_search(query, pages=3, location="United States", MAX_THREADS=5, MAX_RETRIES=4, num=100):    page_numbers = list(range(pages))    full_results = []    with ThreadPoolExecutor(max_workers=MAX_THREADS) as executor:        executor.map(search_page, [query]*pages, page_numbers, [location]*pages, [MAX_RETRIES], [num])
if __name__ == "__main__":
    MAX_RETRIES = 5    QUERIES = ["cool stuff"]
    for query in QUERIES:        full_search(query, pages=1)

If you run this code, you'll probably get output similar to the image below. The reason for this is actually really simple. Our scraper was already faster than a normal human, and we just sped it up even more! Google recognizes the fact that our scraper doesn't appear human at all, so they block us.

Step 5: Bypassing Anti-Bots

As you probably recall from the last section, now that our requests are coming in really fast, we get blocked. The ScrapeOps Proxy is perfect for addressing this issue.With the ScrapeOps Proxy, we get rotating IP addresses and we also have a middleman server. Because we're communicating to a server in the middle, this also slows down the rate at which we make our request. In short, our requests are spaced apart, and each one comes from a different IP address. This makes it nearly impossible to identify and block our scraper.In the code below, we create a simple function, get_scrapeops_url(). This is a really simply function that just performs some basic string formatting for us, but this is vital to our scraper. We now have the ability to convert any url into a proxied url with very minimal impact on our overall code. With this function, we can now run our Python script, without getting blocked!

import requestsfrom bs4 import BeautifulSoupfrom urllib.parse import urlparse, parse_qs, urlencodeimport csvfrom os import pathfrom concurrent.futures import ThreadPoolExecutor
#our default user agentheaders = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.3'}proxy_url = "https://proxy.scrapeops.io/v1/"API_KEY = "YOUR-SUPER-SECRET-API-KEY"

def get_scrapeops_url(url, location='us'):    payload = {'api_key': API_KEY, 'url': url, 'country': location}    proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)    return proxy_url
def write_page_to_csv(filename, object_array):    path_to_csv = filename    file_exists = path.exists(filename)    with open(path_to_csv, mode="a", newline="", encoding="UTF-8") as file:        #name the headers after our object keys        writer = csv.DictWriter(file, fieldnames=object_array[0].keys())        if not file_exists:            writer.writeheader()        writer.writerows(object_array)
def search_page(query, page, location="United States", retries=3, num=100):    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.3'}    results = []    last_link = ""    tries = 0    success = False    while tries <= retries and not success:        try:            url = f"https://www.google.com/search?q={query}&start={page * num}&num={num}"            response = requests.get(get_scrapeops_url(url), headers=headers)            if response.status_code != 200:                print("Failed server response", response.status_code)                raise Exception("Failed server response!")            print(f"Response Code: {response.status_code}")            soup = BeautifulSoup(response.text, 'html.parser')            index = 0            for result in soup.find_all('div'):                title = result.find('h3')                if title:                    title = title.text                else:                    continue                base_url = ""                #pull the raw link from the result                link = result.find('a', href=True)                if link:                    link = link['href']                    parsed_url = urlparse(link)                    base_url = f"{parsed_url.scheme}://{parsed_url.netloc}"                else:                    continue                #this is the full site info we wish to extract                site_info = {'title': title, "base_url": base_url, 'link': link, "page": page, "result_number": index}                #if the link is different from the last link                if last_link != site_info["link"]:                    results.append(site_info)                    index += 1                    last_link = link
            write_page_to_csv(f"{query}.csv", results)            success = True
        except:            print(f"Failed to scrape page {page}")            print(f"Retries left: {retries-tries}")            tries += 1    if not success:        print(f"Failed to scrape page {page}, no retries left")        raise Exception(f"Max retries exceeded: {retries}")    else:        print(f"Scraped page {page} with {retries} retries left")

def full_search(query, pages=3, location="us", MAX_THREADS=5, MAX_RETRIES=4, num=100):    page_numbers = list(range(pages))    full_results = []    with ThreadPoolExecutor(max_workers=MAX_THREADS) as executor:        executor.map(search_page, [query]*pages, page_numbers, [location]*pages, [MAX_RETRIES]*pages, [num]*pages)
if __name__ == "__main__":
    MAX_RETRIES = 5    RESULTS_PER_PAGE = 10    QUERIES = ["cool stuff"]
    for query in QUERIES:        full_search(query, pages=3, num=RESULTS_PER_PAGE)

There are a couple of things you should pay attention to here:

We instead make our requests to proxy_url:
- We use get_scrapeops_url() to convert regular urls into proxied ones
- We no longer pass our location into a request to Google, we use it as a query param for ScrapeOps...ScrapeOps will take care of our location for us!

In production, you should always use a good proxy. When we use proxies, the site server (in this case Google), can't pin down our location because all of our requests are coming from all over the place!

Step 6: Production Run

Time for the production run. You can view the full production level scraper below. This version of the script actually completely removes the return value from the search_page() function and we use multithreading in the main block at the bottom of the script instead a full_search() function. We also added basic logging and file handling to prevent from overwriting results.Take note of the following classes: SearchData and DataPipeline. SearchData is a more simple class that basically just holds the data we're choosing to scrape. DataPipeline is where the real heavy lifting gets done.

import requestsfrom bs4 import BeautifulSoupfrom urllib.parse import urlparse, parse_qs, urlencodeimport csvimport concurrentfrom concurrent.futures import ThreadPoolExecutorimport osimport loggingimport timefrom dataclasses import dataclass, field, fields, asdict#our default user agentheaders = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.3'}proxy_url = "https://proxy.scrapeops.io/v1/"API_KEY = "YOUR-SUPER-SECRET-API-KEY"
## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)
@dataclassclass SearchData:    name: str    base_url: str    link: str    page: int    result_number: int
    def __post_init__(self):        self.check_string_fields()    def check_string_fields(self):        for field in fields(self):            if isinstance(getattr(self, field.name), str):                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_open = False    def save_to_csv(self):        self.csv_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]
        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename)
        with open(self.csv_filename, mode="a", encoding="UTF-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)            if not file_exists:                writer.writeheader()            for item in data_to_save:                writer.writerow(asdict(item))        self.csv_open = False
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate Item Found: {input_data.name}. Item dropped")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)        if len(self.storage_queue) >= self.storage_queue_limit and self.csv_open == False:            self.save_to_csv()
    def close_pipeline(self):        if self.csv_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()
def get_scrapeops_url(url):    payload = {'api_key': API_KEY, 'url': url, 'country': 'us'}    proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)    return proxy_url
def search_page(query, page, location="United States", headers=headers, pipeline=None, num=100, retries=3):    url = f"https://www.google.com/search?q={query}&start={page * num}&num={num}"    payload = {        "api_key": API_KEY,        "url": url,    }    tries = 0    success = False    while tries <= retries and not success:        try:            response = requests.get(get_scrapeops_url(url))            soup = BeautifulSoup(response.text, 'html.parser')            divs = soup.find_all("div")            index = 0            last_link = ""            for div in divs:                h3s = div.find_all("h3")                if len(h3s) > 0:                    link = div.find("a", href=True)                    parsed_url = urlparse(link["href"])                    base_url = f"{parsed_url.scheme}://{parsed_url.netloc}"                    site_info = {'title': h3s[0].text, "base_url": base_url, 'link': link["href"], "page": page, "result_number": index}                    search_data = SearchData(                        name = site_info["title"],                        base_url = site_info["base_url"],                        link = site_info["link"],                        page = site_info["page"],                        result_number = site_info["result_number"]                    )                    if site_info["link"] != last_link:                        index += 1                        last_link = site_info["link"]                        if pipeline:                            pipeline.add_data(search_data)                            success = True
        except:            print(f"Failed to scrape page {page}")            print(f"Retries left: {retries-tries}")            tries += 1    if not success:        print(f"Failed to scrape page {page}, no retries left")        raise Exception(f"Max retries exceeded: {retries}")    else:        print(f"Scraped page {page} with {retries-tries} retries left")
def full_search(query, pages=3, location="us", MAX_THREADS=5, MAX_RETRIES=3, num=10):    with ThreadPoolExecutor(max_workers=MAX_THREADS) as executor:        pipeline = DataPipeline(csv_filename=f"{query.replace(' ', '-')}.csv")        tasks = [executor.submit(search_page, query, page, location, None, pipeline, num, MAX_RETRIES) for page in range(pages)]        for future in tasks:            future.result()        pipeline.close_pipeline()
if __name__ == "__main__":    MAX_THREADS = 5    MAX_RETRIES = 5    queries = ["cool stuff"]

    logger.info("Starting full search...")    for query in queries:        full_search(query, pages=3, num=10)    logger.info("Search complete.")

Remember:

SearchData is a class that simply holds our data
DataPipeline does all the heavy lifting of removing duplicates and writing the data to our csv file

Legal and Ethical Considerations

Whenever you scrape the web, you need to follow the terms and conditions of the site you're scraping. Always consult the robots.txt file to see what they allow. Generally, if you are scraping as a guest (not logged in), the information is considered to be public and scraping is usually alright.You can look at Google's robots.txt here. In addition, if you're unclear about whether or not you can scrape a site, check their Terms and Conditions.You can view Google's Terms and Conditions here. Similar to many other companies, Google reserves the right to suspend, terminate or delete you account if they have reason to believe that you are connected to suspicious or malicious activity.Also, do not collect and release anyone's personal data when scraping. In many countries this is illegal, and even if it is legal in your country, it's a pretty immoral thing to do. Always consider how your scraped data will be used as well. When you scrape a site from Google, some of the information you find might fall under the Terms and Conditions of that site as well.

Conclusion

Thank for reading. You now have a decent understanding of how to:

Make basic HTTP requests in Python
Scrape Google's search results
Write multithreaded code in Python
Integrate Requests using the ScrapeOps Proxy API Aggregator

Check out the links below to learn more about:

How to Scrape Google Search Results With Selenium

In this extensive guide, we'll take you through how to scrape Google Search Results using Selenium.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

TLDR: How to Scrape Google Search with Selenium

When we scrape Google Search, we get results from all over the web. Here is a production ready scraper already built to use the ScrapeOps Proxy API Aggregator.This gives us access to results from all over the web and also gives us the beginning of a much larger crawler.

from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom time import sleepimport csvfrom concurrent.futures import ThreadPoolExecutorfrom urllib.parse import urlencodeimport osimport loggingfrom dataclasses import dataclass, field, fields, asdict
#create a custom options instanceoptions = webdriver.ChromeOptions()#add headless mode to our optionsoptions.add_argument("--headless")
API_KEY = "YOUR-SUPER-SECRET-API-KEY"
logging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)
@dataclassclass SearchData:    name: str    link: str    result_number: int    page_number: int
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            if isinstance(getattr(self, field.name), str):                if getattr(self, field.name) == '':                    setattr(self, field.name, f"No {field.name}")                    continue                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
class DataPipeline:    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        self.csv_file_open = True        self.data_to_save = []        self.data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not self.data_to_save:            return        keys = [field.name for field in fields(self.data_to_save[0])]
        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="UTF-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)            if not file_exists:                writer.writeheader()            for item in self.data_to_save:                writer.writerow(asdict(item))        self.csv_file_open = False
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()
def get_scrapeops_url(url):    payload = {'api_key': API_KEY, 'url': url, 'country': 'us'}    proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)    return proxy_url
#this function performs a search and parses the resultsdef search_page(query, page, location):    #start Chrome with our custom options    driver = webdriver.Chrome(options=options)    #go to the page    driver.get(get_scrapeops_url(f"https://www.google.com/search?q={query}&start={page * 10}"))    #find each div containing site info...THEY'RE SUPER NESTED!!!    divs = driver.find_elements(By.CSS_SELECTOR, "div > div > div > div > div > div > div > div > div > div > div > div > div > div")    #list to hold our results    results = []    #index, this will be used to number the results    index = 0    #last link    last_link = ""    #iterate through our divs    for div in divs:        #find the title element        title = div.find_elements(By.CSS_SELECTOR, "h3")        link = div.find_elements(By.CSS_SELECTOR, "a")        if len(title) > 0 and len(link) > 0:            #result number on the page            result_number = index            #site info object            site_info = {"title": title[0].text, "link": link[0].get_attribute("href"), "result_number": result_number, "page": page}            if site_info["link"] != last_link:                #add the object to our list                results.append(site_info)                #increment the index                index += 1                #update the last link                last_link = site_info["link"]    #the scrape has finished, close the browser    driver.quit()    #return the result list    return results#function to search multiple pages, calls search_page() on eachdef full_search(query, pages=3, location="United States"):    #list for our full results    full_results = []    #list of page numbers    page_numbers = list(range(0, pages))    #open with a max of 5 threads    with ThreadPoolExecutor(max_workers=5) as executor:        #call search page, pass all the following aruments into it        future_results = executor.map(search_page, [query] * pages, page_numbers, [location] * pages)        #for each thread result        for page_result in future_results:            #add it to the full_results            full_results.extend(page_result)    #return the finalized list    return full_results
if __name__ == "__main__":
    logger.info("Starting scrape")    data_pipeline = DataPipeline(csv_filename="production-search.csv")
    search_results = full_search("cool stuff")
    for result in search_results:        search_data = SearchData(name=result["title"], link=result["link"], result_number=result["result_number"] , page_number=result["page"])        data_pipeline.add_data(search_data)
    data_pipeline.close_pipeline()    logger.info("Scrape Complete")

To run this scraper, simply paste the code into a file and enter python your-script.py
In order to change the query, simply replace the words "cool stuff" with whatever you'd like to query
If you'd like to change the result count, simply change the pages kwarg
To run a search of 100 pages, you would do full_search("boring stuff", pages=100)

How To Architect Our Google Scraper

When scraping search results from Google, we need to be able to do the following:

Create a Google Search.
Interpret the results.
Fetch more results.
Repeat steps 2 and 3 until we have our desired data.

Our best implementation of a Google Scraper will be able to parse a page. It also needs to manage pagination. It should be able to perform tasks with concurrency. It should also be set up to work with a proxy.Why does our scraper need these qualities?

To extract data from a page, we need to parse the HTML.
To request different pages (batches of data), we need to control our pagination.
When parsing our data concurrently, our scraper will complete tasks quicker because multiple things are happening at the same time.
When we use a proxy, we greatly decrease our chances of getting blocked, and we can also choose our location much more successfully because the proxy will give us an IP address matching the location we choose.

Understanding How To Scrape Google Search

Let's get started building our scraper.First, we're going to simply parse and extract data from a Google Search result. Once we can handle a single page, we'll add support for pagination.Then, we'll learn how to store this data in a CSV file. At this point, we'll technically have a working scraper from start to finish, so we'll focus on making improvements by adding concurrency for speed and efficiency.Finally, we'll add proxy support to make our scraper stable and reliable.

Step 1: How To Request Google Search Pages

Take a look at the image below:If you look at the address bar, you should see:

https://www.google.com/search?q=cool+stuff

Let's break this down:

Our base domain is https://www.google.com
The endpoint we want from the domain is /search
?q=cool+stuff represents the query we're making:
- ? denotes the query
- q is the value that we're querying
- cool+stuff is equivalent to the string, "cool stuff"...+ denotes a space in the words

Step 2: How To Extract Data From Google Search

As you might have noticed in the screenshot earlier, each of our search results comes with an <h3> tag, so this is a good place to look. If you choose to inspect the page further, you'll come to notice that each of these headers is deeply nested inside a number of <div> tags.To find our results, we need to find all the div elements containing these h3 elements. If we properly identify and parse each div, we can extract all of the relevant information from it.

Step 3: How To Control Pagination

As mentioned previously, ? denotes a query. We can actually add other query parameters using &. Google typically gives us results in batches of 10. With this in mind, we can actually request multiple "pages" by passing in a start query.After the the start parameter is added, our formatted url looks like this:


'https://www.google.com/search?q={query}&start={page * 10}'

We pass our page number multiplied by 10 because of the way our results get delivered. If we want to start at 0, our start would be {0 * 10}. The next batch of results would be {1 * 10}. Then {2 * 10} and so on and so forth.

Step 4: Geolocated Data

Speaking of query params, we can also add one for location. If we add the geo_location parameter to our query, we can actually get results based on that individual location.Now, our formatted url would look like this:


'https://www.google.com/search?q={query}&start={page * 10}&geo_location={location}'

While this is an extremely small change, this gives us the power to drastically change our results.

Setting Up Our Google Scraper Project

Now that we understand the basic strategy that our scraper needs to execute, let's get started on building it!We'll start by creating a new project folder. You can do this from within your file explorer or run the command below.

mkdir google-search

From within the project folder, we want to create a new virtual environment.Linux/Mac

python3 -m venv google-search

Windows

python -m venv google-search

One we've got our new environment created, let's activate it:Linux/Mac

source google-search/bin/activate

Windows

.\google-search\Scripts\Activate.ps1

Once our environment has been activated, it's time to install dependencies. We can install Selenium through pip.

pip install selenium

You will also need to ensure that you have Chrome and webdriver installed. You can check your version of Chrome with the following command:

google-chrome --version

It should output a result similar to this:

Google Chrome 123.0.6312.105

Once you know what version of Chrome you're using, you can head on over to https://chromedriver.chromium.org/ and get the version matching it.If you are using an older version of Chrome, you may have to update your driver more often. Chromedriver 115 and above tend to have some automated webdriver updates which makes dependency management a bit easier.

Building A Google Search Scraper

As we know, our scraper needs to be able to make custom requests in this format:


'https://www.google.com/search?q={query}&start={page * 10}&geo_location={location}'

Now let's begin building a Selenium scraper that can handle this. Our scraper needs to operate in the following steps:

Launch a headless browser
Get a page of results
Interpret the results
Repeat steps 2 and 3 until we have our desired data
Save the data
Close the browser and exit the program

Step 1: Create Simple Search Data Parser

Let's start with a simple scraper that looks performs a search and parses the results. The code below is designed to do exactly that.

from selenium import webdriverfrom selenium.webdriver.common.by import By#create a custom options instanceoptions = webdriver.ChromeOptions()#add headless mode to our optionsoptions.add_argument("--headless")#this function performs a search and parses the resultsdef search_page(query):    #start Chrome with our custom options    driver = webdriver.Chrome(options=options)    #go to the page    driver.get(f"https://www.google.com/search?q={query}")    #find each div containing site info...THEY'RE SUPER NESTED!!!    divs = driver.find_elements(By.CSS_SELECTOR, "div > div > div > div > div > div > div > div > div > div > div > div > div > div")    #list to hold our results    results = []    #index, this will be used to number the results    index = 0    #iterate through our divs    for div in divs:        #find the title element        title = div.find_elements(By.CSS_SELECTOR, "h3")        #find the link element        link = div.find_elements(By.CSS_SELECTOR, "a")        #result number on the page        result_number = index        #if we have a result        if len(title) > 0:            #site info object            site_info = {"title": title[0].text, "link": link[0].get_attribute("href"), "result_number": result_number}            #add the object to our list            results.append(site_info)            #increment the index            index += 1    #the scrape has finished, close the browser    driver.quit()    #return the result list    return results
####this is our main program down here####search_results = search_page("cool stuff")#print our resultsfor result in search_results:    print(result)

In the code above, we:

Create a custom instance of ChromeOptions and add the "--headless" argument to it
Create a search_page() function that takes a query as a parameter
webdriver.Chrome(options=options) opens our browser in headless mode
We then use driver.get() to go to our site
We then find all of our target div elements using their CSS Selector...They are SUPER NESTED!
We create a list to hold our results
We create an index variable so that we can give each result a number
To avoid an "element not found" exception, we use find_elements() to get the title and link for each object
If the list returned by find_elements() is not empty, we save the following:
- title.text
- link.get_attribute("href")
- result_number
After extracting the proper information, we append the object to our results list
Once we've gotten through all the results, we close the browser and return the results list

Step 2: Add Pagination

Now that we know how to scrape a single page, let's get started on adding pagination. As mentioned you read about the intial strategy of the scraper, the final formatted url should look like this:


'https://www.google.com/search?q={query}&start={page * 10}&geo_location={location}'

Let's create a second function that takes our pagination into account. We'll also make some minor changes to the search_page() function.

from selenium import webdriverfrom selenium.webdriver.common.by import By#create a custom options instanceoptions = webdriver.ChromeOptions()#add headless mode to our optionsoptions.add_argument("--headless")#this function performs a search and parses the resultsdef search_page(query, page, location):    #start Chrome with our custom options    driver = webdriver.Chrome(options=options)    #go to the page    driver.get(f"https://www.google.com/search?q={query}&start={page * 10}&location={location}")    #find each div containing site info...THEY'RE SUPER NESTED!!!    divs = driver.find_elements(By.CSS_SELECTOR, "div > div > div > div > div > div > div > div > div > div > div > div > div > div")    #list to hold our results    results = []    #index, this will be used to number the results    index = 0    #last link    last_link = ""    #iterate through our divs    for div in divs:        #find the title element        title = div.find_elements(By.CSS_SELECTOR, "h3")        #find the link element        link = div.find_elements(By.CSS_SELECTOR, "a")        #result number on the page        result_number = index        #if we have a result        if len(title) > 0:            #site info object            site_info = {"title": title[0].text, "link": link[0].get_attribute("href"), "result_number": result_number, "page": page}            if site_info["link"] != last_link:                #add the object to our list                results.append(site_info)                #increment the index                index += 1                #update the last link                last_link = site_info["link"]    #the scrape has finished, close the browser    driver.quit()    #return the result list    return results#function to search multiple pages, calls search_page() on eachdef full_search(query, pages=3, location="United States"):    #list for our full results    full_results = []    #iterate through our pages    for page in range(0, pages):        #get the results of the page        page_results = search_page(query, page, location)        #add them to the full_results list        full_results.extend(page_results)    #return the finalized list    return full_results####this is our main program down here####search_results = full_search("cool stuff")#print our resultsfor result in search_results:    print(result)

This code is only slightly different from our first example:

search_page() now takes three arguments: query, page, and location
page and location have been added into the formatted url
We also created another variable, last_link and use it to prevent doubles from getting into our results
We created a new full_search() function
full_search() simply runs search_page() on a list of pages and returns a full list of results

Step 3: Storing the Scraped Data

In the previous iterations of this scraper, we focused on returning uniform dict objects from each of our functions. The reason for using these dictionaries is simple, when you hold object data in a dict of key-value pairs, it's really easy to transform it into something else.Not all libraries are build to handle all data formats, but almost all of them support JSON or dictionaries (both of these formats are key-value pairs).Now, we'll remove the following code from the bottom of the script:

#print our resultsfor result in search_results:    print(result)

Add the following line to your imports:

import csv

Now, we'll add the following to the bottom of the file:

#path to the csv filepath_to_csv = "search-results.csv"#open the file in write modewith open(path_to_csv, "w") as file:    #format the file based on the keys of the first result    writer = csv.DictWriter(file, search_results[0].keys())    #write the headers    writer.writeheader()    #write each object as a row in the file    writer.writerows(search_results)

In this snippet, we:

Create a path_to_csv variable
Open the file using path_to_csv and "w" as arguments to open the file in write mode
csv.DictWriter(file, search_results[0].keys()) tells the writer object to format our file based on the keys of the first dict object in our list
writer.writeheader() writes the actual headers to the document
writer.writerows(search_results) writes our actual search results to the csv file

Step 4: Adding Concurrency

If you've run any of the previous examples, you should have noticed that it takes about 15 seconds to scrape the default 3 pages. In its current structure, our script goes through and scrapes each page sequentially. We can speed this up by scraping them concurrently.In this section, we're going to refactor our full_search() function so that things are done concurrently.Here is our modified full_search() function:

#function to search multiple pages, calls search_page() on eachdef full_search(query, pages=3, location="United States"):    #list for our full results    full_results = []    #list of page numbers    page_numbers = list(range(0, pages))    #open with a max of 5 threads    with ThreadPoolExecutor(max_workers=5) as executor:        #call search page, pass all the following aruments into it        future_results = executor.map(search_page, [query] * pages, page_numbers, [location] * pages)        #for each thread result        for page_result in future_results:            #add it to the full_results            full_results.extend(page_result)    #return the finalized list    return full_results

The full search function now does the following:

Create a list for our full results
Create a list of page numbers
Open a ThreadPoolExecutor instance with a max of 5 workers
executor.map(search_page, [query] * pages, page_numbers, [location] * pages) calls search_page() and passes in lists of arguments to it
We then take each page_result and use extend() to add it to the full_results list
Once finished, we return the list

Here is the newly updated file:

from selenium import webdriverfrom selenium.webdriver.common.by import Byimport csvfrom concurrent.futures import ThreadPoolExecutor#create a custom options instanceoptions = webdriver.ChromeOptions()#add headless mode to our optionsoptions.add_argument("--headless")#this function performs a search and parses the resultsdef search_page(query, page, location):    #start Chrome with our custom options    driver = webdriver.Chrome(options=options)    #go to the page    driver.get(f"https://www.google.com/search?q={query}&start={page * 10}&location={location}")    #find each div containing site info...THEY'RE SUPER NESTED!!!    divs = driver.find_elements(By.CSS_SELECTOR, "div > div > div > div > div > div > div > div > div > div > div > div > div > div")    #list to hold our results    results = []    #index, this will be used to number the results    index = 0    #last link    last_link = ""    #iterate through our divs    for div in divs:        #find the title element        title = div.find_elements(By.CSS_SELECTOR, "h3")        #find the link element        link = div.find_elements(By.CSS_SELECTOR, "a")        #result number on the page        result_number = index        #if we have a result        if len(title) > 0:            #site info object            site_info = {"title": title[0].text, "link": link[0].get_attribute("href"), "result_number": result_number, "page": page}            if site_info["link"] != last_link:                #add the object to our list                results.append(site_info)                #increment the index                index += 1                #update the last link                last_link = site_info["link"]    #the scrape has finished, close the browser    driver.quit()    #return the result list    return results#function to search multiple pages, calls search_page() on eachdef full_search(query, pages=3, location="United States"):    #list for our full results    full_results = []    #list of page numbers    page_numbers = list(range(0, pages))    #open with a max of 5 threads    with ThreadPoolExecutor(max_workers=5) as executor:        #call search page, pass all the following aruments into it        future_results = executor.map(search_page, [query] * pages, page_numbers, [location] * pages)        #for each thread result        for page_result in future_results:            #add it to the full_results            full_results.extend(page_result)    #return the finalized list    return full_results####this is our main program down here#####results from the searchsearch_results = full_search("cool stuff")#path to the csv filepath_to_csv = "concurrency.csv"#open the file in write modewith open(path_to_csv, "w") as file:    #format the file based on the keys of the first result    writer = csv.DictWriter(file, search_results[0].keys())    #write the headers    writer.writeheader()    #write each object as a row in the file    writer.writerows(search_results)

Step 5: Bypassing Anti-Bots

When scraping in the wild, we often run into anti-bot software. Anti-bots are exactly what they sound like. Because our scraper is a programmatically controlled browser, anti-bots will often block scrapers even if they're not malicious. In order to get past anti-bots, it is always best practice to use a proxy.There are many tools to integrate proxies with different browsers, but the easiest way to do so is with simple string formatting. Take a look at the function below.

def get_scrapeops_url(url):    payload = {'api_key': API_KEY, 'url': url, 'country': 'us'}    proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)    return proxy_url

As simple as it may look, this function holds the key to unlocking the power of the ScrapeOps Proxy. We simply encode our proxy params directly into the url that we want. We can then simply driver.get() this new proxied url just like we would with a non-proxied url. When scraping at scale, we need to use proxies consistently.The ScrapeOps Proxy rotates IP addresses and always uses the best proxy available for each request. This actually allows each of our requests to show up as a different user with potentially a different browser, OS and often a different location as well.When using a proxy, no one can block you based on your location, because your location changes whenever you make a new request to the site.Here is a proxied version of our script:

from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom time import sleepimport csvfrom concurrent.futures import ThreadPoolExecutorfrom urllib.parse import urlencode#create a custom options instanceoptions = webdriver.ChromeOptions()#add headless mode to our optionsoptions.add_argument("--headless")
API_KEY = "YOUR-SUPER-SECRET-API-KEY"def get_scrapeops_url(url):    payload = {'api_key': API_KEY, 'url': url, 'country': 'us'}    proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)    return proxy_url
#this function performs a search and parses the resultsdef search_page(query, page, location):    #start Chrome with our custom options    driver = webdriver.Chrome(options=options)    #go to the page    driver.get(get_scrapeops_url(f"https://www.google.com/search?q={query}&start={page * 10}"))    #find each div containing site info...THEY'RE SUPER NESTED!!!    divs = driver.find_elements(By.CSS_SELECTOR, "div > div > div > div > div > div > div > div > div > div > div > div > div > div")    #list to hold our results    results = []    #index, this will be used to number the results    index = 0    #last link    last_link = ""    #iterate through our divs    for div in divs:        #find the title element        title = div.find_elements(By.CSS_SELECTOR, "h3")        link = div.find_elements(By.CSS_SELECTOR, "a")        if len(title) > 0 and len(link) > 0:            #result number on the page            result_number = index            #site info object            site_info = {"title": title[0].text, "link": link[0].get_attribute("href"), "result_number": result_number, "page": page}            if site_info["link"] != last_link:                #add the object to our list                results.append(site_info)                #increment the index                index += 1                #update the last link                last_link = site_info["link"]    #the scrape has finished, close the browser    driver.quit()    #return the result list    return results#function to search multiple pages, calls search_page() on eachdef full_search(query, pages=3, location="United States"):    #list for our full results    full_results = []    #list of page numbers    page_numbers = list(range(0, pages))    #open with a max of 5 threads    with ThreadPoolExecutor(max_workers=5) as executor:        #call search page, pass all the following aruments into it        future_results = executor.map(search_page, [query] * pages, page_numbers, [location] * pages)        #for each thread result        for page_result in future_results:            #add it to the full_results            full_results.extend(page_result)    #return the finalized list    return full_results
if __name__ == "__main__":    search_results = full_search("cool stuff")    #path to the csv file    path_to_csv = "proxied.csv"    #open the file in write mode    with open(path_to_csv, "w") as file:        #format the file based on the keys of the first result        writer = csv.DictWriter(file, search_results[0].keys())        #write the headers        writer.writeheader()        #write each object as a row in the file        writer.writerows(search_results)

Key things you should notice about this example:

"YOUR-SUPER-SECRET-API-KEY" should be replaced by your API key
get_scrapeops_url() converts normal urls into proxied ones
We have an actual main code block at the end of the script, this is because we're closer to production

Step 6: Production Run

Now, it's time for our production run. We added data storage earlier in the article. Since this example is meant to be the actual production code, we expand on that by adding a SearchData class and a DataPipeline class. SearchData doesn't do much other than hold and format individual results. The DataPipeline is where the real heavy lifting gets done as far as our production storage.Here is our production scraper:

from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom time import sleepimport csvfrom concurrent.futures import ThreadPoolExecutorfrom urllib.parse import urlencodeimport osimport loggingfrom dataclasses import dataclass, field, fields, asdict
#create a custom options instanceoptions = webdriver.ChromeOptions()#add headless mode to our optionsoptions.add_argument("--headless")
API_KEY = "YOUR-SUPER-SECRET-API-KEY"
logging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)
@dataclassclass SearchData:    name: str    link: str    result_number: int    page_number: int
    def __post_init__(self):        self.check_string_fields()
    def check_string_fields(self):        for field in fields(self):            if isinstance(getattr(self, field.name), str):                if getattr(self, field.name) == '':                    setattr(self, field.name, f"No {field.name}")                    continue                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
class DataPipeline:    def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False
    def save_to_csv(self):        self.csv_file_open = True        self.data_to_save = []        self.data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not self.data_to_save:            return        keys = [field.name for field in fields(self.data_to_save[0])]
        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="UTF-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)            if not file_exists:                writer.writeheader()            for item in self.data_to_save:                writer.writerow(asdict(item))        self.csv_file_open = False
    def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped")            return True        self.names_seen.append(input_data.name)        return False
    def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()
    def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()
def get_scrapeops_url(url):    payload = {'api_key': API_KEY, 'url': url, 'country': 'us'}    proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)    return proxy_url
#this function performs a search and parses the resultsdef search_page(query, page, location):    #start Chrome with our custom options    driver = webdriver.Chrome(options=options)    #go to the page    driver.get(get_scrapeops_url(f"https://www.google.com/search?q={query}&start={page * 10}"))    #find each div containing site info...THEY'RE SUPER NESTED!!!    divs = driver.find_elements(By.CSS_SELECTOR, "div > div > div > div > div > div > div > div > div > div > div > div > div > div")    #list to hold our results    results = []    #index, this will be used to number the results    index = 0    #last link    last_link = ""    #iterate through our divs    for div in divs:        #find the title element        title = div.find_elements(By.CSS_SELECTOR, "h3")        link = div.find_elements(By.CSS_SELECTOR, "a")        if len(title) > 0 and len(link) > 0:            #result number on the page            result_number = index            #site info object            site_info = {"title": title[0].text, "link": link[0].get_attribute("href"), "result_number": result_number, "page": page}            if site_info["link"] != last_link:                #add the object to our list                results.append(site_info)                #increment the index                index += 1                #update the last link                last_link = site_info["link"]    #the scrape has finished, close the browser    driver.quit()    #return the result list    return results#function to search multiple pages, calls search_page() on eachdef full_search(query, pages=3, location="United States"):    #list for our full results    full_results = []    #list of page numbers    page_numbers = list(range(0, pages))    #open with a max of 5 threads    with ThreadPoolExecutor(max_workers=5) as executor:        #call search page, pass all the following aruments into it        future_results = executor.map(search_page, [query] * pages, page_numbers, [location] * pages)        #for each thread result        for page_result in future_results:            #add it to the full_results            full_results.extend(page_result)    #return the finalized list    return full_results
if __name__ == "__main__":
    logger.info("Starting scrape")    data_pipeline = DataPipeline(csv_filename="production-search.csv")
    search_results = full_search("cool stuff")
    for result in search_results:        search_data = SearchData(name=result["title"], link=result["link"], result_number=result["result_number"] , page_number=result["page"])        data_pipeline.add_data(search_data)
    data_pipeline.close_pipeline()    logger.info("Scrape Complete")

To change the output filename, simply change "production-search.csv" to your desired filename
To change your search query, change "cool stuff" to whatever query you'd like to perform
If you'd like to scrape a different amount of pages, you can use the pages kwarg in the full_search() function:
If you want 1000 pages of boring stuff, you could do full_search("boring stuff", pages=1000)

Legal and Ethical Considerations

While scraping public data (if you don't have to login to view the data, it's public data.) is generally considered legally acceptable, when scraping in production, always respect the policies of the website you're trying to scrape.Always remember that public data is fair game. Don't scrape people's personal information and certainly don't share it... Be respectful of other people and their privacy.If you are not sure about the policies of the website you're scraping check their robots.txt. You can view Google's robots.txt here.Another thing to consider is the terms & service (T&C) policies of the websites. Unauthorized scraping or violating terms of service may result in legal action or being blocked from accessing services.According to the T&C policy, Google reserves the right to suspend or terminate your access to the services or delete your Google Account if they reasonably believe that your conduct causes harm or liability to a user, third party, or Google — for example, by hacking, phishing, harassing, spamming, misleading others, or scraping content that doesn’t belong to you.It's crucial to consider not only the legality of scraping data but also how the scraped data will be used. Data scraped from Google or other websites may be subject to copyright laws or regulations governing personal data, depending on the jurisdiction and intended use.

Conclusion

You've made it to the end. You now have a decent understanding of how to build a production scraper. You've learned how to parse data, how to add concurrency and how to integrate a proxy with Selenium.If you'd like to see documentation related to this article, take a look at the links below.

How to Scrape Google Search With Puppeteer

In this extensive guide, we'll take you through how to scrape Google Search Results using Puppeteer.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

TLDR - How to Scrape Google Search with Puppeteer

When scraping search results, pay attention to the following things:

We get our results in batches
Each result is highly nested and uses dynamically generated CSS
Each result has both a name and a link

const puppeteer = require('puppeteer');const createCsvWriter = require('csv-writer').createObjectCsvWriter;const fs = require('fs');
const API_KEY = 'YOUR-SUPER-SECRET-API-KEY';const outputFile = 'production.csv';const fileExists = fs.existsSync(outputFile);
//set up the csv writerconst csvWriter = createCsvWriter({  path: outputFile,  header: [    { id: 'name', title: 'Name' },    { id: 'link', title: 'Link' },    { id: 'result_number', title: 'Result Number' },    { id: 'page', title: 'Page Number' },  ],  append: fileExists,});//convert regular urls into proxied onesfunction getScrapeOpsURL(url, location) {  const params = new URLSearchParams({    api_key: API_KEY,    url: url,    country: location,  });  return `https://proxy.scrapeops.io/v1/?${params.toString()}`;}//scrape page, this is our main logicasync function scrapePage(  browser,  query,  pageNumber,  location,  retries = 3,  num = 100) {  let tries = 0;  while (tries <= retries) {    const page = await browser.newPage();    try {      const url = `https://www.google.com/search?q=${query}&start=${pageNumber * num}&num=${num}`;      const proxyUrl = getScrapeOpsURL(url, location);      //set a long timeout, sometimes the server take awhile      await page.goto(proxyUrl, { timeout: 300000 });      //find the nested divs      const divs = await page.$$(        'div > div > div > div > div > div > div > div'      );      const scrapeContent = [];      seenLinks = [];      let index = 0;      for (const div of divs) {        const h3s = await div.$('h3');        const links = await div.$('a');        //if we have the required info        if (h3s && links) {          //pull the name          const name = await div.$eval('h3', (h3) => h3.textContent);          //pull the link          const linkHref = await div.$eval('a', (a) => a.href);          //filter out bad links          if (            !linkHref.includes('https://proxy.scrapeops.io/') &&            !seenLinks.includes(linkHref)          ) {            scrapeContent.push({              name: name,              link: linkHref,              page: pageNumber,              result_number: index,            });            seenLinks.push(linkHref);            index++;          }        }      }      //we failed to get a result, throw an error and attempt a retry      if (scrapeContent.length === 0) {        throw new Error(`Failed to scrape page ${pageNumber}`);        //we have a page result, write it to the csv      } else {        await csvWriter.writeRecords(scrapeContent);        //exit the function        return;      }    } catch (err) {      console.log(`ERROR: ${err}`);      console.log(`Retries left: ${retries - tries}`);      tries++;    } finally {      await page.close();    }  }  throw new Error(`Max retries reached: ${tries}`);}//function to launch a browser and scrape each page concurrentlyasync function concurrentScrape(  query,  totalPages,  location,  num = 10,  retries = 3) {  const browser = await puppeteer.launch();  const tasks = [];  for (let i = 0; i < totalPages; i++) {    tasks.push(scrapePage(browser, query, i, location, retries, num));  }  await Promise.all(tasks);  await browser.close();}//main functionasync function main() {  const queries = ['cool stuff'];  const location = 'us';  const totalPages = 3;  const batchSize = 20;  const retries = 5;
  console.log('Starting scrape...');  for (const query of queries) {    await concurrentScrape(      query,      totalPages,      location,      (num = batchSize),      retries    );    console.log(`Scrape complete, results saved to: ${outputFile}`);  }}//run the main functionmain();

The code above gives you a production ready Google Search scraper.

You can change the query in order to change the query to whatever you'd like.
You can change the location and totalPages (or any of the other constants in the main function) variables to change your results as well.
Make sure to replace "YOUR-SUPER-SECRET-API-KEY" with your ScrapeOps API key.

How To How To Architect Our Google Search Scraper

In order to properly scrape Google Search, we need to be able to do the following:

Perform a query
Interpret the results
Repeat steps one and two until we have our desired data
Store the scraped data

To extract data from a page, we need to parse the HTML.
To request different pages (batches of data), we need to control our pagination.
When parsing our data concurrently, our scraper will complete tasks quicker because multiple things are happening at the same time.
When we use a proxy, we greatly decrease our chances of getting blocked, and we can also choose our location much more successfully because the proxy will give us an IP address matching the location we choose.

In this tutorial, we're going to use puppeteer to perform and interpret our results. We'll use csv-writer and fs for handling the filesystem and storing our data.These dependencies give us the power to not only extract page data, but also filter and store our data safely and efficiently.

Understanding How To Scrape Google Search

When we scrape Google Search, we need to be able to request a page, extract the data, control our pagination, and deal with geolocation.In the next few sections, we'll go over how all this works before we dive head first into code.

We make our requests and extract the data using Puppeteer itself.
We also control the pagination using the url that we construct.
In early examples, we do set our geo_location inside the url, but later on in development, we remove this and let the ScrapeOps Proxy handle our location for us.

Step 1: How To Request Google Search Pages

When we perform a Google search, our url comes in a format like this:

https://www.google.com/search?q=${query}

If we want to search for cool stuff, our url would be:

https://www.google.com/search?q=cool+stuff

Here's an example when you look it up in your browser. We can also attempt to control our batch results with the num query.The num query tends to get mixed results since most normal users are on default settings with approximately 10 results. If you choose to use the num query, exercise caution.Google does block suspicious traffic and the num query does make you look less human.Additional queries are added with the & and the the query name and value. We'll explore these additional queries in the coming sections.

Step 2: How To Extract Data From Google Search

When we perform a Google Search, our results come deeply nested in the page's HTML. In order to extract them, we need to parse through the HTML and pull the data from it. Take a look at the image below.While it's possible to scrape data using CSS selectors, doing so with Google is a mistake. Google tends to use dynamic CSS selectors, so if we hard code our CSS selectors, and then the selectors change... our scraper will break!As you can see, the CSS class is basically just a ton of jumbled garbage. We could go ahead and scrape using this CSS class, but it would be a costly mistake.The CSS classes are dynamically generated and our scraper would more than likely break very quickly. If we're going to build a scraper that can hold up in production, we need to dig deep into the nasty nested layout of this page.

Step 3: How To Control Pagination

As mentioned previously, additional queries are added to our url with &. In the olden days, Google gave us actual pages. In modern day, Google gives us all of our results on a single page.At first glance, this would make our scrape much more difficult, however, our results come in batches. This makes it incredibly simple to simulate pages.To control which result we start at, we can use the start parameter. If we want to start at result 0, our url would be:

https://www.google.com/search?q=$cool+stuff&start=0

When we fetch our results, they come in batches of approximately 10. To fetch the next batch, we would GET

https://www.google.com/search?q=$cool+stuff&start=10

This process repeats until we're staisfied with our results.

Step 4: Geolocated Data

To handle geolocation, we can use the geo_location parameter. If we want to look up cool stuff and use a location of Japan, our url would look like this:

https://www.google.com/search?q=cool+stuff&geo_location=japan

Google still attempts to use our device's location, so the best way to change our location based result is by using either a VPN or a proxy such as the ScrapeOps Proxy.When we use a proxy, it's actually very important to set our location with the proxy so we can keep our data consistent. The ScrapeOps Proxy uses rotating IP addresses.If we don't set our location, our first page of cool stuff could be cool stuff in France, our second page of cool stuff could be cool stuff in Japan.

Setting Up Our Google Search Scraper

Let's get started building. First, we need to create a new project folder. Then we'll intialize our project. After initializing it, we can go ahead and install our dependencies.We only have three dependencies, puppeteer for web browsing and parsing HTML, csv-writer to store our data, and fs for basic file operations.You can start by making a new folder in your file explorer or you can create one from the command line with the command below:

mkdir puppeteer-google-search

Then we need to open a terminal/shell instance inside this folder. You switch into the directory with the following command:

cd puppeteer-google-search

Now, we initialize our project. The command below transforms our new folder into a NodeJS project:

npm init --y

Now to install our dependencies:

npm install puppeteer

npm install csv-writer

We don't need to install fs, because it comes with NodeJS. In our scraper, we simply require it.

Build A Google Search Scraper

Our scraper is actually the very foundation of a powerful crawler. It needs to do the following:

Perform a query based on the result we want
Parse the response
Repeat this process until we have our desired data
Save the data to a CSV file

Step 1: Create Simple Search Data Parser

Here, we need to create a simple parser. The goal of our parser is simple: read HTML and spit out data.Here's a parser that gets the page, finds the nested divs, and extracts the link and name in each div.

const puppeteer = require('puppeteer');const createCsvWriter = require('csv-writer').createObjectCsvWriter;const fs = require('fs');
async function scrapePage(query) {  //set up our page and browser  const url = `https://www.google.com/search?q=${query}`;  const browser = await puppeteer.launch();  const page = await browser.newPage();  //go to the site  await page.goto(url);  //extract the nested divs  const divs = await page.$$('div > div > div > div > div > div > div > div');  const scrapeContent = [];  seenLinks = [];  let index = 0;  for (const div of divs) {    const h3s = await div.$('h3');    const links = await div.$('a');    //if we have the required info    if (h3s && links) {      //pull the name      const name = await div.$eval('h3', (h3) => h3.textContent);      //pull the link      const linkHref = await div.$eval('a', (a) => a.href);      //filter out bad links      if (        !linkHref.includes('https://proxy.scrapeops.io/') &&        !seenLinks.includes(linkHref)      ) {        scrapeContent.push({          name: name,          link: linkHref,          result_number: index,        });        //add the link to our list of seen links        seenLinks.push(linkHref);        index++;      }    }  }  await browser.close();  return scrapeContent;}//main functionasync function main() {  const results = await scrapePage('cool stuff');  for (const result of results) {    console.log(result);  }}//run the main functionmain();

The order of operations here is pretty simple. The complex logic comes when we're parsing through the HTML. Let's explore the parsing logic in detail:

scrapeContent is an array that holds our results to return
The seenLinks array is strictly for holding links we've already scraped
index holds our index on the page
const divs = await page.$$("div > div > div > div > div > div > div > div"); finds all of our super nested divs
Iterate through the divs
for each div, we:
- Use div.$() to check for presence of h3 and a elements
- If these elements are present, we extract them with div.$eval()
- If the links are good, and we haven't seen them we add them to our scrapedContent
- Once they've been scraped, we add them to our seenLinks so we don't scrape them again

We need to parse through these nested divs. Google uses dynamic CSS selectors. Do not hard code CSS Selectors into your Google Scraper. It will break! You don't want to depoy a scraper to production only to find out that it no longer works.

Step 2: Add Pagination

As mentioned earlier, to add pagination, our url needs to look like this:

https://www.google.com/search?q=$cool+stuff&start=0

Our results tend to come in batches of 10, so we'll need to multiply our pageNumber by 10.Taking pagination into account, our url will now look like this:

https://www.google.com/search?q=${query}&start=${pageNumber}

Here is our code adjusted for pagination.

const puppeteer = require('puppeteer');const createCsvWriter = require('csv-writer').createObjectCsvWriter;const fs = require('fs');
async function scrapePage(query, pageNumber) {  //set up our page and browser  const url = `https://www.google.com/search?q=${query}&start=${pageNumber}`;  const browser = await puppeteer.launch();  const page = await browser.newPage();  //go to the site  await page.goto(url);  //extract the nested divs  const divs = await page.$$('div > div > div > div > div > div > div > div');  const scrapeContent = [];  seenLinks = [];  let index = 0;  for (const div of divs) {    const h3s = await div.$('h3');    const links = await div.$('a');    //if we have the required info    if (h3s && links) {      //pull the name      const name = await div.$eval('h3', (h3) => h3.textContent);      //pull the link      const linkHref = await div.$eval('a', (a) => a.href);      //filter out bad links      if (        !linkHref.includes('https://proxy.scrapeops.io/') &&        !seenLinks.includes(linkHref)      ) {        scrapeContent.push({          name: name,          link: linkHref,          pageNumber: pageNumber,          result_number: index,        });        //add the link to our list of seen links        seenLinks.push(linkHref);        index++;      }    }  }  await browser.close();  return scrapeContent;}//main functionasync function main() {  const results = await scrapePage('cool stuff', 0);  for (const result of results) {    console.log(result);  }}//run the main functionmain();

Here are the differences from our first prototype:

scrapePage() now takes two arguments, query and pageNumber
Our url includes the pageNumber multiplied by our typical batch size (10)
const results = await scrapePage("cool stuff", 0) says we want our results to start at zero

The pageNumber argument is the foundation to everything we'll add in the coming sections. It's really hard for your scraper to organize its tasks and data if it has no idea which page its on.

Step 3: Storing the Scraped Data

As you've probably noticed, our last two iterations have unused imports, csv-writer and fs. Now it's time to use them. We'll use fs to check the existence of our outputFile and csv-writer to write the results to the actual CSV file.Pay close attention to fileExists in this section. If our file already exists, we do not want to overwrite it. If it doesn't exist, we need to create a new file. The csvWriter in the code below does exactly this.Here's our adjusted code:

const puppeteer = require('puppeteer');const createCsvWriter = require('csv-writer').createObjectCsvWriter;const fs = require('fs');
const outputFile = 'add-storage.csv';const fileExists = fs.existsSync(outputFile);
//set up the csv writerconst csvWriter = createCsvWriter({  path: outputFile,  header: [    { id: 'name', title: 'Name' },    { id: 'link', title: 'Link' },    { id: 'result_number', title: 'Result Number' },    { id: 'page', title: 'Page Number' },  ],  append: fileExists,});
async function scrapePage(query, pageNumber) {  //set up our page and browser  const url = `https://www.google.com/search?q=${query}&start=${pageNumber}`;  const browser = await puppeteer.launch();  const page = await browser.newPage();  //go to the site  await page.goto(url);  //extract the nested divs  const divs = await page.$$('div > div > div > div > div > div > div > div');  const scrapeContent = [];  seenLinks = [];  let index = 0;  for (const div of divs) {    const h3s = await div.$('h3');    const links = await div.$('a');    //if we have the required info    if (h3s && links) {      //pull the name      const name = await div.$eval('h3', (h3) => h3.textContent);      //pull the link      const linkHref = await div.$eval('a', (a) => a.href);      //filter out bad links      if (        !linkHref.includes('https://proxy.scrapeops.io/') &&        !seenLinks.includes(linkHref)      ) {        scrapeContent.push({          name: name,          link: linkHref,          pageNumber: pageNumber,          result_number: index,        });        //add the link to our list of seen links        seenLinks.push(linkHref);        index++;      }    }  }  await browser.close();  await csvWriter.writeRecords(scrapeContent);}//main functionasync function main() {  console.log('Starting scrape...');  await scrapePage('cool stuff', 0);  console.log(`Scrape complete, results save to: ${outputFile}`);}//run the main functionmain();

Key differences here:

fileExists is a boolean, true if our file exists and false if it doesn't
csvWriter opens the file in append mode if the file exists, otherwise it creates the file

Instead of returning our results, we write the batch to the outputFile as soon as it has been processed. This helps us write everything we possibly can even in the event of a crash.Once we're scraping multiple pages at once, if our scraper succeeds on page 1, but fails on page 2 or page 0, we will still have some results that we can review!

Step 4: Adding Concurrency

JavaScript is single threaded by default, so adding concurrency to our scraper is a little bit tricky, but JavaScript's async support makes this completely doable. In this section, let's add a concurrentScrape() function. The goal of this function is simple, run the scrapePage() function on multiple pages at the same time.Since we're dealing with Promise objects, it's a good idea to add some error handling in scrapePage(). We don't want a Promise to resolve with bad results.The code below adds concurrency and error handling to ensure our scrape completes properly.

const puppeteer = require('puppeteer');const createCsvWriter = require('csv-writer').createObjectCsvWriter;const fs = require('fs');
const outputFile = 'add-concurrency.csv';const fileExists = fs.existsSync(outputFile);
//set up the csv writerconst csvWriter = createCsvWriter({  path: outputFile,  header: [    { id: 'name', title: 'Name' },    { id: 'link', title: 'Link' },    { id: 'result_number', title: 'Result Number' },    { id: 'page', title: 'Page Number' },  ],  append: fileExists,});
async function scrapePage(browser, query, pageNumber, location, retries = 3) {  let tries = 0;  while (tries <= retries) {    const page = await browser.newPage();    try {      const url = `https://www.google.com/search?q=${query}&start=${pageNumber * 10}`;      //set a long timeout, sometimes the server take awhile      await page.goto(url, { timeout: 300000 });      //find the nested divs      const divs = await page.$$(        'div > div > div > div > div > div > div > div'      );      const scrapeContent = [];      seenLinks = [];      let index = 0;      for (const div of divs) {        const h3s = await div.$('h3');        const links = await div.$('a');        //if we have the required info        if (h3s && links) {          //pull the name          const name = await div.$eval('h3', (h3) => h3.textContent);          //pull the link          const linkHref = await div.$eval('a', (a) => a.href);          //filter out bad links          if (            !linkHref.includes('https://proxy.scrapeops.io/') &&            !seenLinks.includes(linkHref)          ) {            scrapeContent.push({              name: name,              link: linkHref,              page: pageNumber,              result_number: index,            });            seenLinks.push(linkHref);            index++;          }        }      }      //we failed to get a result, throw an error and attempt a retry      if (scrapeContent.length === 0) {        throw new Error(`Failed to scrape page ${pageNumber}`);        //we have a page result, write it to the csv      } else {        await csvWriter.writeRecords(scrapeContent);        //exit the function        return;      }    } catch (err) {      console.log(`ERROR: ${err}`);      console.log(`Retries left: ${retries - tries}`);      tries++;    } finally {      await page.close();    }  }  throw new Error(`Max retries reached: ${tries}`);}//scrape multiple pages at onceasync function concurrentScrape(query, totalPages) {  const browser = await puppeteer.launch();  const tasks = [];  for (let i = 0; i < totalPages; i++) {    tasks.push(scrapePage(browser, query, i));  }  await Promise.all(tasks);  await browser.close();}//main functionasync function main() {  console.log('Starting scrape...');  await concurrentScrape('cool stuff', 3);  console.log(`Scrape complete, results save to: ${outputFile}`);}//run the main functionmain();

There are some major improvements in this version of our script:

scrapePage() now takes our browser as a argument and instead opening and closing a browser, it opens and closes a page
When we attempt to scrape a page, we get three retries (you can change this to any amount you'd like)
If a scrape doesn't return the data we want, we throw an error and retry the scrape
If we run out of retries, the function crashes entirely and let's the user know which page it failed on
Once our try/catch logic has completed, we use finally to close the page and free up some memory
concurrentScrape() runs scrapePage() on a bunch of separate pages asynchronously to speed up our results

When scraping at scale, there is always a possibility of either bad data or a failed scrape. Our code needs to be able to take this into account. Basic error handling can take you a really long way.

Step 5: Bypassing Anti-Bots

When scraping the web, we often run into anti-bots. Designed to protect against malicious software, they're a very important part of the web.While our scraper isn't designed to be malicious, it's really fast. It's much faster than a typical human user. Even though we're not doing anything malicious, we're doing stuff way faster than a human would, so anti-bots tend to see this as a red flag and block us.To get around anti-bots, it's imperative to use a good proxy. The ScrapeOps Proxy actually rotates between the best proxies available and this ensures that we can get a result pretty much every time we page.goto(url).In this section, we'll bring our scraper up to production quality and integrate it with the ScrapeOps proxy.

We'll create a simple string formatting function, getScrapeOpsURL().
We'll add our location parameter to scrapePage() and concurrentScrape() as well.
In this case, we pass our location to the ScrapeOps Proxy because they can then route us through an actual server in that location.

Here is our proxied scraper:

const puppeteer = require('puppeteer');const createCsvWriter = require('csv-writer').createObjectCsvWriter;const fs = require('fs');
const API_KEY = 'YOUR-SUPER-SECRET-API-KEY';const outputFile = 'production.csv';const fileExists = fs.existsSync(outputFile);
//set up the csv writerconst csvWriter = createCsvWriter({  path: outputFile,  header: [    { id: 'name', title: 'Name' },    { id: 'link', title: 'Link' },    { id: 'result_number', title: 'Result Number' },    { id: 'page', title: 'Page Number' },  ],  append: fileExists,});//convert regular urls into proxied onesfunction getScrapeOpsURL(url, location) {  const params = new URLSearchParams({    api_key: API_KEY,    url: url,    country: location,  });  return `https://proxy.scrapeops.io/v1/?${params.toString()}`;}//scrape page, this is our main logicasync function scrapePage(  browser,  query,  pageNumber,  location,  retries = 3,  num = 100) {  let tries = 0;  while (tries <= retries) {    const page = await browser.newPage();    try {      const url = `https://www.google.com/search?q=${query}&start=${pageNumber * num}&num=${num}`;      const proxyUrl = getScrapeOpsURL(url, location);      //set a long timeout, sometimes the server take awhile      await page.goto(proxyUrl, { timeout: 300000 });      //find the nested divs      const divs = await page.$$(        'div > div > div > div > div > div > div > div'      );      const scrapeContent = [];      seenLinks = [];      let index = 0;      for (const div of divs) {        const h3s = await div.$('h3');        const links = await div.$('a');        //if we have the required info        if (h3s && links) {          //pull the name          const name = await div.$eval('h3', (h3) => h3.textContent);          //pull the link          const linkHref = await div.$eval('a', (a) => a.href);          //filter out bad links          if (            !linkHref.includes('https://proxy.scrapeops.io/') &&            !seenLinks.includes(linkHref)          ) {            scrapeContent.push({              name: name,              link: linkHref,              page: pageNumber,              result_number: index,            });            seenLinks.push(linkHref);            index++;          }        }      }      //we failed to get a result, throw an error and attempt a retry      if (scrapeContent.length === 0) {        throw new Error(`Failed to scrape page ${pageNumber}`);        //we have a page result, write it to the csv      } else {        await csvWriter.writeRecords(scrapeContent);        //exit the function        return;      }    } catch (err) {      console.log(`ERROR: ${err}`);      console.log(`Retries left: ${retries - tries}`);      tries++;    } finally {      await page.close();    }  }  throw new Error(`Max retries reached: ${tries}`);}//function to launch a browser and scrape each page concurrentlyasync function concurrentScrape(  query,  totalPages,  location,  num = 10,  retries = 3) {  const browser = await puppeteer.launch();  const tasks = [];  for (let i = 0; i < totalPages; i++) {    tasks.push(scrapePage(browser, query, i, location, retries, num));  }  await Promise.all(tasks);  await browser.close();}//main functionasync function main() {  const queries = ['cool stuff'];  const location = 'us';  const totalPages = 3;  const batchSize = 20;  const retries = 5;
  console.log('Starting scrape...');  for (const query of queries) {    await concurrentScrape(      query,      totalPages,      location,      (num = batchSize),      retries    );    console.log(`Scrape complete, results saved to: ${outputFile}`);  }}//run the main functionmain();

Just a few differences here:

We have a new function, getScrapeOpsURL()
We now pass our location into concurrentScrape(), scrapePage() and getScrapeOpsURL()
When we page.goto() a site, we pass the url into getScrapeOpsURL() and pass the result into page.goto()
We added in the num parameter so we can tell Google how many results we want.

Always use num with caution. Google sometimes bans suspicious traffic and the num query can make your scraper look abnormal. Even if they choose not to ban you, they hold the right to send you less than 100 results, causing your scraper to miss important data!

Step 6: Production Run

We now have a production level scraper. To edit input variables, we can simply change some stuff in our main() function. Take a look at our main():

async function main() {  const queries = ['cool stuff'];  const location = 'us';  const totalPages = 3;  const batchSize = 20;  const retries = 5;
  console.log('Starting scrape...');  for (const query of queries) {    await concurrentScrape(      query,      totalPages,      location,      (num = batchSize),      retries    );    console.log(`Scrape complete, results saved to: ${outputFile}`);  }}

If we want to scrape 100 pages of boring stuff, we'd change query to 'boring stuff' and totalPages to 100. To change the location, simply change the location variable from 'us' to whatever you'd like. I named my production scraper, production.js. I can run it with the node command.The image below shows both the command to run it and the console ouput. In fact, feel free to change any of the constants declared in main(). That's exactly why they're there! These constants make it easy to tweak our results.Here's the CSV it spits out:

Legal and Ethical Considerations

When scraping any site, always pay attention to their terms and conditions and always consult their robots.txt file if you're not sure about something. You can view Google's robot.txt here. If you're scraping as a guest (not logged into any site), the information your scraper sees is public and therefore fair game. If a site requires you to login, the information you see afterward is considered private. Don't log in with scrapers!!!Also, always pay attention to the Terms and Conditions of the site you're scraping. You can view Google's Terms here.Google does reserve the right to suspend, block, and/or delete your account if you violate their terms. Always check a site's Terms before you attempt to scrape it.Also, if you turn your Google Scraper into a crawler that scrapes the sites in your results, remember, you are subject to the Terms and Conditions of those sites as well!

Conclusion

You've now built a production level scraper using NodeJS Puppeteer. You should have a decent grasp on how to parse HTML and how to save data to a CSV file. You've also learned how to use async and Promise to improve speed and concurrency. Go build something!!!If you'd like to learn more about the tech stack used in this article, you can find some links below:

More Web Scraping Guides

In the mood to build something? Go do it! If you're in the mood to binge read, here at ScrapeOps, we've got a ton of guides for all sorts of fun and interesting scraping projects. Check our Web Scraping Playbook or take a look at some of the guides below!