How to Scrape TikTok

TikTok is an extremely popular social media platform where users can create and share short videos, typically lasting anywhere from 15 seconds to a few minutes. Imagine a place where dance challenges, cooking hacks, and pets with more personality than most people all coexist in a chaotic, scrolling wonderland.

How to Scrape TikTok Requests and BeautifulSoup

Today, we're going to learn how to scrape data from TikTok using Python. We'll learn the following:

💡GitHub CodeThe full code for this Tiktok Scraper is available on Github here.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

TLDR - How to Scrape TikTok Using Python

Looking to scrape TikTok channels? Look no further! The scraper below will scrape any channel you want. Follow the steps below to run it!

Create a new folder with a config.json file in it (place your ScrapeOps API key inside this file). It should look similar to what you see below.

{    "api_key": "YOUR-SUPER-SECRET-API-KEY"}

Create a Python file inside that same folder and copy/paste this code into it.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "residential": True,        "wait": 2000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass ProfileData:    name: str = ""    follower_count: int = 0    likes: int = 0    video_count: int = 0    nickname: str = ""    verified: bool  = False    signature: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass VideoData:    name: str = ""    url: str = ""    views: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()

def scrape_channel(channel_name, location, data_pipeline=None, retries=3):    url = f"https://www.tiktok.com/@{channel_name}"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)                        response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code == 200:                success = True                        else:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        script_tag = soup.select_one("script[id='__UNIVERSAL_DATA_FOR_REHYDRATION__']")
            json_data = json.loads(script_tag.text)            user_info = json_data["__DEFAULT_SCOPE__"]["webapp.user-detail"]["userInfo"]            stats = user_info["stats"]

            follower_count = stats["followerCount"]            likes = stats["heartCount"]            video_count = stats["videoCount"]
            user_data = user_info["user"]            unique_id = user_data["uniqueId"]            nickname = user_data["nickname"]            verified = user_data["verified"]            signature = user_data["signature"]
            profile_data = ProfileData(                name=unique_id,                follower_count=follower_count,                likes=likes,                video_count=video_count,                nickname=nickname,                verified=verified,                signature=signature            )
            data_pipeline.add_data(profile_data)                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")


def start_scrape(channel_list, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_channel,            channel_list,            [location] * len(channel_list),            [data_pipeline] * len(channel_list),            [retries] * len(channel_list)        )
def scrape_channel_content(row, location, retries):    url = f"https://www.tiktok.com/@{row['name']}"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code == 200:                success = True                        else:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            video_pipeline = DataPipeline(csv_filename=f"{row['name']}.csv")            soup = BeautifulSoup(response.text, "html.parser")
            main_content = soup.select_one("div[id='main-content-others_homepage']")                       links = main_content.find_all("a")
            for link in links:                href = link.get("href")                if row["name"] not in href:                    continue                views = 0                views_present = link.select_one("strong[data-e2e='video-views']")                if views_present:                    views = views_present.text
                video_data = VideoData(                    name=href.split("/")[-1],                    url=href,                    views=views                )
                video_pipeline.add_data(video_data)            success = True            video_pipeline.close_pipeline()
                            except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")
def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                scrape_channel_content,                reader,                [location] * len(reader),                [retries] * len(reader)            )

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    channel_list = [        "paranormalpodcast",        "theparanormalfiles",        "jdparanormal",        "paranormal.com7",        "paranormal064",        "marijoparanormal",        "paranormal_activityghost",        "youtube_paranormal",        "paranormal140",        "paranormal.51"        ]
    ## Job Processes    crawl_pipeline = DataPipeline(csv_filename="channels.csv")    start_scrape(channel_list, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")
    logger.info("Starting content scrape...")
    process_results("channels.csv", LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)    logger.info("Content scrape complete")

Replace the items in the channel_list with the channels you'd like to scrape.
Run the Python file!

python name_of_your_python_file.py

Feel free to change any of the following constants from the main as well:

MAX_RETRIES: Defines the maximum number of times the script will attempt to retry scraping a particular TikTok channel or content if the initial request fails. Increase MAX_RETRIES if you want the script to be more persistent in trying to scrape a channel.
MAX_THREADS: Determines the number of threads that the script will use for concurrent processing. This means how many channels or content pages the script can scrape simultaneously. Increase MAX_THREADS to speed up the scraping process, especially if you have a large number of channels to scrape.
LOCATION: Specifies the geographical location from which the scraping requests should appear to originate. This is useful because TikTok content can vary depending on the user’s location due to regional restrictions or content preferences.

How To Architect Our TikTok Scraper

If you wish to search for a certain category on TikTok, you are immediately blocked and asked to create an account.Take a look at the screenshot below and see for yourself.Our TikTok scraper is actually going to be quite different from other scrapers we've done in the "How To Scrape" series. As you saw at above, TikTok is very strict about letting users view their data without logging in.While we can perform a search manually, even this is extremely limited. We can, however, scrape individual channels. Because of these limitations and the fact that TikTok blocks their search page from unauthenticated users, we need to perform our initial search manually.It's not all bad though. After we've manually gathered the channels we'd like to scrape, it's actually a relatively easy process. We'll also create a scraper that goes through and scrapes all the video content from a channel.Our crawler and scraper will utilize the following design elements:

Parsing to extract valuable data from TikTok different TikTok accounts.
Data Storage to storage the data we've extracted.
Concurrency to scrape multiple pages simultaneously.
Proxy Integration to bypass anti-bots and anything else that might get in our way.

Understanding How To Scrape TikTok

Step 1: How To Request TikTok Pages

When lookup a channel on TikTok, you're performing a GET request. Take a look at the page below. Our URL is laid out like this:

https://www.tiktok.com/@paranormalpodcast

We can follow this format for each channel we'd like to scrape:

https://www.tiktok.com/@{name_of_channel}

As mentioned earlier, our channel names need to be extracted manually, but from there, it's gravy.

Step 2: How To Extract Data From TikTok Channels

Metadata from TikTok channels is embedded on the page within a script tag. Once we pull this element from the page, we can use Python's json module and index through it like any other dict object. Take a look below.The data we're looking for is a script element with an id of __UNIVERSAL_DATA_FOR_REHYDRATION__. This is the data that TikTok uses to start building the page and this is the data that we're going to scrape.

Step 3: Geolocated Data

The data we're scraping is not location based. However, when a website blocks you, they typically do it based on your IP address. Our scraper is going to be exponentially faster than a normal human and we need to take this into account.We'll be using the ScrapeOps Proxy API to take care of this. Along with rotating our IP address, the ScrapeOps API will also route us through servers in whatever location we specify. We can do this using the country param.

If we pass "country": "us", ScrapeOps will route us through a server in the US.
If we pass "uk" in as our country, ScrapeOps will route us through the UK.

Setting Up Our TikTok Scraper Project

Let's get started. You can run the following commands to get setup.Create a New Project Folder

mkdir tiktok-scraper
cd tiktok-scraper

Create a New Virtual Environment

python -m venv venv

Activate the Environment

source venv/bin/activate

Install Our Dependencies

pip install requests

pip install beautifulsoup4

Build A TikTok Channel Crawler

Step 1: Create Channel Data Parser

The core of our crawler is the parsing function. In this case, our parser will fetch a TikTok channel. After getting the channel, it pulls a script (JavaScript) element from the page. Embedded within this JavaScript is a JSON blob. The JSON blob holds all sorts of interesting information about the channel.Along with some basic structure and retry logic, this script does exactly that. Take a look at the Python script below.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)

def scrape_channel(channel_name, location, retries=3):    url = f"https://www.tiktok.com/@{channel_name}"    tries = 0    success = False        while tries <= retries and not success:        try:            response = requests.get(url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code == 200:                success = True                        else:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        script_tag = soup.select_one("script[id='__UNIVERSAL_DATA_FOR_REHYDRATION__']")
            json_data = json.loads(script_tag.text)            user_info = json_data["__DEFAULT_SCOPE__"]["webapp.user-detail"]["userInfo"]            stats = user_info["stats"]

            follower_count = stats["followerCount"]            likes = stats["heartCount"]            video_count = stats["videoCount"]
            user_data = user_info["user"]            unique_id = user_data["uniqueId"]            nickname = user_data["nickname"]            verified = user_data["verified"]            signature = user_data["signature"]
            profile_data = {                "name": unique_id,                "follower_count": follower_count,                "likes": likes,                "video_count": video_count,                "nickname": nickname,                "verified": verified,                "signature": signature            }
            print(profile_data)                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")


def start_scrape(channel_list, location, max_threads=5, retries=3):    for channel in channel_list:        scrape_channel(channel, location, data_pipeline=data_pipeline, retries=retries)

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "uk"
    logger.info(f"Scrape starting...")
    ## INPUT ---> List of keywords to scrape    channel_list = [        "paranormalpodcast",        "theparanormalfiles",        "jdparanormal",        "paranormal.com7",        "paranormal064",        "marijoparanormal",        "paranormal_activityghost",        "youtube_paranormal",        "paranormal140",        "paranormal.51"        ]
    ## Job Processes    start_scrape(channel_list, LOCATION, retries=MAX_RETRIES)    logger.info(f"Scrape complete.")

While we still have tries left and the operation has not succeeded:

We find the relevant JavaScript with soup.select_one("script[id='__UNIVERSAL_DATA_FOR_REHYDRATION__']").
json.loads(script_tag.text) converts text of the script object into a dict we can index from Python.
We pull all the relevant data out of the JSON:
- name
- follower_count
- likes
- video_count
- nickname
- verified
- signature

Step 2: Storing the Scraped Data

After we parse our data, we need to store it. In this section, we're going to add two classes: ProfileData and DataPipeline.

ProfileData is used specifically for holding information from the profiles we scrape.
The DataPipeline object takes a dataclass (in this case ProfileData) and pipes it into a CSV file while removing duplicates.

Here is our ProfileData class.

@dataclassclass ProfileData:    name: str = ""    follower_count: int = 0    likes: int = 0    video_count: int = 0    nickname: str = ""    verified: bool  = False    signature: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

Here is our DataPipeline.

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()

In our full, script, instead of printing the data, we'll now pass it into the pipeline.

The pipeline then takes in all this data, filters out any duplicates and pipes it to a CSV file.
If the file already exists, we append it.
If the file does not exist, we create it.

Here is our full code at this point.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass ProfileData:    name: str = ""    follower_count: int = 0    likes: int = 0    video_count: int = 0    nickname: str = ""    verified: bool  = False    signature: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()

def scrape_channel(channel_name, location, data_pipeline=None, retries=3):    url = f"https://www.tiktok.com/@{channel_name}"    tries = 0    success = False        while tries <= retries and not success:        try:            response = requests.get(url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code == 200:                success = True                        else:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        script_tag = soup.select_one("script[id='__UNIVERSAL_DATA_FOR_REHYDRATION__']")
            json_data = json.loads(script_tag.text)            user_info = json_data["__DEFAULT_SCOPE__"]["webapp.user-detail"]["userInfo"]            stats = user_info["stats"]

            follower_count = stats["followerCount"]            likes = stats["heartCount"]            video_count = stats["videoCount"]
            user_data = user_info["user"]            unique_id = user_data["uniqueId"]            nickname = user_data["nickname"]            verified = user_data["verified"]            signature = user_data["signature"]
            profile_data = ProfileData(                name=unique_id,                follower_count=follower_count,                likes=likes,                video_count=video_count,                nickname=nickname,                verified=verified,                signature=signature            )
            data_pipeline.add_data(profile_data)                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")


def start_scrape(channel_list, location, data_pipeline=None, retries=3):    for channel in channel_list:        scrape_channel(channel, location, data_pipeline=data_pipeline, retries=retries)

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "uk"
    logger.info(f"Scrape starting...")
    ## INPUT ---> List of keywords to scrape    channel_list = [        "paranormalpodcast",        "theparanormalfiles",        "jdparanormal",        "paranormal.com7",        "paranormal064",        "marijoparanormal",        "paranormal_activityghost",        "youtube_paranormal",        "paranormal140",        "paranormal.51"        ]
    ## Job Processes    crawl_pipeline = DataPipeline(csv_filename="channels.csv")    start_scrape(channel_list, LOCATION, data_pipeline=crawl_pipeline, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Scrape complete.")

Step 3: Adding Concurrency

Before our code is production ready, we need to add concurrency. Here, we're going to use ThreadPoolExecutor to spawn scrape_channel() on multiple threads. This will greatly increase our speed and efficiency.The code snippet below replaces our for loop and runs scrape_channel() with ThreadPoolExecutor.

def start_scrape(channel_list, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_channel,            channel_list,            [location] * len(channel_list),            [data_pipeline] * len(channel_list),            [retries] * len(channel_list)        )

Take a look at the arguments we pass into executor.map():

Our first argument, scrape_channel tells executor to run scrape_channel() on every available thread.
channel_list is the list of channels we want to pass into scrape_channel().
We then pass location, data_pipeline, and retries in as arrays to be passed to each individual thread.

Our full code now looks like this.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass ProfileData:    name: str = ""    follower_count: int = 0    likes: int = 0    video_count: int = 0    nickname: str = ""    verified: bool  = False    signature: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()

def scrape_channel(channel_name, location, data_pipeline=None, retries=3):    url = f"https://www.tiktok.com/@{channel_name}"    tries = 0    success = False        while tries <= retries and not success:        try:            response = requests.get(url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code == 200:                success = True                        else:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        script_tag = soup.select_one("script[id='__UNIVERSAL_DATA_FOR_REHYDRATION__']")
            json_data = json.loads(script_tag.text)            user_info = json_data["__DEFAULT_SCOPE__"]["webapp.user-detail"]["userInfo"]            stats = user_info["stats"]

            follower_count = stats["followerCount"]            likes = stats["heartCount"]            video_count = stats["videoCount"]
            user_data = user_info["user"]            unique_id = user_data["uniqueId"]            nickname = user_data["nickname"]            verified = user_data["verified"]            signature = user_data["signature"]
            profile_data = ProfileData(                name=unique_id,                follower_count=follower_count,                likes=likes,                video_count=video_count,                nickname=nickname,                verified=verified,                signature=signature            )
            data_pipeline.add_data(profile_data)                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")


def start_scrape(channel_list, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_channel,            channel_list,            [location] * len(channel_list),            [data_pipeline] * len(channel_list),            [retries] * len(channel_list)        )

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "uk"
    logger.info(f"Scrape starting...")
    ## INPUT ---> List of keywords to scrape    channel_list = [        "paranormalpodcast",        "theparanormalfiles",        "jdparanormal",        "paranormal.com7",        "paranormal064",        "marijoparanormal",        "paranormal_activityghost",        "youtube_paranormal",        "paranormal140",        "paranormal.51"        ]
    ## Job Processes    crawl_pipeline = DataPipeline(csv_filename="channels.csv")    start_scrape(channel_list, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Scrape complete.")

Step 4: Bypassing Anti-Bots

Anti-bots are an unending source of headache for web developers all over the world. Even though our scraper isn't malicious, anti-bots tend to see scrapers as malware.In order to get around anti-bots (and anything else for that matter), we'll be using the ScrapeOps Proxy API with a residential proxy.The snippet below holds the key to all of this.

def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "residential": True,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

get_scrapeops_url() takes in a number of arguments and converts any url into a ScrapeOps proxied url. Here are the individual arguments.

"api_key": is your ScrapeOps API key.
"url": is the url that you'd like to scrape.
"country": is the location you'd like to be routed through.
"residential": is a boolean value. When we set residential to True, we're telling ScrapeOps that we want a residential IP address. Anti-bots are far less likely to block a residential IP than a data center IP.

You can view our production level code below.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "residential": True,        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass ProfileData:    name: str = ""    follower_count: int = 0    likes: int = 0    video_count: int = 0    nickname: str = ""    verified: bool  = False    signature: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()

def scrape_channel(channel_name, location, data_pipeline=None, retries=3):    url = f"https://www.tiktok.com/@{channel_name}"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)                        response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code == 200:                success = True                        else:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        script_tag = soup.select_one("script[id='__UNIVERSAL_DATA_FOR_REHYDRATION__']")
            json_data = json.loads(script_tag.text)            user_info = json_data["__DEFAULT_SCOPE__"]["webapp.user-detail"]["userInfo"]            stats = user_info["stats"]

            follower_count = stats["followerCount"]            likes = stats["heartCount"]            video_count = stats["videoCount"]
            user_data = user_info["user"]            unique_id = user_data["uniqueId"]            nickname = user_data["nickname"]            verified = user_data["verified"]            signature = user_data["signature"]
            profile_data = ProfileData(                name=unique_id,                follower_count=follower_count,                likes=likes,                video_count=video_count,                nickname=nickname,                verified=verified,                signature=signature            )
            data_pipeline.add_data(profile_data)                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")


def start_scrape(channel_list, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_channel,            channel_list,            [location] * len(channel_list),            [data_pipeline] * len(channel_list),            [retries] * len(channel_list)        )

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "uk"
    logger.info(f"Scrape starting...")
    ## INPUT ---> List of keywords to scrape    channel_list = [        "paranormalpodcast",        "theparanormalfiles",        "jdparanormal",        "paranormal.com7",        "paranormal064",        "marijoparanormal",        "paranormal_activityghost",        "youtube_paranormal",        "paranormal140",        "paranormal.51"        ]
    ## Job Processes    crawl_pipeline = DataPipeline(csv_filename="channels.csv")    start_scrape(channel_list, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Scrape complete.")

Step 5: Production Run

Let's run this code in production and see how it does. If you need a refresher, here is our main. MAX_RETRIES is set to 3. MAX_THREADS is set to 5, and our location is set to "uk". Feel free to change any of these constants.

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "uk"
    logger.info(f"Scrape starting...")
    ## INPUT ---> List of keywords to scrape    channel_list = [        "paranormalpodcast",        "theparanormalfiles",        "jdparanormal",        "paranormal.com7",        "paranormal064",        "marijoparanormal",        "paranormal_activityghost",        "youtube_paranormal",        "paranormal140",        "paranormal.51"        ]
    ## Job Processes    crawl_pipeline = DataPipeline(csv_filename="channels.csv")    start_scrape(channel_list, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Scrape complete.")

Here are our results.In total, we scraped 10 channels in 14.235 seconds. This comes out to 1.4 seconds per page. This is lightning fast. It's not uncommon for a web scraper to take 7 to 10 seconds on a page!

Build A TikTok Scraper

In this section, we're going to scrape the video and photo posts from the individual channels we crawled earlier. We'll add a wait parameter to the ScrapeOps URL, and we'll pull some data out of some incredibly nested elements. This scraper needs to do the following:

Read the CSV file.
Parse each channel from the file.
Store the data we parsed from each channel.
Process multiple channels simultaneously using concurrency.
Once again, integrate with the ScrapeOps Proxy API.

Step 1: Create Simple Content Data Parser

We'll start out with a simple parsing function. As usual, we'll start with basic error handling, retries and parsing. Take a look at our scrape_channel_content() function. It looks a lot like our first parsing function.

def scrape_channel_content(row, location, retries):    url = f"https://www.tiktok.com/@{row['name']}"    tries = 0    success = False        while tries <= retries and not success:        try:            response = requests.get(url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code == 200:                success = True                        else:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")
            main_content = soup.select_one("div[id='main-content-others_homepage']")                       links = main_content.find_all("a")
            for link in links:                href = link.get("href")                if row["name"] not in href:                    continue                views = 0                views_present = link.select_one("strong[data-e2e='video-views']")                if views_present:                    views = views_present.text
                video_data = {                    "name": href.split("/")[-1],                    "url": href,                    "views": views                } 
                print(video_data)            success = True
                            except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

First, we find the link to the video (or photo) with main_content.find_all("a").
Then, we filter out links that don't include the channel name.
We then pull the views, url, and name (id number) from the link element.

Step 2: Loading URLs To Scrape

In order to parse this data, we need to feed urls into our parsing function. To do this, we'll write a new function similar to start_scrape().We'll call this one process_results(). This function will read our CSV into an array object. Then it will iterate through all the rows of the array and call scrape_channel_content() on them.Here is process_results().

def process_results(csv_file, location, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            scrape_channel_content(row, location, retries=retries)

After putting it together, our code now looks like this.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "residential": True,        "wait": 2000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass ProfileData:    name: str = ""    follower_count: int = 0    likes: int = 0    video_count: int = 0    nickname: str = ""    verified: bool  = False    signature: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())


class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()

def scrape_channel(channel_name, location, data_pipeline=None, retries=3):    url = f"https://www.tiktok.com/@{channel_name}"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)                        response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code == 200:                success = True                        else:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        script_tag = soup.select_one("script[id='__UNIVERSAL_DATA_FOR_REHYDRATION__']")
            json_data = json.loads(script_tag.text)            user_info = json_data["__DEFAULT_SCOPE__"]["webapp.user-detail"]["userInfo"]            stats = user_info["stats"]

            follower_count = stats["followerCount"]            likes = stats["heartCount"]            video_count = stats["videoCount"]
            user_data = user_info["user"]            unique_id = user_data["uniqueId"]            nickname = user_data["nickname"]            verified = user_data["verified"]            signature = user_data["signature"]
            profile_data = ProfileData(                name=unique_id,                follower_count=follower_count,                likes=likes,                video_count=video_count,                nickname=nickname,                verified=verified,                signature=signature            )
            data_pipeline.add_data(profile_data)                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")


def start_scrape(channel_list, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_channel,            channel_list,            [location] * len(channel_list),            [data_pipeline] * len(channel_list),            [retries] * len(channel_list)        )
def scrape_channel_content(row, location, retries):    url = f"https://www.tiktok.com/@{row['name']}"    tries = 0    success = False        while tries <= retries and not success:        try:            response = requests.get(url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code == 200:                success = True                        else:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")
            main_content = soup.select_one("div[id='main-content-others_homepage']")                       links = main_content.find_all("a")
            for link in links:                href = link.get("href")                if row["name"] not in href:                    continue                views = 0                views_present = link.select_one("strong[data-e2e='video-views']")                if views_present:                    views = views_present.text
                video_data = {                    "name": href.split("/")[-1],                    "url": href,                    "views": views                } 
                print(video_data)            success = True
                            except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")
def process_results(csv_file, location, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            scrape_channel_content(row, location, retries=retries)

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    channel_list = [        "paranormalpodcast",        "theparanormalfiles",        "jdparanormal",        "paranormal.com7",        "paranormal064",        "marijoparanormal",        "paranormal_activityghost",        "youtube_paranormal",        "paranormal140",        "paranormal.51"        ]
    ## Job Processes    crawl_pipeline = DataPipeline(csv_filename="channels.csv")    start_scrape(channel_list, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")
    logger.info("Starting content scrape...")
    process_results("channels.csv", LOCATION, retries=MAX_RETRIES)    logger.info("Content scrape complete")

process_results() reads our CSV file into an array. It then runs scrape_channel_content() on each row from the file.

Step 3: Storing the Scraped Data

We now need to store our scraped data. We've already got our DataPipeline, we just need a dataclass to pass into it. We'll call this one VideoData. This class will hold a the following:

name: the unique number given to the photo or video.
url: the url where we can find the photo or video.
views: the amount of views that the photo or video has received.

Here it is.

@dataclassclass VideoData:    name: str = ""    url: str = ""    views: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

In our updated code below, we create a new DataPipeline and pass VideoData objects into it.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "residential": True,        "wait": 2000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass ProfileData:    name: str = ""    follower_count: int = 0    likes: int = 0    video_count: int = 0    nickname: str = ""    verified: bool  = False    signature: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass VideoData:    name: str = ""    url: str = ""    views: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()

def scrape_channel(channel_name, location, data_pipeline=None, retries=3):    url = f"https://www.tiktok.com/@{channel_name}"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)                        response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code == 200:                success = True                        else:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        script_tag = soup.select_one("script[id='__UNIVERSAL_DATA_FOR_REHYDRATION__']")
            json_data = json.loads(script_tag.text)            user_info = json_data["__DEFAULT_SCOPE__"]["webapp.user-detail"]["userInfo"]            stats = user_info["stats"]

            follower_count = stats["followerCount"]            likes = stats["heartCount"]            video_count = stats["videoCount"]
            user_data = user_info["user"]            unique_id = user_data["uniqueId"]            nickname = user_data["nickname"]            verified = user_data["verified"]            signature = user_data["signature"]
            profile_data = ProfileData(                name=unique_id,                follower_count=follower_count,                likes=likes,                video_count=video_count,                nickname=nickname,                verified=verified,                signature=signature            )
            data_pipeline.add_data(profile_data)                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")


def start_scrape(channel_list, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_channel,            channel_list,            [location] * len(channel_list),            [data_pipeline] * len(channel_list),            [retries] * len(channel_list)        )
def scrape_channel_content(row, location, retries):    url = f"https://www.tiktok.com/@{row['name']}"    tries = 0    success = False        while tries <= retries and not success:        try:            response = requests.get(url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code == 200:                success = True                        else:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            video_pipeline = DataPipeline(csv_filename=f"{row['name']}.csv")            soup = BeautifulSoup(response.text, "html.parser")
            main_content = soup.select_one("div[id='main-content-others_homepage']")                       links = main_content.find_all("a")
            for link in links:                href = link.get("href")                if row["name"] not in href:                    continue                views = 0                views_present = link.select_one("strong[data-e2e='video-views']")                if views_present:                    views = views_present.text
                video_data = VideoData(                    name=href.split("/")[-1],                    url=href,                    views=views                )
                video_pipeline.add_data(video_data)            success = True            video_pipeline.close_pipeline()
                            except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")
def process_results(csv_file, location, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            scrape_channel_content(row, location, retries=retries)

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    channel_list = [        "paranormalpodcast",        "theparanormalfiles",        "jdparanormal",        "paranormal.com7",        "paranormal064",        "marijoparanormal",        "paranormal_activityghost",        "youtube_paranormal",        "paranormal140",        "paranormal.51"        ]
    ## Job Processes    crawl_pipeline = DataPipeline(csv_filename="channels.csv")    start_scrape(channel_list, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")
    logger.info("Starting content scrape...")
    process_results("channels.csv", LOCATION, retries=MAX_RETRIES)    logger.info("Content scrape complete")

Our new dataclass gives us almost everything we need to properly scrape the content from all of these channels.

Step 4: Adding Concurrency

We'll add concurrency exactly the way we did before. We'll refactor process_results() to take advantage of multithreading with ThreadPoolExecutor.

def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                scrape_channel_content,                reader,                [location] * len(reader),                [retries] * len(reader)            )

In our process_results() function, we now pass scrape_channel_content in as our first argument. Then we pass in the reader object (an array of dict objects we want to process). location and retries both get passed in as arrays as well.

Step 5: Bypassing Anti-Bots

To bypass anti-bots, we'll once again be using get_scrapeops_url(). Before we call it again in our code, we're going to add one more argument to it, "wait": 2000.This will tell the ScrapeOps server to wait 2 seconds for content to render before sending it back to us. We need to do this so that the videos and photos from these channels can be fetched and loaded into the page.Here is our finished proxy function.

def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "residential": True,        "wait": 2000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

After this, we need to change a small portion of our code.

scrapeops_proxy_url = get_scrapeops_url(url, location=location)response = requests.get(scrapeops_proxy_url)

You can view our full project ready to run in production below.

import osimport csvimport requestsimport jsonimport loggingfrom urllib.parse import urlencodefrom bs4 import BeautifulSoupimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]


def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "residential": True,        "wait": 2000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass ProfileData:    name: str = ""    follower_count: int = 0    likes: int = 0    video_count: int = 0    nickname: str = ""    verified: bool  = False    signature: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass VideoData:    name: str = ""    url: str = ""    views: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()

def scrape_channel(channel_name, location, data_pipeline=None, retries=3):    url = f"https://www.tiktok.com/@{channel_name}"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)                        response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code == 200:                success = True                        else:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            soup = BeautifulSoup(response.text, "html.parser")                        script_tag = soup.select_one("script[id='__UNIVERSAL_DATA_FOR_REHYDRATION__']")
            json_data = json.loads(script_tag.text)            user_info = json_data["__DEFAULT_SCOPE__"]["webapp.user-detail"]["userInfo"]            stats = user_info["stats"]

            follower_count = stats["followerCount"]            likes = stats["heartCount"]            video_count = stats["videoCount"]
            user_data = user_info["user"]            unique_id = user_data["uniqueId"]            nickname = user_data["nickname"]            verified = user_data["verified"]            signature = user_data["signature"]
            profile_data = ProfileData(                name=unique_id,                follower_count=follower_count,                likes=likes,                video_count=video_count,                nickname=nickname,                verified=verified,                signature=signature            )
            data_pipeline.add_data(profile_data)                                    except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")


def start_scrape(channel_list, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_channel,            channel_list,            [location] * len(channel_list),            [data_pipeline] * len(channel_list),            [retries] * len(channel_list)        )
def scrape_channel_content(row, location, retries):    url = f"https://www.tiktok.com/@{row['name']}"    tries = 0    success = False        while tries <= retries and not success:        try:            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            response = requests.get(scrapeops_proxy_url)            logger.info(f"Recieved [{response.status_code}] from: {url}")            if response.status_code == 200:                success = True                        else:                raise Exception(f"Failed request, Status Code {response.status_code}")                                ## Extract Data
            video_pipeline = DataPipeline(csv_filename=f"{row['name']}.csv")            soup = BeautifulSoup(response.text, "html.parser")
            main_content = soup.select_one("div[id='main-content-others_homepage']")                       links = main_content.find_all("a")
            for link in links:                href = link.get("href")                if row["name"] not in href:                    continue                views = 0                views_present = link.select_one("strong[data-e2e='video-views']")                if views_present:                    views = views_present.text
                video_data = VideoData(                    name=href.split("/")[-1],                    url=href,                    views=views                )
                video_pipeline.add_data(video_data)            success = True            video_pipeline.close_pipeline()
                            except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1    if not success:        raise Exception(f"Max Retries exceeded: {retries}")
def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                scrape_channel_content,                reader,                [location] * len(reader),                [retries] * len(reader)            )

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    channel_list = [        "paranormalpodcast",        "theparanormalfiles",        "jdparanormal",        "paranormal.com7",        "paranormal064",        "marijoparanormal",        "paranormal_activityghost",        "youtube_paranormal",        "paranormal140",        "paranormal.51"        ]
    ## Job Processes    crawl_pipeline = DataPipeline(csv_filename="channels.csv")    start_scrape(channel_list, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")
    logger.info("Starting content scrape...")
    process_results("channels.csv", LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)    logger.info("Content scrape complete")

Step 6: Production Run

Time to run this thing in production. If you need a refresher, here is the main we'll be working with. Feel free to change any of the following constants.

MAX_RETRIES
MAX_THREADS
LOCATION
channel_list

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    channel_list = [        "paranormalpodcast",        "theparanormalfiles",        "jdparanormal",        "paranormal.com7",        "paranormal064",        "marijoparanormal",        "paranormal_activityghost",        "youtube_paranormal",        "paranormal140",        "paranormal.51"        ]
    ## Job Processes    crawl_pipeline = DataPipeline(csv_filename="channels.csv")    start_scrape(channel_list, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")
    logger.info("Starting content scrape...")
    process_results("channels.csv", LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)    logger.info("Content scrape complete")

Here are the results.Since we added the 2 second wait, we'll add 20 seconds to the crawl from earlier (14.235 seconds) so we can estimate our crawl at 34.235 seconds. 106.536 - 34.235 = 72.301 seconds. 72.301 seconds / 10 channels = 7.2301 seconds per channel. If we adjust for our wait time from the crawl, it comes out to about 3.4 seconds per page. Our crawler relies on heavier parsing logic and it shows up in the benchmark.

Legal and Ethical Considerations

Whenever you access a website, whether you're scraping it or looking at it from your browser, you are subject to their Terms of Service and their robots.txt.You may view TikTok's terms here and their robots.txt is available here.It's important to examine both of these files because violating them can get you blocked or even permanently banned.If you are unsure of the legality of a scraping project, generally public data (data not behind a login) is public information and therefore fair game when scraping.If your data is gated behind a login or some other type of authentication, this data is considered private property and you will be subject to individual privacy and intellectual property laws.Any time you're not sure if your data is public or private, make sure to consult an attorney.

Conclusion

You did it! You now know how to use Requests and BeautifulSoup4. You also have a solid grasp of JSON. You have a decent understanding of parsing to extract your data, data storage to store it in a CSV file, concurrency to make everythig happen at once, and proxy integration to get past any roadblocks that might come your way.

How to Scrape TikTok With Selenium and BeautifulSoup

In today's tutorial, we're going to crawl TikTok profile data. Then, we'll scrape the videos from those channels.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.

Proxy Manager

Scraper Monitoring

Job Scheduling

TLDR - How to Scrape TikTok Using Selenium

If you need to scrape TikTok, look no further. This code contains both a profile crawler and a video scraper.

Create a new project folder.
Inside that folder, add your ScrapeOps API Key to a config.json file.
Then you just need to paste the code below into a new Python file.

import osimport csvimport jsonimport loggingimport htmlfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom bs4 import BeautifulSoupfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
OPTIONS = webdriver.ChromeOptions()OPTIONS.add_argument("--headless")OPTIONS.add_argument("--disable-javascript")

def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "json_response": True,        "wait": 2000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass ProfileData:    name: str = ""    follower_count: int = 0    likes: int = 0    video_count: int = 0    nickname: str = ""    verified: bool  = False    signature: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass VideoData:    name: str = ""    url: str = ""    views: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()

def scrape_channel(channel_name, location, data_pipeline=None, retries=3):    url = f"https://www.tiktok.com/@{channel_name}"    tries = 0    success = False        while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            scrapeops_proxy_url = get_scrapeops_url(url, location=location)                        driver.get(scrapeops_proxy_url)                ## Extract Data            json_stuff = driver.find_element(By.CSS_SELECTOR, "pre").get_attribute("innerHTML")            page = json.loads(json_stuff)            decoded_chunk = html.unescape(page["body"])                        soup = BeautifulSoup(decoded_chunk, "html.parser")            
            script_tag = soup.select_one("script[id='__UNIVERSAL_DATA_FOR_REHYDRATION__']")
            json_data = json.loads(script_tag.text)            user_info = json_data["__DEFAULT_SCOPE__"]["webapp.user-detail"]["userInfo"]            stats = user_info["stats"]

            follower_count = stats["followerCount"]            likes = stats["heartCount"]            video_count = stats["videoCount"]
            user_data = user_info["user"]            unique_id = user_data["uniqueId"]            nickname = user_data["nickname"]            verified = user_data["verified"]            signature = user_data["signature"]
            profile_data = ProfileData(                name=unique_id,                follower_count=follower_count,                likes=likes,                video_count=video_count,                nickname=nickname,                verified=verified,                signature=signature            )
            data_pipeline.add_data(profile_data)            success = True                               except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")


def start_scrape(channel_list, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_channel,            channel_list,            [location] * len(channel_list),            [data_pipeline] * len(channel_list),            [retries] * len(channel_list)        )
def scrape_channel_content(row, location, retries):    url = f"https://www.tiktok.com/@{row['name']}"    tries = 0    success = False        while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            driver.get(scrapeops_proxy_url)                ## Extract Data
            video_pipeline = DataPipeline(csv_filename=f"{row['name']}.csv")
            json_stuff = driver.find_element(By.CSS_SELECTOR, "pre").get_attribute("innerHTML")            page = json.loads(json_stuff)            decoded_page = html.unescape(page["body"])            soup = BeautifulSoup(decoded_page, "html.parser")                        main_content = soup.select_one("div[id='main-content-others_homepage']")                       links = main_content.find_all("a")
            for link in links:                href = link.get("href")                if row["name"] not in href or "https://www.tiktok.com" not in href:                    continue                views = 0                views_present = link.select_one("strong[data-e2e='video-views']")                if views_present:                    views = views_present.text                name = href.split("/")[-1]                try:                    int(name)                except:                    logger.info(f"Ad found, skipping item: {name}")                    continue                video_data = VideoData(                    name=href.split("/")[-1],                    url=href,                    views=views                )
                video_pipeline.add_data(video_data)            success = True            video_pipeline.close_pipeline()
                            except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1                finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")
def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                scrape_channel_content,                reader,                [location] * len(reader),                [retries] * len(reader)            )

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    channel_list = [        "paranormalpodcast",        "theparanormalfiles",        "jdparanormal",        "paranormal.com7",        "paranormal064",        "marijoparanormal",        "paranormal_activityghost",        "youtube_paranormal"        ]
    ## Job Processes    crawl_pipeline = DataPipeline(csv_filename="channels.csv")    start_scrape(channel_list, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")
    logger.info("Starting content scrape...")
    process_results("channels.csv", LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)    logger.info("Content scrape complete")

Feel free to change any of the following from main:

MAX_RETRIES: Defines the maximum number of times the script will attempt to scrape a URL if an error occurs during the process.
MAX_THREADS: Specifies the maximum number of concurrent threads that can run at the same time during the scraping process.
LOCATION: Defines the geographical location to simulate while sending requests to the target website.
channel_info: A list of TikTok channel usernames that the script is set to scrape.

How To Architect Our TikTok Scraper

Our TikTok scraper follows the same high level architecture as most of the other projects we've done in this series. We'll create both a crawler and a scraper. Our crawler will find general information on a list of channels.Afterward, we'll learn how to scrape aggregate video data from each channel. Through a series of iterations, we'll add the following features to each component.

Parsing: The ability to dig through the HTML and extract our target data.
Data Storage: Once our data's been extracted, it needs to be saved to a CSV file.
Concurrency: We should be able to run the steps above on multiple pages concurrently. This makes our runtime exponentially more efficient.
Proxy Integration: To prevent from getting blocked, a decent proxy connection is imperative.

Understanding How To Scrape TikTok Pages

Step 1: How To Request TikTok Pages

Most TikTok data (including their search page) is only available if you're logged in. To get around this, we need to know our channel names.The most important part of the screenshot below is the URL:

https://www.tiktok.com/@paranormalpodcast

All of our URLs are laid out like this:

https://www.tiktok.com/@{name_of_channel}

Step 2: How To Extract Data From TikTok Pages

TikTok renders pretty much everything dynamically. In order to load the channel, it uses a script (JavaScript) element packed with all the necessary data to load the channel page.If you look in the screenshot below, you can see evidence of this. You might notice the id: __UNIVERSAL_DATA_FOR_REHYDRATION__.With this unique information, we can write a CSS selector to find this tag on the page: script[id='__UNIVERSAL_DATA_FOR_REHYDRATION__'].All of our videos on the page are embedded inside a elements. The CSS here is jumbled nonsense. To extract these, we'll have to get a bit more creative than just a CSS selector.If you're able to see below, the href of the element seems to have a pretty uniform structure. These videos are laid out like this:

https://www.tiktok.com/@paranormalpodcast/video/7423989765457857798

We'll have to use these href values to filter our links:

https://www.tiktok.com/@{name_of_channel}/video/{id_number}

Step 3: Geolocated Data

With geotargeting, we get to choose which country our requests come from. Proxy Aggregator gives us an effective proxy solution and we don't have to worry about managing pools or individual proxy connections.Proxy Aggregator takes in a country parameter. We simply need to pass a country code into this param. Proxy Aggregator reads the country code and routes the request through our desired country.If we wish to show up in the US, our country code would be us.

Country	Country Code
Brazil	`br`
Canada	`ca`
China	`cn`
India	`in`
Italy	`it`
Japan	`jp`
France	`fr`
Germany	`de`
Russia	`ru`
Spain	`es`
United States	`us`
United Kingdom	`uk`

For more information about Proxy Aggregator's geotargeting abilities, you can view the docs here.

Setting Up Our TikTok Scraper Project

Now, we need to create our project. Follow the steps below to create a new project folder, install Selenium, and get your project ready for coding.Create a new project folder and cd into the folder.

mkdir tiktok-seleniumcd tiktok-selenium

Create a virtual environment.

python -m venv venv

Activate the environment.

source venv/bin/activate

Install Selenium.

pip install selenium

**Make sure you have webdriver installed. You can find the latest version here.

Build A TikTok Search Crawler

Due to its site complexity, TikTok crawling requires more manual intervention than other crawlers we've built.In order to scrape a TikTok page, we need to know its channel name. You can follow along with our code, but when you're performing your own crawl, you'll need to generate a list of channels beforehand.Follow these steps and you'll be able to crawl any list of TikTok channels.

Step 1: Create Simple Search Data Parser

This is unconventional, but our crawler is actually going to use both BeautifulSoup and Selenium. Because of TikTok's complex redirect system, it's pretty much impossible to directly open the page from inside Selenium.To account for this, we'll use Proxy Aggregator to fetch the page and actually take our response as JSON. The proxy integration will happen later on in the tutorial, so don't expect the code to work 100% just yet.In the code below, we have the basic skeleton for our project. We have a few different functions. scrape_channel() performs our actual scraping logic. start_scrape() is used to trigger a scrape. Our runtime is held inside the main block.

import osimport csvimport jsonimport loggingimport htmlfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom bs4 import BeautifulSoupfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
OPTIONS = webdriver.ChromeOptions()OPTIONS.add_argument("--headless")OPTIONS.add_argument("--disable-javascript")
## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)

def scrape_channel(channel_name, location, retries=3):    url = f"https://www.tiktok.com/@{channel_name}"    tries = 0    success = False        while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)                        driver.get(url)                ## Extract Data            json_stuff = driver.find_element(By.CSS_SELECTOR, "pre").get_attribute("innerHTML")            page = json.loads(json_stuff)            decoded_chunk = html.unescape(page["body"])                        soup = BeautifulSoup(decoded_chunk, "html.parser")            
            script_tag = soup.select_one("script[id='__UNIVERSAL_DATA_FOR_REHYDRATION__']")
            json_data = json.loads(script_tag.text)            user_info = json_data["__DEFAULT_SCOPE__"]["webapp.user-detail"]["userInfo"]            stats = user_info["stats"]

            follower_count = stats["followerCount"]            likes = stats["heartCount"]            video_count = stats["videoCount"]
            user_data = user_info["user"]            unique_id = user_data["uniqueId"]            nickname = user_data["nickname"]            verified = user_data["verified"]            signature = user_data["signature"]
            profile_data = {                "name": unique_id,                "follower_count": follower_count,                "likes": likes,                "video_count": video_count,                "nickname": nickname,                "verified": verified,                "signature": signature            }
            print(profile_data)            success = True                               except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")


def start_scrape(channel_list, location, data_pipeline=None, retries=3):    for channel in channel_list:        scrape_channel(channel, location, data_pipeline=data_pipeline, retries=retries)

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    channel_list = [        "paranormalpodcast",        "theparanormalfiles",        "jdparanormal",        "paranormal.com7",        "paranormal064",        "marijoparanormal",        "paranormal_activityghost",        "youtube_paranormal"        ]
    ## Job Processes    start_scrape(channel_list, LOCATION, retries=MAX_RETRIES)    logger.info(f"Crawl complete.")

Our parsing function here is unconventional.

driver.find_element(By.CSS_SELECTOR, "pre").get_attribute("innerHTML") finds our JSON respoonse from Proxy Aggregator. This will not actually work until we've hooked up the proxy. We take our response as JSON to prevent the page from being altered and rendered. If Selenium renders the page, we get an automatic redirect that skips over our target page.
Not only do we need to freeze the page in a JSON response, the page comes obscured and full of escape characters. To make our HTML readable, we use html.unescape(page["body"]).
Once the page data is finally readable, we actually pass it into BeautifulSoup for parsing. Selenium's extraction features are build to interact with the page, and in this case, we don't want to interact with it. We want to keep it intact.
After we've gotten past all these hurdles, we read the JSON from the page with soup.select_one("script[id='__UNIVERSAL_DATA_FOR_REHYDRATION__']").
From there, we just need to index JSON to retrieve our data.

Channel data is very cleverly hidden on the page. If we're careful, we can preserve the data and load it into BeautifulSoup for processing. This allows us to dig through the data without any page interactions that could potentially alter the page or redirect us.

Step 2: Storing the Scraped Data

Once we've extracted our channel data, we need to store it. In the example above, we loaded the data into a dict. This is great when we're just trying to get things working, but these data structures don't cover edge cases.For production use, we should use custom, strongly typed objects. We also need a way to pipe these objects into a CSV file.This is our ProfileData class. We'll use it to replace the dict we used earlier. If a field is missing in this object, it gets replaced by a default value: No {field.name}.Especially when dealing with data used by all sorts of moving parts (people, crawlers, scrapers etc.), this is a much safer appproach to prevent corruption.

@dataclassclass ProfileData:    name: str = ""    follower_count: int = 0    likes: int = 0    video_count: int = 0    nickname: str = ""    verified: bool  = False    signature: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

Next, we need a pipeline to a CSV file. The class below is our DataPipeline. While its waiting to be saved, our data gets held in a storage_queue. is_duplicate() allows us to fileter out duplicate items from the pipeline. When we close our pipeline, we sleep for 3 seconds to wait for any other operations to complete, then it saves the storage_queue to a CSV file.

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()

In our full code, instead of printing a dict to the console, we create a ProfileData object. Then, we pass it into the DataPipeline for safe and effective storage.

import osimport csvimport jsonimport loggingimport htmlfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom bs4 import BeautifulSoupfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
OPTIONS = webdriver.ChromeOptions()OPTIONS.add_argument("--headless")OPTIONS.add_argument("--disable-javascript")
## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass ProfileData:    name: str = ""    follower_count: int = 0    likes: int = 0    video_count: int = 0    nickname: str = ""    verified: bool  = False    signature: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()

def scrape_channel(channel_name, location, data_pipeline=None, retries=3):    url = f"https://www.tiktok.com/@{channel_name}"    tries = 0    success = False        while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)                        driver.get(url)                ## Extract Data            json_stuff = driver.find_element(By.CSS_SELECTOR, "pre").get_attribute("innerHTML")            page = json.loads(json_stuff)            decoded_chunk = html.unescape(page["body"])                        soup = BeautifulSoup(decoded_chunk, "html.parser")            
            script_tag = soup.select_one("script[id='__UNIVERSAL_DATA_FOR_REHYDRATION__']")
            json_data = json.loads(script_tag.text)            user_info = json_data["__DEFAULT_SCOPE__"]["webapp.user-detail"]["userInfo"]            stats = user_info["stats"]

            follower_count = stats["followerCount"]            likes = stats["heartCount"]            video_count = stats["videoCount"]
            user_data = user_info["user"]            unique_id = user_data["uniqueId"]            nickname = user_data["nickname"]            verified = user_data["verified"]            signature = user_data["signature"]
            profile_data = ProfileData(                name=unique_id,                follower_count=follower_count,                likes=likes,                video_count=video_count,                nickname=nickname,                verified=verified,                signature=signature            )
            data_pipeline.add_data(profile_data)            success = True                               except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")


def start_scrape(channel_list, location, data_pipeline=None, retries=3):    for channel in channel_list:        scrape_channel(channel, location, data_pipeline=data_pipeline, retries=retries)

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    channel_list = [        "paranormalpodcast",        "theparanormalfiles",        "jdparanormal",        "paranormal.com7",        "paranormal064",        "marijoparanormal",        "paranormal_activityghost",        "youtube_paranormal"        ]
    ## Job Processes    crawl_pipeline = DataPipeline(csv_filename="channels.csv")    start_scrape(channel_list, LOCATION, data_pipeline=crawl_pipeline, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")

Step 3: Adding Concurrency

Our trigger function, start_scrape() already takes in a list of channels to scrape them. However, we iterate through them using a for loop. This isn't the most efficient process because our crawler needs to look them up one at a time.With ThreadPoolExecutor, we can parse multiple channels at the same time.Here is our rewritten trigger function. The real magic comes from executor.map(). Take a look at the args we pass into it.

scrape_channel: The function we want to call on each thread.
channel_list: This is the list of channels we actually wish to crawl.
All other args get passed in as arrays. The arrays need to be the same length as our channel_list. executor.map() then takes each arg from its respective list and passes it into a separate instance of scrape_channel.

def start_scrape(channel_list, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_channel,            channel_list,            [location] * len(channel_list),            [data_pipeline] * len(channel_list),            [retries] * len(channel_list)        )

Here is our fully updated code.

import osimport csvimport jsonimport loggingimport htmlfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom bs4 import BeautifulSoupfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
OPTIONS = webdriver.ChromeOptions()OPTIONS.add_argument("--headless")OPTIONS.add_argument("--disable-javascript")
## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass ProfileData:    name: str = ""    follower_count: int = 0    likes: int = 0    video_count: int = 0    nickname: str = ""    verified: bool  = False    signature: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()

def scrape_channel(channel_name, location, data_pipeline=None, retries=3):    url = f"https://www.tiktok.com/@{channel_name}"    tries = 0    success = False        while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)                        driver.get(url)                ## Extract Data            json_stuff = driver.find_element(By.CSS_SELECTOR, "pre").get_attribute("innerHTML")            page = json.loads(json_stuff)            decoded_chunk = html.unescape(page["body"])                        soup = BeautifulSoup(decoded_chunk, "html.parser")            
            script_tag = soup.select_one("script[id='__UNIVERSAL_DATA_FOR_REHYDRATION__']")
            json_data = json.loads(script_tag.text)            user_info = json_data["__DEFAULT_SCOPE__"]["webapp.user-detail"]["userInfo"]            stats = user_info["stats"]

            follower_count = stats["followerCount"]            likes = stats["heartCount"]            video_count = stats["videoCount"]
            user_data = user_info["user"]            unique_id = user_data["uniqueId"]            nickname = user_data["nickname"]            verified = user_data["verified"]            signature = user_data["signature"]
            profile_data = ProfileData(                name=unique_id,                follower_count=follower_count,                likes=likes,                video_count=video_count,                nickname=nickname,                verified=verified,                signature=signature            )
            data_pipeline.add_data(profile_data)            success = True                               except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")


def start_scrape(channel_list, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_channel,            channel_list,            [location] * len(channel_list),            [data_pipeline] * len(channel_list),            [retries] * len(channel_list)        )

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    channel_list = [        "paranormalpodcast",        "theparanormalfiles",        "jdparanormal",        "paranormal.com7",        "paranormal064",        "marijoparanormal",        "paranormal_activityghost",        "youtube_paranormal"        ]
    ## Job Processes    crawl_pipeline = DataPipeline(csv_filename="channels.csv")    start_scrape(channel_list, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")

Step 4: Bypassing Anti-Bots

As we mentioned before, we need Proxy Aggregator here. Proxy Aggregator uses a REST API. We need a function that takes in all of our parameters and returns a proxied URL.Along with our regular requirements (api_key, url), we have a couple other parameters we need to get the right response from TikTok. We pass an arbitrary number into wait and Proxy Aggregator will wait that much time for the page to render. We also need to set json_response to True.This is imperative so we can freeze the page. When we receive our JSON response, we pass the body into BeautifulSoup so we can parse the page non-interactively.Here is the function that makes everything work.

def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "json_response": True,        "wait": 2000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

Our full code for the finalized crawler is available below.

import osimport csvimport jsonimport loggingimport htmlfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom bs4 import BeautifulSoupfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
OPTIONS = webdriver.ChromeOptions()OPTIONS.add_argument("--headless")OPTIONS.add_argument("--disable-javascript")

def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "json_response": True,        "wait": 2000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass ProfileData:    name: str = ""    follower_count: int = 0    likes: int = 0    video_count: int = 0    nickname: str = ""    verified: bool  = False    signature: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()

def scrape_channel(channel_name, location, data_pipeline=None, retries=3):    url = f"https://www.tiktok.com/@{channel_name}"    tries = 0    success = False        while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            scrapeops_proxy_url = get_scrapeops_url(url, location=location)                        driver.get(scrapeops_proxy_url)                ## Extract Data            json_stuff = driver.find_element(By.CSS_SELECTOR, "pre").get_attribute("innerHTML")            page = json.loads(json_stuff)            decoded_chunk = html.unescape(page["body"])                        soup = BeautifulSoup(decoded_chunk, "html.parser")            
            script_tag = soup.select_one("script[id='__UNIVERSAL_DATA_FOR_REHYDRATION__']")
            json_data = json.loads(script_tag.text)            user_info = json_data["__DEFAULT_SCOPE__"]["webapp.user-detail"]["userInfo"]            stats = user_info["stats"]

            follower_count = stats["followerCount"]            likes = stats["heartCount"]            video_count = stats["videoCount"]
            user_data = user_info["user"]            unique_id = user_data["uniqueId"]            nickname = user_data["nickname"]            verified = user_data["verified"]            signature = user_data["signature"]
            profile_data = ProfileData(                name=unique_id,                follower_count=follower_count,                likes=likes,                video_count=video_count,                nickname=nickname,                verified=verified,                signature=signature            )
            data_pipeline.add_data(profile_data)            success = True                               except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")


def start_scrape(channel_list, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_channel,            channel_list,            [location] * len(channel_list),            [data_pipeline] * len(channel_list),            [retries] * len(channel_list)        )

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    channel_list = [        "paranormalpodcast",        "theparanormalfiles",        "jdparanormal",        "paranormal.com7",        "paranormal064",        "marijoparanormal",        "paranormal_activityghost",        "youtube_paranormal"        ]
    ## Job Processes    crawl_pipeline = DataPipeline(csv_filename="channels.csv")    start_scrape(channel_list, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")

Step 5: Production Run

Take a look at our main. Feel free to change any of the following to customize your results.

MAX_RETRIES
MAX_THREADS
LOCATION
channel_list

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    channel_list = [        "paranormalpodcast",        "theparanormalfiles",        "jdparanormal",        "paranormal.com7",        "paranormal064",        "marijoparanormal",        "paranormal_activityghost",        "youtube_paranormal"        ]
    ## Job Processes    crawl_pipeline = DataPipeline(csv_filename="channels.csv")    start_scrape(channel_list, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")

We crawled 8 pages in 14.646 seconds! This comes out to 1.83075 seconds per page. We have a wait of 2 seconds per page. This level of speed is almost unfathomable!

Build A TikTok Video Scraper

Now that we've built a crawler to gather data for a list of profiles, we need to gather data for the videos from each of these profiles.We need to know which videos perform the best. We'll record each video's id number, URL and its view count.While this is a rather small amount of data for one particular video, when you amass this data for a large set of videos, it becomes much easier to analyze this dataset.

Step 1: Create Simple TikTok Video Data Parser

Once again, we need to start with a simple parser. The goal of this parser is simple. We need to collect the id number, URL, and view count of each video for the given profiles.The code below is much like our first parser. It won't work until we add proxy support and we once again need to get the page in a JSON response so we can prevent Selenium from rendering the page.Take a look at scrape_channel_content(), our new parser. Like we did earlier, this function is built to take a JSON response from Proxy Aggregator in order to freeze the page and prevent it from interacting with the browser. First, we find our main_content. From there, we gather all of our links.All links must contain the profile name and TikTok's base domain. If they don't they're likely ads or sponsored posts.

def scrape_channel_content(row, location, retries):    url = f"https://www.tiktok.com/@{row['name']}"    tries = 0    success = False        while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            driver.get(url)                ## Extract Data
            json_stuff = driver.find_element(By.CSS_SELECTOR, "pre").get_attribute("innerHTML")            page = json.loads(json_stuff)            decoded_page = html.unescape(page["body"])            soup = BeautifulSoup(decoded_page, "html.parser")                        main_content = soup.select_one("div[id='main-content-others_homepage']")                       links = main_content.find_all("a")
            for link in links:                href = link.get("href")                if row["name"] not in href or "https://www.tiktok.com" not in href:                    continue                views = 0                views_present = link.select_one("strong[data-e2e='video-views']")                if views_present:                    views = views_present.text                name = href.split("/")[-1]                try:                    int(name)                except:                    logger.info(f"Ad found, skipping item: {name}")                    continue                video_data = {                    "name": href.split("/")[-1],                    "url": href,                    "views": views                }
                print(video_data)            success = True
                            except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1                finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")

json_stuff = driver.find_element(By.CSS_SELECTOR, "pre").get_attribute("innerHTML") pulls our JSON response from the page containing the API response.
We then load the response using json.loads().
We pull the body from the response and load it into BeautifulSoup for static processing. We cannot load it into the browser. We will get redirected.
main_content.find_all("a") finds all of our links. We filter out any bad links that don't contain the proper information.
views gets a default value of 0. link.select_one("strong[data-e2e='video-views']") checks to see if the video has views. If it does, we save the text of this element to views.
Our name, the id number of the video, gets extracted from the href of the link.

Step 2: Loading URLs To Scrape

We'll create our URL by simply looking up the profile. We pull these profiles from our initial report get their pages. We need another trigger function to read our CSV file and run our parser on each row of the file. Take a look at process_results(), it does exactly that.With process_results(), we read the CSV file into an array of dict objects. We then iterate through them with a for loop and call scrape_channel_content() on each row. Later on, we'll replace the for loop and add concurrency soon.

def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            scrape_channel_content(row, location, retries)

Our full code now looks like this.

import osimport csvimport jsonimport loggingimport htmlfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom bs4 import BeautifulSoupfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
OPTIONS = webdriver.ChromeOptions()OPTIONS.add_argument("--headless")OPTIONS.add_argument("--disable-javascript")

def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "json_response": True,        "wait": 2000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass ProfileData:    name: str = ""    follower_count: int = 0    likes: int = 0    video_count: int = 0    nickname: str = ""    verified: bool  = False    signature: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()

def scrape_channel(channel_name, location, data_pipeline=None, retries=3):    url = f"https://www.tiktok.com/@{channel_name}"    tries = 0    success = False        while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            scrapeops_proxy_url = get_scrapeops_url(url, location=location)                        driver.get(scrapeops_proxy_url)                ## Extract Data            json_stuff = driver.find_element(By.CSS_SELECTOR, "pre").get_attribute("innerHTML")            page = json.loads(json_stuff)            decoded_chunk = html.unescape(page["body"])                        soup = BeautifulSoup(decoded_chunk, "html.parser")            
            script_tag = soup.select_one("script[id='__UNIVERSAL_DATA_FOR_REHYDRATION__']")
            json_data = json.loads(script_tag.text)            user_info = json_data["__DEFAULT_SCOPE__"]["webapp.user-detail"]["userInfo"]            stats = user_info["stats"]

            follower_count = stats["followerCount"]            likes = stats["heartCount"]            video_count = stats["videoCount"]
            user_data = user_info["user"]            unique_id = user_data["uniqueId"]            nickname = user_data["nickname"]            verified = user_data["verified"]            signature = user_data["signature"]
            profile_data = ProfileData(                name=unique_id,                follower_count=follower_count,                likes=likes,                video_count=video_count,                nickname=nickname,                verified=verified,                signature=signature            )
            data_pipeline.add_data(profile_data)            success = True                               except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")


def start_scrape(channel_list, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_channel,            channel_list,            [location] * len(channel_list),            [data_pipeline] * len(channel_list),            [retries] * len(channel_list)        )
def scrape_channel_content(row, location, retries):    url = f"https://www.tiktok.com/@{row['name']}"    tries = 0    success = False        while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            driver.get(url)                ## Extract Data
            json_stuff = driver.find_element(By.CSS_SELECTOR, "pre").get_attribute("innerHTML")            page = json.loads(json_stuff)            decoded_page = html.unescape(page["body"])            soup = BeautifulSoup(decoded_page, "html.parser")                        main_content = soup.select_one("div[id='main-content-others_homepage']")                       links = main_content.find_all("a")
            for link in links:                href = link.get("href")                if row["name"] not in href or "https://www.tiktok.com" not in href:                    continue                views = 0                views_present = link.select_one("strong[data-e2e='video-views']")                if views_present:                    views = views_present.text                name = href.split("/")[-1]                try:                    int(name)                except:                    logger.info(f"Ad found, skipping item: {name}")                    continue                video_data = {                    "name": href.split("/")[-1],                    "url": href,                    "views": views                }
                print(video_data)            success = True
                            except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1                finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")
def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            scrape_channel_content(row, location, retries)

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    channel_list = [        "paranormalpodcast",        "theparanormalfiles",        "jdparanormal",        "paranormal.com7",        "paranormal064",        "marijoparanormal",        "paranormal_activityghost",        "youtube_paranormal"        ]
    ## Job Processes    crawl_pipeline = DataPipeline(csv_filename="channels.csv")    start_scrape(channel_list, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")
    logger.info("Starting content scrape...")
    process_results("channels.csv", LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)    logger.info("Content scrape complete")

Step 3: Storing the Scraped Data

We've already got a powerful DataPipeline. We just need to feed it some dataclass objects. To do this, we'll write one more class. It will be similar to our ProfileData class from earlier. It will have the same methods, but our fields will be a bit different.Take a look at VideoData. We use it to hold the name, url, and views for each video we scrape.

@dataclassclass VideoData:    name: str = ""    url: str = ""    views: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())

In our updated example, we now open a DataPipeline from inside scrape_channel_content(). We feed our VideoData into it until we're done scraping. Once we've completed the parse, we close the pipeline.

import osimport csvimport jsonimport loggingimport htmlfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom bs4 import BeautifulSoupfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
OPTIONS = webdriver.ChromeOptions()OPTIONS.add_argument("--headless")OPTIONS.add_argument("--disable-javascript")

def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "json_response": True,        "wait": 2000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass ProfileData:    name: str = ""    follower_count: int = 0    likes: int = 0    video_count: int = 0    nickname: str = ""    verified: bool  = False    signature: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass VideoData:    name: str = ""    url: str = ""    views: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()

def scrape_channel(channel_name, location, data_pipeline=None, retries=3):    url = f"https://www.tiktok.com/@{channel_name}"    tries = 0    success = False        while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            scrapeops_proxy_url = get_scrapeops_url(url, location=location)                        driver.get(scrapeops_proxy_url)                ## Extract Data            json_stuff = driver.find_element(By.CSS_SELECTOR, "pre").get_attribute("innerHTML")            page = json.loads(json_stuff)            decoded_chunk = html.unescape(page["body"])                        soup = BeautifulSoup(decoded_chunk, "html.parser")            
            script_tag = soup.select_one("script[id='__UNIVERSAL_DATA_FOR_REHYDRATION__']")
            json_data = json.loads(script_tag.text)            user_info = json_data["__DEFAULT_SCOPE__"]["webapp.user-detail"]["userInfo"]            stats = user_info["stats"]

            follower_count = stats["followerCount"]            likes = stats["heartCount"]            video_count = stats["videoCount"]
            user_data = user_info["user"]            unique_id = user_data["uniqueId"]            nickname = user_data["nickname"]            verified = user_data["verified"]            signature = user_data["signature"]
            profile_data = ProfileData(                name=unique_id,                follower_count=follower_count,                likes=likes,                video_count=video_count,                nickname=nickname,                verified=verified,                signature=signature            )
            data_pipeline.add_data(profile_data)            success = True                               except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")


def start_scrape(channel_list, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_channel,            channel_list,            [location] * len(channel_list),            [data_pipeline] * len(channel_list),            [retries] * len(channel_list)        )
def scrape_channel_content(row, location, retries):    url = f"https://www.tiktok.com/@{row['name']}"    tries = 0    success = False        while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            driver.get(url)                ## Extract Data
            video_pipeline = DataPipeline(csv_filename=f"{row['name']}.csv")
            json_stuff = driver.find_element(By.CSS_SELECTOR, "pre").get_attribute("innerHTML")            page = json.loads(json_stuff)            decoded_page = html.unescape(page["body"])            soup = BeautifulSoup(decoded_page, "html.parser")                        main_content = soup.select_one("div[id='main-content-others_homepage']")                       links = main_content.find_all("a")
            for link in links:                href = link.get("href")                if row["name"] not in href or "https://www.tiktok.com" not in href:                    continue                views = 0                views_present = link.select_one("strong[data-e2e='video-views']")                if views_present:                    views = views_present.text                name = href.split("/")[-1]                try:                    int(name)                except:                    logger.info(f"Ad found, skipping item: {name}")                    continue                video_data = VideoData(                    name=href.split("/")[-1],                    url=href,                    views=views                )
                video_pipeline.add_data(video_data)            success = True            video_pipeline.close_pipeline()
                            except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1                finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")
def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        for row in reader:            scrape_channel_content(row, location, retries)

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    channel_list = [        "paranormalpodcast",        "theparanormalfiles",        "jdparanormal",        "paranormal.com7",        "paranormal064",        "marijoparanormal",        "paranormal_activityghost",        "youtube_paranormal"        ]
    ## Job Processes    crawl_pipeline = DataPipeline(csv_filename="channels.csv")    start_scrape(channel_list, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")
    logger.info("Starting content scrape...")
    process_results("channels.csv", LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)    logger.info("Content scrape complete")

Step 4: Adding Concurrency

Now, we're going to replace the for loop in our trigger function. ThreadPoolExecutor is an excellent tool for this.Our rewritten function is in the snippet below. Our first arg is the function we wish to call, scrape_channel_content. Next, we pass in our CSV file data with reader. All other args get passed in as arrays the length of reader.

def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                scrape_channel_content,                reader,                [location] * len(reader),                [retries] * len(reader)            )

Step 5: Bypassing Anti-Bots

We've got one final change to make. We need to hook this new scraper up to Proxy Aggregator. We already have our proxy function, we just need to use it in the right place. We'll change the driver.get() line from the parser.

scrapeops_proxy_url = get_scrapeops_url(url, location=location)driver.get(scrapeops_proxy_url)

Our final code is available for you to see below.

import osimport csvimport jsonimport loggingimport htmlfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom bs4 import BeautifulSoupfrom urllib.parse import urlencodeimport concurrent.futuresfrom dataclasses import dataclass, field, fields, asdict
API_KEY = ""
with open("config.json", "r") as config_file:    config = json.load(config_file)    API_KEY = config["api_key"]
OPTIONS = webdriver.ChromeOptions()OPTIONS.add_argument("--headless")OPTIONS.add_argument("--disable-javascript")

def get_scrapeops_url(url, location="us"):    payload = {        "api_key": API_KEY,        "url": url,        "country": location,        "json_response": True,        "wait": 2000        }    proxy_url = "https://proxy.scrapeops.io/v1/?" + urlencode(payload)    return proxy_url

## Logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)


@dataclassclass ProfileData:    name: str = ""    follower_count: int = 0    likes: int = 0    video_count: int = 0    nickname: str = ""    verified: bool  = False    signature: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
@dataclassclass VideoData:    name: str = ""    url: str = ""    views: str = ""
    def __post_init__(self):        self.check_string_fields()            def check_string_fields(self):        for field in fields(self):            # Check string fields            if isinstance(getattr(self, field.name), str):                # If empty set default text                if getattr(self, field.name) == "":                    setattr(self, field.name, f"No {field.name}")                    continue                # Strip any trailing spaces, etc.                value = getattr(self, field.name)                setattr(self, field.name, value.strip())
class DataPipeline:        def __init__(self, csv_filename="", storage_queue_limit=50):        self.names_seen = []        self.storage_queue = []        self.storage_queue_limit = storage_queue_limit        self.csv_filename = csv_filename        self.csv_file_open = False        def save_to_csv(self):        self.csv_file_open = True        data_to_save = []        data_to_save.extend(self.storage_queue)        self.storage_queue.clear()        if not data_to_save:            return
        keys = [field.name for field in fields(data_to_save[0])]        file_exists = os.path.isfile(self.csv_filename) and os.path.getsize(self.csv_filename) > 0        with open(self.csv_filename, mode="a", newline="", encoding="utf-8") as output_file:            writer = csv.DictWriter(output_file, fieldnames=keys)
            if not file_exists:                writer.writeheader()
            for item in data_to_save:                writer.writerow(asdict(item))
        self.csv_file_open = False                        def is_duplicate(self, input_data):        if input_data.name in self.names_seen:            logger.warning(f"Duplicate item found: {input_data.name}. Item dropped.")            return True        self.names_seen.append(input_data.name)        return False                def add_data(self, scraped_data):        if self.is_duplicate(scraped_data) == False:            self.storage_queue.append(scraped_data)            if len(self.storage_queue) >= self.storage_queue_limit and self.csv_file_open == False:                self.save_to_csv()                           def close_pipeline(self):        if self.csv_file_open:            time.sleep(3)        if len(self.storage_queue) > 0:            self.save_to_csv()

def scrape_channel(channel_name, location, data_pipeline=None, retries=3):    url = f"https://www.tiktok.com/@{channel_name}"    tries = 0    success = False        while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            scrapeops_proxy_url = get_scrapeops_url(url, location=location)                        driver.get(scrapeops_proxy_url)                ## Extract Data            json_stuff = driver.find_element(By.CSS_SELECTOR, "pre").get_attribute("innerHTML")            page = json.loads(json_stuff)            decoded_chunk = html.unescape(page["body"])                        soup = BeautifulSoup(decoded_chunk, "html.parser")            
            script_tag = soup.select_one("script[id='__UNIVERSAL_DATA_FOR_REHYDRATION__']")
            json_data = json.loads(script_tag.text)            user_info = json_data["__DEFAULT_SCOPE__"]["webapp.user-detail"]["userInfo"]            stats = user_info["stats"]

            follower_count = stats["followerCount"]            likes = stats["heartCount"]            video_count = stats["videoCount"]
            user_data = user_info["user"]            unique_id = user_data["uniqueId"]            nickname = user_data["nickname"]            verified = user_data["verified"]            signature = user_data["signature"]
            profile_data = ProfileData(                name=unique_id,                follower_count=follower_count,                likes=likes,                video_count=video_count,                nickname=nickname,                verified=verified,                signature=signature            )
            data_pipeline.add_data(profile_data)            success = True                               except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1
        finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")


def start_scrape(channel_list, location, data_pipeline=None, max_threads=5, retries=3):    with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:        executor.map(            scrape_channel,            channel_list,            [location] * len(channel_list),            [data_pipeline] * len(channel_list),            [retries] * len(channel_list)        )
def scrape_channel_content(row, location, retries):    url = f"https://www.tiktok.com/@{row['name']}"    tries = 0    success = False        while tries <= retries and not success:        try:            driver = webdriver.Chrome(options=OPTIONS)            scrapeops_proxy_url = get_scrapeops_url(url, location=location)            driver.get(scrapeops_proxy_url)                ## Extract Data
            video_pipeline = DataPipeline(csv_filename=f"{row['name']}.csv")
            json_stuff = driver.find_element(By.CSS_SELECTOR, "pre").get_attribute("innerHTML")            page = json.loads(json_stuff)            decoded_page = html.unescape(page["body"])            soup = BeautifulSoup(decoded_page, "html.parser")                        main_content = soup.select_one("div[id='main-content-others_homepage']")                       links = main_content.find_all("a")
            for link in links:                href = link.get("href")                if row["name"] not in href or "https://www.tiktok.com" not in href:                    continue                views = 0                views_present = link.select_one("strong[data-e2e='video-views']")                if views_present:                    views = views_present.text                name = href.split("/")[-1]                try:                    int(name)                except:                    logger.info(f"Ad found, skipping item: {name}")                    continue                video_data = VideoData(                    name=href.split("/")[-1],                    url=href,                    views=views                )
                video_pipeline.add_data(video_data)            success = True            video_pipeline.close_pipeline()
                            except Exception as e:            logger.error(f"An error occurred while processing page {url}: {e}")            logger.info(f"Retrying request for page: {url}, retries left {retries-tries}")            tries+=1                finally:            driver.quit()
    if not success:        raise Exception(f"Max Retries exceeded: {retries}")
def process_results(csv_file, location, max_threads=5, retries=3):    logger.info(f"processing {csv_file}")    with open(csv_file, newline="") as file:        reader = list(csv.DictReader(file))
        with concurrent.futures.ThreadPoolExecutor(max_workers=max_threads) as executor:            executor.map(                scrape_channel_content,                reader,                [location] * len(reader),                [retries] * len(reader)            )

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    channel_list = [        "paranormalpodcast",        "theparanormalfiles",        "jdparanormal",        "paranormal.com7",        "paranormal064",        "marijoparanormal",        "paranormal_activityghost",        "youtube_paranormal"        ]
    ## Job Processes    crawl_pipeline = DataPipeline(csv_filename="channels.csv")    start_scrape(channel_list, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")
    logger.info("Starting content scrape...")
    process_results("channels.csv", LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)    logger.info("Content scrape complete")

Step 6: Production Run

Now, we're going to run a full crawl and scrape. As always, feel free to change the config variables: MAX_RETRIES, MAX_THREADS, LOCATION, and channel_list.

if __name__ == "__main__":
    MAX_RETRIES = 3    MAX_THREADS = 5    LOCATION = "uk"
    logger.info(f"Crawl starting...")
    ## INPUT ---> List of keywords to scrape    channel_list = [        "paranormalpodcast",        "theparanormalfiles",        "jdparanormal",        "paranormal.com7",        "paranormal064",        "marijoparanormal",        "paranormal_activityghost",        "youtube_paranormal"        ]
    ## Job Processes    crawl_pipeline = DataPipeline(csv_filename="channels.csv")    start_scrape(channel_list, LOCATION, data_pipeline=crawl_pipeline, max_threads=MAX_THREADS, retries=MAX_RETRIES)    crawl_pipeline.close_pipeline()    logger.info(f"Crawl complete.")
    logger.info("Starting content scrape...")
    process_results("channels.csv", LOCATION, max_threads=MAX_THREADS, retries=MAX_RETRIES)    logger.info("Content scrape complete")

Here are the final results.If you remember earlier, our crawl took 14.646 seconds. This time, the full run took 29.796 seconds. 29.796 - 14.646 = 15.15 seconds scraping videos. 15.15 / 8 pages = 1.89375 seconds per page.Just like our crawler example, this is incredibly fast.

Legal and Ethical Considerations

When scraping, we need to be both conscious and cautious about what we're doing. Don't scrape private data. It's best practice to only scrape public data and that's exactly what we did today. When you're scraping public data, it's generally legal. It's no different than taking a picture of a billboard.Private data (data behind a login) is a completely different story and there can be disastrous consuquences if you're not careful about it.

Legal

Breaking the law when scraping can lead to any of the following:

Cease and Desist Letters: When a company formally asks you to stop scraping their site.
Lawsuits: Nobody likes going to court. If you collect data illegally, you can be liable for civil damages and more.
Prison Time: If you scrape people's private data, you'd better be prepared to face a slew of consuquences. This is a serious crime in most countries punishable by real prison time.

Ethical

Reputation Damage: No one wants to be in the next headline about unethical business practices. This can seriously damage your personal reputation and that of your company.
Lawsuits and Suspensions: When you agree to a site's terms, you're signing a legally binding contract. If you violate this contract, you can lose your account or even be subject to a lawsuit.

If you are unsure of your scraper's legality, please consult an attorney.You can view TikTok's policies using the links below.

Conclusion

Now you know how to scrape both TikTok profile data and aggergate video data from each profile.You know how to use the ScrapeOps API to avoid not only antibots, but nasty redirects as well. Take your new knowledge of parsing, data storage, concurrency, and proxy integration to build something great. These are valuable skills. Use them to your advantage.If you're interested in the tech stack from this article, check out the links below.

More Web Scraping Guides

Here at ScrapeOps, we've got loads of learning materials for you to choose from. Whether you're just starting out, or you're a veteran dev, we have something for you. Check out our extensive Web Scraping Playbook or read one of the articles below and add another piece to your scraping toolbox.