How To Bypass Anti-Bots With Python
Websites use various techniques like CAPTCHAs, rate limiting, IP blocking, JavaScript challenges, and behavioral analysis to detect and block bots to protect their content and prevent malicious activities like data theft and transaction fraud. This constant evolution makes it tougher for web scrapers to get the data without getting blocked or slowed down.
In this guide, we'll explore different methods to bypass anti-bot measures using Python.
- TLDR: How To Bypass Anti-Bots With Python
- Understanding Anti-Bot Mechanisms
- How To Bypass Anti-Bots With Python
- Method #1: Optimize Request Fingerprints
- Method #2: Use Rotating Proxy Pools
- Method #3: Use Fortified Headless Browsers
- Method #4: Use Managed Anti-Bot Bypasses
- Method #5: Solving CAPTCHAs
- Method #6: Scrape Google Cache Version
- Method #7: Reverse Engineer Anti-Bot
- Case Study - Scrape Twitchtracker.com
- Conclusion
- More Python Web Scraping Guides
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
TLDR: How To Bypass Anti-Bots With Python
To bypass anti-bots with Python, one of the most efficient methods is to use managed anti-bot bypasses like those provided by ScrapeOps.
These services handle everything for you, including IP rotation, request fingerprint optimization, CAPTCHA solving, and JavaScript execution.
Here's an example of how you can bypass a Cloudflare-protected site using ScrapeOps' anti-bot managed service:
import requests
response = requests.get(
url='https://proxy.scrapeops.io/v1/',
params={
'api_key': 'YOUR_API_KEY',
'url': 'http://example.com/', ## Cloudflare protected website
'bypass': 'cloudflare_level_1',
},
)
print('Body: ', response.content)
- This script makes a request to a Cloudflare-protected website using ScrapeOps.
- It sends the target URL along with your API key to the ScrapeOps API endpoint.
- The
bypass
parameter specifies the level of Cloudflare protection to bypass. - The response content is then printed, showing the retrieved data.
If you'd like to explore other options for bypassing anti-bot systems, keep reading for more techniques and tools.
Understanding Anti-Bot Mechanisms
Websites use a variety of clever techniques to detect and block bots, aiming to protect their content and services from automated scripts like web scrapers. Understanding these techniques can help us figure out how to bypass them.
Here are some of the most common anti-bot mechanisms you'll encounter:
- CAPTCHAs: You've seen these before. They make you solve puzzles that are easy for humans but tough for bots.They might be presented in the following forms:
- Text-based CAPTCHAs: Users must identify and enter distorted text.
- Image-based CAPTCHAs: Users select images that match a specific criterion.
- Audio CAPTCHAs: Users listen to a sequence of numbers or letters and enter them correctly.
- One-click CAPTCHAs: Users click a checkbox to verify they are not a bot, with underlying algorithms analyzing mouse movements to confirm human activity.
While CAPTCHAs are effective at blocking basic bots, they can be bypassed using advanced bots from CAPTCHA-solving services or AI robots which are able to decode the CAPTCHAs with 99.8% accuracy, and numbers in images with 90% accuracy
-
Rate Limiting: This limits how many requests you can make in a short time. Too many requests too quickly, and you get blocked.
-
IP Blocking: If your IP address makes too many requests or acts suspiciously, it gets blocked.
-
JavaScript Challenges: Some websites require your browser to run JavaScript to prove you're not a bot. Simple bots can't handle this, hence they get blocked.
-
Behavioral Analysis: Websites watch your behavior, like mouse movements and keystrokes, to see if you're human or a bot.
These are the basic anti-bot systems used by most websites. Now, let's dive into some advanced anti-bot systems and how they work:
-
PerimeterX: PerimeterX uses behavioral fingerprinting and machine learning to spot and block bots. It tracks user behavior, creating unique profiles for each user to identify anomalies that suggest bot activity.
-
DataDome: DataDome provides real-time bot protection using AI to analyze and filter out malicious traffic. It monitors web traffic continuously, using AI models to recognize bot behavior by examining request headers, IP addresses, and browsing patterns.
-
Cloudflare: Cloudflare offers tools to block bots and ensure legitimate traffic gets through. It uses JavaScript challenges, rate limiting, and a database of known malicious IPs to keep bots at bay.
Beyond these advanced systems, there are other methods websites use to combat bots:
-
User-Agent Detection: This method identifies and blocks bots by analyzing the user-agent string in HTTP headers. Websites compare incoming requests against lists of known bot user-agents and block or challenge matches.
-
Device Fingerprinting: This sophisticated technique collects and analyzes device-specific information to create a unique fingerprint for each device. It’s highly effective because it's difficult to spoof.
-
Request Throttling: This limits the number of requests a single user or IP address can make within a specified time frame. Advanced systems can adapt their limits based on real-time traffic conditions and historical data.
-
Geolocation Restrictions: This method blocks or restricts access based on the geographic location of the IP address making the request. It’s especially useful for websites with localized user bases, reducing the risk of bot attacks by limiting access to known and trusted regions.
How To Bypass Anti-Bots With Python
When it comes to web scraping, websites don't make it easy for bots to access their content. They employ various anti-bot mechanisms to protect their data and services.
Now that we know what we're up against, let's talk about how we can overcome these challenges using Python. Here are some strategies:
-
CAPTCHA Solvers: While CAPTCHAs are tough for bots, services like 2Captcha or AI-based solvers can crack them. These services can solve text-based, image-based, and even audio CAPTCHAs for you.
-
Rotating Proxies: To avoid rate limiting and IP blocking, use a pool of proxies and rotate them. This way, you distribute your requests across multiple IP addresses, making it harder for the website to detect and block you.
-
Headless Browsers: Tools like Selenium can run a real browser in the background. This helps bypass JavaScript challenges since it can execute JavaScript just like a regular browser.
-
Human-Like Behavior: Mimic human behavior by adding random delays between your actions and simulating mouse movements and keystrokes. Libraries like pyautogui can help with this.
-
Using Advanced APIs: Some anti-bot systems like PerimeterX and Cloudflare are tough to bypass. However, using advanced web scraping APIs such as those provided by ScrapeOps that offer real-time bot protection can help you understand how they work and find ways to bypass them.
Now, let's explore some methods of bypassing Anti-Bots with python and the challenges solved by these methods.
Method #1: Optimize Request Fingerprints
Challenge: Websites may analyze user-agent strings and headers sent by web browsers to identify and block bot traffic.
When you visit a website, your browser sends a request to the server, and this request contains various pieces of information known as HTTP headers.
HTTP headers are key-value pairs sent by your browser to the server with each HTTP request. They contain important information about the request and the client making the request. Examples of these information include:
-
User-Agent: Contains information about the browser and operating system.
-
Accept-Language: Indicates the language preferences of the client.
-
Referer: The URL of the previous webpage from which a link to the currently requested page was followed.
A User-Agent string is part of the HTTP headers and looks something like this:
One of the most important headers is the User-Agent string. This little piece of information tells the server what kind of device and software is making the request. Anti-bot systems often analyze these, and if they spot the exact same headers repeatedly, they'll likely block you. So, how do we get around this?
To mitigate this issue, we need to optimize our headers. This means making our requests look like they’re coming from different genuine browsers.
You can do this by rotating through a list of User-Agent strings, varying other header values, and mimicking human behavior by adding random delays between requests. This way, our traffic looks more natural and less like it’s coming from a bot.
Solution: Header Optimization
One of the easiest ways to optimize our headers is through User-Agent rotation. This means dynamically changing the contents of HTTP headers by randomly selecting from a list of User-Agent strings and other header values.
By doing this, each request looks different and more like typical human behavior, making it harder for anti-bot systems to detect and block our scraping activities.
Here’s a simple example of how we can rotate headers in Python using the requests library:
import requests
import random
# List of User-Agent strings
user_agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Safari/605.1.15",
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
]
# Function to get a random User-Agent string
def get_random_user_agent():
return random.choice(user_agents)
# Example of making a request with a rotated User-Agent
url = "http://httpbin.org/headers" # this should be the target website you want to scrape
headers = {
"User-Agent": get_random_user_agent(),
"Accept-Language": "en-US,en;q=0.9",
"Referer": "https://google.com"
}
response = requests.get(url, headers=headers)
print(response.status_code)
print(response.text)
In the code above, we have;
- A list of User-Agent Strings (representing different browsers and devices).
- Random User-Agent Function: A function to select a random User-Agent string from the list.
- Headers Function: A function to create headers, including a random User-Agent string and other headers like Accept-Language and Referer.
- HTTP Request: An example of making an HTTP GET request with the rotated headers.
Use Cases
-
403 Forbidden: Rotating headers helps in avoiding 403 errors by making your requests look legitimate and diverse, thereby reducing the chances of being blocked.
-
429 Too Many Requests: By rotating headers and adding random delays between requests, you can stay under the rate limits imposed by the server, avoiding 429 errors.
If you would like to learn more about how to use fake User-Agents and Browser Headers, check our detailed guide.
Method #2: Use Rotating Proxy Pools
Challenge: Websites may block access from IP addresses that exhibit suspicious behavior or high request rates, effectively preventing bots from accessing the site.
Another common challenge for web scrapers is IP blocking. Websites often monitor and block IP addresses that make too many requests in a short period. To counter this, we use IP rotation, which means switching between multiple IP addresses to mask our identity and mimic the behavior of different users.
IP rotation involves periodically changing the IP address used for web scraping or after a certain number of requests. This technique spreads our requests across a pool of IP addresses, helping us avoid detection and blocking. If an IP address gets blocked, we can simply switch to another one from the pool and keep scraping without interruption.
We achieve IP rotation by using a proxy server, which acts as an intermediary between our scraper and the target website. The proxy server assigns different IP addresses to our outgoing requests, making it appear as if they’re coming from different users and locations. This method helps us evade anti-bot systems and avoid IP bans effectively.
Solution: IP Rotation Using Proxy
There are two major ways to obtain proxies for IP rotation:
- Subscribing to a proxy rotation service or
- Using a list of publicly available proxies.
While free or public proxies might seem appealing because there's no cost, they often come with significant drawbacks, such as low reliability, slow speeds, and a high likelihood of being blocked.
Paid proxy services, on the other hand, offer several advantages:
- Reliability: Paid proxies are less likely to go offline and usually provide higher uptime.
- Speed: Paid proxies typically offer faster connection speeds, which is crucial for efficient web scraping.
- Anonymity: Premium proxies are less likely to be flagged and blocked, as they are used by fewer people.
- Support: Paid services often come with customer support to help troubleshoot any issues.
One of the most efficient ways to implement IP rotation is by using a web scraping API. These APIs simplify the data-gathering process from websites by handling things like request headers, IP rotation, and CAPTCHAs for you. They come with built-in proxy pools, making it easier to rotate IP addresses without the hassle of managing proxy servers yourself.
ScrapeOps Proxy Aggregator is a leading web scraping API service that offers a rotating proxy pool feature. This can be easily integrated into your Python scripts.
Here’s an example of how to use the ScrapeOps API for IP rotation:
import requests
# Define the ScrapeOps API endpoint and your API key
SCRAPEOPS_API_URL = "https://api.scrapeops.io/v1/"
API_KEY = "your_scrapeops_api_key"
def scrape_with_proxies(url):
response = requests.get(
SCRAPEOPS_API_URL,
params={
'api_key': API_KEY,
'url': url,
'country': 'us' # Optional: specify the country of the IP address
}
)
return response.json()
# Example usage
url = "https://example.com"
data = scrape_with_proxies(url)
print(data)
In this example, we used the ScrapeOps API to manage IP rotation.
- The API handles proxy management and provides different IP addresses for each request, ensuring anonymity and reducing the risk of IP bans.
- The
scrape_with_proxies
function sends a GET request to the ScrapeOps API with the target URL and the API key. - The response contains the data from the target website.
If you would like to learn more about how to use and rotate proxies with Python Requests library, check our detailed Python Requests: How to Use & Rotate Proxies article.
Use Cases
-
403 Forbidden Errors: Some websites automatically block IP addresses that make too many requests in a short period. Using a rotating proxy pool can help distribute the requests across multiple IP addresses, reducing the likelihood of encountering a 403 Forbidden error.
-
JavaScript Challenges: Websites that use JavaScript to detect bots can be trickier to scrape. Rotating proxies can help bypass these challenges by making it appear as if the requests are coming from different users, thus avoiding detection.
Method #3: Use Fortified Headless Browsers
Challenge: Websites may employ JavaScript challenges to detect bot behavior, such as requiring user interaction or executing JavaScript code to verify the browser's capabilities.
Modern websites often use advanced bot detection techniques like JavaScript challenges and user interaction requirements to verify the browser's capabilities.
Anti-bot services such as Cloudflare and DataDome are known for employing these methods to identify and block scrapers. To bypass these sophisticated systems, we can use fortified headless browsers.
Solution: Dynamic Rendering using headless browser
A headless browser is essentially a web browser without a graphical user interface (GUI). You can launch it using tools like Playwright or Selenium. It renders web pages just like a regular browser but runs in the background, making it perfect for automated tasks like web scraping, testing, and accessing dynamic content that requires JavaScript execution.
Here are some benefits of using a headless browser to dynamically render a website and evade bot detection:
-
JavaScript Execution: Headless browsers can execute JavaScript, allowing them to render dynamic content that standard HTTP libraries (like requests) can't handle.
-
Simulating User Interaction: They can simulate user actions such as clicks, form submissions, and scrolling, which are often needed to fully load dynamic content.
-
Avoiding Detection: By mimicking real user behavior, headless browsers can help evade sophisticated anti-bot mechanisms. Tools like Playwright Stealth or Selenium's Undetected Chromedriver can modify browser properties to avoid detection by systems like Cloudflare or DataDome.
-
Handling Complex Web Elements: They can interact with complex web elements that require JavaScript to be fully loaded and displayed
Here's an example of how to set up and use Playwright for rendering dynamic content:
import asyncio
from playwright.async_api import async_playwright
async def main():
async with async_playwright() as pw:
browser = await pw.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto('https://example.com')
await page.wait_for_selector('h1')
content = await page.content()
print(content)
await browser.close()
asyncio.run(main())
In the code above:
- We imported asyncio and
playwright.async_api
for asynchronous web scraping. - The
main()
function launches a Chromium browser in headless mode (without a GUI). - A new browser page is created, and it navigates to
https://example.com
. - The script waits for an
h1
element to appear on the page. - The page's HTML content is retrieved and printed.
- The browser is then closed to free up resources.
Sometimes, using a headless browser for evading bot detection might not be enough and may reveal the fact that we are using a bot. This process is known as browser leaks.
Browser leaks refer to the unintentional exposure of information by a web browser that can be used to identify automated browsing activities, such as those conducted by bots or headless browsers. T
hese leaks are particularly relevant in the context of web scraping and automated browsing, as they can lead to the detection and blocking of bots by sophisticated anti-bot systems. These leaks can include:
JavaScript Execution Differences:
Headless browsers can execute JavaScript, but there may be subtle differences in how they do so compared to regular browsers. These differences can include variations in the timing of script execution, discrepancies in the JavaScript environment, or the presence of properties that indicate a headless mode (e.g. navigator.webdriver
).
Rendering Anomalies:
Headless browsers may render web pages slightly differently than standard browsers. These anomalies can be detected by anti-bot systems. For example, differences in how fonts, images, or canvas elements are rendered can serve as indicators that the browser is operating in headless mode.
Variations in Browser Properties:
Browser properties such as the user agent string, screen resolution, installed plugins, and other fingerprintable attributes can vary between headless and non-headless browsers.
Anti-bot systems can query these properties and compare them against expected values for legitimate user agents. If the properties indicate inconsistencies or default values often associated with headless browsers, the request can be flagged as suspicious.
To address these problems, fortified headless browsers come into play. Fortified headless browsers are enhanced versions of headless browsers designed to evade detection by sophisticated anti-bot systems like DataDome and Cloudflare.
They use techniques such as randomizing browser fingerprints, simulating human-like browsing patterns, and bypassing common detection methods. One of such fortified headless browser is the Selenium Undetected Chromedriver.
It's a modified version of the ChromeDriver that is designed to bypass these detection mechanisms by mimicking human-like browsing behavior and avoiding common detection pitfalls.
Let's see how we can use Selenium Undectected Chromedriver to bypass anti-bot:
First, you need to install the undetected-chromedriver package. This can be done easily using pip:
pip install --upgrade undetected_chromedriver
Once installed, you can set up the Undetected ChromeDriver in your Python script. Here's how to get started:
import undetected_chromedriver as uc
import time
def main():
# Initialize Chrome options
chrome_options = uc.ChromeOptions()
# chrome_options.add_argument('--headless') # Uncomment this to run in headless mode
chrome_options.add_argument('--disable-gpu') # Disable GPU usage for compatibility
chrome_options.add_argument('--no-sandbox') # Disable sandboxing for compatibility
# Initialize the undetected ChromeDriver
driver = uc.Chrome(options=chrome_options)
try:
# Navigate to a webpage
driver.get('https://example.com')
# Wait for a few seconds to allow the page to load
time.sleep(5)
# Print the contents of the page
print(driver.page_source)
finally:
driver.quit()
if __name__ == "__main__":
main()
In this script:
- We use
undetected_chromedriver
to launch a Chrome browser that can bypass detection. - Chrome options are configured, including disabling GPU usage and sandboxing for compatibility.
- The undetected ChromeDriver is initialized with these options.
- The browser navigates to
https://example.com
and waits for 5 seconds to ensure the page loads completely. - The HTML content of the loaded page is printed.
- Finally, the browser is properly closed to free up resources.
Using proxies can further enhance your ability to bypass detection by rotating IP addresses.
Here's a sample code to setup proxy in your Selenium script:
PROXY = "87.83.230.229:8080" # Your proxy address
chrome_options.add_argument(f'--proxy-server={PROXY}')
driver = uc.Chrome(options=chrome_options)
For a more comprehensive guide and additional details, check our Selenium Undetected Chromedriver: Bypass Anti-Bots With Ease guide on bypassing anti-bot systems using Selenium Undetected Chromedriver.
Method #4: Use Managed Anti-Bot Bypasses
Challenges: Some anti-bots systems are very difficult to bypass and require the use of highly fortified browsers in combination with residential proxies and optimized headers/cookies.
Sometimes, even the best fortified headless browsers can’t get past advanced anti-bot systems like Datadome and PerimeterX. These systems are designed to be incredibly tough on scrapers. This is where managed anti-bot bypasses come into play. Instead of trying to handle everything on your own, these services offer managed solutions to bypass anti-bot measures.
Let's look at some companies providing these services and their pricing systems.
Solution: Use a Proxy provider that offers managed anti-bot bypasses instead of doing it yourself.
When dealing with very difficult anti-bot systems, using a managed anti-bot bypass service can save you a lot of time and hassle.
Here are some options:
Proxy Provider | Anti-Bot Solution | Pricing Method |
---|---|---|
ScrapeOps | Anti-Bot Bypasses | Pay per successful request |
BrightData | Web Unlocker | Pay per successful request |
Oxylabs | Web Unblocker | Pay per GB |
Smartproxy | Site Unblocker | Pay per GB |
Zyte | Zyte API | Pay per successful request |
ScraperAPI | Ultra Premium | Pay per successful request |
ScrapingBee | Stealth Proxy | Pay per successful request |
Scrapfly | Anti-Scraping Protection | Pay per successful request |
These anti-bot solutions do work, but they can become extremely expensive when used at scale. With prices ranging from $1,000 to $5,000 to scrape 1M pages per month.
As part of the ScrapeOps Proxy Aggregator, we aggregate these anti-bot bypassing solutions together and find the best performing and cheapest option for your use case.
For example, a user can activate the Cloudflare Bypass by simply adding bypass=cloudflare_level_1
to your API request, and the ScrapeOps proxy will use the best & cheapest Cloudflare bypass available for your target domain.
import requests
response = requests.get(
url='https://proxy.scrapeops.io/v1/',
params={
'api_key': 'YOUR_API_KEY',
'url': 'http://example.com/', ## Cloudflare protected website
'bypass': 'cloudflare_level_1',
},
)
print('Body: ', response.content)
In this script:
- A GET request is sent to the ScrapeOps API endpoint using the provided API key.
- The target URL is specified along with a bypass parameter to handle Cloudflare protection.
- The response content from the requested URL is printed.
Here is a list of available bypasses:
Bypass | Description |
---|---|
cloudflare_level_1 | Use to bypass Cloudflare protected sites with low security settings. |
cloudflare_level_2 | Use to bypass Cloudflare protected sites with medium security settings. |
cloudflare_level_3 | Use to bypass Cloudflare protected sites with high security settings. |
incapsula | Use to bypass Incapsula protected sites. |
perimeterx | Use to bypass PerimeterX protected sites. |
datadome | Use to bypass DataDome protected sites. |
Method #5: Solving CAPTCHAs
Challenge: CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) challenges present images, text, or puzzles that humans can typically solve but are difficult for bots.
CAPTCHA challenges are designed to distinguish human users from automated bots. These challenges usually involve recognizing text from distorted images, selecting certain images, or solving simple puzzles. While humans can easily solve these tasks, they present significant challenges for bots.
Solution: Utilize CAPTCHA-solving services There are several techniques for automating CAPTCHA solving, some of these include;
-
Optical Character Recognition (OCR): OCR technology can be used to decode text-based CAPTCHAs by recognizing and extracting text from images. However, this technique is less effective against more complex CAPTCHAs.
-
Image Recognition: For CAPTCHAs that involve identifying objects within images, advanced image recognition algorithms can be utilized. This approach is more challenging and often less reliable.
-
Third-Party CAPTCHA Solving Services: The most effective method involves using third-party services like 2Captcha, Anti-Captcha or Capsolver. These services offer APIs that can be integrated into your web scraping scripts to handle CAPTCHA challenges in real-time.
Let's see how we can integrate some of these services into our web scraping scripts. We'll start with 2Captcha.
First install the package using pip:
pip install 2captcha-python
Create an instance of the TwoCaptcha like this:
from twocaptcha import TwoCaptcha
solver = TwoCaptcha('YOUR_API_KEY')
Finally solve the CAPTCHA using:
try:
# Solve the CAPTCHA by providing the image URL or base64 string
result = solver.normal('path/to/captcha_image.png') # This is the path to the image from the CAPTCHA page. You can save this image using regular BeautifulSoup code and pass it here or simply pass the image URL.
print(f"CAPTCHA Solved: {result['code']}")
# From here, you can process the result by automatically filling in the CAPTCHA box.
except Exception as e:
print(f"Error solving CAPTCHA: {e}")
You can also customize some of the options for the created instance:
config = {
'server': '2captcha.com',
'apiKey': 'YOUR_API_KEY',
'softId': 123,
'callback': 'https://your.site/result-receiver',
'defaultTimeout': 120,
'recaptchaTimeout': 600,
'pollingInterval': 10,
}
solver = TwoCaptcha(**config)
Let's try Anti-Captcha service:
Step 1: Install the package using pip.
pip3 install anticaptchaofficial
Step 2: Integrate with the script:
from anticaptchaofficial.imagecaptcha import *
solver = imagecaptcha()
solver.set_verbose(1)
solver.set_key("YOUR_API_KEY_HERE")
captcha_text = solver.solve_and_return_solution("captcha.jpeg") # path to image
if captcha_text != 0:
print "captcha text "+captcha_text
else:
print "task finished with error "+solver.error_code
Here's an explanation of the script above:
- Both examples start by importing the necessary libraries and initializing the CAPTCHA-solving service with an API key.
- The
solver.normal
method in the 2Captcha example and thesolver.solve_and_return_solution
method in the Anti-Captcha example are used to send the CAPTCHA image to the service. - The service processes the image and returns the solution, this result can then be progamatically submitted to the CAPTCHA page and scraping can continue.
Method #6: Scrape Google Cache Version
Challenge: Bypassing Advanced Bot Detection Systems(cloudflare)
Websites protected by advanced anti-bot systems like Cloudflare can be extremely challenging to scrape directly due to their sophisticated security mechanisms, including JavaScript challenges, CAPTCHAs, and behavior analysis. These protections can quickly detect and block scraping attempts.
Solution: Scrape Google Cache Version
When Google indexes web pages, it creates a cached version of the content. Many websites protected by Cloudflare allow Google to crawl their content, making this cached data accessible.
Scraping the Google cache can be more straightforward than scraping a website directly protected by Cloudflare. However, this method is most effective if the target website's data doesn't change frequently, as the cached version may not always be up-to-date.
Why This Works? Google's cache is designed to store and serve copies of web pages to improve load times and provide access to content when the original page is unavailable. Since Googlebot can bypass most anti-bot protections to index pages, the cached versions are generally accessible without triggering the same security mechanisms.
Advantages
- Reduced Blocking: Since requests to Google Cache are not directed to the original site, they are less likely to be blocked or flagged.
- Simplified Scraping: Cached pages often omit some of the dynamic content and JavaScript protections, making them easier to parse.
- Bypassing Rate Limits: Accessing content through Google Cache helps avoid rate limits imposed by the target website.
Disadvantages
- Outdated Content: The cached version may not be up-to-date, leading to discrepancies if the site content changes frequently.
- Partial Content: Some dynamic elements or features of the original page might not be fully captured in the cached version.
- Cache Invalidation: If the target website frequently updates or changes their content, the cache may be invalidated or not available for scraping.
Below is a sample code to scrape a page using Google cache version
import requests
from bs4 import BeautifulSoup
def scrape_google_cache(url):
# Construct the Google Cache URL
cache_url = f"http://webcache.googleusercontent.com/search?q=cache:{url}"
# Send a request to the Google Cache URL
response = requests.get(cache_url)
if response.status_code == 200:
# Parse the content using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
return soup.prettify()
else:
return None
# Example usage
cached_content = scrape_google_cache('https://example.com')
if cached_content:
print(cached_content)
else:
print("Failed to retrieve cached content.")
Here's an explanation of the script above:
- Construct the Cache URL: The
cache_url
variable is created by appending the target URL to the Google Cache prefix. - Request the Cached Page: The
requests.get
method fetches the cached page. - Check Response Status: The script checks if the request was successful (HTTP status code 200).
- Parse and Return Content: Using BeautifulSoup, the script parses the HTML content of the cached page and prints it.
Method #7: Reverse Engineer Anti-Bot
Challenge: Some advanced anti-bot systems might be able to detect and block the majority of the above techniques.
Sometimes, anti-bot systems are so advanced that the usual techniques just won't cut it. When that happens, we might need to roll up our sleeves and reverse engineer these systems.
Solution: Reverse Engineer Anti-Bot Systems
Reverse engineering an anti-bot system means digging into the detection mechanisms they use to spot bots. By understanding how these systems work, you can create methods to bypass their protections without always relying on fortified headless browsers or proxy solutions for every request. This approach is complex, but it can be incredibly effective for large-scale scraping operations.
Advantages of Reverse Engineering
- Resource Efficiency: Avoid the high costs associated with running numerous full headless browser instances by developing a lightweight, custom bypass solution.
- Scalability: Perfect for large-scale operations, such as scraping over 500 million pages per month, where resource optimization is crucial.
- Precision: Tailor your solution to specifically meet the requirements of passing various anti-bot checks including JavaScript challenges, TLS fingerprints, and IP reputation tests.
Disadvantages and Considerations
- Complexity: Advanced anti-bot systems are intentionally complex, requiring significant expertise and time to understand and circumvent.
- Maintenance: Continuous updates from anti-bot protection providers mean your bypass will require ongoing maintenance to remain effective.
- Initial Investment: The initial development phase demands a substantial investment of time and engineering resources, which may only be justified for large-scale operations or businesses that depend heavily on cost-effective scraping solutions.
Who Should Consider This Approach?
- Enthusiasts: Those with a keen interest in the intellectual challenge of reverse engineering sophisticated systems.
- Large-Scale Operators: Companies or proxy services that scrape high volumes of data and need to minimize operational costs.
For most developers, simpler methods of bypassing anti-bot protection are sufficient and less resource-intensive. However, for companies scraping at very large volumes (+500M pages per month) or smart proxy solutions who's businesses depend on cost effective ways to access sites, then building your own custom bypass solution might be a good option.
Let's be honest—reverse engineering is no walk in the park. It takes a lot of grit and determination. If you're up for the challenge, a good place to start is by carefully studying the "How it works" section on these anti-bot systems' websites.
Understanding their own descriptions of their technology can give you valuable insights into how they detect and block bots.
Case Study - Scrape TwitchTracker
In this section, we're going to try scraping TwitchTracker and implement the various methods to bypass anti-bot systems listed above.
TwitchTracker maintains up to date statistics for Twitch streamers, including useful data like follower count and average viewers and protected by Cloudflare which makes it a great example to test what we have learned so far.
Scraping With Vanilla Python
Here, we will try to use just BeautifulSoup and Python Requests without applying any form of anti-bot bypasses.
Let's see the results.
import requests
from bs4 import BeautifulSoup
url = 'https://twitchtracker.com/sheschardcore/streams'
r = requests.get(url)
status_code = r.status_code # obtaining the status code
soup = BeautifulSoup(r.content,'html.parser') #parsing the content with bs4, wether request was successful or not
if status_code==200:
print('Request was successful')
print(soup.text)
else:
print(f'we got a status code of {status_code}, therefore we are unable to scrape.')
print(soup.text)
The script above:
- Sends a GET request to a specified URL, checks if the request was successful, and parses the HTML content of the page using BeautifulSoup.
- If the request is successful, it prints a success message along with the text content of the page.
- If the request fails, it prints an error message with the status code and the text content.
After running this code, we get the following errors, implying that we cannot scrape this site without implementing some kind of anti-bot system:
Optimizing Requests Fingerprints
Next, let's try optimizing our request fingerprints to make it look like our requests are coming from a real device.
import requests
import random
# List of User-Agent strings
user_agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Safari/605.1.15",
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
]
# Function to get a random User-Agent string
def get_random_user_agent():
return random.choice(user_agents)
# Example of making a request with a rotated User-Agent
url = "https://twitchtracker.com/sheschardcore/streams" # this should be the target website you want to scrape
headers = {
"User-Agent": get_random_user_agent(),
"Accept-Language": "en-US,en;q=0.9",
"Referer": "https://google.com"
}
response = requests.get(url, headers=headers)
if response.status_code == 200:
print('Request Successful, We are Able to scrape the Data')
else:
print(f'We got a status code of {response.status_code}, unable to scrape Data.' )
- Our updated script sends a GET request to a specified URL, using randomly selected User-Agent strings to avoid detection and blocking by the server.
- It also includes additional headers to simulate a legitimate browser request.
- The script checks if the request was successful by examining the status code of the response and prints an appropriate message based on the result.
After running our scraper with optimized fingerprints, we can see that we got a status code of 200, meaning we are able to extract the data successfully.
As we can see, optimizing our request fingerprints allowed us to successfully scrape data from the website. However, it's important to note that some websites with stringent anti-bot systems will require more than just fingerprint optimization to bypass them.
The effectiveness of these methods depends on the specific anti-bot defenses in place, so if one method doesn't work, you should try another from the list above.
Rotating Proxy Pools
This technique helps us avoid detection by constantly changing our IP address from a list of proxy pools managed by ScrapeOps, making it look like our requests are coming from different locations.To use this method, you need to use the ScrapeOps proxy.
First, obtain an API key by signing up for a free account here.
import requests
from bs4 import BeautifulSoup
response = requests.get(
url='https://proxy.scrapeops.io/v1/',
params={
'api_key': 'XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX', # Replace with your ScrapeOps API key
'url': 'https://twitchtracker.com/sheschardcore/streams',
},
)
status_code = response.status_code
soup = BeautifulSoup(response.content,'html.parser')
if status_code==200:
print(f'Request was successful with a status code of {status_code}')
title = soup.title.string
print(title)
else:
print(f'we got a status code of {status_code}, therefore we are unable to scrape.')
The code above:
- Sends a GET request to the ScrapeOps proxy API with the API key, targeting the Twitchtracker URL.
- Checks the status code of the response to determine if the request was successful.
- Parses the HTML content of the response using BeautifulSoup.
- If the request is successful (status code 200), it prints a success message with the status code and extracts the page title, printing the title.
- If the request fails, it prints an error message with the status code, indicating that the scraping was unsuccessful.
The code's output confirms that we have been successfully able to scrape the data using the ScrapeOps proxy, as shown by the status code 200 in the screenshot.
Using Fortified Headless Browsers
Now, lets try using Fortified Headless Browsers, In this example we will use Selenium's Undetected Chromedriver to render the page and print its content.
First, we need to install the undetected-chromedriver package. This can be done using pip:
pip install --upgrade undetected_chromedriver
Now, let's look at the code:
import undetected_chromedriver as uc
import time
def main():
# Initialize Chrome options
chrome_options = uc.ChromeOptions()
chrome_options.add_argument('--blink-settings=imagesEnabled=false')
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox') # Disable sandboxing for compatibility
# Initialize the undetected ChromeDriver
driver = uc.Chrome(options=chrome_options)
try:
# Navigate to a webpage
driver.get('https://twitchtracker.com/sheschardcore/streams')
# Wait for a few seconds to allow the page to load
time.sleep(5)
status_code = driver.execute_script(
'return fetch(document.location.href).then(response => response.status);' # extracting status code using Java script
)
if status_code == 200:
print(f'Request was successful with a status code of {status_code}')
page_title = driver.title
# Print the page title
print(f'Page title: {page_title}')
else:
print(f'Request was NOT successful with a status code of {status_code}')
finally:
try:
driver.close()
driver.quit()
except :
pass
if __name__ == "__main__":
main()
In the code above:
- We set up Chrome options and initialize the undetected ChromeDriver to help avoid detection.
- Next, we navigate to the TwitchTracker stream page and give it a few seconds to load.
- We then execute a JavaScript command to fetch and print the status code of the response, checking if our request was successful.
- After that, we retrieve and print the page title if the status code is 200; otherwise, we print the other status code.
- Finally, we ensure the browser closes properly, even if something goes wrong.
The results from this code show that we successfully bypassed their Cloudflare protection and scraped the required data. This is shown in the screenshot below:
Using Managed Anti-Bot Bypasses
Next, we'll use the ScrapeOps Managed Anti-Bot Service to bypass anti-bot protection. We can do that using the code below.
import requests
from bs4 import BeautifulSoup
response = requests.get(
url='https://proxy.scrapeops.io/v1/',
params={
'api_key': 'XXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX', # use your Scrapeops API key
'url': 'https://twitchtracker.com/sheschardcore/streams', ## Cloudflare protected website
'bypass': 'cloudflare_level_1',
},
)
status_code = response.status_code
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.title.string if soup.title else "Title not found"
print('Title: ', title)
if status_code == 200:
print(f'Request was successful with status code of {status_code}')
else:
print(f'Request was not successful with status code of {status_code}')
In the code above:
- We send a GET request to the ScrapeOps proxy API endpoint with the specified URL and API key, targeting the TwitchTracker stream page. We included a parameter to bypass Cloudflare's level 1 protection.
- Next, we checked the status code of the response to determine if our request was successful.
- We then parsed the HTML content of the response using BeautifulSoup and attempt to extract the page title.
- If the request is successful (status code 200), we print a success message along with the extracted title. If the request fails, we print an error message with the status code.
- Finally, we ensure that the script handles the absence of a title gracefully by printing "Title not found" if the title element is missing.
This has also proven to successfully bypassed their cloudflare anti bot sytems by returning a status code of 200, and allowing us to scrape the data needed.
Scrape Google Cache Version
Let's try using Google's cached version of the website to scrape its content. This method involves scraping the version of the site that Google has indexed. While this can be useful, keep in mind that not all pages you want to scrape may be indexed by Google, and even if they are, the content might be outdated.
Below is a sample code to scrape a page using Google cache version
import requests
from bs4 import BeautifulSoup
def scrape_google_cache(url):
# Construct the Google Cache URL
cache_url = f"http://webcache.googleusercontent.com/search?q=cache:{url}"
# Send a request to the Google Cache URL
response = requests.get(cache_url)
if response.status_code == 200:
# Parse the content using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
return f'Request was successful with status code {response.status_code}', f' Title : {soup.title.string}'
else:
return None
# Example usage
cached_content = scrape_google_cache('https://twitchtracker.com/sheschardcore/statistics')
if cached_content:
print()
print(cached_content)
else:
print("Failed to retrieve cached content.")
In the code above:
-
We defined a function
scrape_google_cache
that constructs the Google Cache URL for the given webpage. -
The function sends a GET request to the Google Cache URL and checks if the request was successful by checking the status code.
-
If the status code is 200, indicating success, the function parses the HTML content of the response using BeautifulSoup and retrieves the page title.
-
The function returns a success message with the status code and the page title. If the request fails, the function returns None.
-
In the example usage, we called the scrape_google_cache function with the twitchtracker URL and printed the cached content or an error message if the content couldn't be retrieved.
The screenshot below shows that we successfully retrieved and scraped the content.
Each anti-bot bypass method has its strengths and limitations. Understanding these can help you choose the right approach for your web scraping needs. Here's a comparison of these methods based on performance, ease of implementation, and usability
Method | Performance | Ease of Implementation | Usability |
---|---|---|---|
Scraping With Vanilla Python | Low | High | Limited - Quickly blocked by sites with anti-bot protections. |
Optimizing Requests Fingerprints | Medium | Medium | Moderate - Useful for sites with less sophisticated anti-bot systems. |
Rotating Proxy Pools | High | Medium | High - Effective for many sites with IP-based blocking. |
Using Fortified Headless Browsers | High | Medium | High - Works well for sites with complex anti-bot measures. |
Using Managed Anti-Bot Bypasses | High | High | Very High - Reliable for scraping sites with advanced protections like datadome. |
Scrape Google Cache Version | Medium | Medium | Limited - Only works for cached pages, which may not be up-to-date. |
It is important to note that the effectiveness of these methods depends on the specific anti-bot defenses in place, so if one method doesn't work, you should try another from the list above.
Conclusion
In summary, the best practices for bypassing anti-bot systems really depend on the specific measures you're up against. It's crucial to understand that no single technique is a silver bullet. Instead, adapting and combining these strategies will give you the best chance of successfully scraping websites protected by anti-bot systems.
Experiment with different methods, from optimizing request fingerprints to using fortified headless browsers, and don't hesitate to leverage third-party CAPTCHA-solving services. The more flexible and resourceful you are, the better your results will be.
More Python Web Scraping Guides
Now that you have a solid grasp of various techniques to bypass anti-bot systems using Python, you should have a good understanding of when and how to use each method.
You know how to optimize request fingerprints, utilize CAPTCHA-solving services, and even implement managed anti-bot bypassess to do the heavy lifting for you. You've also learned how to set up proxy connections and use fortified headless browsers to your advantage.
Want to take your scraping skills to the next level? Check out these additional guides: