How To Bypass Anti-Bots in 2024
Over the past few years, there has been growing trend of websites integrating anti-bot or anti-scraping systems on their websites to prevent companies and developers getting automated access to their websites.
Be it Cloudflare, DataDome, PerimeterX, etc. these anti-scraping systems can either completely block developers from websites
There are several approaches you can take to bypassing anti-bot systems, all with their own pros and cons.
In this guide, we're going to discuss the range of options you have as a developer to bypass these anti-bot systems so you can choose the one that works best for you.
- Option #1: Scrape Google Cache Version
- Option #2: Scrape With Fortified Headless Browsers
- Option #3: Open Source Anti-Bot Solvers
- Option #4: Premium Anti-Bot Solvers
- Option #5: Send Requests To Origin Server
- Option #6: Reverse Engineer The Anti-Bot Protection
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
TLDR: Options To Bypass Anti-Bot Systems
There are numerous ways to bypass a website's anti-bot system. Here are the details on the main bypassing options:
Options | Effectiveness | Reliability | Technical Expertise | Cost |
---|---|---|---|---|
Scrape Google Cache | Sometimes works, but gives old Data | If it works, reliable | Easy to use | Low |
Fortified Headless Browser | Effective | If it works, reliable | Depending on anti-bot, can be tricky | Medium to high cost |
Open Source Anti-Bot Solvers | Often don't work | Unreliable, inevitably get blocked | Need to setup additional server. | Low |
Premium Anti-Bot Solvers | Very effective | Most reliable | Easy to use | Medium to high cost |
Scrape Origin Server | If available, very effective. | Very reliable | Can be hard to find origin server, but easy to scrape afterwards. | Low cost. |
Reverse Engineer Anti-Bot Protection | Effective | Very unreliable | Can be very hard to reverse engineer. | Low cost. |
Which option you choose is heavily dependent on the anti-bot you are trying to bypass, your level of technical expertise, budget and priorities.
The easiest, fastest and most reliable option is to use one of the premium anti-bot solvers listed below or a proxy aggregator like the ScrapeOps Proxy Aggregator that finds the best performing and cheapest bypassing option for your use case.
For example, a user can activate the Cloudflare Bypass by simply adding bypass=cloudflare_level_1
to your API request, and the ScrapeOps proxy will use the best & cheapest Cloudflare bypass available for your target domain.
import requests
response = requests.get(
url='https://proxy.scrapeops.io/v1/',
params={
'api_key': 'YOUR_API_KEY',
'url': 'http://example.com/', ## Cloudflare protected website
'bypass': 'cloudflare_level_1',
},
)
print('Body: ', response.content)
Here is a list of available bypasses:
Bypass | Description |
---|---|
cloudflare_level_1 | Use to bypass Cloudflare protected sites with low security settings enabled. |
cloudflare_level_2 | Use to bypass Cloudflare protected sites with medium security settings enabled. |
cloudflare_level_3 | Use to bypass Cloudflare protected sites with high security settings enabled. |
incapsula | Use to bypass Incapsula protected sites. |
perimeterx | Use to bypass PerimeterX protected sites. |
datadome | Use to bypass DataDome protected sites. |
You can get a ScrapeOps API key with 1,000 free API credits by signing up here.
Option #1: Scrape Google Cache Version
Depending on how fresh your data needs to be, a easy option is to scrape the data from the Google Cache instead of the actual website.
When Google crawls the web to index web pages, it creates a cache of the data it finds. Most anti-bot protected websites let Google crawl their websites so you can scrape this cache instead.
Scraping the Google cache can be easier than scraping a Cloudflare, DataDome, etc. protected website, but it is only a viable option if the data on the website you are looking to scrape doesn't change that often.
To scrape the Google cache simply add https://webcache.googleusercontent.com/search?q=cache:
to the start of the URL you would like to scrape.
For example, if you would like to scrape https://www.petsathome.com/shop/en/pets/dog
then the URL to scrape the Google cache version would be:
'https://webcache.googleusercontent.com/search?q=cache:https://www.petsathome.com/shop/en/pets/dog'
Some websites (like LinkedIn), tell Google to not cache their web pages or Google's crawl frequency is too low meaning some pages mightn't be cached already. So this method doesn't work with every website.
Option #2: Scrape With Fortified Headless Browsers
If scraping the Google Cache version of the website isn't viable for your use case, then another option is to use a headless browser that has been fortified to look like a real user's browser.
Vanilla headless browsers leak their identify in their JS fingerprints which anti-bot systems can easily detect. However, you can patch these leaks so that your headless browser looks more like a real user.
You can fortify the browser yourself or use a fortified version of the headless browser released by the developer community.
Developers have released a number of fortified headless browsers for Puppeteer, Playwright and Selenium that patch the biggest leaks:
- Puppeteer: The puppeteer-extra-plugin-stealth plugin for puppeteer.
- Playwright: The stealth plugin is coming to Playwright soon. Follow developments here and here. However, puppeteer-extra-plugin-stealth plugin is compatible with Playwright too.
- Selenium: The undetected-chromedriver is an optimized version of Selenium Chromedriver that patches the major leaks.
Common Browser Leaks
For example, a commonly known leak present in headless browsers like Puppeteer, Playwright and Selenium is the value of the navigator.webdriver
. In normal browsers, this is set to false
, however, in unfortified headless browsers it is set to true
.
There are over 200 known headless browser leaks which stealth plugins attempt to patch. However, it is believed to be much higher as browsers are constantly changing and it is in browser developers & anti-bot companies interest to not reveal all the leaks they know of.
Headless browser stealth plugins patch a large majority of these browser leaks, and can often bypass a lot of anti-bot services like Cloudflare, PerimeterX, Incapsula, DataDome depending on what security level has been implemented on the website.
However, they don't get them all. To truely make your headless browser appear like a real browser then you will have to do this yourself.
Using Residential & Mobile Proxies
Another way to make your headless browsers more undetectable is to pair them with high-quality residential or mobile proxies. These proxies typically have higher IP address reputation scores than datacenter proxies and anti-bot services are more relucant to block them making them more reliable.
The downside of pairing headless browsers with residential/mobile proxies is that costs can rack up fast.
As residential & mobile proxies are typically charged per GB of bandwidth used and a page rendered with a headless browser can consume 2MB on average (versus 250kb without headless browser). Meaning it can get very expensive as you scale.
The following is an example of using residential proxies from BrightData with a headless browser assuming 2MB per page.
Pages | Bandwidth | Cost Per GB | Total Cost |
---|---|---|---|
25,000 | 50 GB | $13 | $625 |
100,000 | 200 GB | $10 | $2000 |
1 Million | 2TB | $8 | $16,000 |
If you want to compare proxy providers you can use this free proxy comparison tool, which can compare residential proxy plans and mobile proxy plans.
Example: Puppeteer Extra Stealth
Puppeteer-Extra is a light-weight wrapper around Puppeteer that augments its functionality with plugins.
It has a variety of plugins, but one of the most useful is puppeteer-extra-plugin-stealth which uses various anti-detection evasion modules to make your Puppeteer scraper less detectable to anti-bot systems. It is actively maintained and enhanced with new evasion modules by its dedicated open-source community as they encounter and address new bot detection challenges.
To use puppeteer-extra-plugin-stealth, you first need to install Puppeteer along with Puppeteer-Extra and Puppeteer-Extra-Plugin-Stealth, run this command:
npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealth
Once you've installed all the necessary dependencies, you're ready to proceed. Puppeteer-Extra provides the use()
method to incorporate plugins into your Puppeteer script. Here's a sample code to kickstart your project:
const puppeteer = require('puppeteer-extra')
const StealthPlugin = require('puppeteer-extra-plugin-stealth')()
puppeteer.use(StealthPlugin)
// Your regular Puppeteer code goes here
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('http://example.com/');
const pageContent = await page.evaluate(() => document.body.innerText);
console.log(pageContent);
})();
To use proxies with Puppeteer Stealth then we would do the following:
const puppeteer = require('puppeteer-extra')
const StealthPlugin = require('puppeteer-extra-plugin-stealth')()
puppeteer.use(StealthPlugin)
const proxyUrl = '11.456.448.110:8080';
const launchOptions = {
args: [
`--proxy-server=${proxyUrl}`
]
};
// Your regular Puppeteer code goes here
(async () => {
const browser = await puppeteer.launch(launchOptions);
const page = await browser.newPage();
await page.goto('http://example.com/');
const pageContent = await page.evaluate(() => document.body.innerText);
console.log(pageContent);
})();
By default, Puppeteer leaks a lot of information that anti-bot systems can use to determine if it is an automated browser/scraper or a real user visiting the website.
puppeteer-extra-plugin-stealth fortifies the standard Puppeteer driver by patching the vast majority of the ways anti-bot systems can use to detect your Puppeteer bot/scraper.
Making it much harder for anti-bot systems like DataDome, Imperva, Perimeterx, Botprotect.io and Cloudflare to detect and block your Puppeteer bot/scraper.
For more information about how to use puppeteer-extra-plugin-stealth then check out our guide here.
Example: Selenium Undetected ChromeDriver
For example, here is how you could use Selenium's undetected-chromedriver to scrape a anti-bot protected website.
First, you just need to install the undetected-chromedriver package via pip:
pip install undetected-chromedriver
Now with undetected-chromedriver installed we can setup our scraper/bot to use it instead of the default Chromedriver.
import undetected_chromedriver as uc
driver = uc.Chrome()
driver.get('https://cloudflare.com/')
To enable the use of authenticated proxies, in the below example we will load the undetected_chromedriver
from seleniumwire
instead of directly from the undetected-chromedriver package and pass the proxy settings into the seleniumwire_options
attribute of the Chromedriver.
import seleniumwire.undetected_chromedriver as uc
## Chrome Options
chrome_options = uc.ChromeOptions()
## Proxy Options
proxy_options = {
'proxy': {
'http': 'http://user:pass@ip:port',
'https': 'https://user:pass@ip:port',
'no_proxy': 'localhost,127.0.0.1'
}
}
## Create Chrome Driver
driver = uc.Chrome(
options=chrome_options,
seleniumwire_options=proxy_options
)
driver.get('https://cloudflare.com/')
The standard Selenium ChromeDriver leaks a lot of information that anti-bot systems can use to determine if it is a automated browser/scraper or a real user visiting the website.
The Selenium Undetected ChromeDriver fortifies the standard Selenium ChromeDriver by patching the vast majority of the ways anti-bot systems can use to detect your Selenium bot/scraper.
Making it much harder for anti-bot systems like DataDome, Imperva, Perimeterx, Botprotect.io and Cloudflare to detect and block your Selenium bot/scraper.
For more information about how to use the Selenium Undetected ChromeDriver then check out our guide here.
Option #3: Open Source Anti-Bot Solvers
For popular anti-bots like Cloudflare, some developers have developed and open-sourced anti-bot solvers designed to act as a middle man for your requests and solve the anti-bot challenges.
These anti-bot solvers do work. However, as they are publically available the anti-bot companies often quickly deploy fixes to block these anti-bot solvers so they only have a limited shelf-life if they aren't properly maintained and updated by the developer community.
Cloudflare Solvers
The most common type of anti-bot solvers are designed to bypass Cloudflare:
- cloudscraper Guide here
- cloudflare-scrape
- CloudflareSolverRe
- Cloudflare-IUAM-Solver
- cloudflare-bypass [Archived]
- CloudflareSolverRe
However, they often go out of date and stop working due to Cloudflare updates.
Currently, the best performing Cloudflare solver is FlareSolverr.
FlareSolverr
FlareSolverr is a proxy server you can use to bypass Cloudflare and DDoS-GUARD protection.
When run, FlareSolverr starts a proxy server which forwards your requests to the Cloudflare protected website using puppeteer and the stealth plugin, and waits until the Cloudflare challenge is solved (or timesout) before returning the response and cookies to your scraper.
From here you can use those cookies to bypass Cloudflare using your normal HTTP clients.
The advantage of this approach over using a fortified headless browser for every request is that you only need to use FlareSolverr to retrieve valid Cloudflare cookies and then can continue scraping with much less resource intensive HTTP clients (like Python Requests, HTTPX, Node Axios, etc.).
You can install FlareSolverr on a server using Docker (Firefox browser already included) so it is pretty simple to get setup.
docker run -d \
--name=flaresolverr \
-p 8191:8191 \
-e LOG_LEVEL=info \
--restart unless-stopped \
ghcr.io/flaresolverr/flaresolverr:latest
When run, FlareSolverr starts a server that uses Python Selenium with undetected-chromedriver to solve Cloudflares Javascript and browser fingerprinting challenges by impersonating a real web browser.
FlareSolverr opens the target URL with a Selenium browser and waits until the Cloudflare challenge is solved, before returning the HTML and cookies Cloudflare returns to the browser.
As FlareSolverr is using a Selenium undetected-chromedriver behind the scenes to bypass Cloudflare it can also be used to bypass DataDome in certain situations (might require modifications).
To use FlareSolverr you need to configure your scraper to send the URLs you want to scrape to the FlareSolverr server:
import requests
post_body = {
"cmd": "request.get",
"url":"https://cloudflare.com/",
"maxTimeout": 60000
}
response = requests.post('http://localhost:8191/v1', headers={'Content-Type': 'application/json'}, json=post_body)
print(response.json())
It will then respond with the cookies & the HTML response:
{
"status": "ok",
"message": "Challenge not detected!",
"solution": {
"url": "https://cloudflare.com/",
"status": 200,
"cookies": [
{
"domain": ".cloudflare.com",
"expiry": 1705160731,
"httpOnly": false,
"name": "datadome",
"path": "/",
"sameSite": "Lax",
"secure": true,
"value": "5H6S1eVa4qoqPbzbQxo4fGjFNdeY7ZUE40Qlk0ZQTiLk5b8aqv4nYNE6-JC1MQtUs4k4lBXf-ScmiijLOk1QlolRRVVlUTtc1i_maPBzFSz4AJVtM~_iWqJGNPZpbJge"
}
...
],
"userAgent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36",
"headers": {},
"response": "<html><head>...</head><body>...</body></html>"
},
"startTimestamp": 1673459546891,
"endTimestamp": 1673459560345,
"version": "3.0.2"
}
Using FlareSolverr for bypassing Cloudflare isn't as reliable as it doesn't detect Cloudflare challenges and bans, so you need to validate the response yourself.
As headless browsers can consume a lot of memory and each request to FlareSolverr launches a new browser window, FlareSolverr can crash your server if you send to many requests to it and your machine doesn't have enough RAM. Therefore you need to throttle the number of requests you send and/or deploy it on a larger server.
Sometimes CloudFlare not only gives mathematical computations and Javascript browser tests to be solved, but sometimes require the user to solve a CAPTCHA. Although FlareSolverr does support CAPTCHA solving via third party CAPTCHA solvers, currently, none of the automated CAPTCHA solving solutions work as Cloudflare uses hCAPTCHA.
Option #4: Premium Anti-Bot Solvers
The downsides with using open source Anti-Bot Solvers and Pre-Fortified Headless Browsers, is that anti-bot companies like Cloudflare and DataDome can see how they bypass their anti-bot protections systems and easily patch the issues that they exploit.
As a result, most open source anti-bot bypasses only have a couple months of shelf life before they stop working.
The alternative to using open source anti-bot bypasses, is to use commercial anti-bot bypasses.
With more and more websites now using anti-bot systems to try prevent scraping, proxy provider have launched a range of premium anti-bot bypassing solutions to help users bypass anti-bots like Cloudflare, DataDome, PerimeterX, etc.
Proxy Provider | Anti-Bot Solution | Pricing Method |
---|---|---|
BrightData | Web Unlocker | Pay per successful request |
Oxylabs | Web Unblocker | Pay per GB |
Smartproxy | Site Unblocker | Pay per GB |
Zyte | Zyte API | Pay per successful request |
ScraperAPI | Ultra Premium | Pay per successful request |
ScrapingBee | Stealth Proxy | Pay per successful request |
Scrapfly | Anti-Scraping Protection | Pay per successful request |
These anti-bot solutions do work, but they can become extremely expensive when used at scale. With prices ranging from $1,000 to $5,000 to scrape 1M pages per month.
ScrapeOps Proxy Aggregator
As part of the ScrapeOps Proxy Aggregator we aggregate these anti-bot bypassing solutions together and find the best performing and cheapest option for your use case.
For example, a user can activate the Cloudflare Bypass by simply adding bypass=cloudflare_level_1
to your API request, and the ScrapeOps proxy will use the best & cheapest Cloudflare bypass available for your target domain.
import requests
response = requests.get(
url='https://proxy.scrapeops.io/v1/',
params={
'api_key': 'YOUR_API_KEY',
'url': 'http://example.com/', ## Cloudflare protected website
'bypass': 'cloudflare_level_1',
},
)
print('Body: ', response.content)
Here is a list of available bypasses:
Bypass | Description |
---|---|
cloudflare_level_1 | Use to bypass Cloudflare protected sites with low security settings enabled. |
cloudflare_level_2 | Use to bypass Cloudflare protected sites with medium security settings enabled. |
cloudflare_level_3 | Use to bypass Cloudflare protected sites with high security settings enabled. |
incapsula | Use to bypass Incapsula protected sites. |
perimeterx | Use to bypass PerimeterX protected sites. |
datadome | Use to bypass DataDome protected sites. |
You can get a ScrapeOps API key with 1,000 free API credits by signing up here.
The advantage of taking this approach is that you can use your normal HTTP client and don't have to worry about:
- Finding origin servers
- Fortifying headless browsers
- Managing numerous headless browser instances & dealing with memory issues
- Reverse engineering the anti-bot protection
As this is all managed within the ScrapeOps Proxy Aggregator.
Option #5: Send Requests To Origin Server
Although it isn't always possible, one of the easiest ways to bypass some anti-bot systems (for example, Cloudflare) is to send the request directly to the websites origin servers IP address instead of to anti-bot's CDN network.
For example, in the case of Cloudflare, instead of having to trick Cloudflare into thinking your requests are from a real user, you instead bypass Cloudflare completely by finding the IP address of the origin server that hosts the website and send your requests to that instead.
Completely bypassing Cloudflare and all its protections!
This approach doesn't work with anti-bots like DataDome or PerimeterX, as this loophole doesn't exist. Their protections are always active on the website and typically they are correctly integrated into the website.
Anti-bot systems are often very sophisticated, however, they are setup by humans who:
- Mightn't fully understand how to anti-bot system works,
- Might cut corners, or
- Might make mistakes when setting up the anti-bot system on their website.
Because of this, sometimes with a bit snooping around you can find the IP address of the server that hosts the master version of the website.
Once you find this IP address, you can configure your scrapers to send the requests to this server instead of anti-bot protected CDN servers.
For example, the origin IP address of PetsAtHome.com, a Cloudflare protected site is publically accessible:
Origin IP Address --> 'http://88.211.26.45/'
Sometimes accessing the website via the origin IP address by inserting it in your browsers address bar won't work, as the server may be expecting a HTTP HOST
header. When this is the case, you can query the origin server with a tool like curl or Postman which allows you to set HOST
headers or add a static mapping to your hosts file.
Finding The IP Address of the Origin Server
There are a number of ways to find the origin IP address of a websites server. Here are the top 3 methods:
Method 1: SSL Certificates
If the target website is using SSL certificates (most sites are), then those SSL certificates are registered in the Censys database.
Although websites have deployed their website onto the Cloudflare CDN, sometimes their current or old SSL certificates are registered to the original server.
You can look up the website in the Censys database and see if any of the these servers host the origin website.
Method 2: DNS Records Of Other Services
Sometimes other subdomains, mail exchanger (MX) servers, FTP/SCP services or hostnames are hosted on the same server as the main website but haven't been protected by the Cloudflare network.
Here you can check the DNS records for other subdomains or A, AAAA, CNAME, and MX DNS records that relieve the IP address of the main server using Censys database or Shodan.
Provided that the website isn't using a 3rd part email provider, one trick is to send a email to a non-existing emaill address at your target website fakeemail@targetwebsite.com
, and assuming the delievery fails you should recieve a notification from the email server which will contain the IP address.
Method 3: Old DNS Records
The DNS history of every server is available on the internet so it is sometimes the case that the website is still being hosted on the same server as it was before they deployed it to the Cloudflare CDN. As a result, you can use a tool like CrimeFlare to find it.
CrimeFlare maintains a database of likely origin servers for websites hosted on Cloudflare, derived from current and old DNS records.
Tools To Help
The following are some of the best tools available to help you find the original IP address of the server:
Sometimes even if you find the actual IP address of the website server it is not possible to access it for example when the websites administrators correctly limits the server to only respond to Cloudflare IP ranges, redirects any requests to the Cloudflare CDN, or if Origin CA certificates are used.
If find what looks like an origin server, it may in fact be a development or staging server for the real website. Although you can never be 100% sure that the server you found is the origin server, if you can browse around, the data looks the same as the Cloudflare protected site, can register an account on the "origin version" and login to the real website with it then it should be okay to treat this website as the real website.
For more information about finding the IP addresses of the origin server check out these guides:
- Bypassing Cloudflare WAF with the origin server IP address
- Introducing CFire: Evading CloudFlare Security Protections
- CloudFlair: Bypassing Cloudflare using Internet-wide scan data
If after all this you can't find the IP address of the origin server, don't worry. There are plenty other ways to bypass Cloudflare protection.
Option #6: Reverse Engineer Anti-Bot Protection
The final and most complex way to bypass the anti-bot protection is to actually reverse engineer anti-bot protection system itself and develop a bypass that passes all anti-bot checks without the need to use a full fortified headless browser instance.
Technically this is possible, but can be extremely complex, time consume and unreliable.
Most anti-bot detection systems can be split into two categories:
- Backend Detection Techniques: These are bot fingerprinting techniques that are performed on the backend server.
- Client-Side Detection Techniques: These are bot fingerprinting techniques that are performed in the users browser (client-side).
Some anti-bot system only use backend detection techniques, but the most sophisticated anti-bot systems like DataDome use bot backend and frontend detection techniques. Making them very difficult to reverse engineer and bypass.
Passing Backend Detection Techniques
The following are some of the common backend bot fingerprinting techniques anti-bot systems often perform on the server side prior to returning a HTML response and how to pass them:
#1: IP Quality
One of the most fundamental tests DataDome conducts is computing an IP address reputation score for the IP addresses you use to send requests. Taking into account factors like is it known to be part of any known bot networks, its location, ISP, reputation history.
To obtain the highest IP address reputation scores you should use residential/mobile proxies over datacenter proxies or any proxies associated with VPNs. However, datacenter proxies can still work if they are high quality.
#2: HTTP Browser Headers
DataDome analyses the HTTP headers you send with your requests and compares them to a database of known browser headers patterns.
Most HTTP clients send user-agents and other headers that clearly identify them by default, so you need to override these headers and use a complete set of browser headers that match the type of browser you want to appear as. In this header optimization guide, we go into detail on how to do this and you can use our free Fake Browser Headers API to generate a list of fake browser headers.
#3: TLS & HTTP/2 Fingerprints
DataDome also uses is TLS & HTTP/2 fingerprinting which is a much more complex anti-bot detection method. Every HTTP request client generates a static TLS and HTTP/2 fingerprint that DataDome can use to determine if the request is coming from a real user or bot.
Different versions of browsers and HTTP clients tend to posess different TLS and HTTP/2 fingerprints which DataDome can then compare to the browser headers you send to match sure that you really are who claim to be in the browser headers you set.
The problem is that faking TLS and HTTP/2 fingerprints is much harder than simply adding fake browser headers to your request. You first need to capture and analyze the packets from the browsers you want to impersonate, then alter the TLS and HTTP/2 fingerprints used to make the request.
However, many HTTP clients like Python Requests don't give you the ability to alter these TLS and HTTP/2 fingerprints. You will need to use programming languages and HTTP client like Golang HTTP or Got which gives you enough low-level control of the request that you can fake the TLS and HTTP/2 fingerprints.
Libraries like CycleTLS, Got Scraping. utls help you spoof TLS/JA3 fingerprints in GO and Javascript.
This is a complicated topic, so I would suggest you dive into how TLS & HTTP/2 fingerprinting works. Here are some resources to help you:
- Passive Fingerprinting of HTTP/2 Clients (Akami White Paper, 2017)
- What happens in a TLS handshake?
- TLS Fingerprinting with JA3 and JA3S
The way anti-bot systems detect your scrapers with these fingerprinting methods is when you make a request using user-agents and browser headers that say you are a Chrome browser, however, your TLS and HTTP/2 fingerprints say you are using the Python Requests HTTP client.
So to trick anti-bot fingerprinting techniques you need to make sure your browser headers, TLS & HTTP/2 fingerprints are all consistent and are telling anti-bot system the request is coming from a real browser.
When you use a automated browser to make the requests then all of this is handled for you. However, it gets quite tricky when you are trying to make requests using a normal HTTP client.
Server-side detection techniques are most anti-bot systems first line of defense. If you fail any of these tests your request will be challenged with a DataDome CAPTCHA page or blocked completely by the anti-bot.
The server-side detection techniques assign your request a risk score which DataDome then uses to determine if client side.
Each individual website can set their own anti-bot protection risk thresholds, to determine who should be challenged and with what challenges (background client-side challenges or CAPTCHAs). So your goal is to obtain the lowest risk score possible. Especially for the most protected websites.
Passing DataDome's Client-Side Detection
Okay, assuming you've been able to build a system to pass all DataDome's server-side anti-bot checks, now you need to deal with its client-side verfication tests.
The following are the main client-side bot fingerprinting techniques DataDome performs in the users browser which you will need to pass:
#1: Browser Web APIs
Modern browsers have hundreds of APIs that allow us as developers to design apps that interact with the users browser. Unfortuntately, when DataDome loads in the users browser it gets access to all these APIs too.
Allowing it to access huge amounts of information about the browser environment, that it can then use to detect scrapers lying about their true identies. For example DataDome can query:
- Browser-Specific APIs: Some web APIs like
window.chrome
only exists on a Chrome browser. So if your browser headers, TLS and HTTP/2 fingerprints all say that you are making a request with a Chrome browser, but thewindow.chrome
API doesn't exist when DataDome checks the browser then it is a clear sign that you are faking your fingerprints. - Automated Browser APIs: Automated browsers like Selenium have APIs like
window.document.__selenium_unwrapped
. If DataDome sees that these APIs exist then it knows you aren't a real user. - Sandbox Browser Emulatator APIs: Sandboxed browser browser emulators like JSDOM, which runs in NodeJs, has the
process
object which only exists in NodeJs. - Environment APIs: If your user-agent is saying you are using a MacOs or Windows machine but the
navigator.platform
value is set toLinux x86_64
, then that makes your request look suspicious.
If you are using a fortified browser it will have fixed a lot of these leaks, however, you will likely have to fix more and make sure that your browser headers and TLS & HTTP/2 fingerprints match the values returned from the browser web APIs.
#2: Canvas Fingerprinting
DataDome uses canvas fingerprinting libraries like WebGL to render an image and create a canvas fingerprint.
Canvas fingerprinting is a technique that allows DataDome to classify the type of device being used (combination of browser, operating system, and graphics hardware of the system).
Canvas fingerprinting is one of the most common browser fingerprinting techniques that uses the HTML5 API to draw graphics and animations of a page with Javascript, which can then be used to product a fingerprint of the device.
You can use BrowserLeaks Live Demo to see your browsers canvas fingerprint.
DataDome maintains a large dataset of legitimate canvas fingerprints and user-agent pairs. So when a request is coming from a user who is claiming to be a Firefox browser running on a Windows machine in their headers, but their canvas fingerprint is saying they are actually a Chrome browser running on a Linux machine then is a sign for DataDome to challenge or block the request.
#3: Event Tracking
If you need to mavigate around or interact with a web page to get the data you need, then you will have to contend with DataDome's event tracking.
DataDome adds event listeners to webpages so that it can monitor user actions like mouse movements, clicks, and key presses. If you have a scraper that need to interacts with a page, but the mouse never moves then it is a clear sign to DataDome that the request is coming from an automated browser and not a real user.
More Web Scraping Guides
So when it comes to bypassing DataDome you have multiple options. Some are pretty quick and easy, others are a lot more complex. Each with their own tradeoffs.
If you would like to learn how to scrape some popular websites then check out our other How To Scrape Guides:
Or if you would like to learn more about web scraping in general, then be sure to check out The Web Scraping Playbook, or check out one of our more in-depth guides: