Skip to main content

The Web Scraping Playbook - How To Bypass DataDome

How To Bypass DataDome in 2024

DataDome is one of the most sophisticated and hardest to bypass anti-scraping protection system in use today.

Companies like AngelList use DataDome to try and prevent companies and developers scraping data from their websites.

DataDome's technology is very sophisticated, however, with the right set of tools you can bypass it and reliably scrape the data you need.

So in this guide, we're going to go through each of those options so you can choose the one that works best for you.

First, let's get a quick overview of what is DataDome.


What is DataDome?

DataDome is a online fraud and bot management firm that provides a suite of tools to protect web applications from online fraud, web scraping, scalping, credential stuffing, account takeover, DDoS attacks, and card fraud.

They use machine learning algorithms and risk scores to analyse the request fingerprints and behavioral signals to detect and block bot attacks in real time.

In contrast to other anti-bot solutions like Cloudflares Bot Manager, DataDome's solution isn't a CDN so you can't bypass it simply finding the websites orgin/master server.

Instead, you need to optimize your requests so that their fingerprints don't get detected by DataDome's anti-bot system.


How To Bypass DataDome - Option #1: Scrape Google Cache Version

Option #1: Scrape Google Cache Version

Depending on how fresh your data needs to be, one option to bypass DataDome is to scrape the data from the Google Cache instead of the actual website.

When Google crawls the web to index web pages, it creates a cache of the data it finds. Most DataDome protected websites let Google crawl their websites so you can scrape this cache instead.

Scraping the Google cache can be easier than scraping a DataDome protected website, but it is only a viable option if the data on the website you are looking to scrape doesn't change that often.

To scrape the Google cache simply add https://webcache.googleusercontent.com/search?q=cache: to the start of the URL you would like to scrape.

For example, if you would like to scrape https://datadome.co// then the URL to scrape the Google cache version would be:


'https://webcache.googleusercontent.com/search?q=cache:https://datadome.co//'

Websites Not Cached

Some websites (like LinkedIn), tell Google to not cache their web pages or Google's crawl frequency is too low meaning some pages mightn't be cached already. So this method doesn't work with every website.


How To Bypass DataDome - Option #2: Scrape With Fortified Headless Browsers

Option #2: Scrape With Fortified Headless Browsers

If you want to scrape the live website, then one option is to do the entire scraping job with a headless browser that has been fortified to look like a real users browser.

Vanilla headless browsers leak their identify in their JS fingerprints which anti-bot systems like DataDome can easily detect. However, developers have released a number of fortified headless browsers that patch the biggest leaks:

For example, a commonly known leak present in headless browsers like Puppeteer, Playwright and Selenium is the value of the navigator.webdriver. In normal browsers, this is set to false, however, in unfortified headless browsers it is set to true.

Headless browser navigator.webdriver leak

There are over 200 known headless browser leaks which these stealth plugins attempt to patch. However, it is believed to be much higher as browsers are constantly changing and it is in browser developers & anti-bot companies interest to not reveal all the leaks they know of.

Headless browser stealth plugins patch a large majority of these browser leaks, and can often bypass a lot of anti-bot services like DataDome, Incapsula, and Cloudflare depending on what security level they have been implement on the website with.

However, they don't get them all. To truely make your headless browser appear like a real browser then you will have to do this yourself.

DataDome has a lot more sophisticated IP address fingerprinting systems than a lot of other anti-bot solutions, so in most cases you will need to pair the fortified browser with residential/mobile proxies to bypass it.

Residential and mobile proxies typically have higher IP address reputation scores than datacenter proxies and anti-bot services are more relucant to block them making them more reliable.

The downside of pairing headless browsers with residential/mobile proxies is that costs can rack up fast.

As residential & mobile proxies are typically charged per GB of bandwidth used and a page rendered with a headless browser can consume 2MB on average (versus 250kb without headless browser). Meaning it can get very expensive as you scale.

The following is an example of using residential proxies from BrightData with a headless browser assuming 2MB per page.

PagesBandwidthCost Per GBTotal Cost
25,00050 GB$13$625
100,000200 GB$10$2000
1 Million2TB$8$16,000
Find Cheap Residential & Mobile Proxies

If you want to compare proxy providers you can use this free proxy comparison tool, which can compare residential proxy plans and mobile proxy plans.


Example: Selenium Undetected ChromeDriver

For example, here is how you could use Selenium's undetected-chromedriver to scrape a DataDome protected website.

First, you just need to install the undetected-chromedriver package via pip:


pip install undetected-chromedriver

Now with undetected-chromedriver installed we can setup our scraper/bot to use it instead of the default Chromedriver.


import undetected_chromedriver as uc

driver = uc.Chrome()
driver.get('https://datadome.co/')

To enable the use of authenticated proxies, in the below example we will load the undetected_chromedriver from seleniumwire instead of directly from the undetected-chromedriver package and pass the proxy settings into the seleniumwire_options attribute of the Chromedriver.


import seleniumwire.undetected_chromedriver as uc

## Chrome Options
chrome_options = uc.ChromeOptions()

## Proxy Options
proxy_options = {
'proxy': {
'http': 'http://user:pass@ip:port',
'https': 'https://user:pass@ip:port',
'no_proxy': 'localhost,127.0.0.1'
}
}

## Create Chrome Driver
driver = uc.Chrome(
options=chrome_options,
seleniumwire_options=proxy_options
)

driver.get('https://datadome.co/')

The standard Selenium ChromeDriver leaks a lot of information that anti-bot systems can use to determine if it is a automated browser/scraper or a real user visiting the website.

The Selenium Undetected ChromeDriver fortifies the standard Selenium ChromeDriver by patching the vast majority of the ways anti-bot systems can use to detect your Selenium bot/scraper.

Making it much harder for anti-bot systems like DataDome, Imperva, Perimeterx, Botprotect.io and Cloudflare to detect and block your Selenium bot/scraper.

For more information about how to use the Selenium Undetected ChromeDriver then check out our guide here.


How To Bypass DataDome - Option #3: Anti-Bot Solvers

Option #3: Anti-Bot Solvers

Another way to bypass DataDome is to use anti-bot solvers that are designed to bypass anti-bots.

Although, there are no anti-bot solvers that have been specifically designed for bypassing DataDome (that we know of anyway) some of the anti-bot solvers designed for other systems like Cloudflare might work.

For example, FlareSolverr, one of the best performing Cloudflare solvers can work with DataDome in certain instances.

FlareSolverr is a proxy server that you can use to bypass Cloudflare's anti-bot protection so you can scrape data from websites who have deployed their content on Cloudflare's CDN.

When run, FlareSolverr starts a server that uses Python Selenium with undetected-chromedriver to solve Cloudflares Javascript and browser fingerprinting challenges by impersonating a real web browser.

FlareSolverr opens the target URL with a Selenium browser and waits until the Cloudflare challenge is solved, before returning the HTML and cookies Cloudflare returns to the browser.

As FlareSolverr is using a Selenium undetected-chromedriver behind the scenes to bypass Cloudflare it can also be used to bypass DataDome in certain situations (might require modifications).

To use FlareSolverr you need to download and run the docker image which will spin up a FlareSolverr server:


docker run -d \
--name=flaresolverr \
-p 8191:8191 \
-e LOG_LEVEL=info \
--restart unless-stopped \
ghcr.io/flaresolverr/flaresolverr:latest

Then configure your scraper to send the URLs you want to scrape to the FlareSolverr server:


import requests

post_body = {
"cmd": "request.get",
"url":"https://datadome.co/",
"maxTimeout": 60000
}

response = requests.post('http://localhost:8191/v1', headers={'Content-Type': 'application/json'}, json=post_body)

print(response.json())

It will then respond with the cookies & the HTML response:


{
"status": "ok",
"message": "Challenge not detected!",
"solution": {
"url": "https://datadome.co/",
"status": 200,
"cookies": [
{
"domain": ".datadome.co",
"expiry": 1705160731,
"httpOnly": false,
"name": "datadome",
"path": "/",
"sameSite": "Lax",
"secure": true,
"value": "5H6S1eVa4qoqPbzbQxo4fGjFNdeY7ZUE40Qlk0ZQTiLk5b8aqv4nYNE6-JC1MQtUs4k4lBXf-ScmiijLOk1QlolRRVVlUTtc1i_maPBzFSz4AJVtM~_iWqJGNPZpbJge"
}
...
],
"userAgent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36",
"headers": {},
"response": "<html><head>...</head><body>...</body></html>"
},
"startTimestamp": 1673459546891,
"endTimestamp": 1673459560345,
"version": "3.0.2"
}

Using FlareSolverr for bypassing DataDome isn't as reliable as it doesn't detect DataDome challenges and bans, so you need to validate the response yourself.

Memory Issues

As headless browsers can consume a lot of memory and each request to FlareSolverr launches a new browser window, FlareSolverr can crash your server if you send to many requests to it and your machine doesn't have enough RAM. Therefore you need to throttle the number of requests you send and/or deploy it on a larger server.


How To Bypass DataDome - Option #4: Smart Proxy With DataDome Built-In Bypass

Option #4: Smart Proxy With A DataDome Bypass

The downsides with using open source Pre-Fortified Headless Browsers and Anti-Bot Solvers, is that anti-bot companies like DataDome can see how they bypass their anti-bot protections systems and easily patch the issues that they exploit.

As a result, most open source DataDome bypasses only have a couple months of shelf life before they stop working.

The alternative to using open source DataDome bypasses, is to use smart proxies that develop and maintain their own private DataDome bypass.

These are typically more reliable as it is harder for DataDome to develop patches for them, and they are developed by proxy companies who are financially motivated to stay 1 step ahead of DataDome and fix their bypasses the very minute they stop working.

One of the best options is to use the ScrapeOps Proxy Aggregator as it integrates over 20 proxy providers into the same proxy API, and finds the best/cheapest proxy provider for your target domains.

You can activate ScrapeOps' DataDome Bypass by simply adding bypass=datadome to your API request, and the ScrapeOps proxy will use the best & cheapest DataDome bypass available for your target domain.


import requests

response = requests.get(
url='https://proxy.scrapeops.io/v1/',
params={
'api_key': 'YOUR_API_KEY',
'url': 'https://datadome.co/', ## DataDome protected website
'bypass': 'datadome',
},
)

print('Body: ', response.content)

You can get a ScrapeOps API key with 1,000 free API credits by signing up here.

The advantage of taking this approach is that you can use your normal HTTP client and don't have to worry about:

  • Fortifying headless browsers
  • Managing numerous headless browser instances & dealing with memory issues
  • Reverse engineering the DataDome's anti-bot protection

As this is all managed within the ScrapeOps Proxy Aggregator.


How To Bypass DataDome - Option #5: Reverse Engineer DataDome's Anti-Bot Protection

Option #5: Reverse Engineer DataDome's Anti-Bot Protection

The final and most complex way to bypass the DataDome's anti-bot protection is to actually reverse engineer DataDome's anti-bot protection system and develop a bypass that passes all DataDome anti-bot checks without the need to use a full fortified headless browser instance.

Technically this is possible, but unlike reverse engineering other anti-bot systems like Cloudflare and PerimeterX, reverse engineering DataDome is next to impossible as it is much more sophisticated.

DataDome's bot detection system can be split into two categories:

  • Backend Detection Techniques: These are bot fingerprinting techniques that are performed on the backend server.
  • Client-Side Detection Techniques: These are bot fingerprinting techniques that are performed in the users browser (client-side).

To bypass DataDome you must pass both sets of verficiation tests:


Passing DataDome's Backend Detection Techniques

The following are the known backend bot fingerprinting techniques DataDome performs on the server side prior to returning a HTML response and how to pass them:

#1: IP Quality

One of the most fundamental tests DataDome conducts is computing a IP address reputation score for the IP addresses you use to send requests. Taking into account factors like is it known to be part of any known bot networks, its location, ISP, reputation history.

To obtain the highest IP address reputation scores you should use residential/mobile proxies over datacenter proxies or any proxies associated with VPNs. However, datacenter proxies can still work if they are high quality.

#2: HTTP Browser Headers

DataDome analyses the HTTP headers you send with your requests and compares them to a database of known browser headers patterns.

Most HTTP clients send user-agents and other headers that clearly identify them by default, so you need to override these headers and use a complete set of browser headers that match the type of browser you want to appear as. In this header optimization guide, we go into detail on how to do this and you can use our free Fake Browser Headers API to generate a list of fake browser headers.

#3: TLS & HTTP/2 Fingerprints

DataDome also uses is TLS & HTTP/2 fingerprinting which is a much more complex anti-bot detection method. Every HTTP request client generates a static TLS and HTTP/2 fingerprint that DataDome can use to determine if the request is coming from a real user or bot.

Different versions of browsers and HTTP clients tend to posess different TLS and HTTP/2 fingerprints which DataDome can then compare to the browser headers you send to match sure that you really are who claim to be in the browser headers you set.

The problem is that faking TLS and HTTP/2 fingerprints is much harder than simply adding fake browser headers to your request. You first need to capture and analyze the packets from the browsers you want to impersonate, then alter the TLS and HTTP/2 fingerprints used to make the request.

However, many HTTP clients like Python Requests don't give you the ability to alter these TLS and HTTP/2 fingerprints. You will need to use programming languages and HTTP client like Golang HTTP or Got which gives you enough low-level control of the request that you can fake the TLS and HTTP/2 fingerprints.

Libraries like CycleTLS, Got Scraping. utls help you spoof TLS/JA3 fingerprints in GO and Javascript.

This is a complicated topic, so I would suggest you dive into how TLS & HTTP/2 fingerprinting works. Here are some resources to help you:

Important: Matching Browser Headers, TLS & HTTP/2 Fingerprints

The way DataDome detects your scrapers with these fingerprinting methods is when you make a request using user-agents and browser headers that say you are a Chrome browser, however, your TLS and HTTP/2 fingerprints say you are using the Python Requests HTTP client.

So to trick DataDome fingerprinting techniques you need to make sure your browser headers, TLS & HTTP/2 fingerprints are all consistent and are telling DataDome the request is coming from a real browser.

When you use a automated browser to make the requests then all of this is handled for you. However, it gets quite tricky when you are trying to make requests using a normal HTTP client.

DataDome's server-side detection techniques is its first line of defence. If you fail any of these tests your request will be challenged with a DataDome CAPTCHA page or blocked completely by DataDome.

The server-side detection techniques assign your request a risk score which DataDome then uses to determine if client side.

Each individual website can set their own anti-bot protection risk thresholds, to determine who should be challenged and with what challenges (background client-side challenges or CAPTCHAs). So your goal is to obtain the lowest risk score possible. Especially for the most protected websites.


Passing DataDome's Client-Side Detection

Okay, assuming you've been able to build a system to pass all DataDome's server-side anti-bot checks, now you need to deal with its client-side verfication tests.

The following are the main client-side bot fingerprinting techniques DataDome performs in the users browser which you will need to pass:

#1: Browser Web APIs

Modern browsers have hundreds of APIs that allow us as developers to design apps that interact with the users browser. Unfortuntately, when DataDome loads in the users browser it gets access to all these APIs too.

Allowing it to access huge amounts of information about the browser environment, that it can then use to detect scrapers lying about their true identies. For example DataDome can query:

  1. Browser-Specific APIs: Some web APIs like window.chrome only exists on a Chrome browser. So if your browser headers, TLS and HTTP/2 fingerprints all say that you are making a request with a Chrome browser, but the window.chrome API doesn't exist when DataDome checks the browser then it is a clear sign that you are faking your fingerprints.
  2. Automated Browser APIs: Automated browsers like Selenium have APIs like window.document.__selenium_unwrapped. If DataDome sees that these APIs exist then it knows you aren't a real user.
  3. Sandbox Browser Emulatator APIs: Sandboxed browser browser emulators like JSDOM, which runs in NodeJs, has the process object which only exists in NodeJs.
  4. Environment APIs: If your user-agent is saying you are using a MacOs or Windows machine but the navigator.platform value is set to Linux x86_64, then that makes your request look suspicious.

If you are using a fortified browser it will have fixed a lot of these leaks, however, you will likely have to fix more and make sure that your browser headers and TLS & HTTP/2 fingerprints match the values returned from the browser web APIs.

#2: Canvas Fingerprinting

DataDome uses canvas fingerprinting libraries like WebGL to render an image and create a canvas fingerprint.

Canvas fingerprinting is a technique that allows DataDome to classify the type of device being used (combination of browser, operating system, and graphics hardware of the system).

Canvas fingerprinting is one of the most common browser fingerprinting techniques that uses the HTML5 API to draw graphics and animations of a page with Javascript, which can then be used to product a fingerprint of the device.

Check Out Your Canvas Fingerprint

You can use BrowserLeaks Live Demo to see your browsers canvas fingerprint.

DataDome maintains a large dataset of legitimate canvas fingerprints and user-agent pairs. So when a request is coming from a user who is claiming to be a Firefox browser running on a Windows machine in their headers, but their canvas fingerprint is saying they are actually a Chrome browser running on a Linux machine then is a sign for DataDome to challenge or block the request.

#3: Event Tracking

If you need to mavigate around or interact with a web page to get the data you need, then you will have to contend with DataDome's event tracking.

DataDome adds event listeners to webpages so that it can monitor user actions like mouse movements, clicks, and key presses. If you have a scraper that need to interacts with a page, but the mouse never moves then it is a clear sign to DataDome that the request is coming from an automated browser and not a real user.


More Web Scraping Guides

So when it comes to bypassing DataDome you have multiple options. Some are pretty quick and easy, others are a lot more complex. Each with their own tradeoffs.

If you would like to learn how to scrape some popular websites then check out our other How To Scrape Guides:

Or if you would like to learn more about web scraping in general, then be sure to check out The Web Scraping Playbook, or check out one of our more in-depth guides: