Skip to main content

The Web Scraping Playbook - How To Bypass Cloudflare

How To Bypass Cloudflare in 2022

With an estimated 40% of websites using Cloudflares Content Delievery Network (CDN), bypassing Cloudflare's anti-bot protection system has become a big requirement for developers looking to scrape some of the most popular websites on the internet.

Luckily for us, bypassing Cloudflares anti-bot protection is possible. However, it isn't an easy task.

There are a number of approaches you can take to bypassing Cloudflare, all with their own pros and cons.

They range from the easy like using off-the-shelf tools, to the extremely complex like completely reverse engineering how Cloudflare detects and blocks scrapers.

So in this guide, we're going to go through each of those options so you can choose the one that works best for you.


How To Bypass Cloudflare - Option #1: Send Requests To Origin Server

Option #1: Send Requests To Origin Server

It isn't always possible, but one of the easiest ways to bypass Cloudflare is to send the request directly to the websites origin servers IP address instead of to Cloudflare's CDN network.

Here instead of having to trick Cloudflare into thinking your requests are from a real user, you instead bypass Cloudflare completely by finding the IP address of the origin server that hosts the website and send your requests to that instead.

Completely bypassing Cloudflare and all its protections!

How To Bypass Cloudflare - Via Origin Server

Cloudflare is a sophisticated anti-bot protection system, but it is setup by humans who:

  1. Mightn't fully understand Cloudflare,
  2. Might cut corners, or
  3. Make mistakes when setting up their website on Cloudflare.

Because of this, sometimes with a bit snooping around you can find the IP address of the server that hosts the master version of the website.

Once you find this IP address, you can configure your scrapers to send the requests to this server instead of Cloudflares servers which have the anti-bot protection active.

For example, the origin IP address of PetsAtHome.com, a Cloudflare protected site is publically accessible:


Origin IP Address --> 'http://88.211.26.45/'

Accessing The Website

Sometimes accessing the website via the origin IP address by inserting it in your browsers address bar won't work, as the server may be expecting a HTTP HOST header. When this is the case, you can query the origin server with a tool like curl or Postman which allows you to set HOST headers or add a static mapping to your hosts file.


Finding The IP Address of the Origin Server

There are a number of ways to find the origin IP address of a websites server. Here are the top 3 methods:

Method 1: SSL Certificates

If the target website is using SSL certificates (most sites are), then those SSL certificates are registered in the Censys database.

Although websites have deployed their website onto the Cloudflare CDN, sometimes their current or old SSL certificates are registered to the original server.

You can look up the website in the Censys database and see if any of the these servers host the origin website.

Method 2: DNS Records Of Other Services

Sometimes other subdomains, mail exchanger (MX) servers, FTP/SCP services or hostnames are hosted on the same server as the main website but haven't been protected by the Cloudflare network.

Here you can check the DNS records for other subdomains or A, AAAA, CNAME, and MX DNS records that relieve the IP address of the main server using Censys database or Shodan.

Provided that the website isn't using a 3rd part email provider, one trick is to send a email to a non-existing emaill address at your target website fakeemail@targetwebsite.com, and assuming the delievery fails you should recieve a notification from the email server which will contain the IP address.

Method 3: Old DNS Records

The DNS history of every server is available on the internet so it is sometimes the case that the website is still being hosted on the same server as it was before they deployed it to the Cloudflare CDN. As a result, you can use a tool like CrimeFlare to find it.

CrimeFlare maintains a database of likely origin servers for websites hosted on Cloudflare, derived from current and old DNS records.

Tools To Help

The following are some of the best tools available to help you find the original IP address of the server:

Sometimes even if you find the actual IP address of the website server it is not possible to access it for example when the websites administrators correctly limits the server to only respond to Cloudflare IP ranges, redirects any requests to the Cloudflare CDN, or if Origin CA certificates are used.

Staging & Development Servers

If find what looks like an origin server, it may in fact be a development or staging server for the real website. Although you can never be 100% sure that the server you found is the origin server, if you can browse around, the data looks the same as the Cloudflare protected site, can register an account on the "origin version" and login to the real website with it then it should be okay to treat this website as the real website.

For more information about finding the IP addresses of the origin server check out these guides:

If after all this you can't find the IP address of the origin server, don't worry. There are plenty other ways to bypass Cloudflare protection.


How To Bypass Cloudflare - Option #2: Scrape Google Cache Version

Option #2: Scrape Google Cache Version

Depending on how fresh your data needs to be, another option is to scrape the data from the Google Cache instead of the actual website.

When Google crawls the web to index web pages, it creates a cache of the data it finds. Most Cloudflare protected websites let Google crawl their websites so you can scrape this cache instead.

Scraping the Google cache can be easier than scraping a Cloudflare protected website, but it is only a viable option if the data on the website you are looking to scrape doesn't change that often.

To scrape the Google cache simply add https://webcache.googleusercontent.com/search?q=cache: to the start of the URL you would like to scrape.

For example, if you would like to scrape https://www.petsathome.com/shop/en/pets/dog then the URL to scrape the Google cache version would be:


'https://webcache.googleusercontent.com/search?q=cache:https://www.petsathome.com/shop/en/pets/dog'

Websites Not Cached

Some websites (like LinkedIn), tell Google to not cache their web pages or Google's crawl frequency is too low meaning some pages mightn't be cached already. So this method doesn't work with every website.


How To Bypass Cloudflare - Option #3: Cloudflare Solver

Option #3: Cloudflare Solvers

Okay if you can't find the origin server and using the Google Cache isn't an option for you then we need to bypass Cloudflare directly.

One way to bypass Cloudflare is to use one of a number of Cloudflare solvers that solve the Cloudflare challenges:

How To Bypass Cloudflare - Challenge Page

There have been a number of Cloudflare solvers developed:

However, they often go out of date and stop working due to Cloudflare updates.

Currently, the best performing Cloudflare solver is FlareSolverr.

FlareSolverr

FlareSolverr is a proxy server you can use to bypass Cloudflare and DDoS-GUARD protection.

When run, FlareSolverr starts a proxy server which forwards your requests to the Cloudflare protected website using puppeteer and the stealth plugin, and waits until the Cloudflare challenge is solved (or timesout) before returning the response and cookies to your scraper.

From here you can use those cookies to bypass Cloudflare using your normal HTTP clients.

The advantage of this approach over using a fortified headless browser for every request is that you only need to use FlareSolverr to retrieve valid Cloudflare cookies and then can continue scraping with much less resource intensive HTTP clients (like Python Requests, HTTPX, Node Axios, etc.).

You can install FlareSolverr on a server using Docker (Firefox browser already included) so it is pretty simple to get setup.

Memory Issues

As headless browsers can consume a lot of memory and each request to FlareSolverr launches a new browser window, FlareSolverr can crash your server if you send to many requests to it and your machine doesn't have enough RAM. Therefore you need to throttle the number of requests you send and/or deploy it on a larger server.

Sometimes CloudFlare not only gives mathematical computations and Javascript browser tests to be solved, but sometimes require the user to solve a CAPTCHA. Although FlareSolverr does support CAPTCHA solving via third party CAPTCHA solvers, currently, none of the automated CAPTCHA solving solutions work as Cloudflare uses hCAPTCHA.


How To Bypass Cloudflare - Option #4: Scrape With Fortified Headless Browsers

Option #4: Scrape With Fortified Headless Browsers

The other option is to do the entire scraping job with a headless browser that has been fortified to look like a real users browser.

Vanilla headless browsers leak their identify in their JS fingerprints which anti-bot systems can easily detect. However, developers have released a number of fortified headless browsers that patch the biggest leaks:

For example, a commonly known leak present in headless browsers like Puppeteer, Playwright and Selenium is the value of the navigator.webdriver. In normal browsers, this is set to false, however, in unfortified headless browsers it is set to true.

Headless browser navigator.webdriver leak

There are over 200 known headless browser leaks which these stealth plugins attempt to patch. However, it is believed to be much higher as browsers are constantly changing and it is in browser developers & anti-bot companies interest to not reveal all the leaks they know of.

Headless browser stealth plugins patch a large majority of these browser leaks, and can often bypass a lot of anti-bot services like Cloudflare, PerimeterX, Incapsula, DataDome depending on what security level they have been implement on the website with.

However, they don't get them all. To truely make your headless browser appear like a real browser then you will have to do this yourself.

Another way to make your headless browsers more undetectable is to pair them with high-quality residential or mobile proxies. These proxies typically have higher IP address reputation scores than datacenter proxies and anti-bot services are more relucant to block them making them more reliable.

The downside of pairing headless browsers with residential/mobile proxies is that costs can rack up fast.

As residential & mobile proxies are typically charged per GB of bandwidth used and a page rendered with a headless browser can consume 2MB on average (versus 250kb without headless browser). Meaning it can get very expensive as you scale.

The following is an example of using residential proxies from BrightData with a headless browser assuming 2MB per page.

PagesBandwidthCost Per GBTotal Cost
25,00050 GB$13$625
100,000200 GB$10$2000
1 Million2TB$8$16,000
Find Cheap Residential & Mobile Proxies

If you want to compare proxy providers you can use this free proxy comparison tool, which can compare residential proxy plans and mobile proxy plans.


How To Bypass Cloudflare - Option #5: Smart Proxy With Cloudflare Built-In Bypass

Option #5: Smart Proxy With Cloudflare Built-In Bypass

The downsides with using open source Cloudflare Solvers and Pre-Fortified Headless Browsers, is that anti-bot companies like Cloudflare can see how they bypass their anti-bot protections systems and easily patch the issues that they exploit.

As a result, most open source Cloudflare bypasses only have a couple months of shelf life before they stop working.

The alternative to using open source Cloudflare bypasses, is to use smart proxies that develop and maintain their own private Cloudflare bypasses.

These are typically more reliable as it is harder for Cloudflare to develop patches for them, and they are developed by proxy companies who are financially motivated to stay 1 step ahead of Cloudflare and fix their bypasses the very minute they stop working.

Most smart proxy providers (ScraperAPI, Scrapingbee, Oxylabs, Smartproxy) have some form of Cloudflare bypass that work to varying degrees and vary in cost.

However, one of the best options is to use the ScrapeOps Proxy Aggregator as it integrates over 20 proxy providers into the same proxy API, and finds the best/cheapest proxy provider for your target domains.

You can activate ScrapeOps' Cloudflare Bypass by simply adding bypass=cloudflare to your API request, and the ScrapeOps proxy will use the best & cheapest Cloudflare bypass available for your target domain.


import requests

response = requests.get(
url='https://proxy.scrapeops.io/v1/',
params={
'api_key': 'YOUR_API_KEY',
'url': 'http://example.com/', ## Cloudflare protected website
'bypass': 'cloudflare',
},
)

print('Body: ', response.content)

You can get a ScrapeOps API key with 1,000 free API credits by signing up here.

The advantage of taking this approach is that you can use your normal HTTP client and don't have to worry about:

  • Finding origin servers
  • Fortifying headless browsers
  • Managing numerous headless browser instances & dealing with memory issues
  • Reverse engineering the Cloudflare anti-bot protection

As this is all managed within the ScrapeOps Proxy Aggregator.


How To Bypass Cloudflare - Option #6: Reverse Engineer Cloudflare Anti-Bot Protection

Option #6: Reverse Engineer Cloudflare Anti-Bot Protection

The final and most complex way to bypass the Cloudflare anti-bot protection is to actually reverse engineer Cloudflare's anti-bot protection system and develop a bypass that passes all Cloudflares anti-bot checks without the need to use a full fortified headless browser instance.

This approach works (and is what many smart proxy solutions do), however, it is not for the faint hearted.

Advantages: The advantage of this approach, is that if you are scraping at large scales and you don't want to run hundreds (if not thousands) of costly full headless browser instances. You can instead develop the most resource efficient Cloudflare bypass possible. One that is solely designed to pass the Cloudflare JS, TLS and IP fingerprint tests.

Disadvantages: The disadvantages to this approach is that you will have to dive deep into a anti-bot system that has been made purposedly hard to understand from the outside, and split test different techniques to trick their verification system. Then maintain this system as Cloudflare continue to develop their anti-bot protection.

It is possible to do this, but I would only recommend someone to take this approach unless they either are:

  1. Genuinely interested the intellectual challenge of reverse engineering a sophisticated anti-bot system, or
  2. The economic returns from having a more cost effective Cloudflare bypass, warrant the days or weeks of engineering time that you will have to devote to building and maintaining it.

For companies scraping at very large volumes (+500M pages per month) or smart proxy solutions who's businesses depend on cost effective ways to access sites, then building your own custom Cloudflare bypass might be a good option.

For most other developers, you are probably better off using one of the other five Cloudflare bypassing methods.

For those of you who do want to take the plunge, the following is run down of how Cloudflare's Web Application Firewall (WAF) works and how you can approach bypassing it.


Understanding Cloudflare's Bot Manager

When we say we want to bypass Cloudflare, what we really mean is that we want to bypass their Bot Manager which is part of their Web Application Firewall (WAF).

A system designed to mitigate attacks from malicious bots without impacting real users.

Cloudflares bot detection system can be split into two categories:

  • Backend Detection Techniques: These are bot fingerprinting techniques that are performed on the backend server.
  • Client-Side Detection Techniques: These are bot fingerprinting techniques that are performed in the users browser (client-side).

To bypass Cloudflare you must pass both sets of verficiation tests.


Passing Cloudflare's Backend Detection Techniques

The following are the known backend bot fingerprinting techniques Cloudflare performs on the server side and how to pass them:

#1: Proxy Quality

One of the most basic tests Cloudflare conducts is computing a IP address reputation score for the IP addresses you use to send requests. Taking into account factors like is it known to be part of any known bot networks, its location, ISP, reputation history.

To obtain the highest IP address reputation scores you should use residential/mobile proxies over datacenter proxies or any proxies associated with VPNs. However, datacenter proxies can still work if they are high quality.

#2: HTTP Browser Headers

Cloudflare also analyses the HTTP headers you send with your requests and compares them to a database of known browser headers patterns.

Most HTTP clients send user-agents and other headers that clearly identify them by default, so you need to override these headers and use a complete set of browser headers that match the type of browser you want to appear as. In this header optimization guide, we go into detail on how to do this and you can use our free Fake Browser Headers API to generate a list of fake browser headers.

#3: TLS & HTTP/2 Fingerprints

The more complex fingerprint detection system Cloudflare uses is TLS & HTTP/2 fingerprinting. Every HTTP request client generates a static TLS and HTTP/2 fingerprint that Cloudflare can use to determine if the request is coming from a real user or a bot.

Different versions of browsers and HTTP clients tend to posess different TLS and HTTP/2 fingerprints which Cloudflare can then compare to the browser headers you send to match sure that you really are who claim to be in the browser headers you set.

The problem is that faking TLS and HTTP/2 fingerprints is much harder than simply adding fake browser headers to your request. You first need to capture and analyze the packets from the browsers you want to impersonate, then alter the TLS and HTTP/2 fingerprints used to make the request.

However, many HTTP clients like Python Requests don't give you the ability to alter these TLS and HTTP/2 fingerprints. You will need to use programming languages and HTTP client like Golang HTTP or Got which gives you enough low-level control of the request that you can fake the TLS and HTTP/2 fingerprints.

Libraries like CycleTLS, Got Scraping. utls help you spoof TLS/JA3 fingerprints in GO and Javascript.

This is a complicated topic, so I would suggest you dive into how TLS & HTTP/2 fingerprinting works. Here are some resources to help you:

Important: Matching Browser Headers, TLS & HTTP/2 Fingerprints

The way Cloudflare detects your scrapers with these fingerprinting methods is when you make a request using user-agents and browser headers that say you are a Chrome browser, however, your TLS and HTTP/2 fingerprints say you are using the Python Requests HTTP client.

So to trick Cloudflares fingerprinting tests you need to make sure browser headers, TLS & HTTP/2 fingerprints are all consistent and are telling Cloudflare the request is coming from a real browser.

When you use a automated browser to make the requests then all of this is handled for you. However, it gets quite tricky when you are trying to make requests using a normal HTTP client.

Cloudflare's server-side detection techniques is its first line of defence. If you fail any of these tests your request will be challenged or blocked by Cloudflare.

The server-side detection techniques assign your request a risk score which Cloudflare then uses to determine what challenges to show you (if any) on the client side.

Each individual website can set their own anti-bot protection risk thresholds, to determine who should be challenged and with what challenges (background client-side challenges or CAPTCHAs). So your goal is to obtain the lowest risk score possible. Especially for the most protected websites.


Passing Cloudflare's Client-Side Detection Techniques

Okay, assuming you've been able to build a system to pass all Cloudflares server-side anti-bot checks, now you need to deal with its client-side verfication tests.

These client-side verfication tests occur when Cloudflare shows you its security page prior to giving you access to the website. Here is an example.

How To Bypass Cloudflare - Challenge Page

When you (or your scraper) first visits a website, Cloudflare will display this page and in the background your browser is solving various challenges to prove to Cloudflare that you aren't a robot.

If you get flagged as a bot, then you will be given 403 Access Denied / Forbidden error.

The risk score your request obtained during the server-side tests can have an affect on what client-side verification tests it runs. Most importantly, whether it requires you to solve a CAPTHCA or not.

There are three general approaches to solving the client-side anti-bot challenges that occur whilst you waiting on this page:

  • Use Automated Browser: A mentioned previously, if you use a fortified browser to open the page then it will take care of a lot of the heavy lifting of solving the Cloudflare JavaScript challenges.
  • Emulate A Browser In A Sandbox: You could emulate a browser in a sandbox using a library like JSDOM, which would be less resource instensive and give you finer control over what do you want it to render.
  • Build A Challenge Solver Algorithm: Build a algorithm that can pass the checks without a browser. This is the hardest approach as you need to fully understand Cloudflares client-side checks, deobfuscate the Javascript challenge scripts and then create a algorithm to solve them.

The following are the main client-side bot fingerprinting techniques Cloudflare performs in the users browser which you will need to pass:

#1: Browser Web APIs

Modern browsers have hundreds of APIs that allow us as developers to design apps that interact with the users browser. Unfortuntately, when Cloudflare loads in the users browser it gets access to all these APIs too.

Allowing it to access huge amounts of information about the browser environment, that it can then use to detect scrapers lying about their true identies. For example Cloudflare can query:

  1. Browser-Specific APIs: Some web APIs like window.chrome only exists on a Chrome browser. So if your browser headers, TLS and HTTP/2 fingerprints all say that you are making a request with a Chrome browser, but the window.chrome API doesn't exist when Cloudflare checks the browser then it is a clear sign that you are faking your fingerprints.
  2. Automated Browser APIs: Automated browsers like Selenium have APIs like window.document.__selenium_unwrapped. If Cloudflare sees that these APIs exist then it knows you aren't a real user.
  3. Sandbox Browser Emulatator APIs: Sandboxed browser browser emulators like JSDOM, which runs in NodeJs, has the process object which only exists in NodeJs.
  4. Environment APIs: If your user-agent is saying you are using a MacOs or Windows machine but the navigator.platform value is set to Linux x86_64, then that makes your request look suspicious.

If you are using a fortified browser it will have fixed a lot of these leaks, however, you will likely have to fix more and make sure that your browser headers and TLS & HTTP/2 fingerprints match the values returned from the browser web APIs.

#2: Canvas Fingerprinting

Another technique Cloudflare uses to detect scrapers is canvas fingerprinting, a technique that allows Cloudflare to classify the type of device being used (combination of browser, operating system, and graphics hardware of the system).

Cloudflare uses Google's Picasso Fingerprinting. to generate canvas fingerprints.

Canvas fingerprinting is one of the most common browser fingerprinting techniques that uses the HTML5 API to draw graphics and animations of a page with Javascript, which can then be used to product a fingerprint of the device.

Check Out Your Canvas Fingerprint

You can use BrowserLeaks Live Demo to see your browsers canvas fingerprint.

Cloudflare maintains a large dataset of legitimate canvas fingerprints and user-agent pairs. So when a request is coming from a user who is claiming to be a Firefox browser running on a Windows machine in their headers, but their canvas fingerprint is saying they are actually a Chrome browser running on a Linux machine then is a sign for Cloudflare to challenge or block the request.

#3: Event Tracking

If you need to mavigate around or interact with a web page to get the data you need, then you will have to contend with Cloudflares event tracking.

Cloudflare adds event listeners to webpages so that it can monitor user actions like mouse movements, clicks, and key presses. If you have a scraper that need to interacts with a page, but the mouse never moves then it is a clear sign to Cloudflare that the request is coming from an automated browser and not a real user.

#4: CAPTCHAs

Probably the hardest Cloudflare anti-bot challenge you will face when scraping a Cloudflare protected website is solving their CAPTCHA challenges.

How To Bypass Cloudflare - hCAPTCHA Challenge

Cloudflare only shows CAPTCHA challenges to users when:

  1. Cloudflare gives the request a high risk score.
  2. The website has configure their security to show a CAPTCHA challenge sometimes or all the time.

Luckily, most websites prefer not to show CAPTCHA challenges as they are known to hurt user experience.

In the rare case, that a website adminstrator has configured Cloudflare to show a CAPTCHA on every request then you will need to use a human based CAPTCHA solving service to solve their hCaptcha challenge as automated CAPTCHA solvers aren't able to solve hCaptcha CAPTCHAs. This isn't ideal, as it can make scraping quite slow and expensive.

Otherwise, you should optimise your scrapers as much as possible to reduce the risk score Cloudflare assigns to them. This way you should be able to avoid having to deal with them altogether.


Low-Level Bypass

Overall, actually reverse engineering and developing a low level bypass (that doesn't use headless browser) for Cloudflares anti-bot system is extremely challenging as you will need to:

  • Intercept the Cloudflare network requests when it loads the Waiting Room page
  • Deobfuscate the Cloudflare code
  • Decrypt the Javascript challenges contained in the obfuscated code
  • Understand the Javascript challenges contained in the deobfuscated code
  • Solve the Javascript challenges and return the correct result.

Here is a deobfuscated snippet of some of the Browser API tests Cloudflare carries out.


function _0x15ee4f(_0x4daef8) {
return {
/* .. */
wb: !(!_0x4daef8.navigator || !_0x4daef8.navigator.webdriver),
wp: !(!_0x4daef8.callPhantom && !_0x4daef8._phantom),
wn: !!_0x4daef8.__nightmare,
ch: !!_0x4daef8.chrome,
ws: !!(
_0x4daef8.document.__selenium_unwrapped ||
_0x4daef8.document.__webdriver_evaluate ||
_0x4daef8.document.__driver_evaluate
),
wd: !(!_0x4daef8.domAutomation && !_0x4daef8.domAutomationController),
};
}

We will go into more detail into how to actually reverse engineer Cloudflare's Javascript challenges in another article as that is a big topic.


More Web Scraping Guides

So when it comes to bypassing Cloudflare you have multiple options. Some are pretty quick and easy, others are a lot more complex. Each with their own tradeoffs.

If you would like to learn how to scrape some popular websites then check out our other How To Scrape Guides:

Or if you would like to learn more about web scraping in general, then be sure to check out The Web Scraping Playbook, or check out one of our more in-depth guides: