Web Scraping Without Getting Blocked
The goal of every web scraper is to not standout, instead doing everything you can to blend into a websites normal traffic.
However, not being flagged as scraper is getting harder and harder as anti-bot technologies get ever more sophisticated and more widely used.
Today, you can be detected by:
- IP Address
- TLS or TCP/IP fingerprint
- HTTP headers (values, order and cases used)
- Browser fingerprints
- Cookies/Sessions
In this guide we're going to share with you some of the common ways detect you as a scraper, and how you optimise your scrapers so that you can blend into a websites normal traffic and not get blocked.
- Header Optimisation
- IP Addresses & Proxies
- Browser Fingerprinting
- TLS Fingerprinting
- Request Profiling
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
Header Optimization
The first step every developer should take when making their scrapers production ready so they don't get blocked is optimising the headers they use with their requests.
The headers you use is one of the easiest ways for a website to detect that you are a scraper and not a real user.
You need to make sure the headers you use are like the headers a real web browser would send, and that they are consistent with who you are trying to pretend to be.
In our header optimisation guide, we go through in detail how you should optimise your headers when scraping, however, here are the main points:
1. Use Real Web Browser Headers
By default, most HTTP libraries (Python Requests, Scrapy, NodeJs Axios, etc.) either don't attach real browser headers to your requests or include headers that identify the library that is being used. Both of which immediately tell the website you are trying to scrape that you are scraper, not a real user.
So with every request you should use a real set of headers, and vary them with each request. For example, here are example headers for using Chrome on a MacOS machine:
Host: 127.0.0.1:65432
Connection: keep-alive
Cache-Control: max-age=0
sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="99", "Google Chrome";v="99"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "macOS"
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.83 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: en-GB,en-US;q=0.9,en;q=0.8
2. Pay Attention To Header Order
Most web browsers attach headers in a certain order that doesn't change, however, a lot of HTTP clients either use their own header ordering or randomise their order making the identification of web scrapers very easy.
Some HTTP clients, like the popular Python Requests library does not respect the header order you define in your request (see issue 5814). Making it easier for websites to detect requests using unnatural headers.
To combat this you should use HTTP clients that respect the header orders you define, so you can match them exactly to how a browser would send them. In the case of Python, the httpx library does respect header orders so it is a good alternative to Python requests.
3. Optimise Headers For Specific Websites
Depending on what website you are scraping, using specific header combinations may increase your scraping performance.
Sometimes websites require you to include specific headers when accessing lower level pages on the site, or certain header combinations may increase your success rates (for example, setting the referer
header as facebook.com
versus google.com
).
A big part of the reason why proxy services like ScraperAPI, Scrapingbee, or ScrapingAnt can get much higher performances out of data center proxies than you can, is because they are much better at managing headers and have systems in place to constantly split test various header combinations to maximise performance.
Takeaways
- Use real browser header and user-agent combinations with every request.
- Make sure you use a HTTP client that respects header ordering so you can make the header order seem natural.
- Optimise your headers for specific websites.
IP Addresses & Proxies
If you want to scrape a website quickly or at scale (more than a couple thousand requests per day) then next bottleneck you will run into is websites determining you are a scraper from the IP address you are using.
A real human user will rarely request more than 5 pages per second from the same website.
However, a scraper making concurrent request certainly can do this and it is pretty obvious to the website that the requests aren't coming from a real user.
This means that if you want to disquise your scraper then you will have to start using proxies.
You have numerous options when it comes to picking a proxy solution for your scrapers, from using:
- Free proxy lists
- Data center proxies
- Residential or mobile proxies
- Proxy APIs
- Building your own proxy network
All of which have their own pros and cons, from better performance to lower/higher costs, however, you will likely need a proxy solution at some point if you want to prevent your scrapers from getting blocked.
Depending on which option you go with, you will also need to build the system to manage your proxy infrastructure. Including proxy selection, rotation, blacklisting, unblocking, etc.
Picking the best proxy provider for your particular use case is quite a big topic, so if you want more info then check out our proxy guides and tools:
- Proxy Pool Optimization Guide
- How To Pick The Best Proxy Solution For Your Use Case
- Proxy Provider Comparison Tool
Takeaways
Check out of Proxy Pool Optimization Guide for more information, but here are some of the key things to remember.
- If scraping protected websites, or scraping at scale then it is very likely you will need to use proxies to disquise your scraper.
- The type of proxy you use (datacenter, ISP, residential, mobile, etc.) can have a big affect on your performance.
- You need a diversified pool of proxies (proxies from different subnets, not just different IP addresses).
- You need to put in place a system to manage your proxies, otherwise you will get poor performance and utlimately burn out the proxy pool.
Request Profiling
A common mistake we see developers make, is making requests that are very obvious to the target website that they are a web scraper and not a real user.
Sometimes two users can be scraping the same website, and use the exact same header/IP systems and one is getting consistently blocked and the other is scraping without any issues solely because of how they structure their requests.
One users requests are believable as real users, whereas the others is obviously a scraper.
Here are some of the common mistakes that can quickly give you away as a scraper:
1. Unrealistic URLs
The URL you use to make the request can often give you away as a scraper. The question you should be asking yourself is, are you sending requests to URLs that a real user would never use?
A common example of this is when scraping e-commerce sites. To keep their scrapers as simple as possible, a lot of developers will design their scrapers to request URLs that use a products ASIN number. For example: https://www.shop.com/product/[product_id]
These URLs can work, however, if you are browsing the website as a real user you will rarely see the website format the URL like this. Making it much easier for the website to detect you as a scraper.
Instead, you should make your URLs look like the URL a real user would request. For example: https://www.walmart.com/ip/Surface-Bassu-Moisture-Conditioner-2-oz-Pack-of-2/643905888
will have much higher success rate than https://www.walmart.com/ip/643905888
both of which are URLs for the same product.
2. Request Patterns
The pattern you make the requests can also be a easy give away that you are a scraper and not a real user.
For example, if I want to scrape all the products in a category from a e-commerce store, and I started with page 1 and product 1, and scraped the entire category in sequential order and at a constant rate then it is highly likely the requests are automated.
Instead, of scraping every page and product in order, you should randomly scrape different pages/products and vary the interval between your requests (from seconds to minutes).
3. Location
A big giveaway for websites that have a very specific geographic focus is making requests from a very weird location that normal users would never use.
If you are scraping a South American real estate platform, but you are making all your requests using Russion proxies then it will show up very quickly in their website analytics that there is suspisous traffic coming from Russia. So they might decide to show more CAPTCHAs to Russion traffic in future or block it completely.
Takeaways
- Use believeable URLs that normal users will request when scraping a website, not the shortest/simplest to code into your scrapers.
- Randomise how you scrape a site and the intervals between your requests to make it less obvious the requests are automated.
- Make requests from locations that the websites real users actually live, not a country on the other side of the world.
Browser Fingerprinting
Increasingly, to combat websites using antibot technologies, lots of developers are turning to using headless browsers like Puppeteer, Playwright, or Selenium to avoid getting blocked when scraping a website.
Using a headless browser does make your requests seem more like a real user than using a HTTP client, however, they aren't a magic bullet and open up a pandoras box of ways for websites to test if you are a scraper or a real user.
Modern antibot technologies use browser fingerprinting, can detect browser automation leaks and integrate honeytraps and other challenges into the page that your scrapers could fail.
Here are some of the major issues:
1. Fixing Browser Leaks
Browsers provide information about themselves in the Javascript execution context, which the client (i.e. website) can access to verify that the browser is in fact a real user and not a bot.
By default, most headless browsers leak information that tells the website that the browser is automated and not a real user. To avoid being blocked you need to patch these leaks and fortify your browser fingerprint so that your scraper isn't detected.
In our guide to fortifying your headless browser we go through in detail what some of these fingerprint leaks are, and how to patch them.
However, when you are using a headless browser for web scraping you should always use the stealth versions, as they often have the most common leaks fixed.
2. Consistent Identity
A common issue that many developers overlook is making sure the identity you are displaying in your headers, user-agent, browser, server and proxy are all consistent and match each other.
For example, if you are using a headless browser you need to make sure the user-agent string you defined matches the browser version you are using. Or if you are running your scraper on a Linux machine, but your user-agent says it is a Windows machine.
Inconsistencies like this won't happen for real users visiting a website, so if you don't make sure you use a consistent identity with every request then you are likely to get blocked.
Takeaways
When you start using headless browsers for web scraping you can get a performance boost, however, websites can still detect you. And the number of ways they can detect your is truely massive. Checkout our guide to fortifying your headless browser which goes into much more detail on the topic.
- Always use the stealth version of your automation library, be it puppeteer-stealth, playwright-stealth or selenium-stealth.
- Make sure the identity you present with the request is consistent amongst the headers, user-agent, browser, server and proxies you use.