Skip to main content

The Web Scraping Playbook - The Ethics of Web Scraping

The Ethics of Web Scraping

Question: "Is web scraping ethical?"

Not: "Is web scraping legal?"

As that is a different question which brings you down the rabbit hole of court cases, data protection and copyright laws.

Just: Is the act of scraping data from someones website ethical?

It is a simple question, but actually quite hard to get a consensus on a answer. Primarily because web scraping is so polarizing.

Some people see web scraping as a sinister menace on the internet, stealing company data, impacting user experience for real users, and driving up infrastructure costs.

Whilst others argue web scraping is simply the conversion of HTML data on public pages into more machine-readable data formats. Or tout web scrapings positive impacts as web scraping powers numerous products consumers and businesses love.

This is a polarising topic, one rife with hypocrisy, so we're going to view it from both sides and explore:

Ultimately, suggesting some principles for an ethical web scraper.


The Web Scraping Playbook - The Ethical Case Against Web Scraping

The Ethical Case Against Web Scraping

The case for why web scraping is unethical is pretty straightforward.

If I'm a company, and I say in my robots.txt, and my Terms of Service that the automated scraping of my site is prohibited, and if you proceed to scrape my data, then doesn't that make web scraping my content unethical?

A clause like this in someones Terms of Service could not be more explicit:

You may only use or reproduce the Content for your own personal and non-commercial use. The framing, scraping, data-mining, extraction or collection of the Content of the Sites in any form and by any means whatsoever is strictly prohibited. Furthermore, you may not mirror any material contained on this Sites.

Here, I have clearly stated that the data I have assembled and published on my website should not be scraped, so why would it be ethical for anyone to do so?

Not only that, if someone proceeds to scrape my website and I ban their IP address with a message "web scraping is prohibited", but instead they revert to hiding their identity with proxies and fake user agents. Then they can be under no illusion that I don't want my data to be scraped.

Isn't that proof enough as to why web scraping is unethical?


The Web Scraping Playbook - The Ethical Case For Web Scraping

The Ethical Case For Web Scraping

The other perspectice, is that of the web scraper.

Often their point of view is that the data has been published publically on the open web, and all they are doing is converting HTML into machine readable JSON.

Bringing value to the world by using it to create new products that are a benefit to society as a whole, and oftentimes adding value to the original data owners product ecosystem.

In some circumstances, they make be violating the websites robots.txt or Terms of Service, but if the website has made the data public and the scraped data is providing value to society as a whole doesn't that make web scraping ethical?

Obviously, there are types of web scraping that break this arguement and are very questionable ethically.

  • Scraping personal information with the purpose of harming or annoying a person.
  • Scraping data from websites and directly replicating it on their own website.
  • Scalpers using bots to create multipe accounts on sneaker and ticketing websites, and having their bots take offers from real users. Then reselling them at higher prices.
  • Spamming the comments of someones blog, YouTube videos, or Twitter DMs, with untargeted advertising or the promotion of scams.

And yes, web scraping with bad practices can put extra load on a websites servers, impacting user experience and increasing server costs.

However, a responsible web scraper can mitigate all/some of these issues with a well designed scraping architecture.

And if the original publishers of the data provided a public API, or another way of consuming the data, then web scrapers would use that instead. Virtually removing the infrastructure burden.

Is this enough to redeem web scraping and make it ethical?


The Web Scraping Playbook - The Ethical Hypocrisy Of Companies

The Ethical Hypocrisy Of Companies

The ethical arguement of web scraping is pretty straightforward when you are only on one side of the debate. Either you are a web scraper or a website being scraped.

However, there is a real ethical hypocrisy for companies when they are both prohibiting/blocking scraping of their own content, and web scraping themselves.

How can a company argue that web scraping is prohibited, whilst the same company is actively scraping other websites themselves?

From the monitoring data we collect using ScrapeOps, our free web scraping monitoring tool, we can say with certainty that the majority of web scraping is targetted at a handful of large websites.

And in the vast majority of cases, whilst these companies are trying to stop people scraping their websites, at the same time they are aggressively scraping their competitors websites.

If a company, is ignoring the Terms of Service of another website and using advanced proxy solutions to scrape someone elses data, how can they argue that their own Terms of Service should be respected?

If a company wants to prohibit the scraping of their website, then shouldn't they ban the practice within their own companies as well?

But just because a company has hypocritical ethics, does that mean it is ethical to scrape their content?


The Web Scraping Playbook - A Manifesto For The Ethical Scraper

A Manifesto For The Responsible Scraper

There are so many entrenched views and incentives around web scraping, that it will likely be impossible to ever reach a web scraping code of ethics that everyone agrees upon.

However, that doesn't mean that you as a web scraper shouldn't strive to act in an ethical and responsible manner.

So if you want to be a responsible scraper (maybe not ethical in some peoples eyes) here are some principles to live by:

Principals of a Responsible Web Scraper

  • If you provide a public API to retrieve the data I require, I will use that instead of scraping.
  • I will always endeavor to minimise the impact of my scraping by scraping at off-peak times and scraping at a reasonable rate.
  • I will try to make my requests as light as possible on your servers. I will only use headless browsers when it is 100% necessary.
  • If the data can be accessed via an hidden API endpoint, then I will use this instead of requesting the HTML response.
  • I will only scrape the data that is essential for my requirements. If I can get the data I need from a products shelf-page, I won't scrape individual product pages.
  • I will scrape with the goal of creating value with your data, not simply duplicating it.
  • I will respect your copyright, and not pass off your content as my own.
  • If I am causing a burden to your website, I will make changes.

Be A Truely Ethical Scraper

Some people might disagree with the statements above. Feeling they do not go far enough.

And in certain circumstances they might be right.

So if you would like to be a truely ethical web scraper, we can add the following principles:

Principals of a Truely Ethical Web Scraper

  • I will obey your Robots.txt and Terms of Service at all times. If they forbid web scraping, then I will not scrape your website.
  • If I really want your data, I will reach out beforehand and seek your permission before scraping your website.
  • If I do scrape your website, I will clearly identify myself in the User Agent, and provide a way for you to contact me.

The last point, is important.

If I am the owner of a website and I notice a surge in traffic that is impacting our websites performance, having a clear way to identify and contact a web scraper is of huge value to me.

As web scrapers, we can make the websites owners job so much easier simply by identifying ourselves in our User-Agents, and giving them a way to contact us.


{'user-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64); John Doe (johndoe@gmail.com)'}


The Ethical Website Owner

Some might disagree, but ethics in web scraping is a two-way street.

Whilst, web scrapers have a duty to act ethically and responsibly, it can be argued that website owners have the a duty too.

Here are some principals for a website owner to live by:

Principals for a Ethical Website Owner

  • I acknowledge the fact that scraping is a fact of life on the open web.
  • If my data is in high demand, I will consider exposing a public API as an alternative to scraping.
  • If someone uses transparent User Agents, I will reward their transparency and not block them unless they are being irresponsible.
  • I won't block web scrapers unless they are being a burden to our website, and/or using the data in negative ways.

Verdict

Web scraping is a fact of life on the modern internet, and is unlikely to change soon.

However, that doesn't mean that web scrapers don't have a duty to behave responsibly and ethically.

They need to take into account the burden they place on the websites they scrape and the damage the data they scrape might have on the others.

Similarly, website owners have to acknowledge if you provide valuable data for free on the public web, then web scraping is likely to occur.

If website owners were to provide free (or paid) APIs to their data, then a lot of web scraping would go away.

It is in nobody's interest to spend time and money building web scrapers/proxy networks, and trying to block them on the other side. It would be cheaper for everyone involved to have access to the data via APIs.

If you would like to learn more about web scraping in general, then be sure to check out The Web Scraping Playbook. Or check out some our other popular articles like: