What Is Web Scraping? A Beginner's Guide On How To Get Started
Web scraping is a polarizing topic. Some people see it as a sinister menace on the internet, others see its positive impact as it powers many of the products consumers and businesses love to use.
Part of this reason is the fact that web scraping can take many different forms, from the simple conversion of HTML data on public pages into more machine-readable data formats, to bots that simulate real-user behavior and have more sinister aims.
In this guide, we're going to give you a broad understanding of web scraping, and try to dispel some of the myths and misconceptions associated with it.
- What Is Web Scraping?
- What Is The Difference Between Web Scraping & Web Crawling?
- Who Uses Web Scraping?
- When To Use Web Scraping
- Scraping Tools: What is the Right Way to Scrape Data?
- Why Is Web Scraping Controversial?
So that begs the question, what exactly is web scraping?
If you prefer to follow along with a video then check out the video tutorial version here:
Need help scraping the web?
Then check out ScrapeOps, the complete toolkit for web scraping.
What Is Web Scraping?
Web scraping, or data extraction or screen scraping as some call it, is an automated technique used to extract data from websites or HTML pages.
Web scraping is normally used when a website doesn't provide a public API from which external individuals and companies can access the public data displayed on their websites.
Typically, the data we see on websites is unstructured HTML data, which is hard for machines to understand. The goal of web scraping is to convert that unstructured HTML data into more machine-friendly data and use it to power their products, and data analytics.
What Is The Difference Between Web Scraping & Web Crawling?
Although commonly confused, web scraping is not the same as web crawling.
Web crawling is the process of traversing a website by following links in pages or the sitemap. Oftentimes, indexing the web pages as it goes. Google uses web crawling to index the web, and provides a powerful search engine that can find content anywhere.
Web scraping, on the other hand, is the process of extracting specific bits of data from a target web page and storing it for your own purposes.
The confusion often arises because sometimes developers use a combination of web crawling and web scraping to extract the data they need.
They have a web crawler that traverses a website finding web pages that they want to scrape data from, then use a web scraper to extract the data.
Who Uses Web Scraping?
The use cases for web scraping are extremely broad and varied. Ranging from small projects to get the data from one web page instead of transcribing the data by hand, to large-scale data extraction operations designed to scrape data from millions of web pages every month.
Web scraping is used by marketers, research companies, developers, and businesses to:
- Monitor competitor products & their pricing.
- Build comparison tools for their markets.
- Monitor the company's own products to make sure resellers are sticking to their reseller agreements.
- Track their rankings in Google & Bing.
- Monitor social media accounts.
- Build email outreach lists for their sales teams.
- Build datasets that can be used for investment decisions.
When To Use Web Scraping
Web scraping makes the most sense when you need to extract data scale or regularly.
Either when you need large amounts of data from a single web page, or when you need to do it regularly scrape data from numerous web pages.
Two examples would be:
-
If you want to extract the contact details of every person in a business directory so you don't need to manually transcribe them, or.
-
If you wanted to monitor the competitor's product prices every day on an eCommerce site.
Both of these are great use cases for web scraping, as using an automated web scraper will save your hours in manual data entry and enable you to greatly improve your ability to make data-based decisions.
Scraping Tools: What is the Right Way to Scrape Data?
There is no right or wrong way to scrape data from a website, the tool you use really depends on your use case and your technical skills.
If you are a developer, the preferred methods are to use a common web scraping stack to send requests to your target website and extract the data from the response.
Common web scraping stacks include:
- Python Requests + BeautifulSoup: Use the Python Requests library to fetch the HTML data from the website and use BeautifulSoup to extract the data from the page.
- Python Scrapy: Scrapy is a full-stack web scraping framework that is great for large-scale web scraping projects.
- NodeJs Requests/Fetch/Axios + Cheerio.js: Use a popular HTTP requests library like Requests, Fetch, or Axios to get the HTML data and use Cheerio to parse the data you want from the web page.
- NodeJs Puppeteer or Playwright: If you need to scrape Javascript heavy websites then using a headless browser like Puppeteer or Playwright are very popular options.
If you do not know how to code, then you will need to use a web scraping tool like:
- ScrapingRobot
- Import.io
Why Is Web Scraping Controversial?
Web scraping, in general, is a controversial practice. Some find it ethical, while others find it unethical.
The controversy surrounding the topic of web scraping is due to a couple of factors:
#1 Violates The Terms Of Use
A big issue is that web scraping violates the stated terms of use of many websites.
When a website owner provides their content using a Creative Commons license, they are granting users the right to use the content so long as they adhere to the terms of the license. However, in most cases, websites do not allow web scraping as part of their terms of use.
This topic isn't black and white though, as many argue that since a website has made its data publically available then scraping it isn't in breach of its terms of use as the web scraper hasn't explicitly agreed to those terms.
#2 Stealing Data
Another reason web scraping is viewed as bad is the fact that sometimes businesses can profit off the use of another website's data without their permission or with no compensation.
For example, if you were to create an app like Yelp! and scrape all the business data on it without the permission from Yelp!, and create your new version you would be taking advantage of the year's Yelp! spent building up this dataset.
#3 Increased Infrastructure Costs
Web scraping can be a thorn in the side of many developers and website admins because it can increase the pressure on their website's servers and cost them more in infrastructure costs.
A single automated web scraper can emulate the load of thousands of real users, driving up the costs of servers and degrading the performance of the website for real users.
All of these arguments are legitimate and are real concerns for businesses whose data is being scraped. That is why when scraping any site it is very important to do so responsibly and ethically.
More Web Scraping Guides
This was a high-level overview of web scraping to give you a basic overview of what web scraping is, how it is used and how you can get started.
If you would like to learn more about web scraping in general, then be sure to check out The Web Scraping Playbook, or check out one of our more in-depth guides: