Skip to main content

Mastering Web Scraper Logging and Alerts with ELK Stack

When dealing with software, things break. Web scrapers are no exception to this rule.

Whether it's just a small change or a complete site redesign, these changes can leave your scraper stranded trying to look for a page element that no longer exists.

In this situation, a good scraper should throw an error and write it to a log file. A good stack can read this log file and let you know that something is broken.

"In these situations, you might find it more comfortable to move the monitor closer to the front of the desk than you would ordinarily have it. You'll find this makes it much easier to repeatedly bang your head against it when you're feeling particularly frustrated."

--Pete Goodliffe, Becoming a Better Programmer

ELK (Elasticsearch, Logstash, Kibana) Stack eliminates the need to manually dig through log files. You can find your broken scrapers in a snap... no head banging required!

In this guide, we'll walk you through installing the ELK (Elasticsearch, Logstash, Kibana) stack, integrating your scraper logs, and setting up real-time alerts.

By the end, you'll have a fully functioning local setup that gives you complete visibility into your web scraping workflows, no more silent failures.


TLDR: Quick Takeaways

  • ELK Stack centralizes logs from multiple scrapers, making them searchable and easy to visualize.
  • Logstash ingests your scraper logs, pushes them to Elasticsearch, and Kibana provides a slick web interface for filtering and analyzing them.
  • Alerts can be configured to fire on specific log patterns (e.g., level:ERROR), letting you know the moment something breaks.
  • Scaling is straightforward by containerizing or distributing the stack, and you can tweak resource usage to avoid crashing your local machine.

Need help scraping the web?

Then check out ScrapeOps, the complete toolkit for web scraping.


Why ELK? The Power of Centralized Logging

Logging is Critical: Scrapers need to log their actions. Without logging, you might not spot broken code until the data has made it into your report.

A single web scraper can have a variety of logs: debug messages, site structure changes, or authentication errors. Multiply this by multiple scrapers, and you have a logging nightmare.

Enter the ELK Stack: With logging, you're not just jotting down events, you're keeping track of things in real time to stay ahead of potential issues. A plaintext log file will tell you what's wrong, but sometimes it takes hours, even days to even find an error.

ELK stack processes our logs for us. This way, you don't need to dig through the muck.

  • Elasticsearch: A backend API that can effectively process and serve your logs.

  • Logstash: Process your logs and feed them into the Elasticsearch backend.

  • Kibana: A highly intuitive web app that visualizes the Elasticsearch backend in your browser.

ELK doesn't give you a basic log. You get a complete webapp fully integrated with your logs to utilize them however you might need.

By centralizing logs from all your scrapers, you can quickly spot issues, like 404 errors, missing data, or website layout changes, without scanning endless text files.


Setting Up the ELK Stack on Your Local Machine

Next, we'll go through the requirements of setting up ELK stack.

Elasticsearch Setup

The following instructions will work with a native Ubuntu install or Ubuntu using WSL. If you need to install on a different OS, Elastic has you covered here.

Installing Elasticsearch

To start, we need to download the Elasticsearch keyrings. These are used to cross check our build and ensure its integrity.

curl -fsSL https://artifacts.elastic.co/GPG-KEY-elasticsearch |sudo gpg --dearmor -o /usr/share/keyrings/elastic.gpg

Next, we need to add Elasticsearch to the source.d directory. This is where it actually gets installed on Ubuntu.

echo "deb [signed-by=/usr/share/keyrings/elastic.gpg] https://artifacts.elastic.co/packages/7.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-7.x.list

Update your APT repositories. The Ubuntu package manager uses these repos to install software on your computer.

sudo apt update

We're finally ready to install Elasticsearch.

sudo apt-get install elasticsearch

After installing, you should see this output or something similar.

Web Scraping ELK Stack - Installation Finished

Starting Elasticsearch

Start Elasticsearch. systemctl allows you to run it as a background process.

sudo systemctl start elasticsearch

Optionally, you can set Elasticsearch to start automatically whenever the system boots.

sudo systemctl enable elasticsearch

Create a new password. This password will give you superuser access to Elasticsearch. We'll use it to login to Kibana later on.

sudo /usr/share/elasticsearch/bin/elasticsearch-reset-password -u elastic

When prompted, enter y to randomly create a new password and you should see output similar to this.

Testing with curl.

curl -u "elastic:<YOUR_PASSWORD>" -X GET "https://localhost:9200"

If it's running properly, it should spit out a JSON object with basic information about your Elasticsearch build.

Web Scraping ELK Stack - cURL Output


Logstash Setup

This part's pretty similar to Elasticsearch. The following instructions will work with a native Ubuntu install or Ubuntu using WSL.

Installing Logstash

Start by getting the logstash signing key.

wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo gpg --dearmor -o /usr/share/keyrings/elastic-keyring.gpg

Install apt-transport-https. Logstash uses it to communicate with our Elasticsearch backend.

sudo apt-get install apt-transport-https

Save the repo information to your Elasticsearch configuration.

echo "deb [signed-by=/usr/share/keyrings/elastic-keyring.gpg] https://artifacts.elastic.co/packages/8.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-8.x.list

Install Logstash.

sudo apt-get update && sudo apt-get install logstash

Starting Logstash

Start Logstash.

sudo systemctl start logstash

Optionally, you can set Logstash to start automatically whenever the system boots.

sudo systemctl enable logstash

Kibana Setup

Finally, we'll install Kibana.

Installing Kibana

The good news is that Kibana is already included with the repositories we added earlier.

sudo apt-get install kibana

Successful installation will look like this.

Web Scraping ELK Stack - Successfully Installed Kibana

Starting Kibana

Start Kibana.

sudo systemctl start kibana

Optionally, you can set Kibana to start automatically whenever the system boots.

sudo systemctl enable kibana

Generate a new enrollment token.

sudo /usr/share/elasticsearch/bin/elasticsearch-create-enrollment-token --scope kibana

You can access Kibana at http://localhost:5601/. Enter your new enrollment token into the input box.

Get your Kibana verification code.

sudo /usr/share/elasticsearch/bin/elasticsearch-create-enrollment-token --scope kibana

Enter the verification code.

Web Scraping ELK Stack - Kibana Enter Your Verification Code

Setup takes a minute or two. Then you need to get logged in. Use elastic as your username and the password you created when setting up Elasticsearch.

Web Scraping ELK Stack - Logging Into Kibana

Finally, you'll be taken to the Kibana dashboard.

Web Scraping ELK Stack - Kibana Dashboard


Integrating Logging into Your Web Scraper

For those of you unfamiliar with the process, we'll create a basic web scraper. Nothing fancy, just scrape a few quotes and add some basic logging info.

In your own personal ELK stack, this piece will differ, but the concepts remain the same.

Implementing a Robust Logging System in Python

import logging
import os

#make sure that we have a "logs" folder
os.makedirs("logs", exist_ok=True)

#configure logging
logging.basicConfig(
filename="logs/scraper.log",
level=logging.DEBUG,
format="%(asctime)s - %(levelname)s - %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
)

logger = logging.getLogger(__name__)

#begin log messages
logger.info("Web scraper initialized.")

Web Scraping ELK Stack - Begin Scraper Logging


Sample Web Scraper with Logging

Next, we add some simple scraping logic to our logger.

We create a JsonFormatter class to automatically convert our data into JSON as we pass it into the log.

Most APIs like JSON data and Elasticsearch is no exception. Our second url variable is commented out.

We'll use it to automatically trigger errors later on. In production, you won't need this.

import logging
import json
import os
import requests
from bs4 import BeautifulSoup

# Ensure "logs" folder exists
os.makedirs("logs", exist_ok=True)

# Define a custom JSON formatter
class JsonFormatter(logging.Formatter):
def format(self, record):
log_entry = {
"timestamp": self.formatTime(record, self.datefmt),
"level": record.levelname,
"message": record.getMessage(),
"module": record.module,
"function": record.funcName,
"line": record.lineno
}
return json.dumps(log_entry)

# Configure logging with JSON formatter
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)

file_handler = logging.FileHandler("logs/scraper.log", mode="a", encoding="utf-8")
file_handler.setFormatter(JsonFormatter())

logger.addHandler(file_handler)

# Log the initialization message
logger.info(json.dumps({"message": "Web scraper initialized."}))

def scrape_page(url):
response = requests.get(url)
try:
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")
quotes = soup.find_all(class_="quote")
for quote in quotes:
log_data = {
"quote": quote.text.strip(),
"url": url
}
logger.info(json.dumps(log_data))
except Exception as e:
logger.error(json.dumps({"error": str(e), "url": url}))
finally:
logger.info(json.dumps({"message": f"Finished Scraping {url}"}))

if __name__ == "__main__":
url = "https://quotes.toscrape.com"
#url = "https://httpbin.org/500"
scrape_page(url)

If we run our scraper and check the log, we now log each quote found in the page. At the bottom of the log, you'll see the exit message we created.

Web Scraping ELK Stack - Logging the Quotes

Afterward, comment out url = "https://quotes.toscrape.com" and uncomment #url = "https://httpbin.org/500". Run the scraper again to create an error in the logs.


Sending Logs to ELK and Visualizing in Kibana

Next, we'll configure Logstash to process our logs and send them to Elasticsearch.

Configuring Logstash to Process Logs

First, we'll create a configuration file for Logstash and open it in a text editor.

We're using nano since its installed by default in the command line, but this isn't a requirement.

sudo nano /etc/logstash/conf.d/logstash.conf

Paste the following code into the file.

Later on, we'll create an index for our scraper logs.

Next, do the following:

  • Replace the path with the path to your desired log.
  • Replace scraper_logs with your index name.
  • Replace the password with the password you generated earlier.

This is the same one you used to log into Kibana.

input {
file {
path => "/home/nultinator/clients/ahmet/elk-stack/logs/scraper.log"
start_position => "beginning"
sincedb_path => "/dev/null"
codec => "json"
}
}

filter {
mutate {
add_field => { "[@metadata][index]" => "scraper_logs" }
}
}

output {
elasticsearch {
hosts => ["https://localhost:9200"]
user => "elastic"
password => "sjL9IHGPSFOo0fMyOd*f"
index => "%{[@metadata][index]}"
ssl_certificate_verification => false
}

stdout { codec => rubydebug }
}

Web Scraping ELK Stack - Logstash Configuration File


Using Kibana to Analyze and Filter Logs

Now, from Kibana, click on Stack Management in the Navbar.

Web Scraping ELK Stack - Kibana Stack Management

Next, click Index Management.

Web Scraping ELK Stack - Kibana Index Management

Now, you can click the Create Index button and follow the prompts to create your index. Give this index the same name you used in your Logstash configuration file earlier.

Web Scraping ELK Stack - Kibana Create Index

If your logs aren't showing up, you'll need to click on Discover index.

Web Scraping ELK Stack - Kibana Discover Index

Hover over level on the sidebar and you should see a popup. Click Filter for level: ERROR.

Web Scraping ELK Stack - Kibana Finding Errors

Afterward your display should filter out all non-error logs.

Web Scraping ELK Stack - Kibana Filtered Errors


Creating Visualizations

Next, we'll create a visualization to display our logs.

Using your Navbar, open Data Views.

If your index doesn't show up right away, click the button to create it and fill in the required fields.

If your index is called scraper_logs, your index pattern should be scraper_logs*. * is used to search all available documents in the index.

Web Scraping ELK Stack - Kibana Create a Data View

Now, you might see an empty visuals dashboard. If so, click on Search entire time range.

Web Scraping ELK Stack - Kibana Empty Visuals

You can filter your data however you wish. Below, we filter using level.keyword:ERROR. This shows all logs with the level of ERROR.

Web Scraping ELK Stack - Kibana Error Visual


Automating Alerts for Scraper Failures

Great, so we've got our logs in Kibana. Now, it's time to create custom alerts.

This is where Kibana really shines.

With custom alerts, you don't need to dig through log files or wait until your data pipeline breaks.

You just need to check your email or log into Kibana.


Setting Up Encryption Keys

Before we configure our alerts, we need to set up encryption keys. These keys allow Kibana to communicate securely when creating our alerts.

cd into your kibana folder. Depending on your installation method, your path may vary.

cd /usr/share/kibana/bin

Create the keys. When doing this myself, I continually received an error without using sudo.

sudo ./kibana-encryption-keys generate

Open kibana.yml. Make sure your new keys are inside the file.

sudo nano /etc/kibana/kibana.yml

Web Scraping ELK Stack - Added Encryption Keys


Detecting Errors with Kibana

Once your keys are in the file, restart Kibana.

sudo systemctl restart kibana

Now, it's time to create a rule. Under Stack Management, click on Create rule.

Web Scraping ELK Stack - Kibana Create Rule

Select Index Threshold since we're setting a threshold for errors.

Web Scraping ELK Stack - Kibana Index Threshold

In the example below, when our count is above zero during the last 5 minutes, we want the rule to trigger. level:ERROR tells Kibana that we only want this alert to count error logs.

Web Scraping ELK Stack - Kibana Threshold Settings

Once all of your settings are correct, click Create rule.

Web Scraping ELK Stack - Kibana Create the Rule

If your rule hasn't triggered yet, run your scraper to trigger an error. As you can see in the image below, our rule executed successfully.

Web Scraping ELK Stack - Kibana Trigger the Rule


Sending Email Alerts for Critical Failures

Okay, so we've got our rule set up. Now, it's time to send an email alert.

To send email alerts, navigate to Connectors and click on Create connector.

Web Scraping ELK Stack - Kibana Create Connector

If you decide to upgrade to the Gold license, you can add any connector you want, like email or Slack. Without a Gold license, you're limited to Index and Server Log, which we already use to operate the stack.

Web Scraping ELK Stack - Kibana Choosing a Connector


Alternative: Setting Up Alerts in Kibana

When you created your rule, you already setup the alert.

To view it, you just need to click the "Alerts" tab.

If you look at the image below, our Alert Status is Active. This is because the scraper error tripped the alert.

Web Scraping ELK Stack - Kibana Viewing Your Alerts

When there are no alerts present, this tab will appear empty.

Web Scraping ELK Stack - Kibana No Active Alerts


Scaling Your ELK-Based Logging System

Now that ELK is humming along and catching your issues, you can take it to the next level.

You can grow your setup by adding more scrapers, logs and machines.

You don't need to worry about keeping it working, you can make it grow with your operation.

  • Cutting Log Noise: Only log when necessary in your scraper. You really only need to log when a scraper exited successfully or encountered an error.

  • Containerize: With Docker, you can build your stack locally and deploy it straight to your server with no tweaks or fuss.

  • Add Other Monitoring Tools: You can pair this stack with Prometheus and Grafana for top quality insights into your scraper health.

ELK stack gives you a strong, scalable base that will grow with you. The sky's the limit.


Tips and Troubleshooting

The following tips and troubleshooting steps are based on my own experience. If you've got other tips or suggestions, please let us know in the comments.

Resource Limitations

When I first set up my stack, my laptop would not stop crashing. Elasticsearch and Logstash are major resource hogs. If you don't manage their resources, they might crash yours too.

To edit Elasticsearch resource usage, just use systemctl. In the image below, I set my MemoryMax (maximum RAM used) to 3GB. MemoryHigh tells systemctl to begin reallocating memory at 2.5GB.

sudo systemctl edit elasticsearch

Web Scraping ELK Stack - Setting Hardware Limits

We can edit the settings for Logstash the same way.

sudo systemctl edit logstash

The CPUQuota and CPUShares will work for any program in systemctl as well.

Web Scraping ELK Stack - Setting Limits For Logstash


Logstash Not Sending Documents

I also ran into file permission issues. Logstash needs to be able to read the logs. Make sure that both you and Logstash have permission to access the log file. Your scraper needs to write it and Logstash needs to read it.

The commands below create a new user group with you and Logstash and give the appropriate permissions.

# Set your user as the owner (replace 'youruser' with your actual username)
sudo chown youruser /path/to/your/logs/scraper.log

# Give the 'logstash' group read access
sudo chgrp logstash /path/to/your/logs/scraper.log

# Grant read/write to owner (youruser) and read to group (logstash)
sudo chmod u+rw,g+r /path/to/your/logs/scraper.log

# Verify permissions
ls -l /path/to/your/logs/scraper.log

Conclusion

Implementing logging and alerts with the ELK Stack transforms your web scraping game.

Gone are the days of sifting through mountains of text files or discovering errors days late. With Elasticsearch, Logstash, and Kibana, you can spot issues in real time, visualize trends, and scale your solution as your scraping projects multiply.

Once you've tamed your logs, consider weaving in additional tools like the ScrapeOps Monitoring SDK to track performance metrics and streamline scheduling. Because who wants to stare at logs 24/7?

If you've got experience with ELK stack, let us know your opinion. At ScrapeOps, we love to talk shop and compare notes.


More Web Scraping Guides

In this guide, we went through how to set up the ELK stack and use it to monitor your web scrapers.

If you would like to learn more about web scraping in general, then be sure to check out The Web Scraping Playbook, or check out one of our more in-depth guides:

Ready to take your scraping game to the next level? Check out the ScrapeOps Proxy API Aggregator for seamless proxy switching, or streamline your workflow with the ScrapeOps Scraper Scheduler.