Making sense of access logs with the Modern Streaming Stack

Building a DIY analytics stack with Apache Kafka, Filebeat, Faust, Apache Pinot, and Superset.

Mark Needham
Tributary Data

--

Photo by Mildly Useful on Unsplash

If you run a website you’ll likely have access logs and the ELK (Elastic/Logstash/Kibana) Stack has become the go-to stack for building analytics on top of data stored in access log files.

But it’s sometimes fun to live on the wild side and explore alternative approaches to see what’s out there.

Earlier this year Dunith coined the Modern Streaming Stack (or Real-Time Analytics Stack as I like to think of it!), a set of tools that you can use to build real-time analytics applications. In this blog post, we’re going to use some of the tools that Dunith describes to analyse HTTP access log files.

We’ll cheat a bit and use some of the tools from the ELK Stack as well — we are working with log files after all!😏

The goal of this blog post is this:

We have a bunch of log files that contain useful data and we want to make sense of that data.

Before we dive into the detail, below is an architecture diagram of what we’re going to build:

An architecture diagram for analysing log files
An architecture diagram for analysing log files

Let’s break down the data flows in this system.

1. Access logs are picked up by the Filebeat log shipper, which creates an event per log entry in Apache Kafka.

2. The Faust Stream processor takes those events and enriches them by parsing the log messages and pulling out the various components before writing the results to a new stream.

3. Apache Pinot ingests the enriched stream from Apache Kafka.

4. The dashboarding tool executes queries against Apache Pinot to give an overview of the state of our web application.

The GitHub Repository

You can find all the examples used in this blog post at github.com/mneedham/analysing-log-files.

If you have any questions please send me a message on Twitter @markhneedham and I’ll try to help.

Setup

We’re going to use Docker Compose to spin up all the components used in this application.

You can find the Docker Compose file at github.com/mneedham/analysing-log-files/blob/main/docker-compose.yml. We’ll go through the config for each component in that file as we get to them.

We can launch everything by running docker-compose up .

This blog post is heavily based on Python tooling, in particular the libraries described in the requirements.txt file.

cat requirements.txt
Python requirements.txt file
requirements.txt

You can install these dependencies by running the following:

pip install -r requirements.txt

Generating log messages

We’ll be using Kirit Basu’s excellent Apache Log Generator to generate log messages. I’ve adjusted the script a bit to generate logs with more compressed timestamps and you can run the generator like this:

python apache-fake-log-gen.py

This will generate a single message, which we can see below:

A HTTP access log message
An HTTP access log message

If we want to generate an infinite stream of log entries, we’ll need to pass in -n0, like this:

python apache-fake-log-gen.py -n 0 | 
tee access-logs/access_logs_$(date +%s).log

This combination of commands will stream the log entries to the command line as well as write them to a file in the access-logs directory.

The next step is to get log entries out of the log files and into Apache Kafka.

Shipping log entries to Apache Kafka

Elastic has a nice library called Filebeat that we can use. Filebeat is one of the tools that comprise the Elastic Stack and is described as a “lightweight shipper for logs”.

The Docker Compose Config for this component is shown below:

Filebeat Docker Compose Config
Filebeat Docker Compose Config

You would usually have it ship logs to Elasticsearch or Logstack, but we’re going to ship them to Kafka instead.

Note that we’re mounting the host file system’s access-logs directory to /var/log/apache2 , which we’ll refer to next.

We’ll need to define both input and output sources in the Filebeat config file, /config/filebeat.yml, which looks like this:

YAML configuration for Filebeat

Based on this configuration, Filebeat will take any logs written to /var/log/apache2/ and will create an event per log entry in the access topic in Kafka. The Docker Compose Config for Kafka is shown below:

Kafka/Zookeeper Docker Compose Config
Kafka/Zookeeper Docker Compose Config

Viewing log entries in Kafka

Once we’ve started writing log entries to Kafka, we can view them using the kcat command line tool. The command below will print out one of the messages in the access stream:

kcat -C -b localhost:9092 -t access -c1 | jq ‘.’
A message in the access stream
A message in the access stream

We can see that Filebeat has added a bunch of metadata to the event and in the middle under the “message” property is the log message.

The next step is to extract the various components of that message so that we can query them.

Parsing log messages

We can extract these components with help from the excellent apachelogs library. The code below shows how we can use the library to extract the components from the message that we saw in Kafka:

Apache HTTP Acess Logs Parser
Apache HTTP Acess Logs Parser

The values that we’re interested in can be extracted by calling headers_in :

print(output.headers_in)
Headers in the access log message
Headers in the access log message

And directives:

print(output.directives)
Directives in the access log message
Directives in the access log message

We can tidy up that code so that it returns a single dictionary that we can join to the rest of the fields that we got from the event in the access Kafka stream.

The full code for the access log parsing code is available in access_log_parser.py

After we’ve done that we’ll want to write the new event to a new stream. But how do we do that?

Extracting message components using Faust

We need a stream processor and since we’re already using Python, let’s give Faust a try. Faust defines itself as a library for building streaming applications in Python, which is exactly what we need.

As far as I understand, it’s implemented maybe 50% of the functionality of Kafka Streams, but that will easily be good enough for what we want to do.

At this point, I think it’s useful to note that there are two versions of the library: Faust and Faust-Streaming. Faust hasn’t done a release for more than two years and doesn’t support Python 3.10, so we’re gonna use Faust-Streaming.

Our Faust app is defined below in a file called app.py:

Faust Application
Faust Application

The AccessLogParser referenced on the second line is a refactored version of the code that we wrote to extract the components from an access log entry in the previous section.

We create an agent to listen to the access topic and we then iterate over every message received on that stream. Next, we extract the components of the access log message, before we write the enriched event to the enriched-access-logs stream.

We can run a local Faust agent by executing the following command:

faust -A app worker -l info

Leave that running for a few seconds and the messages will start streaming into our new topic. We can check that the messages have made their way into that stream using kcat again:

kcat -C -b localhost:9092 -t enriched-access-logs -c1 | jq ‘.’
An event in the enriched-access-logs stream

Now it’s time to do some analysis of these logs. If we were just doing some ad hoc analysis we could probably just do this using a stream processor, but for a longer-term solution, we’ll want to use a custom-built serving layer.

Importing the enriched stream into Apache Pinot

Apache Pinot is one such serving layer, initially created at LinkedIn to answer OLAP queries with low latency. It has support for ingesting data from Kafka streams, which is what we’re going to do next.

The Docker Compose configuration for this component is shown below:

Pinot Docker Compose Config
Pinot Docker Compose Config

Pinot stores data in tables. Tables have schemas that define columns, column types, and also categorise those columns. The schema for our table, schema.json, is shown below:

Apache Pinot Schema

We’re only going to be extracting a subset of the fields included in events published to the enriched-access-logs stream.

The corresponding table config, table.json, is shown below:

Apache Pinot Table Config
Apache Pinot Table Config

The most relevant part of this config is ingestionConfig.transformConfig. In this section, we define functions to pull nested data out of the events in the enriched-access-logs stream and map it to the columns defined in our schema.

https://youtu.be/2GKTLW_vwu8

Let’s add this table to Pinot:

docker run -v $PWD/pinot/config:/config \
--network analysing-log-files_default \
apachepinot/pinot:0.11.0 \
AddTable \
-schemaFile /config/schema.json \
-tableConfigFile /config/table.json \
-controllerHost pinot-controller \
-exec

Once that table is created Pinot will automatically start ingesting data. If we navigate to http://localhost:9000 we can then write SQL queries against the access_logs table.

Let’s start by determining the most used browser:

Find the top 10 browsers
Find the top 10 browsers
The top 10 browsers
The top 10 browsers

Firefox is the most popular at the time I ran the query. We could include the browser version as well

Find the top 10 browsers and versions
Find the top 10 browsers and versions
The top 10 browsers and versions
The top 10 browsers and versions

Firefox 3.8 is out in front with a couple of Safari versions quite far behind in 2nd and 3rd place.

It’s all very well writing these SQL queries, but we need to automate this process so that we can see how the results of the queries change over time. We can probably come up with a more pleasing way of displaying the data as well.

This is where dashboards come in and that’s what we’ll be doing in the next section.

Building a Dashboard with Apache Superset

Apache Superset self describes as a modern data exploration and visualization platform.

It integrates with Apache Pinot through SQLAlchemy and once we’ve got the Pinot integration configured, we can start to build charts that will then be put together into a dashboard.

The Docker Compose configuration for this component is shown below:

Superset Docker Compose Config
Superset Docker Compose Config

Before we can create our first dashboard, we’ll need to add an admin user:

Adding an Admin user to Superset
Adding an Admin user to Superset

And then run the following to initialise Superset:

Initialise Superset
Initialise Superset

The above instructions and other useful information about integrating Superset and Pinot are included in the Pinot documentation.

Once we’ve done that, navigate to localhost:8088 and log in using theadmin/admincredentials.

Once we’re in we’ll need to add a database and dataset:

Adding a database and dataset in Superset
Adding a database and dataset in Superset

The URI for the database is shown below:

pinot+http://pinot-controller:8000/query/sql?controller=http://pinot-controller:9000/

Now it’s time to start adding charts. I’ve created a few and added them to a simple dashboard, which you can see below:

Superset Dashboard

This dashboard shows the total requests, the types of requests, the most popular browsers, and the most popular operating systems.

I’ve sped it up so that the gif didn’t get too big — in a real dashboard, you wouldn’t be refreshing this frequently!

You can import that dashboard along with all its charts by importing this dashboard file into Superset. You can also add other charts to the dashboard because I’ve only skimmed the surface of what you can do with Superset in this blog post.

Summary

Hopefully, this post has shown you that it is possible to build your own stack to analyse your HTTP access log files.

The thing I like best about using the modern streaming stack is that you can substitute different tools if you don’t like one of the tools in someone else’s architecture. The one constraint in this particular blog post is that Filebeat only supports Kafka, but that presumably means we could use Redpanda if we wanted.

But apart from that, we could easily use a different stream processor (e.g. Kafka Streams), a different serving layer (e.g. Druid), and even a different front end (e.g. Grafana).

And I think that’s pretty awesome! 😊

--

--