Making sense of access logs with the Modern Streaming Stack
Building a DIY analytics stack with Apache Kafka, Filebeat, Faust, Apache Pinot, and Superset.
If you run a website you’ll likely have access logs and the ELK (Elastic/Logstash/Kibana) Stack has become the go-to stack for building analytics on top of data stored in access log files.
But it’s sometimes fun to live on the wild side and explore alternative approaches to see what’s out there.
Earlier this year Dunith coined the Modern Streaming Stack (or Real-Time Analytics Stack as I like to think of it!), a set of tools that you can use to build real-time analytics applications. In this blog post, we’re going to use some of the tools that Dunith describes to analyse HTTP access log files.
We’ll cheat a bit and use some of the tools from the ELK Stack as well — we are working with log files after all!😏
The goal of this blog post is this:
We have a bunch of log files that contain useful data and we want to make sense of that data.
Before we dive into the detail, below is an architecture diagram of what we’re going to build:
Let’s break down the data flows in this system.
1. Access logs are picked up by the Filebeat log shipper, which creates an event per log entry in Apache Kafka.
2. The Faust Stream processor takes those events and enriches them by parsing the log messages and pulling out the various components before writing the results to a new stream.
3. Apache Pinot ingests the enriched stream from Apache Kafka.
4. The dashboarding tool executes queries against Apache Pinot to give an overview of the state of our web application.
The GitHub Repository
You can find all the examples used in this blog post at github.com/mneedham/analysing-log-files.
If you have any questions please send me a message on Twitter @markhneedham and I’ll try to help.
Setup
We’re going to use Docker Compose to spin up all the components used in this application.
You can find the Docker Compose file at github.com/mneedham/analysing-log-files/blob/main/docker-compose.yml. We’ll go through the config for each component in that file as we get to them.
We can launch everything by running docker-compose up
.
This blog post is heavily based on Python tooling, in particular the libraries described in the requirements.txt file.
cat requirements.txt
You can install these dependencies by running the following:
pip install -r requirements.txt
Generating log messages
We’ll be using Kirit Basu’s excellent Apache Log Generator to generate log messages. I’ve adjusted the script a bit to generate logs with more compressed timestamps and you can run the generator like this:
python apache-fake-log-gen.py
This will generate a single message, which we can see below:
If we want to generate an infinite stream of log entries, we’ll need to pass in -n0
, like this:
python apache-fake-log-gen.py -n 0 |
tee access-logs/access_logs_$(date +%s).log
This combination of commands will stream the log entries to the command line as well as write them to a file in the access-logs directory.
The next step is to get log entries out of the log files and into Apache Kafka.
Shipping log entries to Apache Kafka
Elastic has a nice library called Filebeat that we can use. Filebeat is one of the tools that comprise the Elastic Stack and is described as a “lightweight shipper for logs”.
The Docker Compose Config for this component is shown below:
You would usually have it ship logs to Elasticsearch or Logstack, but we’re going to ship them to Kafka instead.
Note that we’re mounting the host file system’s access-logs
directory to /var/log/apache2
, which we’ll refer to next.
We’ll need to define both input and output sources in the Filebeat config file, /config/filebeat.yml, which looks like this:
Based on this configuration, Filebeat will take any logs written to /var/log/apache2/ and will create an event per log entry in the access topic in Kafka. The Docker Compose Config for Kafka is shown below:
Viewing log entries in Kafka
Once we’ve started writing log entries to Kafka, we can view them using the kcat command line tool. The command below will print out one of the messages in the access stream:
kcat -C -b localhost:9092 -t access -c1 | jq ‘.’
We can see that Filebeat has added a bunch of metadata to the event and in the middle under the “message” property is the log message.
The next step is to extract the various components of that message so that we can query them.
Parsing log messages
We can extract these components with help from the excellent apachelogs library. The code below shows how we can use the library to extract the components from the message that we saw in Kafka:
The values that we’re interested in can be extracted by calling headers_in
:
print(output.headers_in)
And directives
:
print(output.directives)
We can tidy up that code so that it returns a single dictionary that we can join to the rest of the fields that we got from the event in the access Kafka stream.
The full code for the access log parsing code is available in access_log_parser.py
After we’ve done that we’ll want to write the new event to a new stream. But how do we do that?
Extracting message components using Faust
We need a stream processor and since we’re already using Python, let’s give Faust a try. Faust defines itself as a library for building streaming applications in Python, which is exactly what we need.
As far as I understand, it’s implemented maybe 50% of the functionality of Kafka Streams, but that will easily be good enough for what we want to do.
At this point, I think it’s useful to note that there are two versions of the library: Faust and Faust-Streaming. Faust hasn’t done a release for more than two years and doesn’t support Python 3.10, so we’re gonna use Faust-Streaming.
Our Faust app is defined below in a file called app.py:
The AccessLogParser referenced on the second line is a refactored version of the code that we wrote to extract the components from an access log entry in the previous section.
We create an agent to listen to the access topic and we then iterate over every message received on that stream. Next, we extract the components of the access log message, before we write the enriched event to the enriched-access-logs stream.
We can run a local Faust agent by executing the following command:
faust -A app worker -l info
Leave that running for a few seconds and the messages will start streaming into our new topic. We can check that the messages have made their way into that stream using kcat again:
kcat -C -b localhost:9092 -t enriched-access-logs -c1 | jq ‘.’
Now it’s time to do some analysis of these logs. If we were just doing some ad hoc analysis we could probably just do this using a stream processor, but for a longer-term solution, we’ll want to use a custom-built serving layer.
Importing the enriched stream into Apache Pinot
Apache Pinot is one such serving layer, initially created at LinkedIn to answer OLAP queries with low latency. It has support for ingesting data from Kafka streams, which is what we’re going to do next.
The Docker Compose configuration for this component is shown below:
Pinot stores data in tables. Tables have schemas that define columns, column types, and also categorise those columns. The schema for our table, schema.json, is shown below:
We’re only going to be extracting a subset of the fields included in events published to the enriched-access-logs stream.
The corresponding table config, table.json, is shown below:
The most relevant part of this config is ingestionConfig.transformConfig. In this section, we define functions to pull nested data out of the events in the enriched-access-logs stream and map it to the columns defined in our schema.
Let’s add this table to Pinot:
docker run -v $PWD/pinot/config:/config \
--network analysing-log-files_default \
apachepinot/pinot:0.11.0 \
AddTable \
-schemaFile /config/schema.json \
-tableConfigFile /config/table.json \
-controllerHost pinot-controller \
-exec
Once that table is created Pinot will automatically start ingesting data. If we navigate to http://localhost:9000 we can then write SQL queries against the access_logs table.
Let’s start by determining the most used browser:
Firefox is the most popular at the time I ran the query. We could include the browser version as well
Firefox 3.8 is out in front with a couple of Safari versions quite far behind in 2nd and 3rd place.
It’s all very well writing these SQL queries, but we need to automate this process so that we can see how the results of the queries change over time. We can probably come up with a more pleasing way of displaying the data as well.
This is where dashboards come in and that’s what we’ll be doing in the next section.
Building a Dashboard with Apache Superset
Apache Superset self describes as a modern data exploration and visualization platform.
It integrates with Apache Pinot through SQLAlchemy and once we’ve got the Pinot integration configured, we can start to build charts that will then be put together into a dashboard.
The Docker Compose configuration for this component is shown below:
Before we can create our first dashboard, we’ll need to add an admin user:
And then run the following to initialise Superset:
The above instructions and other useful information about integrating Superset and Pinot are included in the Pinot documentation.
Once we’ve done that, navigate to localhost:8088 and log in using theadmin/admin
credentials.
Once we’re in we’ll need to add a database and dataset:
The URI for the database is shown below:
pinot+http://pinot-controller:8000/query/sql?controller=http://pinot-controller:9000/
Now it’s time to start adding charts. I’ve created a few and added them to a simple dashboard, which you can see below:
This dashboard shows the total requests, the types of requests, the most popular browsers, and the most popular operating systems.
I’ve sped it up so that the gif didn’t get too big — in a real dashboard, you wouldn’t be refreshing this frequently!
You can import that dashboard along with all its charts by importing this dashboard file into Superset. You can also add other charts to the dashboard because I’ve only skimmed the surface of what you can do with Superset in this blog post.
Summary
Hopefully, this post has shown you that it is possible to build your own stack to analyse your HTTP access log files.
The thing I like best about using the modern streaming stack is that you can substitute different tools if you don’t like one of the tools in someone else’s architecture. The one constraint in this particular blog post is that Filebeat only supports Kafka, but that presumably means we could use Redpanda if we wanted.
But apart from that, we could easily use a different stream processor (e.g. Kafka Streams), a different serving layer (e.g. Druid), and even a different front end (e.g. Grafana).
And I think that’s pretty awesome! 😊