Analysing GitHub Events with Apache Pinot and Streamlit
Building a real-time analytics dashboard on streaming data
Just over a year ago Kenny Bastani wrote a blog post showing how to analyse GitHub Events using Apache Pinot and Kafka. In this blog post, we will build on Kenny’s work, but instead of using Apache Superset to visualise the data, we’ll be using a Python library called Streamlit.
The code used in this blog post is available in the mneedham/pinot-github-events repository if you want to follow along or try it out afterwards:
A Streamlit application that queries GitHub events using Apache Pinot. - mneedham/pinot-github-events
To recap, Apache Pinot is a data store that’s optimised for user-facing analytical workloads. It’s good at running real-time queries on event data.
We’re going to launch a local instance of Pinot using the GitHubEventsQuickStart, as shown in the Docker Compose file below:
Before we launch Pinot we’ll need to create a GitHub personal access token and make it available as the GITHUB_TOKEN environment variable. I put my token into a .env file:
If we then run
docker-compose up it’ll spin up a single Docker container running Pinot’s components (Controller, Server, Broker, and Zookeeper), as well as Kafka.
After those components are running, it’ll start querying the GitHub Events API, write the results to a Kafka topic, and then import the corresponding events into the pullRequestMergedEvents real-time table with the schema described in Kenny’s post:
Streamlit is a Python tool that makes it easy to build data-based single page web applications. I first came across it at the end of 2019 and have used it for pretty much every data-based side project that I’ve worked on since then.
It is available as a PyPi package and can be installed by running the following command:
pip install streamlit
Once we’ve done that we’ll create a file called
app.py with the following contents:
Let’s run our Streamlit app:
streamlit run app.py
We’ll see the following output:
You can now view your Streamlit app in your browser.Local URL: http://localhost:8501
Network URL: http://192.168.86.26:8501
If we navigate to that URL we’ll see the following web app:
It’s obviously very simple so far, but it is up and running. If we make a change to
app.py the web app will show the following message in the top right-hand corner:
We’ll tell it to Always rerun so that any changes we make are automatically reflected in the UI. I find this feature really useful for incrementally building data apps.
We can install the Pinot driver and Pandas by running the following:
pip install pinotdb pandas
Now we can update our Streamlit app to run a simple query against the pullRequestMergedEvents table, put the results into a Pandas DataFrame, and then render the table to the screen:
This query finds the organisations that have had the most pull requests submitted since we started the Pinot Docker container. If we go back to our web browser, we’ll now see the following output:
We can see some well-known organisations in that list, like Microsoft, Google, Elastic, and Apache. The most active, however, is Mu-L, which seems to contain forks of popular repositories.
If we refresh the Streamlit app we’ll get an updated table based on the last event that’s been taken off the Kafka topic and applied to Pinot.
In this query, we found the most active organisations by counting the number of events (PRs), but we could also find the organisations or even repositories that have the most comments, lines of code changed, or indeed any of the metric columns.
We can also filter by one of the time columns to only include events that happened, for example, in the last week:
If we refresh our Streamlit app we’ll see a new table underneath the other one:
So far we’ve been hardcoding the metric that we want to group by and the period of time that we’re interested in, but we can do better. Streamlit has form controls that we can dynamically change the selected metric and time range.
The code below shows how we can build a form with dropdowns to select the metric and time range:
You can see how this works in the gif below:
Each time that we press the Refresh button the query is run again against the Pinot Broker.
This is is pretty cool, but we can go further and generate word clouds of active organisations, repositories, and users, like Kenny did in his blog post. We’ll use the word_cloud library to do this. We won’t show the code to do this in this blog post, but you can find it in the overview.py file:
Below is a screenshot of the Streamlit app rendering word clouds instead of tables:
We can also write queries that drill down into particular organisations or repositories to see what’s going on at that level.
For example, the screenshot below shows the activity in the apache organisation in the last 7 days:
In this blog post we’ve showed how to build a real-time analytics dashboard using Streamlit and Apache Pinot, based on data ingested from the GitHub Events API.
We’ve only touched on some of the types of queries that we can run over this dataset, but hopefully it’s given you an idea of how you might be able to use these tools on your own datasets.
Below are some resources if you want to learn more about either of these tools:
Getting Started: https://docs.pinot.apache.org/getting-started
Community Forum: https://discuss.streamlit.io/
Getting Started: https://docs.streamlit.io/en/stable/getting_started.html