Blog

Updates from the Arroyo team

Real-time Web Analytics with Arroyo

Working with real-time data can be daunting. We're working to solve that by building a new stream processing engine that's easy enough for anyone to use. So how easy is it to solve real-world streaming problems with Arroyo today? I decided to find out.

Micah Wylde

Micah Wylde

CEO of Arroyo

Working with real-time data can be daunting. We're working to solve that by building a new stream processing engine that's easy enough for anyone to use. So how easy is it to solve real-world streaming problems with Arroyo today? I decided to find out.

You can find all of the code for the tutorial on Github: ArroyoSystems/analytics-tutorial.

Walkthrough

Ngrok

To run this locally, you will need to create a tunnel from the internet to your computer. For the tutorial, we used ngrok but there are many similar tools.

Instrumenting our site

We start by instrumenting our site. For the tutorial, we used our homepage (https://arroyo.dev) to demonstate. We need some way to get HTTP events from our visitors' browsers to our server.

While there are many ways to accomplish this, for simplicity we used a small javascript snippet that we wired up to our NextJS application to fire on changes to our NextJS router.

That code can be found in analytics.ts.

It can be integrated into your application by adding that component somewhere in your source code (we added ours in src/app/analytics.tsx)

<ArroyoPageview endpoint="<ngrok endpoint>" />

Vector

Vector is a great swiss-army knife for connecting various data systems and shuttling observability data throughout your data infra. We use its HTTP server source to collect the analytics events and its Kafka sink to expose them to Arroyo, using this vector.toml config file.

Kafka

Kafka is a distributed log that works great with Arroyo due to its ability to provide exactly-once processing. Here we use it as both source and sink to get data to Arroyo and to Postgres, via Debezium.

Arroyo

Arroyo is a powerful yet simple stream processing engine that lets you execute complex SQL queries against streams of data in real-time. Here, Arroyo reads in the raw analytics events from Kafka, performs various windowed aggregations, and writes the results back to Kafka.

The final query we use can be found https://github.com/ArroyoSystems/analytics-tutorial/blob/main/here

Debezium

Debezium supports connecting relational databases like Postgres and MySQL to Kafka, providing both sources to read from DBs and sinks to write to them. Arroyo integrates with Debezium, and here we use it to sink our query results to Postgres.

PostgreSQL

PostgreSQL is a powerful RDBMS that we use to store our results for querying.

Grafana

Grafana makes it easy to build dashboards, and includes a Postgres plugin that allows us to query results directly from the database. We use this to visualize the results.

Running the tutorial

The tutorial components are packaged up via Docker compose. With Docker installed, you should just need to run

$ git clone https://github.com/ArroyoSystems/analytics-tutorial.git
$ cd analytics-tutorial
$ docker compose up

Once everything has finished, open http://localhost:8000/pipelines/new to create the pipeline.

Paste in the query here and click "Start Pipeline."

Graphing the results

Open up Grafana at http://localhost:3000. Then create a Postgres data source with the options:

  • Host: postgres
  • Database: postgres
  • User postgres
  • TLS/SSL Mode: disable

Then we can graph the metrics using this query:

SELECT sum(value), time, tag
FROM metrics
WHERE metric = 'views_15_minute' AND $__timeFilter(time)
GROUP BY time, tag
ORDER BY time;

(make sure to change the format to Time series).

We can show the top pages in a table with the query:

SELECT page, count, rank
FROM top_pages
WHERE time = (
  SELECT max(time) from top_pages
)
ORDER BY rank;

Get in touch

If you need help or have any questions/comments/suggestions you can find us on Discord.