Blog

Updates from the Arroyo team

Why Not Flink?

Flink is a mature and powerful streaming engine. So why didn't we build Arroyo on top of it?

Micah Wylde

Micah Wylde

CEO of Arroyo

We started Arroyo with the mission of making stream processing so easy that every company is able to adopt it—to make real-time the default for data.

As a tech lead for streaming at Lyft and Splunk, I'd seen the power of stream processing, but also how challenging it could be for users of these systems to build pipelines that are correct, reliable, and efficient. Without solving those usability challenges, streaming would never become mainstream.

This led to the first big decision for the company: should we build our platform around an existing engine, or build a new one from scratch?

If you're reading this, it's probably not a spoiler to say that we took the more ambitious approach: building an entirely new engine to power the Arroyo Cloud, which is now a fast-growing open-source project.

We've gotten a number of questions about why we went down that path, when there are many existing streaming engines, including Apache Flink, Spark Streaming, and kSQL.

And indeed, most of the new streaming platform companies have opted to build UIs and hosting around Flink, the most sophisticated and mature of the existing open-source engines.

Table of Contents

What's a stream processing engine?

Before diving into Flink, let's start with some background on what this is all about. Traditionally, data processing operated in batch. Events come in from your data sources1, they pile up in some storage system2, and periodically (every hour, day, etc.) you run a big query over all of those accumulated data.

Stream processing by contrast operates on the data as they arrive, instead of waiting for it to collect and processing it all at once. Stream processing engines come in various flavors, but the most powerful ones allow writing complex queries (often in SQL) that apply to these data streams, giving answers in milliseconds instead of hours or days.

Apache Flink is the most advanced open source stream processing engine, and by far the most widely used. Starting from a research project at TU Berlin, it has grown into a large and successful project with thousands of users around the world. It's also the base for many of the next generation of streaming-as-a-service companies, including Decodable, Immerok (acquired by Confluent), and Deltastream.

An example of a Flink DAG

An example Flink job pipeline, courtesy of the Flink docs

Flink models streaming jobs as a distributed dataflow: a directed-acyclic graph (DAG) of computation, with nodes performing operations like “map” and “window” and edges over which data flows. In contrast to simple streaming systems, nodes are stateful. This means that they store data internally that can be updated as events come in. This statefulness is needed to implement features like windowing, aggregations, and joins.

Flink is written in Java, and runs on the JVM. Flink pipelines can be written in various languages including Java, Scala, Python, and SQL.

On a personal level, Flink has been great for my career. I spent four years building Flink-based streaming infrastructure at Lyft and Splunk, including leading development of Lyft's flinkk8soperator which was the best way to run Flink on Kubernetes.

At Lyft we were trying to build a self-service platform that was reliable and operationally simple enough for any engineer at the company. Our Kubernetes operator (and associated tooling) made substantial progress towards that goal by automating many of the difficult and tedious operational tasks involved with running Flink pipelines. By the end of my tenure we had nearly a hundred pipelines running, and some critical parts of the product depending on Flink, including dynamic pricing and our near-real-time data warehouse.

But even with a strong team building this infrastructure (including several Flink committers and project leaders) we never achieved our goal of providing a truly self-serve platform. The most successful use cases still relied on expert involvement at every stage, from conception to implementation and ongoing operations and debugging.

Why is it so hard to build a good streaming service3 around Flink? There are a variety of reasons, but here I'm going to focus on two core challenges: authoring pipelines and operating them reliably.

Pipelines express the particular logic of a streaming application. Writing Flink pipelines is hard. Flink's core DataStream API is a Java interface that allows users to directly describe the pipeline topology and implement their own stateful operators. In fact, this is the same API that Flink's higher-level constructs, like windows and joins, are implemented in. This is very powerful, and enables some truly impressive and efficient applications.

But for users, it's like being thrown into the deep end of the pool. You have to immediately understand a host of complex stateful stream processing concepts like event time, checkpointing, partitioning, statefulness, and backpressure.

A man looking confused and frustrated at a laptop

Debugging why your operator state size is growing exponentially

Beyond the inherent complexity, there are plenty of foot-guns that can cause huge performance and operational burdens4. Flink was so early in providing a stream processing API that it had to discover how to do it without any precedent. But because this is both the user-facing API and the internal implementation, it's nearly impossible to change or fix at this point without breaking everyone's applications.

Out of the dozens of engineers I've helped build Flink applications with the Java API, only a few ever became really self-sufficient.

The Flink project now provides SQL and Python APIs, but they're implemented as thin layers on top of that Java API. Understanding why queries are behaving the way they are still requires diving into those lower layers, as does extending SQL with user-defined functions.

And while the quality of the SQL implementation has come a long way in the past few years, further improvements are running up the limits of what is possible given Flink's APIs and architecture5.

The end result is a SQL engine that's riddled with limitations and caveats.

Back when Flink was first developed, in the early 2010s, AWS had just launched. Big data meant running Hadoop MapReduce. Hadoop clusters were statically provisioned, typically run on prem, and were managed by an in-house ops team. Flink's distributed runtime was developed with that model in mind.

Since then how we run software has changed dramatically. AWS and other cloud providers now offer highly elastic compute resources, and the operating model for data processing has shifted towards self-service and automation. Flink was not designed with this model in mind; its job topologies are completely static and can't be changed without a restart.

A man staring at a rack of servers

Doing Big Data circa 2012

This combines in challenging ways with how Flink manages state. When a pipeline needs to remember something, that's stored in an operator's state. For example, an operator that computes a count-distinct over a 30 day window needs to store (some representation of) the input data for those 30 days.

At scale, this can add up to terabytes of data. In Flink, all of that state is stored locally on the worker nodes (either in RocksDB or in memory) and is periodically snapshotted and backed-up externally (typically to S3) so that it can be recovered if a job fails.

In other words, every Flink worker is like a small, weird database that runs arbitrary user code. And nothing is harder to operate than a database!

Taking together these two properties (static topologies and statefulness), you get the following standard procedure to make changes to a pipeline:

  1. Spin up a new set of Flink nodes to run the new version of the job, and wait for it to be ready
  2. Send a final checkpoint (snapshot command) to the existing job, wait for this to finish
  3. Start up the new version of the job on the new cluster, wait for it to download the state from S3
  4. Finally, tear down the old cluster

When jobs have a lot of state, this whole process can take minutes to hours. For much of that time the job is unavailable6! This makes every operation fraught and challenging to automate. For efficiency you'd like to autoscale pipelines according to data volumes, but it's mostly too risky and slow to do that with Flink. Similarly, you can't let Kubernetes move worker pods to improve cluster utilization without incurring significant downtime. And code updates are so hard to perform that it's common to go months without trying to deploy an existing pipeline.

While Flink wasn't the right choice for the product we're building, there are many cases where it works very well. Flink shines for building streaming applications that require specialized stateful logic, built and operated by an expert team. The low level of control offered by the DataStream API is a huge advantage in these cases, and is not easily replicated in a product targeted towards a broader audience like ours.

The Arroyo Vision

Hundreds of very smart engineers (including myself, many personal friends, and former colleagues) have contributed years of effort to build Flink and its ecosystem. With a small team, we're not going to be able to cover the vast surface area that Flink offers. We also don't aim for the level of flexibility that Flink provides.

But we think that there is a huge opportunity for a smaller, more focused product that solves some problems exceptionally well.

A desert stream

Arroyos are seasonal desert streams that can grow from a trickle to a torrent. There's a metaphor there somewhere...

This starts with our guiding principle: Arroyo must be usable by folks with no particular knowledge or expertise in streaming. SQL support should be comprehensive and unsurprising, with performance characteristics that are intuitive to SQL users. It should have operational characteristics that allow it to be run as a service—fully automated, without requiring operators to understand or tune individual pipelines.

We're accomplishing this by:

  1. Implementing optimizations through our entire stack, from the SQL planner to the storage layer to provide excellent performance for any reasonable SQL query
  2. Separating storage and compute to allow fast and automated pipeline operations
  3. Supporting dynamic reconfiguration of pipelines
  4. Running UDFs in WASM containers to provide predictable and controller performance across a variety of languages

Part two of this blog post will go into more detail about how we're addressing these issues in Arroyo.

In the meantime, Arroyo is easy to get started with. See the guide or run the docker container with:

docker run -p 8000:8000 -p 8001:8001 \
    ghcr.io/arroyosystems/arroyo-single:latest

Footnotes

  1. For example from logs, web or mobile apps, network devices…

  2. Today, typically S3 or a data warehouse like Snowflake

  3. Here I want to emphasize that we're trying to build a self-service platform. A good platform can be used by users who are not experts in the underlying infrastructure without the platform operator (us) having to be experts in their particular applications. Take for example AWS Lambda, which can run extremely diverse workloads without involving an AWS engineer.

  4. One of these foot-guns, broadcast state, caused a three day outage at a previous company. The process of debugging this is a great story that I will happily tell after a couple of beers.

  5. See this post for some examples of that.

  6. In real-world Flink applications there are a couple of approaches to deal with this downtime. One is blue-green deployments, where the new job is brought up before the old one is torn down. This eliminates downtime but adds additional cost and resources while running both pipelines and means the sink system needs to be able to handle writes from both pipelines. The other approach—which Lyft uses for its most critical pipelines, like pricing—is to just run two copies of the pipeline continuously, so that if one is being modified or is having issues, the other one is still running.