Blog

Updates from the Arroyo team

Open-sourcing the Arroyo Streaming Engine

After launching our state-of-the-art cloud real-time data processor, we're opening up the technology that powers it: the Arroyo streaming engine

Micah Wylde

Micah Wylde

CEO of Arroyo

The Kit Fox is one of the fastest animals in the American South West, and the Arroyo Systems mascot

After launching our state-of-the-art cloud real-time data processor, we're opening up the technology that powers it: the Arroyo streaming engine is now an Apache-licensed open source project.

Arroyo is distributed stream processing engine written in Rust, designed to efficiently perform stateful computations on streams of data. Arroyo lets you ask complex questions of high-volume real-time data with sub-second results.

To achieve our goal of making real-time data the default, we need to make this technology broadly accessible. By open-sourcing our core engine we hope to build a community around real-time data infrastructure. As a Rust data project (built on top of Rust data libraries like DataFusion) we are providing a powerful new capability to the emerging Rust data stack.

Why we built Arroyo

There are already a number of streaming engines out there, including Apache Flink and KSQL. Why create a new one?

Before co-founding Arroyo, I spent years leading streaming platform teams at Lyft and Splunk. My teams built two excellent real-time data platforms on top of Flink. At Lyft, this platform powers core features like dynamic pricing, ETAs, and promotions. But we found that while Flink is a very powerful framework, it is too complex for end-users like data scientists and product engineers. And as an infrastructure team, it's too hard to reliably run dozens or hundreds of user-written pipelines.

We created Arroyo because we believe that the default for data processing should be real-time, and that making this the reality will require a huge leap in technology.

Our initial technical design for Arroyo is centered around three goals:

  • Serverless operations: Arroyo pipelines are designed to run in modern cloud environments, supporting seamless scaling, recovery, and rescheduling
  • High performance SQL: SQL is a first-class concern, with consistently excellent performance
  • Designed for non-experts: Arroyo cleanly separates the pipeline APIs from its internal implementation: you don't need to be a streaming expert to build real-time data pipelines

Try it out

Getting started with the open source release of Arroyo is easy. See the getting started guide for all of the details, but in short:

$ docker run -p 8000:8000 -p 8001:8001 \
  ghcr.io/arroyosystems/arroyo-single:latest

Then, load the Web UI at http://localhost:8000.

Once you have Arroyo running, follow the tutorial to create your first real-time pipeline.

Get involved

We're building a new community around real-time data in Rust, and we welcome your contributions to Arroyo. Check out the repo and the dev setup to get started hacking on Arroyo. Join us on Discord to connect with the Arroyo community and stay up-to-date on the latest developments.