Arroyo is joining Cloudflare logo Cloudflare to bring stream processing to everyone
Arroyo Logo

Blog

Updates from the Arroyo team

Arroyo is joining Cloudflare

Arroyo has been acquired by Cloudflare to bring serverless SQL stream processing to the Cloudflare Developer Platfrorm, integrated with Queues, Workers, and R2. The Arroyo Engine will remain open-source and self-hostable.

Micah Wylde
Micah Wylde CEO of Arroyo

I’m incredibly excited to announce that Arroyo has been acquired by Cloudflare, where we will be continuing our mission to bring stream processing to everyone who works with data.

Here’s the short version: Arroyo is coming to Cloudflare’s Developer Platform. You’ll get the same stateful aggregations, joins, and transformations on a fully-managed platform, seamlessly integrated with Cloudflare Queues, R2 object storage, and Workers-powered UDFs. Arroyo will remain fully open-source and self-hostable.

To those who know Cloudflare primarily as the backbone of the modern internet, that may sound like an odd combination. What does DDoS protection and CDNs have to do with data processing? I had a similar confusion. But as we started talking about working together, I learned that Cloudflare’s ambitions were much larger: to build a new type of cloud, designed around their global network of compute and storage. Over many conversations over the past year, it became clear to me that there was no better place to build a next-generation data platform.

Some backstory

But let’s back up a bit. Jackson and I started Arroyo in 2022 to democratize real-time data processing.

Modern companies rely on data pipelines to power their applications and businesses — from user customization, recommendations, and anti-fraud, to the emerging world of AI agents. But today, most of these pipelines operate in batch, running once per hour, day, or even month. After spending many years working on stream processing at companies like Lyft and Splunk, it was no mystery why: it was just too hard for developers and data scientists to build correct, performant, and reliable pipelines. Large tech companies hire streaming experts to build and operate these systems, but everyone else is stuck waiting for batches to arrive.

When we started, the dominant solution for streaming pipelines — and what we ran at Lyft and Splunk — was Apache Flink. Flink was the first system that successfully combined a fault-tolerant (able to recover consistently from failures), distributed (across multiple machines), stateful (and remember data about past events) dataflow with a graph-construction API. This combination of features meant that we could finally build powerful real-time data applications, with capabilities like windows, aggregations, and joins. But while Flink had the necessary power, in practice the API proved too hard and low-level for non-expert users, and the stateful nature of the resulting services required endless operations.

We realized we would need to build a new streaming engine — one with the power of Flink, but designed for product engineers and data scientists and to run on modern cloud infrastructure. We started with SQL as our API because it’s easy to use, widely known, and declarative. We built it in Rust for speed and operational simplicity (no JVM tuning required!). We constructed an object-storage-native state backend, simplifying the challenge of running stateful pipelines — which each are like a weird, specialized database.

And then in the summer of 2023, we open-sourced it. Today, dozens of companies are running Arroyo pipelines with use cases including data ingestion, anti-fraud, IoT observability, and financial trading.

We always knew that the engine was just one piece of the puzzle. To make streaming as easy as batch, users need to be able to develop and test query logic, backfill on historical data, and deploy serverlessly without having to worry about cluster sizing or ongoing operations. Democratizing streaming ultimately meant building a complete data platform. And Cloudflare, we realized, already had all of the other pieces: R2 provides object storage for state and data at rest, Queues for data in transit, and Workers to safely and efficiently run user code.

What’s next

In the short term, the Arroyo team will be working to integrate the engine with Cloudflare’s compute infrastructure, bringing SQL processing capabilities to Cloudflare Pipelines (out in beta today). The Arroyo engine will remain fully open source (Apache-licensed) with support for self-hosting on VMs, Kubernetes, and serverless container platforms.

While much of this work will be Cloudflare-specific, we will continue contributing fixes and features to Arroyo open source. Together we will have significantly more resources to invest in stability, performance, and operability, and we hope to see the project and community continue to thrive in this new era.

We are extraordinarily thankful to everyone who helped us get to this point—our employees, investors, contributors, supporters, and friends. I want to particularly thank our early users, who took a bet on a young piece of data infrastructure and made it possible for us to build Arroyo into what it is today.

This is the end of our startup journey, but it’s still just the beginning of our mission to reinvent data processing. This is the serverless stream processing platform we started the company to build, and we couldn’t be more thrilled to do it with Cloudflare.