Blog

Updates from the Arroyo team

Running Arroyo on EKS

The easiest way to run a highly-scaled production Arroyo cluster is on Kubernetes. Setting up a Kubernetes cluster used to be a daunting task, but services like Amazon EKS have made it much easier. This post will walk through how to set up an EKS cluster and deploy Arroyo to it.

Micah Wylde

Micah Wylde

CEO of Arroyo

Prometheus holding a torch on a mountain

Arroyo was designed for a wide range of scales and use cases, from small single-node deployments to fully scaled-out clusters with hundreds of nodes.

For the latter case, the typical way to run it is within Kubernetes. Accordingly, Arroyo ships with support for both running the control plane on Kubernetes (via a Helm chart) and running the data plane on Kubernetes (via the Kubernetes scheduler) 1

Kubernetes has a reputation for being complicated (and it certainly can be!) but in recent years the major clouds have made it dramatically easier to create and operate Kubernetes clusters, with services like Google Kubernetes Engine (GKE) and Amazon's Elastic Kubernetes Service (EKS).

This post is going to walk through the process of getting a production-level Arroyo cluster up and running on EKS.

Table of contents

Creating the EKS Cluster

The first step is to create an EKS cluster. If you already have a cluster, you can skip this step.

  1. Start by navigating to https://console.aws.amazon.com/eks/home?#/cluster-create, logging into your AWS account if necessary. On the first screen we configure the cluster name and version. Choose whatever you'd like for the name, and leave the rest of the configs as default.
  2. If this is the first time you've created an EKS cluster, you will need to create a Cluster Service Role, which allows EKS to create and manage resources on your behalf. Follow the guide here to do so.

Create an EKS cluster

  1. Next we configure networking. For the simplest setup, this can also be left as the defaults. However, a more secure option would be to use a private cluster endpoint and access the cluster over a VPN, but setting that up is out of scope for this guide.
  2. The rest of the settings (observability and add-ons) can all be set up with the default configurations.
  3. Continue through the wizard until you get to “Create.” Now our cluster is creating.

Creating the Node Group

Next, we will need to create a node group, which controls which EC2 instances will be created to provide compute resources for our cluster.

  1. Before we can create the node group, we need to create an IAM role for it. Navigate to https://console.aws.amazon.com/iam/home#/roles/create, select AWS service as the type of trusted entity, and select EC2 as the Service in the drop down. On the next sreen add the following policies:
    • AmazonEBSCSIDriverPolicy
    • AmazonEC2ContainerRegistryReadOnly
    • AmazonEKS_CNI_Policy
    • AmazonEKSWorkerNodePolicy
    • AmazonS3FullAccess 2
  2. Finish creating the role, and give it a name like EksNodePolicy.
  3. Back in the EKS console, click on the cluster you created in the previous step, click into the Compute tab, and click “Add Node Group”.
  4. On the next screen, select Amazon Linux 2 (AL2_x86_64) and configure the instance type you would like. We recommend c7i or c7g (Amazon's Graviton ARM nodes which can offer substantial cost savings over x86). You can also increase the maximum number of nodes, which will support running larger clusters.

Configure the node group

  1. The next two screens can be left as is, and on the final screen click “Create”.
  2. The last step is to add the EBS CSI Driver addon. This will allow us to create persistent storage needed for Postgres, which Arroyo uses as its metadata store. Click the “Add-ons” tab, then “Get more add-ons,” then select the Amazon EBS CSI Driver and scroll down to click “Next”. Finish creating it with the default options.

Select the EBS CSI addon

Creating an S3 Bucket

Now we have a cluster configured. While that's spinning up, we'll create an S3 bucket to store Arroyo checkpoints and outputs.

Navigate to the s3 bucket creation page, and create a new bucket with a unique name (I used arroyo-eks) then create it in your region with the default settings.

Create an S3 bucket

Setting up Kubectl

Now that we have the EKS cluster ready to go, we can deploy Arroyo to it. For this, we'll need to set up kubectl and aws-cli be able to talk to our new Kubernetes cluster.

First, install kubectl if you don't have it already:

$ brew install kubectl
$ sudo apt install kubectl
$ sudo apt install

Next we need to install aws-cli. On MacOS this is most easily done with Homebrew:

$ brew install awscli

For Linux and Windows, refer to Amazon's instructions.

Next you will need to configure awscli to be able to authenticate to your AWS account. How this should be done will depend on the security settings for your account, and full details can be found in the AWS docs.

Now we should be able to configure kubectl to talk to and auth against our new EKS cluster:

$ aws eks update-kubeconfig --region us-east-2 --name arroyo

(replace the region and name arguments according to the region you created the cluster in, and the name you gave it.)

To verify this is set up correctly, we can list the nodes in our cluster. It should show something like this:

$ kubectl get nodes
NAME                                        STATUS   ROLES    AGE   VERSION
ip-172-31-5-66.us-east-2.compute.internal   Ready    <none>   20h   v1.28.3-eks-4f4795d

Deploying Arroyo

To deploy Arroyo to our new EKS cluster we will use Helm, a Kubernetes package manager. This makes it easy to share configurations for Kubernetes applications like the Arroyo control plane via Helm charts.

Run these commands to install Helm and register the Arroyo repository:

$ brew install helm
$ helm repo add arroyo https://arroyosystems.github.io/helm-repo
$ helm search repo arroyo
NAME         	CHART VERSION	APP VERSION	DESCRIPTION
arroyo/arroyo	0.7.1        	0.7.1      	Helm chart for the Arroyo stream processing engine

Now that we have the Arroyo Helm repo registered, we need to create a configuration file for our Arroyo cluster. There are many properties that can be configured (see all of the options here: https://artifacthub.io/packages/helm/arroyo/arroyo).

For now we're just going to configure a couple of paths, to tell Arroyo to use our new S3 bucket for artifacts and checkpoints.

Create a file called values.yaml with content that looks like this:

artifactUrl: "https://s3.us-east-2.amazonaws.com/arroyo-eks/artifacts"
checkpointUrl: "https://s3.us-east-2.amazonaws.com/arroyo-eks/checkpoints"

Replace the region with the correct region for your bucket, and the bucket name (arroyo-eks) with the name of the S3 bucket you created.

We're finally ready to deploy Arroyo! We can do that with helm install, which will deploy our chart to Kubernetes with our configuration overrides:

$ helm install arroyo-cluster arroyo/arroyo -f /tmp/helm-values.yaml
NAME: arroyo-cluster
LAST DEPLOYED: Thu Nov  9 14:40:53 2023
NAMESPACE: default
STATUS: deployed
REVISION: 1
NOTES:
You've successfully installed arroyo!
 
This release is arroyo-cluster
 
Once the release is fully deployed, you can navigate to the web ui by running
 
  $ open "http://$(kubectl get service/arroyo-cluster-api -o jsonpath='{.spec.clusterIP}')"
 
(note this might take a few minutes if you are also deploying Postgres)
 
If that doesn't work (for example on MacOS or a remote Kubernetes cluster), you can also try
 
  $ kubectl port-forward service/arroyo-cluster-api 8000:80
 
And opening http://localhost:8000 in your browser.
 
See the documentation at https://doc.arroyo.dev, and ask questions on our discord: https://discord.gg/cjCr5rVmyR

The cluster can take a few minutes to fully roll out, primarily waiting for Postgres to be ready. You can monitor the progress by running

$ kubectl get pods -l app.kubernetes.io/instance=arroyo-cluster

While Postgres is coming up, you may see the Arroyo pods restarting; that's expected, as they won't start until the database is available. Once all pods are in the Running state, you can launch the Web UI by running

$ kubectl port-forward service/arroyo-cluster-api 8000:80

and opening http://localhost:8000 in your browser:

The Arroyo Web UI

If you see arroyo-cluster-postgresl-0 stuck in Pending state, you may not have added the Amazon EBS CSI Driver to your cluster. Refer back to that step

And that's it! You're ready to start running streaming applications at scale.

Footnotes

  1. Arroyo (like many distributed systems) is separated into two parts. The control plane is a persistent set of services including the api server, controller, and compiler. It's responsible for serving the web ui and API, scheduling jobs, taking checkpoints, recovering from failures, and monitoring progress. The actual work of processing happens in the data plane which is made up of instances of arroyo-worker. Resources in the data plane will come and go as jobs are started, stopped, and scaled.

  2. The S3 policy can also be set up to only allow access to the specific bucket you created, but that's out of scope for this guide.