Case StudyCoreWeave

How CoreWeave Promotes Releases Across 400 Clusters and 4,000 GPUs

Outgrowing Kargo and consolidating deployments, firmware, and site bring-up onto a single release control plane.

400

Clusters

Regions

4,000

GPU nodes

Services

Firmware packages

CoreWeave runs one of the largest GPU cloud platforms in the world. Their fleet has to keep moving — new clusters come online every week, firmware patches land across whole regions, and inference and control-plane services ship multiple times a day. Every one of those changes has to land somewhere out of the 400-cluster footprint, in the right order, without taking customers down.

This is the story of how that happens, and why CoreWeave outgrew the tooling they started with.

The starting point: Kargo

CoreWeave's platform team initially ran Kubernetes promotions through Kargo. Kargo is a good product. For teams running ArgoCD across a handful of stages, it does exactly what it says on the tin.

At 400 clusters, three problems showed up.

Manifest pre-rendering got slow. Kargo pre-renders manifests as part of its promotion model. With CoreWeave's fan-out — clusters per region, services per cluster, overlays per cluster tier — pre-render time stopped being a background detail and started being part of the deploy critical path.

Promotion granularity didn't match the topology. When CoreWeave promoted a release in Kargo, it deployed to everything in production at once. There was no native way to express "land this on tier-3 clusters first, observe for 30 minutes, then walk it up to tier-1." That kind of staged, blast-radius-aware rollout had to be built outside the tool.

Kargo only knows about Kubernetes. This is the deeper one. CoreWeave's release surface is not just Kubernetes. It includes:

Firmware updates on GPU nodes
Site bring-up runs through AWX
Drain and hydrate windows scheduled around customer load
Dynamic cluster inventory pulled from NetBox and VictoriaMetrics

None of those live inside a Kubernetes API. So the platform team did what every platform team eventually does — they started building it themselves. Custom operators. A version store. Glue logic for "is this cluster eligible to receive this release right now?" An aggregation service to pull cluster state from NetBox and VictoriaMetrics so the rest of the system knew what was actually out there.

That's the moment the cost of building a control plane internally crosses the cost of adopting one.

What changed with Ctrlplane

The pitch we made to CoreWeave was simple: stop building the orchestration layer above Kubernetes, and let Ctrlplane be that layer. Kargo can stay where it makes sense. Everything that doesn't fit a strict Kubernetes-only mental model — firmware, AWX, scheduled drains, dynamic cluster discovery — moves into a single control plane that understands all of it.

Ctrlplane treats clusters, regions, tenants, and arbitrary external resources as first-class objects. A "release" isn't a Kubernetes manifest — it's any change you want to govern, with any underlying executor. That means the same promotion policy can drive an argocd sync, a terraform apply, an AWX job, a firmware push, or a custom script, with the same sequencing, observation windows, and audit trail.

Today at CoreWeave:

All 80 Kubernetes services governed by Ctrlplane have staged, region-aware rollouts. Some teams continue to use Kargo where it fits; the two tools coexist.
All 8 firmware packages are pushed through Ctrlplane. Same promotion policy primitives, totally different executor underneath.
AWX site bring-up is integrated with Ctrlplane. When a new cluster is provisioned, Ctrlplane picks it up automatically and starts deploying the standard software stack to it. No more pinging four teams to say "hey, new cluster, please start your deploy."
Cluster tiering by customer count drives promotion order. A release lands on a low-density tier first, observation passes, and Ctrlplane walks it up the tiers automatically.

The "no more pinging" change sounds small. It's not. At 400 clusters, the cross-team coordination tax was real — every new cluster was a Slack thread and a calendar invite before it could host production workloads. Removing that coordination cost is what makes the platform actually scale with the fleet.

The five-minute rollback

The clearest moment of "this was worth it" came on a regular release day.

CoreWeave was rolling out a new version of one of their database-adjacent services. The release looked clean in CI. It passed the first observation window. It started walking up the cluster tiers like every other change.

About fifteen minutes in, weird signals started showing up in Datadog. Write latencies on the affected service spiked in a way that didn't match a healthy deploy. The on-call engineer pulled up the rollout in Ctrlplane, rejected the release, and let the policy engine do the rest.

Within five minutes, every impacted cluster was rolled back.

Not "we shipped the rollback build." Rolled back. Across every cluster the bad version had reached, in the order it had been deployed in, with the same staged guardrails on the way out. No coordination calls, no manual cluster-by-cluster intervention, no "wait, did region C get the rollback?"

That's the kind of incident that, at 400 clusters, used to be a multi-hour war room. Now it's a button.

“We caught a bad release on a database service in Datadog, rejected it in Ctrlplane, and watched every impacted cluster roll back in five minutes. That kind of control wasn't possible before.”

Justin Brooks, Staff Engineer, CoreWeave

What's next

CoreWeave is expanding Ctrlplane's footprint in two directions.

The first is more services and more lifecycle stages. The plan is to bring more of the platform's release surface — internal services, scheduled maintenance jobs, longer-running migrations — under the same governance model.

The second is cluster provisioning from the Ctrlplane console. Today, new clusters are stood up by a separate controller, and Ctrlplane picks them up downstream. The next step is closing that loop: provisioning a cluster, configuring it, and bringing it into the deployment fleet from a single workflow. That collapses the "stand up a cluster" and "make it part of production" steps into one motion.

Why this matters if you're not CoreWeave

You don't need 400 clusters or 4,000 GPUs for this story to apply. The pattern shows up everywhere infrastructure crosses a certain threshold:

A pure-Kubernetes promotion tool stops being enough because your real release surface is bigger than Kubernetes.
Manifest pre-rendering, all-at-once production deploys, and "the team has to build the rest themselves" become the limiters.
The hidden cost is the in-house orchestration layer your platform team is quietly building to fill the gap.

Kargo, ArgoCD, Flux, Terraform, AWX — none of those tools are wrong. They just weren't designed to be the layer above themselves. That layer is what Ctrlplane is.