DevOps

Designing for 1000 Clusters: What Changes

Justin Brooksjsbroks

Posted: April 28, 2026

This is the third in a series. Start here if you want the setup:

The first two were about why the tools you have stop working. This one is about what you have to redesign once you accept that.

At ten clusters you are configuring infrastructure. At a hundred you are managing a fleet. At a thousand you are running a product whose users happen to be your other systems. Almost every assumption your platform was built on quietly inverts somewhere between those numbers.

I am not going to claim a clean threshold. The shift is gradual and uneven. But the design choices below are the ones I have watched teams get wrong, repeatedly, by carrying small-fleet instincts into large-fleet realities.

Targets become first-class objects

At small scale, a "target" is implied. You have prod-us-east in your config and that is enough. Everyone knows what it means.

At a thousand clusters, a target needs to be a real entity in your system, with:

a stable identity (not a string in a YAML file)
properties (region, tier, customer, compliance class, kubelet version, etc.)
a current state (what is actually running on it, right now)
a desired state (what should be running on it)
a history (what has happened to it, by whom, when)

small fleet                  large fleet

target = "us-east-prod"      target = {
in deploy.yml                  id, region, tier, customer,
                               compliance, version, drift,
                               policy_set, owner, ...
                             }

If your target is just a string, every question you ask about the fleet has to be answered by grep. That does not scale past a few dozen.

The real shift is from configuring deployments to querying a fleet. "Show me every cluster running v4.2 in EU, owned by team Atlas, that has not been touched in 30 days." That sentence has to be cheap.

Selectors replace lists

Related, but worth saying separately. At ten clusters you list them. At a thousand you select them.

# does not scale
clusters:
  - prod-us-east-1
  - prod-us-east-2
  - prod-us-west-1
  - ...
  - prod-eu-central-997
  - prod-eu-central-998

# scales
target_selector:
  region: ["us-*", "eu-*"]
  tier: "production"
  exclude_labels: ["frozen", "compliance-hold"]

Lists encode the wrong thing. They encode "what is true right now" instead of "what I mean." The day you onboard customer 1001, every list in your repo is wrong. Selectors keep working.

This is the same shift Kubernetes made between Pods and Deployments. You stop naming the units. You name the rule that picks them.

Identity boundaries get harder, not easier

Small fleets share trust. One IAM role, one set of credentials, a couple of robots. Engineers debug by SSHing or kubectl exec-ing as needed.

A thousand clusters cannot share trust. Some of them are in customer VPCs you do not own. Some are air-gapped. Some belong to a regulated tenant where engineer-level access is a paperwork event. The blast radius of a compromised credential goes from "embarrassing" to "we have to notify regulators."

What this forces:

per-cluster identity, not shared identity
short-lived credentials, not long-lived
agent-pull, not central-push, for the things you cannot reach
a clear audit trail of which control plane action touched which cluster, with which identity

If your platform's permission model is "the deploy bot is admin everywhere," you are a phishing email away from a very bad week.

Drift is the default, not the exception

In a small fleet you can pretend desired state equals actual state. You ran the pipeline, it succeeded, the cluster matches. Done.

At a thousand clusters this is fiction. At any moment some non-trivial fraction of the fleet has drifted. Reasons:

a cert rotated locally during an outage
a customer ops team patched a CVE before you got to it
an autoscaler made a decision in the middle of the night
somebody used kubectl edit and meant to revert it
a regional provider quietly changed default behavior

You need a system that:

continuously observes actual state
compares it against desired
classifies drift (acceptable, transient, suspicious, dangerous)
reconciles only what should be reconciled
never silently overwrites things that humans intentionally changed during an incident

cluster
   │
   ├── desired state ─────┐
   │                       ▼
   ├── observed state ──> drift detector ──> classify ──> act / alert / accept
   │                       ▲
   └── change events ──────┘

Pure reconciliation loops are too aggressive at this scale. They will undo the fix somebody made at 2am. You need policy that knows the difference between "I drifted because I am broken" and "I drifted because we patched me on purpose."

Multi-cloud stops being a marketing word

At a hundred clusters you might mostly be on one cloud and tolerate a second. At a thousand, you almost certainly span:

2-3 hyperscalers
a sovereign cloud or two for specific regions
on-prem clusters for regulated customers
edge sites for latency-sensitive workloads
airgapped or DR environments that only sometimes phone home

Your platform abstractions cannot be cloud-shaped anymore. They have to be capability-shaped:

"I need a place to run a workload of class X with capability Y in region Z."
The platform decides which substrate satisfies that.

If your deploy logic has if aws: ... elif gcp: ... branches, every new substrate doubles the matrix. That is a year-two pain that becomes a year-three crisis.

Compliance becomes a runtime concern

Small-fleet compliance is mostly process. Reviews, approvals, attestation documents, an annual audit. The system itself does not care.

Large-fleet compliance is part of the runtime. Some clusters cannot run certain images. Some regions cannot accept changes during defined windows. Some customer tenants require dual approval. Some workloads must be pinned to specific certifications.

This means policy needs to live next to the orchestrator, not in a Confluence page:

which workloads can run where
which approvals are required for which targets
which windows are valid for changes
which signatures and provenance are required on artifacts
what gets blocked, automatically, at deploy time

The audit log is no longer a side effect. It is a product surface. Your customers will ask for it and your auditors will assume you have it.

The mental model shift

Pulling it together. A few summary contrasts:

small fleet                       large fleet
─────────────                     ──────────────
configure environments            query a fleet
named targets in YAML             targets as data
deploy = run pipeline             deploy = orchestrate waves
shared trust                      per-target identity
desired == actual (assumed)       drift is normal, observed
one cloud, mostly                 capability over substrate
compliance is paperwork           compliance is runtime policy
audit is a log file               audit is a product surface

Almost none of these are about adding features. They are about the same problem repriced once you cross a scale threshold. The work is in the abstractions, not the throughput.

What this means for the tooling stack

Bringing this back to the GitOps and CI/CD posts: nothing in this list is provided by Git or by your CI system. They are the wrong shape for it. Git stores intent. CI builds artifacts. Neither maintains a model of the fleet, observes drift, enforces runtime policy, or coordinates rollouts across heterogeneous substrates.

That is the layer this series has been pointing at the whole time. Call it a control plane, call it a deployment orchestrator, call it whatever your team will not roll its eyes at. The label matters less than the fact that someone, deliberately, has to own this layer.

If nobody owns it, it gets built anyway, in fragments, in CI YAML and bash scripts and a Slack channel called #release-coordination. That works for a while. It does not work at a thousand clusters.

Final take

Designing for a thousand clusters is not designing for more of the same. It is designing for a different category of system, where the things you used to leave implicit (targets, identity, drift, policy) all have to become explicit, queryable, and governed.

Get this right early and the next zero of growth costs you a quarter of work. Get it wrong and it costs you a rewrite.

DevOps

Why CI/CD Breaks Down at 200 Customer Environments

CI/CD pipelines were designed to deploy one artifact to a few targets. At 200 customer environments, the pipeline stops being a step and starts being its own system. Here is what breaks and why orchestration has to live above it.

DevOps

Why GitOps Doesn't Work at Scale (and What to Do Instead)

GitOps is excellent for small systems, but large enterprises hit failure modes around dependency coordination, rollback safety, compliance workflows, and configuration sprawl.

Designing for 1000 Clusters: What Changes

Targets become first-class objects

Selectors replace lists

Identity boundaries get harder, not easier

Drift is the default, not the exception

Multi-cloud stops being a marketing word

Compliance becomes a runtime concern

The mental model shift

What this means for the tooling stack

Final take

Related Articles

Why CI/CD Breaks Down at 200 Customer Environments

Why GitOps Doesn't Work at Scale (and What to Do Instead)

Deploy and manage your applications with Ctrlplane