This is the third in a series. Start here if you want the setup:
The first two were about why the tools you have stop working. This one is about what you have to redesign once you accept that.
At ten clusters you are configuring infrastructure. At a hundred you are managing a fleet. At a thousand you are running a product whose users happen to be your other systems. Almost every assumption your platform was built on quietly inverts somewhere between those numbers.
I am not going to claim a clean threshold. The shift is gradual and uneven. But the design choices below are the ones I have watched teams get wrong, repeatedly, by carrying small-fleet instincts into large-fleet realities.
Targets become first-class objects
At small scale, a "target" is implied. You have prod-us-east in your config and that is enough. Everyone knows what it means.
At a thousand clusters, a target needs to be a real entity in your system, with:
- a stable identity (not a string in a YAML file)
- properties (region, tier, customer, compliance class, kubelet version, etc.)
- a current state (what is actually running on it, right now)
- a desired state (what should be running on it)
- a history (what has happened to it, by whom, when)
small fleet large fleet
target = "us-east-prod" target = {
in deploy.yml id, region, tier, customer,
compliance, version, drift,
policy_set, owner, ...
}
If your target is just a string, every question you ask about the fleet has to be answered by grep. That does not scale past a few dozen.
The real shift is from configuring deployments to querying a fleet. "Show me every cluster running v4.2 in EU, owned by team Atlas, that has not been touched in 30 days." That sentence has to be cheap.
Selectors replace lists
Related, but worth saying separately. At ten clusters you list them. At a thousand you select them.
# does not scale
clusters:
- prod-us-east-1
- prod-us-east-2
- prod-us-west-1
- ...
- prod-eu-central-997
- prod-eu-central-998
# scales
target_selector:
region: ["us-*", "eu-*"]
tier: "production"
exclude_labels: ["frozen", "compliance-hold"]
Lists encode the wrong thing. They encode "what is true right now" instead of "what I mean." The day you onboard customer 1001, every list in your repo is wrong. Selectors keep working.
This is the same shift Kubernetes made between Pods and Deployments. You stop naming the units. You name the rule that picks them.
Identity boundaries get harder, not easier
Small fleets share trust. One IAM role, one set of credentials, a couple of robots. Engineers debug by SSHing or kubectl exec-ing as needed.
A thousand clusters cannot share trust. Some of them are in customer VPCs you do not own. Some are air-gapped. Some belong to a regulated tenant where engineer-level access is a paperwork event. The blast radius of a compromised credential goes from "embarrassing" to "we have to notify regulators."
What this forces:
- per-cluster identity, not shared identity
- short-lived credentials, not long-lived
- agent-pull, not central-push, for the things you cannot reach
- a clear audit trail of which control plane action touched which cluster, with which identity
If your platform's permission model is "the deploy bot is admin everywhere," you are a phishing email away from a very bad week.
Drift is the default, not the exception
In a small fleet you can pretend desired state equals actual state. You ran the pipeline, it succeeded, the cluster matches. Done.
At a thousand clusters this is fiction. At any moment some non-trivial fraction of the fleet has drifted. Reasons:
- a cert rotated locally during an outage
- a customer ops team patched a CVE before you got to it
- an autoscaler made a decision in the middle of the night
- somebody used
kubectl editand meant to revert it - a regional provider quietly changed default behavior
You need a system that:
- continuously observes actual state
- compares it against desired
- classifies drift (acceptable, transient, suspicious, dangerous)
- reconciles only what should be reconciled
- never silently overwrites things that humans intentionally changed during an incident
cluster
│
├── desired state ─────┐
│ ▼
├── observed state ──> drift detector ──> classify ──> act / alert / accept
│ ▲
└── change events ──────┘
Pure reconciliation loops are too aggressive at this scale. They will undo the fix somebody made at 2am. You need policy that knows the difference between "I drifted because I am broken" and "I drifted because we patched me on purpose."
Multi-cloud stops being a marketing word
At a hundred clusters you might mostly be on one cloud and tolerate a second. At a thousand, you almost certainly span:
- 2-3 hyperscalers
- a sovereign cloud or two for specific regions
- on-prem clusters for regulated customers
- edge sites for latency-sensitive workloads
- airgapped or DR environments that only sometimes phone home
Your platform abstractions cannot be cloud-shaped anymore. They have to be capability-shaped:
- "I need a place to run a workload of class X with capability Y in region Z."
- The platform decides which substrate satisfies that.
If your deploy logic has if aws: ... elif gcp: ... branches, every new substrate doubles the matrix. That is a year-two pain that becomes a year-three crisis.
Compliance becomes a runtime concern
Small-fleet compliance is mostly process. Reviews, approvals, attestation documents, an annual audit. The system itself does not care.
Large-fleet compliance is part of the runtime. Some clusters cannot run certain images. Some regions cannot accept changes during defined windows. Some customer tenants require dual approval. Some workloads must be pinned to specific certifications.
This means policy needs to live next to the orchestrator, not in a Confluence page:
- which workloads can run where
- which approvals are required for which targets
- which windows are valid for changes
- which signatures and provenance are required on artifacts
- what gets blocked, automatically, at deploy time
The audit log is no longer a side effect. It is a product surface. Your customers will ask for it and your auditors will assume you have it.
The mental model shift
Pulling it together. A few summary contrasts:
small fleet large fleet
───────────── ──────────────
configure environments query a fleet
named targets in YAML targets as data
deploy = run pipeline deploy = orchestrate waves
shared trust per-target identity
desired == actual (assumed) drift is normal, observed
one cloud, mostly capability over substrate
compliance is paperwork compliance is runtime policy
audit is a log file audit is a product surface
Almost none of these are about adding features. They are about the same problem repriced once you cross a scale threshold. The work is in the abstractions, not the throughput.
What this means for the tooling stack
Bringing this back to the GitOps and CI/CD posts: nothing in this list is provided by Git or by your CI system. They are the wrong shape for it. Git stores intent. CI builds artifacts. Neither maintains a model of the fleet, observes drift, enforces runtime policy, or coordinates rollouts across heterogeneous substrates.
That is the layer this series has been pointing at the whole time. Call it a control plane, call it a deployment orchestrator, call it whatever your team will not roll its eyes at. The label matters less than the fact that someone, deliberately, has to own this layer.
If nobody owns it, it gets built anyway, in fragments, in CI YAML and bash scripts and a Slack channel called #release-coordination. That works for a while. It does not work at a thousand clusters.
Final take
Designing for a thousand clusters is not designing for more of the same. It is designing for a different category of system, where the things you used to leave implicit (targets, identity, drift, policy) all have to become explicit, queryable, and governed.
Get this right early and the next zero of growth costs you a quarter of work. Get it wrong and it costs you a rewrite.
Related Articles
Why CI/CD Breaks Down at 200 Customer Environments
CI/CD pipelines were designed to deploy one artifact to a few targets. At 200 customer environments, the pipeline stops being a step and starts being its own system. Here is what breaks and why orchestration has to live above it.
Read moreWhy GitOps Doesn't Work at Scale (and What to Do Instead)
GitOps is excellent for small systems, but large enterprises hit failure modes around dependency coordination, rollback safety, compliance workflows, and configuration sprawl.
Read more