This is a follow-up to Why GitOps Doesn't Work at Scale. Same root issue, different surface: the tool you trusted for one environment is the same tool you are now shoving 200 environments through.
Most CI/CD pipelines were designed for one mental model: build once, deploy somewhere. That somewhere was usually staging, then production. Maybe two production regions if you were ambitious.
Then enterprise happened. Now you have:
- 12 SaaS regions
- 40 single-tenant customer instances
- 6 air-gapped on-prem deployments
- 30 ephemeral preview environments
- A handful of compliance-locked tenants nobody likes touching
Same artifact. 200 targets. The pipeline that took you from zero to one is the same pipeline you are asking to take you from one to many. It will not.
The math hasn't changed
Same formula from the GitOps post: P(failure) = 1 - p^n.
CI/CD did not move the levers. It just gave you a faster way to feed n.
p = 0.99per targetn = 200targets1 - 0.99^200 ≈ 0.866
So a "stable" pipeline rolling to 200 environments will see at least one failure on roughly 87% of releases. That is not a flaky pipeline. That is the floor.
P(failure) by fleet size at p=0.99
n=10 ▇▇ 0.10
n=50 ▇▇▇▇▇▇▇▇▇ 0.39
n=100 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.63
n=200 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.87
n=500 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.99
Every "release night" you stay up for is n quietly compounding against you.
What CI/CD was designed to do
CI/CD is excellent at:
- compiling, testing, and packaging an artifact
- promoting that artifact through a few well-known stages
- keeping a linear, auditable history of what shipped where
- integrating with the repo and the PR
That model assumes the deployment target is mostly fungible. One env is a copy of the last one. Promote, repeat.
What it was not designed to do
Once your fleet has identity (per-customer config, per-region constraints, per-tier release cadence), the assumption breaks. You start asking the pipeline to do things it was never meant to do:
- coordinate rollouts across hundreds of independent targets
- enforce dependency ordering between services that live in different repos
- pause mid-rollout when a canary in region B starts misbehaving
- skip the three customers in a regulatory freeze window
- resume from the exact target where the last run died
- track version skew across the fleet without a spreadsheet
CI/CD pipelines respond by accumulating YAML. Matrix builds. Reusable workflows. Conditional steps. Twelve-deep if ladders. Eventually somebody writes a Python script that calls the CI API to launch other CI jobs. That script becomes load-bearing.
You did not build a pipeline. You built a fleet orchestrator out of CI primitives, and you built it badly because CI was never the right substrate.
The patterns that get hacked together
Here is what every team I have seen ends up with by year two:
the pipeline iceberg
repo CI (visible, governed)
│
▼
manual "deploy to fleet" job (still tracked)
│
▼
internal script that fans out (one engineer maintains it)
│
▼
per-customer override files (lives in 4 repos)
│
▼
slack thread with the order (this is how rollback works)
Symptoms you have already crossed the line:
- A senior engineer is the only person who can run the rollout safely.
- "Did we ship to customer X yet?" is answered by reading job logs.
- Rollbacks involve re-running the previous build with a flag.
- The rollout doc is a Confluence page with
(do not edit, ask Maya)in the title. - You have written, at least once, the phrase "we will fix this in the orchestration layer."
There is no orchestration layer. That is the problem.
What CI/CD cannot give you, no matter how much YAML you write
These are not things you patch onto a pipeline. They are properties of a system that sits above the pipeline.
Fleet-aware rollout strategy. The unit of deployment is no longer "this artifact." It is "this artifact, in waves, with these gates, paused on this signal." A pipeline thinks in steps. A fleet thinks in policy.
Per-target state. Customer A is on v4.2 and frozen until quarter end. Customer B is on v4.3 and gets canaries first. CI does not have a model of the target. The pipeline has no opinion about where things currently are.
Dependency coordination. Service A must be at v2 before Service B moves to v3. Across two repos, two pipelines, two teams. CI/CD has no shared brain to enforce that.
Drift detection and reconciliation. Out-of-band changes happen. They will happen. A pipeline only knows what it last ran. It does not observe the world.
Approval at the right granularity. Not "approve this PR." Approve "this rollout, to this wave, in this window, with this rollback plan."
Recovery. A failed rollout to 200 targets needs surgical resume, not a full rerun. CI's idea of recovery is "click retry."
What actually scales
Same shape as the GitOps answer. Keep the pipeline for what it is good at. Put a real orchestrator on top of it.
artifact path fleet path
CI ──> build, test, sign ──> registry
│
▼
orchestrator
│ │ │
│ │ └─> wave 1: canaries
│ └─────> wave 2: SaaS regions
└─────────> wave N: locked tenants
orchestrator owns: target inventory, policy, dependency ordering,
live state, drift, approvals, resume, audit
The pipeline produces an artifact and a claim about it. The orchestrator decides where, when, in what order, under what conditions, and what to do when something goes wrong.
Stop asking your CI to be a release manager.
The rule of thumb
If your CI YAML has a for_each_customer loop in it, you have already outgrown CI as a deployment tool. The loop is a tell. It means the pipeline is being asked to model a fleet, and it has no language for fleets.
You do not need a bigger pipeline. You need a different layer.
Final take
CI/CD is necessary. It is not sufficient.
At 200 environments, the deploy step stops being a step in a pipeline and starts being its own system. Treating it as anything less is how you end up with a release process that only works on Tuesdays, and only if Maya is online.
Related Articles
Designing for 1000 Clusters: What Changes
At 1000 clusters, almost every assumption your platform was built on inverts. Targets become first-class objects, identity boundaries harden, drift is the default, and compliance moves into the runtime.
Read moreWhy GitOps Doesn't Work at Scale (and What to Do Instead)
GitOps is excellent for small systems, but large enterprises hit failure modes around dependency coordination, rollback safety, compliance workflows, and configuration sprawl.
Read more