DevOps

Why CI/CD Breaks Down at 200 Customer Environments

Justin Brooksjsbroks

Posted: March 24, 2026

This is a follow-up to Why GitOps Doesn't Work at Scale. Same root issue, different surface: the tool you trusted for one environment is the same tool you are now shoving 200 environments through.

Most CI/CD pipelines were designed for one mental model: build once, deploy somewhere. That somewhere was usually staging, then production. Maybe two production regions if you were ambitious.

Then enterprise happened. Now you have:

12 SaaS regions
40 single-tenant customer instances
6 air-gapped on-prem deployments
30 ephemeral preview environments
A handful of compliance-locked tenants nobody likes touching

Same artifact. 200 targets. The pipeline that took you from zero to one is the same pipeline you are asking to take you from one to many. It will not.

The math hasn't changed

Same formula from the GitOps post: P(failure) = 1 - p^n.

CI/CD did not move the levers. It just gave you a faster way to feed n.

p = 0.99 per target
n = 200 targets
1 - 0.99^200 ≈ 0.866

So a "stable" pipeline rolling to 200 environments will see at least one failure on roughly 87% of releases. That is not a flaky pipeline. That is the floor.

P(failure) by fleet size at p=0.99

n=10    ▇▇                            0.10
n=50    ▇▇▇▇▇▇▇▇▇                     0.39
n=100   ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇               0.63
n=200   ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇          0.87
n=500   ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇       0.99

Every "release night" you stay up for is n quietly compounding against you.

What CI/CD was designed to do

CI/CD is excellent at:

compiling, testing, and packaging an artifact
promoting that artifact through a few well-known stages
keeping a linear, auditable history of what shipped where
integrating with the repo and the PR

That model assumes the deployment target is mostly fungible. One env is a copy of the last one. Promote, repeat.

What it was not designed to do

Once your fleet has identity (per-customer config, per-region constraints, per-tier release cadence), the assumption breaks. You start asking the pipeline to do things it was never meant to do:

coordinate rollouts across hundreds of independent targets
enforce dependency ordering between services that live in different repos
pause mid-rollout when a canary in region B starts misbehaving
skip the three customers in a regulatory freeze window
resume from the exact target where the last run died
track version skew across the fleet without a spreadsheet

CI/CD pipelines respond by accumulating YAML. Matrix builds. Reusable workflows. Conditional steps. Twelve-deep if ladders. Eventually somebody writes a Python script that calls the CI API to launch other CI jobs. That script becomes load-bearing.

You did not build a pipeline. You built a fleet orchestrator out of CI primitives, and you built it badly because CI was never the right substrate.

The patterns that get hacked together

Here is what every team I have seen ends up with by year two:

the pipeline iceberg

repo CI                          (visible, governed)
   │
   ▼
manual "deploy to fleet" job     (still tracked)
   │
   ▼
internal script that fans out    (one engineer maintains it)
   │
   ▼
per-customer override files      (lives in 4 repos)
   │
   ▼
slack thread with the order      (this is how rollback works)

Symptoms you have already crossed the line:

A senior engineer is the only person who can run the rollout safely.
"Did we ship to customer X yet?" is answered by reading job logs.
Rollbacks involve re-running the previous build with a flag.
The rollout doc is a Confluence page with (do not edit, ask Maya) in the title.
You have written, at least once, the phrase "we will fix this in the orchestration layer."

There is no orchestration layer. That is the problem.

What CI/CD cannot give you, no matter how much YAML you write

These are not things you patch onto a pipeline. They are properties of a system that sits above the pipeline.

Fleet-aware rollout strategy. The unit of deployment is no longer "this artifact." It is "this artifact, in waves, with these gates, paused on this signal." A pipeline thinks in steps. A fleet thinks in policy.

Per-target state. Customer A is on v4.2 and frozen until quarter end. Customer B is on v4.3 and gets canaries first. CI does not have a model of the target. The pipeline has no opinion about where things currently are.

Dependency coordination. Service A must be at v2 before Service B moves to v3. Across two repos, two pipelines, two teams. CI/CD has no shared brain to enforce that.

Drift detection and reconciliation. Out-of-band changes happen. They will happen. A pipeline only knows what it last ran. It does not observe the world.

Approval at the right granularity. Not "approve this PR." Approve "this rollout, to this wave, in this window, with this rollback plan."

Recovery. A failed rollout to 200 targets needs surgical resume, not a full rerun. CI's idea of recovery is "click retry."

What actually scales

Same shape as the GitOps answer. Keep the pipeline for what it is good at. Put a real orchestrator on top of it.

artifact path                       fleet path

CI ──> build, test, sign ──> registry
                                │
                                ▼
                          orchestrator
                            │   │   │
                            │   │   └─> wave 1: canaries
                            │   └─────> wave 2: SaaS regions
                            └─────────> wave N: locked tenants

orchestrator owns: target inventory, policy, dependency ordering,
                   live state, drift, approvals, resume, audit

The pipeline produces an artifact and a claim about it. The orchestrator decides where, when, in what order, under what conditions, and what to do when something goes wrong.

Stop asking your CI to be a release manager.

The rule of thumb

If your CI YAML has a for_each_customer loop in it, you have already outgrown CI as a deployment tool. The loop is a tell. It means the pipeline is being asked to model a fleet, and it has no language for fleets.

You do not need a bigger pipeline. You need a different layer.

Final take

CI/CD is necessary. It is not sufficient.

At 200 environments, the deploy step stops being a step in a pipeline and starts being its own system. Treating it as anything less is how you end up with a release process that only works on Tuesdays, and only if Maya is online.

DevOps

Designing for 1000 Clusters: What Changes

At 1000 clusters, almost every assumption your platform was built on inverts. Targets become first-class objects, identity boundaries harden, drift is the default, and compliance moves into the runtime.

DevOps

Why GitOps Doesn't Work at Scale (and What to Do Instead)

GitOps is excellent for small systems, but large enterprises hit failure modes around dependency coordination, rollback safety, compliance workflows, and configuration sprawl.

Why CI/CD Breaks Down at 200 Customer Environments

The math hasn't changed

What CI/CD was designed to do

What it was not designed to do

The patterns that get hacked together

What CI/CD cannot give you, no matter how much YAML you write

What actually scales

The rule of thumb

Final take

Related Articles

Designing for 1000 Clusters: What Changes

Why GitOps Doesn't Work at Scale (and What to Do Instead)

Deploy and manage your applications with Ctrlplane