Cloud cost is an architecture problem wearing a finance costume
Most cloud-cost tools help you see spend. They rarely help you reduce it. The leverage is upstream, in the architecture — and in the way teams make trade-offs.
Every cloud platform now ships with a cost explorer. Every major observability vendor has a FinOps dashboard. The industry is not short of visibility. So why does the bill still go up?
Because cost is not primarily a finance problem. It is an architecture problem.
Where cloud spend actually comes from
After a few dozen cost audits, the shape of most overruns is the same. It is rarely a surprise service or a forgotten test environment. It is one of five things:
- Idle capacity held by habit. RDS instances sized for the old monolith; Kubernetes nodes scaled for the worst day of 2021; cross-AZ traffic charged on every hop of an internal chat.
- Data in the wrong place. Logs shipped to a premium tier and kept forever; object storage never moved to infrequent access; egress because a service talks to itself across regions.
- Autoscaling that does not. HPAs pinned to one replica “to avoid flapping”; instance families chosen three years ago; cluster-autoscaler on workloads that would happily run on spot.
- Observability bills bigger than compute. Every team emitting every metric at one-second resolution because the first default in the agent said so.
- Platform products paid for twice. Two secrets managers, two API gateways, two message buses — because each team picked their own before there was a platform team to pick for everyone.
Tools show you the numbers. They do not solve these. Architecture does.
Treat cost as a first-class non-functional requirement
In the systems we build, cost is a design input, not a review finding. Concretely:
- Unit-cost budgets, not absolute ones. “This service must cost less than $X per million requests.” Abstract targets are easier to regress against than fixed monthly numbers, and they let the business grow.
- Cost in the design doc. Every new service gets a back-of-envelope monthly cost estimate in its design doc. It does not need to be precise. It needs to be argued with.
- Cost gates in CI. Infrastructure-as-code plans surface the estimated monthly delta on pull requests. A 5× increase does not block the merge; it forces a conversation.
These are small practices. They change what engineers argue about — which is where leverage lives.
Where we find the big wins
Roughly in order of typical ROI, on the clients we meet:
- Storage tiering on object stores. Lifecycle rules for S3/GCS/Blob that move cold data to infrequent-access and then to archive. Usually saves 30–60% of storage bill with no application change.
- Log volume rationalisation. Sampling strategies, cold storage for compliance copies, and killing the five noisiest log sources that nobody reads. Often halves the observability bill.
- Right-sizing fleets. Move to current-generation instance families; match node groups to actual workload shapes; use Graviton where the runtime supports it.
- Spot, responsibly. Stateless, restart-tolerant workloads on spot with proper diversification can cut compute by 60–80%.
- Autoscaling that actually scales. Real load tests, not theoretical ones; HPAs tuned against real traffic; cluster autoscaler configured to evict politely.
Each of these is a few weeks of work. Each of them is also a cultural intervention: you are teaching the team to treat cost the way you teach them to treat performance.
The part that tools cannot do
No dashboard will decide, for you, whether to close down one of your two message buses. No cost explorer will ask the uncomfortable question about the SaaS product that nobody uses but everyone is afraid to cancel. Those are architecture and governance conversations.
If you want the bill to come down and stay down, make cost part of the way you design systems. Ship less. Ship sharper. Measure it.
The savings follow.