All articlesDownload PDF
technologypeopleculture

AI Infrastructure in the Cloud: a systems map from silicon to SLAs

A new playbook for architects and platform leads—GPU fleets, networks, storage, orchestration, and FinOps as one coupled chain, not separate dashboards.

Topics

  • AI infrastructure
  • cloud architecture
  • GPU computing
  • platform engineering
  • FinOps
  • systems thinking

The meeting where utilization looked fine

Most platform teams have lived through a version of this week three review. GPU utilization on the new fleet reads in the sixties—not embarrassing. Finance still shows spend nearly doubling quarter over quarter. Product insists training is slow, but nobody can point to a single saturated chart. In the room, network is undersized. The framework is blamed. Storage quietly mentions checkpoint directories that behave like a small country's worth of small files.

The fix, when it arrives, is rarely a faster chip. It is a whiteboard chain: bursty east-west traffic during checkpoint serialization, a metadata tier never sized for LLM-scale file creation, and a sidecar that retries aggressively when that tier hiccups. The GPUs were not broken. They were waiting—politely, expensively—while the rest of the stack argued about whose dashboard looked fine.

That pattern is why I wrote AI Infrastructure in the Cloud: A Systems Architect's Playbook. Accelerated compute, networking, and operations—from silicon to SLAs—for teams building AI in public clouds. The book is a map you can reuse when the room splits between "it's the network" and "it's the scheduler," and both sides have partial evidence.

AI in the cloud is not a model problem wearing deployment YAML. It is a systems problem where mathematics, silicon, networks, storage, orchestration, economics, and human process meet—and the weakest link wins.

A chain of bottlenecks, not a ladder of best practices

Architects like boxes and arrows. For AI infrastructure, the more useful picture is a chain of coupled bottlenecks. Each link changes the cost and failure modes of the whole system. Optimizing one box without seeing the chain is how teams buy the latest accelerators and still miss SLAs.

The map runs from accelerators and host CPUs through memory hierarchy, scale-up and scale-out interconnects, storage, networks, orchestration, observability, security, and finally customer-visible outcomes and money. None of these layers is garnish:

  • A kernel launch pattern can leave tensor cores idle while the chip waits on memory.
  • A checkpoint strategy can turn a healthy training run into a storage incident.
  • A scheduler default can pack workloads that fight over NUMA or NIC bandwidth in ways that look like "bad GPUs."
  • A missing budget tag can hide the experiment nobody will admit they left running.

Two habits repeat across every layer. Coupling beats labels: a storage problem can be a network metadata problem can be a scheduler retry problem. Feedback beats heroics: postmortems should route into quotas, budgets, and architecture reviews—not one-off heroics.

Technology, people, and culture in the same incident

The technical chain is only half the story. The same incident exposes how teams are organized:

Technology — metrics that look healthy in isolation (utilization, queue depth on one service) while the causal chain stays invisible.

People — specialists who speak different languages (network, storage, ML framework) without a shared map to reconcile symptoms into one story.

Culture — incentives that reward buying capacity before naming the bottleneck, or shipping features before checkpoint and quota discipline exists.

The book treats those three dimensions as connected systems, not separate workstreams. That is the same lens I use in advisory work: strategy only holds if teams can operate it.

Who this book is for—and what it is not

This is written for intermediate practitioners: platform engineers, cloud architects, SRE leads, and engineering managers who own AI capacity in public clouds. You already know containers and Kubernetes. You may not yet have a disciplined way to connect NCCL wait time, filesystem latency, and invoice line items on one screen.

It is not a substitute for ML theory or algorithm design. The boundary is explicit: once someone chooses a model family and training approach, how does it become reliable, efficient infrastructure?

It is not vendor marketing dressed as architecture. Patterns recur across environments; where a topic is vendor-shaped, the transferable lesson is what would change if you moved clouds or chip generations.

Across 18 chapters and roughly 320 pages, the book walks the chain in the order bottlenecks tend to appear as teams grow—foundations, silicon and topology, cluster services, production disciplines, and forward-looking trends. Each technical chapter follows the same rhythm: what the thing is, how practitioners reason about it, a composite case (symptoms and investigation, not benchmark flex), and a checklist you can copy into a runbook.

Training, inference, and mixed production

Teams blur these words constantly. For consistency throughout the book:

Training updates parameters at scale. It stresses scale-out networking, checkpoint durability, and gang scheduling. Failures cost hours of GPU time.

Inference applies a trained model to new inputs. Tail latency, autoscaling, cold start, and cost per request dominate.

Real platforms are mixed: fine-tuning, evaluation, batch scoring, and research sandboxes share clusters with production inference. Mixed is not a third algorithm; it is a scheduling and governance problem. Expect quotas, noisy neighbors, and capacity planning whenever you read about production AI platforms.

What to do next

If you lead AI platform work, read Chapter 1 once as a compass, then hop by incident. Bring the checklists into your next design review or postmortem—name the bottleneck, quantify it with a simple model, pick two plausible fixes, and choose the one that survives time, money, risk, and people.

Browse books for related titles and updates. When the playbook is listed there, you will find purchase details alongside other work on AI and cloud systems.

If you want help applying this map inside your organization—capacity planning, operating model, or executive alignment—book a call and we can scope where the chain is actually breaking for you.

AI Infrastructure in the Cloud: a systems map from silicon to SLAs | Sam Advisory Hub