Many of us have spent the last decade wrestling with hyperscale cloud providers and their ever-growing suite of managed services. The mainstream narrative pushes these as the ultimate solution for scalability and reliability. Yet, the specific discussions around "building a cloud," often sparked by critiques like David Crawshaw's, point to a deep dissatisfaction. The current state for many organizations looks like this:
This diagram, while simplified, shows the typical Kubernetes-centric architecture. It's a system designed for extreme elasticity and resilience, but it comes with a steep learning curve and significant operational overhead. The promise is abstraction; the reality is often a flood of YAML, obscure error messages, and a constant battle with kubectl.
When someone says, "building a cloud," they usually mean they want to replace this with something simpler. Often, it's a return to virtual machines, perhaps orchestrated by a basic scheduler, running applications directly. They want to own the stack, from the hypervisor up. This isn't a new idea; it's a re-evaluation of where the complexity should reside.
The Architecture You're Trying to Escape (and What You're Really Building)
The illusion of control is a powerful motivator. You think you're shedding complexity by moving away from Kubernetes when building a cloud, but you're simply shifting it. You're not eliminating the problems of distributed systems; you're inheriting them, often without the battle-tested tooling and expertise that hyperscalers have spent decades building.
Consider a basic scenario: you're running a few VMs, each hosting a service. What happens when one VM fails? In a hyperscale environment, a managed service like EC2 Auto Scaling or GKE handles replacement and rescheduling automatically. In your custom cloud (when building a cloud), you now need to build that. You need a scheduler, a health checker, a provisioning system. You need to manage network overlays, storage replication, and secret distribution.
Why Building a Cloud Will Break at Scale
This is where operational debt starts to pile up. You're not just building an application; you're building a platform. That means:
- Talent Drain: You need a team of highly specialized engineers who understand networking, virtualization, storage, and distributed consensus. These aren't cheap or easy to find.
- Tribal Knowledge Trap: Without solid documentation and automated processes, your cloud's operational knowledge becomes siloed within a few key individuals. If they leave, your cloud's stability is at risk.
- The Thundering Herd: When a service instance fails and restarts, or when a sudden traffic spike hits, your custom load balancer or service discovery mechanism might not handle the sudden influx of requests or connection attempts gracefully. This can cascade, bringing down other services that were otherwise healthy. Hyperscalers have spent years mitigating this with sophisticated backoff strategies and circuit breakers. You'll have to build your own.
You might achieve greater stability and cost reduction for specific, predictable workloads with simpler VM setups, as many on Hacker News point out. But the moment you need true elasticity, multi-tenancy, or self-service for diverse teams, you start re-inventing the very abstractions you tried to escape.
The Hard Choices for Building a Cloud: Consistency, Availability, and Your Custom Platform
This is where the rubber meets the road. When building a cloud and its distributed systems, you are forced to confront the CAP theorem. You can choose Availability (AP) or Consistency (CP). If you pick both, you are ignoring Brewer's Theorem. Hyperscalers offer services that make these trade-offs explicit, like DynamoDB (AP) or Spanner (CP with global consistency). When building a cloud, you're making these decisions implicitly, often without realizing the long-term implications.
Let's say you're building a custom storage layer. Do you prioritize strong consistency, meaning every read sees the most recent write, even if it means higher latency or reduced availability during network partitions? Or do you opt for eventual consistency, where reads might return stale data for a period, but the system remains highly available?
If you choose eventual consistency, you then need to design your application logic to handle stale reads and potential conflicts. This means:
- Idempotency: Every operation your services perform must be idempotent. If a message queue delivers a message twice (which
Kafkaguarantees at-least-once delivery, for example), your consumer shouldn't double-charge a customer or create duplicate records. This is a non-negotiable requirement in any distributed system, but it becomes even more critical when you're managing the underlying infrastructure. - Conflict Resolution: How do you resolve conflicts if two nodes write to the same data item concurrently? Hyperscalers provide mechanisms for this; you'll need to implement your own.
These aren't trivial problems. They are fundamental challenges in distributed systems design. Without a clear architectural stance on these trade-offs, your custom cloud will be a source of constant data integrity issues and unpredictable behavior.
When Building a Cloud Makes Sense (and What You'll Need)
So, when does "building a cloud" actually make sense? It's a niche scenario, not a universal solution for cloud fatigue.
- Highly Predictable, High-Utilization Workloads: If you have a stable set of applications with consistent resource demands, where you can achieve near 100% utilization of your hardware, a private cloud can offer TCO advantages. Think large-scale batch processing or dedicated compute farms.
- Strict Regulatory Compliance: Industries with stringent data residency or security requirements might find it easier to meet compliance by owning the entire stack.
But even in these cases, the investment is immense. You're not just buying servers; you're building a platform engineering organization.
Here's what you'll need to consider:
- Minimal Abstractions: Resist the urge to re-create Kubernetes. Focus on simpler, well-defined interfaces for provisioning compute, storage, and networking. Think
libvirtandQEMUwith a basic API layer, not a full-blown container orchestrator. - Observability from Day One: You cannot operate what you cannot see. Implement solid logging, metrics, and tracing across every layer of your custom stack. This isn't an afterthought; it's foundational.
- Automated Operations: Manual intervention is the enemy of scale and reliability. Automate provisioning, deployment, scaling, and recovery. This means investing heavily in
Ansible,Terraform, or similar tools, and building a continuous delivery pipeline for your infrastructure. - Clear Data Consistency Model: Define your data consistency requirements for each service. If you need strong consistency, use distributed consensus protocols like
RaftorPaxosfor your critical state. If eventual consistency is acceptable, design for it explicitly with idempotent operations and conflict resolution strategies.
This is a more realistic "simple cloud" architecture. It still has distributed components, but the focus is on explicit management rather than implicit orchestration.
The idea that building a cloud automatically grants superior control and cost efficiency is often an illusion. For most organizations, it leads to substantial technical debt and unforeseen complexities. The hyperscalers exist for a reason: they've solved these problems at a scale and with an investment that few individual companies can match. If you're going to build your own cloud, understand that you're not just building infrastructure; you're building a platform company, and that's a different kind of business entirely.