Project: Zero Downtime Lab — What Happens When Your Cluster Breaks?

TL;DR

  • An interactive lab where you attack real Kubernetes pods and observe the impact on service availability
  • Built to explore what horizontal scaling actually protects against — and where its limits are
  • Hands-on cluster administration experience matters more than ever in the age of AI

The idea

Kubernetes promises high availability through horizontal scaling. Deploy three replicas, and if one goes down, the others pick up the load. But what actually happens when components fail? How does the failure of individual pods affect the service that real users see? And where do single points of failure still exist, even in a horizontally scaled setup?

These are the questions that motivated the Zero Downtime Lab: an interactive environment where visitors can actively attack a running Kubernetes cluster and observe the consequences in real time.

What you can do

The lab exposes a simple web service running as three replicas inside a Kubernetes cluster. As a visitor, you can target individual pods, reduce their health, and eventually take them down. Meanwhile, you see the actual service response — either a healthy 200 or, if you manage to overwhelm all replicas, a real 503 error. Nothing is simulated. When you kill a pod, Kubernetes deletes it and spins up a replacement. The 503 you see is a genuine service outage.

The core question: Can you bring the service down?

It turns out that taking down a horizontally scaled service is harder than you might think. Kubernetes restarts pods automatically, passive healing occurs while you focus on one target, and you need to coordinate attacks across all replicas simultaneously. This is exactly the resilience that horizontal scaling provides — and experiencing it interactively makes the concept tangible in a way that reading documentation never can.

Beyond pod scaling

What makes this project interesting to me goes far beyond just killing pods. The real learning happens when you start asking: what else can fail?

In a managed cloud environment, many components are abstracted away. The load balancer, the ingress controller, DNS resolution, the container runtime — these feel invisible. But in a self-hosted setup on Raspberry Pis in my living room, every single component is something I had to set up, configure, and maintain. And every single one of them is a potential point of failure.

Horizontal scaling solves one layer of the problem. But what about:

  • The ingress controller that routes traffic to your pods — what if that goes down?
  • The network switch that connects your nodes — can you scale that?
  • The router that provides your public IP — is there a failover?
  • Stateful services like databases — how do you horizontally scale those?

These questions are easy to ignore in the cloud, where the provider handles them transparently. In a home setup, they become very real, very fast. And that is exactly the point.

Why hands-on experience matters now more than ever

We live in a time where AI can generate Kubernetes manifests, write Helm charts, and explain cluster architecture in seconds. But there is a fundamental difference between having AI produce a configuration and truly understanding what happens when that configuration meets reality.

When a pod gets stuck in CrashLoopBackOff at 2 AM, when Traefik stops routing traffic after a certificate renewal, when a node runs out of memory because you forgot to set resource limits — these are the moments where real experience counts. No amount of generated YAML prepares you for debugging a live cluster under pressure.

The Zero Downtime Lab is my way of deliberately creating these situations. Not in a staging environment that nobody cares about, but on a service that real visitors interact with. The pressure of maintaining an actual service, even a small one, forces a level of care and understanding that sandbox experiments simply do not provide.

What is next

The lab is designed to evolve. The current level lets you attack web service pods. Future levels will expose deeper cluster components — the ingress controller, internal routing, and other infrastructure that most people never think about until it breaks. Each level reveals another layer of the stack and another set of assumptions about what high availability really means.

If you want to try it yourself: Zero Downtime Lab