Published on

Why Kubernetes Operators Exist — The Problem They Actually Solve

Authors
  • avatar
    Name
    Moinuddin M Masud
    Twitter

Part of the series: From Zero to Kubernetes Operators with Kubebuilder (KubeAgent Journey)

Full code for this post: github.com/madmmas/kubeoperator-journey/tree/blog-01Run it: go run ./cmd/why-operators inside the repo root


Most operator tutorials open with a scaffold command. You run kubebuilder init, stare at generated files, and follow along without understanding what problem you're actually solving.

This series does it differently. We're going to feel the pain first.

By the end of this post you'll understand why operators exist at a level that makes every Kubebuilder concept click into place — because you'll have seen exactly what life looks like without one.


The Infrastructure Automation Gap

Kubernetes is exceptional at managing stateless workloads. You define a Deployment, set your replica count, and Kubernetes handles the rest — scheduling, restarts, rolling updates, scaling. The built-in controllers are mature, battle-tested, and handle failure gracefully.

But stateful systems are a different problem.

Consider a PostgreSQL cluster. The steps to provision it correctly aren't just "apply a YAML file." You need to:

  • Create PersistentVolumes and bind them to specific nodes
  • Bootstrap the primary replica before secondaries can join
  • Configure replication slots and WAL settings per node
  • Wait for readiness in the right order (not just pod running, but actually accepting connections)
  • Register the cluster with your monitoring system
  • Configure backup cron jobs with correct IAM permissions
  • Document the current state somewhere so the next engineer isn't guessing

Now multiply this by every database, every environment, every team. Add upgrades, scaling events, disaster recovery drills, and version migrations. This is what platform teams actually do before operators exist.

They write runbooks. They build shell scripts that half-work. They create Ansible playbooks that drift from reality. They get paged at 2am when something falls out of sync and nobody noticed.

Kubernetes gave us a declarative API for pods. But for complex stateful systems, we were still writing imperative procedures.

That's the gap operators fill.


What We're Going to Simulate

The companion code for this post (cmd/why-operators) simulates exactly this gap — the manual toil of managing database clusters without an operator.

It's not pretty. It's not supposed to be. It's an honest representation of what these runbooks feel like at runtime.

git clone https://github.com/madmmas/kubeoperator-journey
cd kubeoperator-journey
git checkout blog-01
go run ./cmd/why-operators

Let's walk through what it does and why each part matters.


The Manual Operator Problem

The internal/problem/manual_operator.go file models a ManualOperator — a simulation of a human engineer (or a fragile script) managing a DatabaseCluster struct:

type DatabaseCluster struct {
    Name      string
    Replicas  int
    Version   string
    BackupURL string
    Status    ClusterStatus
}

This is the thing we're trying to manage. The ClusterStatus values represent states that matter — Provisioning, Running, Degraded, Failed — and in a real system, you'd want something watching these states continuously and acting on them.

Without an operator, nobody is watching. The status only updates when someone runs a check manually.

Provisioning: The Runbook Problem

Here's the provisioning function:

func (m *ManualOperator) Provision(name string, replicas int, version string) error {
    fmt.Println("[MANUAL] Step 1: SSH into node, create PersistentVolume manually...")
    time.Sleep(200 * time.Millisecond)

    fmt.Println("[MANUAL] Step 2: Apply StatefulSet YAML (hope it's the right version)...")
    time.Sleep(200 * time.Millisecond)

    fmt.Println("[MANUAL] Step 3: Wait for pods... (checking every 10s like it's 2014)...")
    time.Sleep(300 * time.Millisecond)

    fmt.Println("[MANUAL] Step 4: Configure replication manually between nodes...")
    time.Sleep(200 * time.Millisecond)

    // Simulate what makes on-call miserable: random failure
    if rand.Float32() < 0.3 {
        return fmt.Errorf("provisioning failed: node not ready")
    }
    // ...
}

The time.Sleep calls aren't padding — they represent real latency. Each step in a real provisioning flow takes time: waiting for pods to schedule, for volumes to bind, for replication to sync. The 30% random failure rate is actually conservative for complex stateful provisioning without automated retries.

When you run this, you'll see provisioning fail for some clusters. That's the point. In real infrastructure, failures like this trigger Slack threads, incident timelines, and postmortems. With an operator, they trigger a retry — automatically, without waking anyone up.

Health: The Visibility Problem

func (m *ManualOperator) CheckHealth(name string) ClusterStatus {
    // Simulate drift — the cluster silently degrades over time
    if rand.Float32() < 0.2 {
        cluster.Status = StatusDegraded
        fmt.Println("[MANUAL] !! Replica drift detected. Add to backlog.")
    }
    return cluster.Status
}

This reveals a subtler problem than provisioning failures: you only know the health status when you check. Between checks, the cluster could be degraded for hours. In a manual system, health checks are a cron job. They run periodically. In between runs, you're operating blind.

An operator reconciles continuously. It watches the actual state of the cluster against the desired state and acts immediately when they diverge — not on the next cron interval.

Upgrades: The 2am Problem

func (m *ManualOperator) UpgradeVersion(name, newVersion string) error {
    fmt.Println("[MANUAL] Step 1: Read 47-step upgrade runbook...")
    fmt.Println("[MANUAL] Step 2: Take manual snapshot (hope there's enough disk space)...")
    fmt.Println("[MANUAL] Step 3: Rolling restart, praying for no data loss...")
    time.Sleep(500 * time.Millisecond)

    if rand.Float32() < 0.4 {
        cluster.Status = StatusFailed
        return fmt.Errorf("upgrade failed: replica 2 didn't rejoin. Page the DBA")
    }
    // ...
}

A 40% failure rate on upgrades sounds absurd until you've done rolling upgrades on stateful clusters without automation. The failure modes are real: a replica fails to rejoin after restart, a schema migration times out, a configuration difference between nodes causes split-brain. The runbook tells you what to do when things go right. It's rarely detailed enough for what happens when they don't.


Running the Full Simulation

When you run go run ./cmd/why-operators, you'll see all five operations play out in sequence: provision, check health, scale, configure backup, upgrade. The output will be different every run because failures are random — just like production.

━━━ STEP 1: Provisioning ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
[MANUAL] Provisioning "production-postgres"3 replicas, version 15.3
[MANUAL]   Step 1: SSH into node, create PersistentVolume manually...
[MANUAL]   Step 2: Apply StatefulSet YAML (hope it's the right version)...
[MANUAL]   Step 3: Wait for pods... (checking every 10s like it's 2014)...
[MANUAL]   Step 4: Configure replication manually between nodes...
[MANUAL]   !! FAILED: Node not ready. Check Slack. Good luck.

  In real life: escalate to senior engineer, delay deployment,
   update incident timeline, write postmortem.
   With an operator: it retries automatically and reports status.

After running it, open internal/crd/database-cluster-example.yaml.


The Operator Alternative

This is everything the simulation above does, expressed as a CRD:

apiVersion: databases.madmmas.dev/v1alpha1
kind: DatabaseCluster
metadata:
  name: production-postgres
  namespace: data-platform
spec:
  replicas: 3
  version: "15.4"
  storage:
    size: 100Gi
    storageClass: fast-ssd
  backup:
    enabled: true
    schedule: "0 2 * * *"
    destination: s3://backups/production-postgres
    retention: 30d
  monitoring:
    enabled: true
    alertOnDegradedReplica: true

You write this. You apply it with kubectl apply -f cluster.yaml. The operator handles everything else.

Not just once — continuously. It watches the cluster. When a replica falls behind, it resyncs it. When the backup job fails, it retries and alerts. When you change replicas: 3 to replicas: 5, it scales safely. When you update the version, it runs a rolling upgrade with automatic rollback if something goes wrong.

The operator is not a deployment script. It's not a cron job. It's a continuous control loop running inside your cluster, watching your resources, and reconciling the actual state of the world toward the state you declared.

That's the fundamental shift.


Kubernetes Was Already Doing This

The insight that makes operators click: Kubernetes itself works this way.

When you create a Deployment with replicas: 3, you're not telling Kubernetes how to create three pods. You're declaring that you want three pods. The built-in Deployment controller — which is itself just a control loop watching Deployment resources — handles the rest. It creates ReplicaSets. It handles rolling updates. It restarts failed pods. If you delete a pod manually, the controller notices the actual state (2 pods) diverges from the desired state (3 pods) and creates a new one.

Operators extend this pattern to your own custom resources.

The same mechanism that Kubernetes uses internally to manage Pods, Services, and Deployments — you get to use it for your database clusters, your ML model deployments, your backup schedules, your certificate renewals.

Built-in resources:   Pod, Deployment, Service → managed by built-in controllers
Your custom resources: DatabaseCluster, BackupSchedule, MLModel → managed by operators YOU write

The API is the same. The control loop pattern is the same. The only difference is you're now writing the controller.


Real-World Operator Examples

To make this concrete before we write any code, here's what production operators actually automate:

DomainWhat Operators Handle
DatabasesProvisioning, replication, failover, backup, restore, version upgrades
CertificatesIssuance, renewal, rotation (cert-manager)
ML ModelsModel serving deployment, traffic splitting, rollback
KafkaTopic creation, partition rebalancing, user ACLs
PolicyAdmission control, quota enforcement, network policy
MonitoringAlertManager config, Grafana dashboards as code

The OperatorHub.io catalogue has over 300 published operators. Every major database vendor ships one. The pattern has become the standard way to package complex operational knowledge in Kubernetes.


What Operators Are Not

One important boundary to set early:

Operators are not magic. They encode the same operational knowledge that lives in your runbooks — provisioning steps, health checks, upgrade procedures, failure recovery. The difference is that knowledge becomes code, runs continuously, and doesn't require a human to execute it.

An operator is only as good as the operational knowledge it encodes. A poorly written operator that doesn't handle failure modes correctly is worse than a careful human following a good runbook — it will fail automatically, at scale, without waking anyone up to notice.

This series pays attention to that. We'll write operators that handle the hard cases: failed reconciliations, partial state, external resource cleanup, upgrade rollback. Not just the happy path.


What's Next

In Blog 2, we get into the mechanism: the Kubernetes control loop. We'll cover how etcd, the kube-apiserver, and controller-manager work together, what "watch" actually means at the API level, and how the Watch → Compare → Act cycle drives everything Kubernetes does — including operators you write.

By Blog 3, we'll have Kubebuilder installed and our first project scaffolded. The database cluster from this post's YAML will be the CRD we build toward throughout the series.

See you in Blog 2.


This is Blog 1 of the series "From Zero to Kubernetes Operators with Kubebuilder (KubeAgent Journey)." The series covers Kubernetes foundations, real operator development with Kubebuilder, and evolving into intelligent KubeAgent-style systems.

Discuss on TwitterView on GitHub