Back to Blog
Engineering May 20, 2025 11 min read

How We Achieved 99.99% Uptime

The architecture decisions, chaos engineering practices, and on-call culture that makes our platform highly available.

NP
Nina Patel
Site Reliability Engineer

99.99% uptime sounds impressive until you do the math: it's 52 minutes of downtime per year. For a platform where customers run production workloads, even that feels like too much. This post explains how we designed KubeBid for high availability and the practices that keep us reliable.

Defining Reliability

First, what do we mean by "uptime"? We measure availability as:

Our SLA covers all three. A failure in any component counts against our uptime budget.

Architecture for Availability

Multi-Region, Active-Active

KubeBid runs in 40 regions, but that's not just about being close to users. It's about fault isolation. Each region is an independent failure domain:

┌─────────────────────────────────────────────────────────────┐
│                    Global Load Balancer                      │
│                   (Cloudflare / Route53)                     │
└─────────────────────────────────────────────────────────────┘
                              │
         ┌────────────────────┼────────────────────┐
         ▼                    ▼                    ▼
  ┌─────────────┐      ┌─────────────┐      ┌─────────────┐
  │  us-west-2  │      │  us-east-1  │      │  eu-west-1  │
  │             │      │             │      │             │
  │  ┌───────┐  │      │  ┌───────┐  │      │  ┌───────┐  │
  │  │ API   │  │      │  │ API   │  │      │  │ API   │  │
  │  │ Server│  │      │  │ Server│  │      │  │ Server│  │
  │  └───────┘  │      │  └───────┘  │      │  └───────┘  │
  │  ┌───────┐  │      │  ┌───────┐  │      │  ┌───────┐  │
  │  │ etcd  │  │      │  │ etcd  │  │      │  │ etcd  │  │
  │  │cluster│  │      │  │cluster│  │      │  │cluster│  │
  │  └───────┘  │      │  └───────┘  │      │  └───────┘  │
  │  ┌───────┐  │      │  ┌───────┐  │      │  ┌───────┐  │
  │  │Auction│  │      │  │Auction│  │      │  │Auction│  │
  │  │Engine │  │      │  │Engine │  │      │  │Engine │  │
  │  └───────┘  │      │  └───────┘  │      │  └───────┘  │
  └─────────────┘      └─────────────┘      └─────────────┘

If us-west-2 goes down completely, traffic automatically routes to healthy regions. Customer clusters in that region would be affected, but the platform remains operational.

Cell-Based Architecture

Within each region, we use a cell-based architecture. Each cell is a self-contained unit that can serve a subset of customers:

Redundancy at Every Layer

Every critical component has redundancy:

Component Redundancy Recovery Time
API Servers 5 replicas, 3 AZs <1s (automatic)
etcd 5 nodes, 3 AZs <30s (leader election)
Database (Postgres) Primary + 2 replicas <60s (failover)
Auction Engine 3 replicas per region <5s (automatic)
Load Balancers Multi-AZ, multi-region <1s (automatic)

Chaos Engineering

We don't wait for failures to happen—we cause them intentionally. Our chaos engineering program runs continuously in production (yes, production):

Chaos Experiments We Run

# Example: Kill random API server pods
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: api-server-failure
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - kube-system
    labelSelectors:
      component: kube-apiserver
  scheduler:
    cron: "*/30 * * * *"  # Every 30 minutes
---
# Example: Network partition between AZs
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: az-partition
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - production
  direction: both
  target:
    mode: all
    selector:
      labelSelectors:
        zone: us-west-2b
  duration: "5m"
  scheduler:
    cron: "0 3 * * *"  # Daily at 3 AM

Game Days

Monthly, we run "game days" where we simulate major failures: full region outages, database corruption, DDoS attacks. The entire engineering team participates, and we use these to find gaps in our runbooks and automation.

Observability

You can't fix what you can't see. Our observability stack includes:

Key SLIs We Monitor

# Service Level Indicators
- name: api_availability
  description: "Percentage of successful API requests"
  query: |
    sum(rate(http_requests_total{code!~"5.."}[5m])) /
    sum(rate(http_requests_total[5m]))
  target: 0.9999

- name: api_latency_p99
  description: "99th percentile API latency"
  query: |
    histogram_quantile(0.99,
      rate(http_request_duration_seconds_bucket[5m]))
  target: 0.5  # 500ms

- name: auction_match_rate
  description: "Percentage of bids matched within SLA"
  query: |
    sum(rate(auction_matches_total{within_sla="true"}[5m])) /
    sum(rate(auction_bids_total[5m]))
  target: 0.999

- name: cluster_provision_time_p95
  description: "95th percentile cluster provisioning time"
  query: |
    histogram_quantile(0.95,
      rate(cluster_provision_duration_seconds_bucket[1h]))
  target: 30  # 30 seconds

Incident Response

When things go wrong (and they do), speed matters. Our incident response process:

  1. Detection: Automated alerts fire within 60 seconds of anomaly
  2. Triage: On-call engineer assesses severity within 5 minutes
  3. Communication: Status page updated within 10 minutes
  4. Mitigation: Focus on restoring service, not root cause
  5. Resolution: Full fix deployed after service is stable
  6. Post-mortem: Blameless review within 48 hours

On-Call Culture

We rotate on-call weekly, with every engineer participating—including founders. This ensures everyone understands operational pain points and incentivizes building reliable systems.

Deployment Safety

Most outages are caused by changes. We've built multiple safety nets:

# Argo Rollouts canary strategy
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api-server
spec:
  strategy:
    canary:
      steps:
      - setWeight: 1
      - pause: {duration: 5m}
      - analysis:
          templates:
          - templateName: success-rate
      - setWeight: 10
      - pause: {duration: 10m}
      - analysis:
          templates:
          - templateName: success-rate
      - setWeight: 50
      - pause: {duration: 10m}
      - setWeight: 100

What We've Learned

After running at scale for over two years, here are our key learnings:

  1. Simple systems are reliable systems. Every component we add is a potential failure point. We ruthlessly remove complexity.
  2. Test your recovery, not just your systems. Backups you've never restored aren't backups. Failover you've never triggered isn't failover.
  3. Invest in observability early. You can't diagnose problems you can't see. We spent 6 months building observability before launching.
  4. Make the right thing easy. Engineers will take shortcuts. Build systems where the shortcut is also the safe path.

Our Uptime Record

99.97%
Last 12 months
26 minutes total downtime across all services

We're not at 99.99% yet, but we're close. Every incident teaches us something, and we're continuously improving.

If you're interested in building reliable systems at scale, we're hiring SREs. And if you want to run your workloads on a platform built for reliability, try KubeBid.

Nina Patel is a Site Reliability Engineer at KubeBid. Previously, she was an SRE at Netflix and helped build their chaos engineering program.

Related Posts