Case Studies | Denvan Consulting — Kubernetes Infrastructure Results

B2B SaaS · Kubernetes Migration

18 Microservices Migrated to Kubernetes in 7 Weeks. Zero Downtime.

B2B SaaS Platform · 40-Person Engineering Team

The Problem

A B2B SaaS platform with 40 engineers was running 18 microservices directly on EC2 instances. Every deployment was a manual process: SSH into servers, pull new images, restart services one by one. Releases took 45 minutes and required a senior engineer standing by. A failed deployment meant 20 minutes of recovery. The team wanted to move to Kubernetes but didn't have the in-house expertise to do it safely.

What We Did

Containerized all 18 services with optimized multi-stage Dockerfiles
Authored Helm charts for every service with per-environment value overrides
Provisioned EKS cluster on AWS using Terraform with Karpenter autoscaling
Set up ArgoCD app-of-apps structure for dev, staging, and production environments
Built GitHub Actions CI pipeline: build, test, push to ECR, trigger ArgoCD sync
Deployed kube-prometheus-stack with 14 custom Grafana dashboards
Ran phased migration: dev first, then staging, then production in 3 waves
Delivered live cluster walkthrough and runbook documentation

The Outcome

All 18 services running on Kubernetes with zero production downtime during migration. Deploy time dropped from 45 minutes to 8 minutes. The team went from 2 releases per week to 15+. Zero production incidents in the 4 months following handoff. Team was independently managing the cluster within 30 days.

Fintech Startup · Cluster Reliability

Unstable Cluster Stabilized: 99.2% to 99.97% Uptime in 3 Weeks

B2B Fintech Company · 25-Person Engineering Team

The Problem

A fintech startup had been running their own Kubernetes cluster for 8 months. OOMKill events on the payment service were triggering CrashLoopBackOff weekly. The on-call rotation was being paged 3-4 times per week, often at night. There was no Prometheus, no Grafana, and no alerting beyond CloudWatch. Engineers were debugging with kubectl logs and guessing. The CEO was fielding customer complaints about payment processing timeouts.

What We Did

Full cluster audit: analyzed events, node conditions, and resource utilization across all namespaces
Identified root cause: payment service had no memory limits and was OOMKilled under peak load
Tuned resource requests and limits for all 12 workloads based on 30-day p95 usage data
Configured HPA for the payment service and API gateway with correct CPU and memory metrics
Deployed kube-prometheus-stack with AlertManager routed to PagerDuty
Wrote 22 PromQL alerting rules covering node pressure, pod restarts, PVC usage, and latency
Built 6 Grafana dashboards: cluster overview, namespace drill-down, and per-service panels
Wrote incident runbooks for every alert, including triage steps and rollback procedures

The Outcome

Cluster uptime went from 99.2% to 99.97% in the month following the engagement. On-call pages dropped from 14 per month to 0. The fintech team had full visibility into their cluster for the first time. Infrastructure costs dropped 38% through right-sizing. The team could now handle incidents independently using the runbooks.

AI Startup · Kubernetes Platform Engineering

Production Kubernetes Platform Built From Scratch in 9 Weeks

AI Inference Startup · 15-Person Engineering Team

The Problem

An AI startup had been running model inference workloads on a mix of spot instances and a loosely configured EKS cluster that was set up "to get something running." GPU nodes were not scheduled efficiently, there was no autoscaling for inference pods, cold start times were 3-4 minutes, and the team had no CI/CD workflow for deploying model updates. They were preparing for a Series A and needed production-grade infrastructure.

What We Did

Redesigned EKS cluster with Karpenter for GPU node provisioning and cost optimization
Implemented NVIDIA device plugin and GPU resource requests across inference pods
Built Helm charts for model serving deployments with configurable replica counts and GPU allocations
Set up ArgoCD with GitOps workflow for model version deployments
Implemented Kubernetes HPA with custom metrics from Prometheus for inference queue depth
Built GitHub Actions pipeline for container builds and automated model deployment
Deployed full Prometheus and Grafana observability with GPU utilization dashboards
Provisioned all infrastructure via Terraform modules: VPC, EKS, node groups, IAM, ECR

The Outcome

Cold start time dropped from 3-4 minutes to under 40 seconds with Karpenter pre-provisioning. GPU utilization increased from 34% to 81% through right-sized node groups and efficient bin-packing. Model deployment time went from a manual 2-hour process to a 12-minute automated GitOps workflow. Infrastructure costs dropped 42% despite doubling inference volume after launch.

Real Kubernetes Results From Real Engagements

18 Microservices Migrated to Kubernetes in 7 Weeks. Zero Downtime.

The Problem

What We Did

The Outcome

Key Results

Technologies

Engagement Type

Unstable Cluster Stabilized: 99.2% to 99.97% Uptime in 3 Weeks

The Problem

What We Did

The Outcome

Key Results

Technologies

Engagement Type

Production Kubernetes Platform Built From Scratch in 9 Weeks

The Problem

What We Did

The Outcome

Key Results

Technologies

Engagement Type

Want Results Like These?