Karpenter Scaling – Drop It Like It's Spot

case study

created 2026-03-21

Background

We had incidents. We increased our resources (mem/cpu, # of instances). Had more incidents, increased resources again. On and on, app after app and worker after worker, we left our resources high. Over time, the app teams fixed the underlying causes, but we never went back to rightsize them. Eventually the various app problems were fixed, and our resources didn't match our workloads - not even close. It was time to address it.

We also largely ran spot instances. They are way cheaper, but prone to disruptions. That's mostly okay, because our workloads are generally fault tolerant, and we were running excess services. But it's relevant to the story later on.

Scaling Down

Time to address the elephant-sized resources in the room. We (the infra team) were at KubeCon in Georgia. We'd been hearing a lot about the new Vertical Pod Autoscaling (VPA) objects, and there were a ton of companies selling cluster autoscaling. The first step, bumping our Kubernetes version so we could try out the newer VPA objects. Next step, installing Cast AI. We just signed up for a trial in our dev environment, but holy cow - so much waste. We started to get excited. We hadn't taken the time to really measure this before.

First things first - identifying waste. Turns out, Karpenter was running far more nodes than necessary. It wasn't just that we were overprovisioning our resources. Karpenter was a huge source of our waste. It wasn't binpacking well at all.

Rightsizing Karpenter

Upon some deeper inspection, Karpenter was having a terrible time with node consolidation. We had a bunch of node pools (to minimize noisy neighbor problems), and we had too many rules that affected consolidation (PDBs, Node Affinities, Topology Spread Constraints). Basically, Karpenter couldn't consolidate and couldn't adequately binpack.

We let CastAI run for a bit. Its node-side binpacking was slightly better. Between better packing and rightsizing, we saw an insane 75% reduction in EC2 spend. We observed, we learned, and we copied. With two pools (resilient and general) and fewer scheduling restrictions, we cut node-side slack (the gap between what we provision and what pods request) in dev from around 36% down to ~14% - same neighborhood as CastAI.

Fixing Resources

By now, we were feeling hungry to save more money. CastAI saved a bunch of money based on their vertical autoscaling, but we figured horizontal autoscaling was going to be a bigger bang. I configured the KEDA autoscaler for almost every queue and service. I configured scheduling rules to scale up during peak hours and scale down at night, and metrics rules to scale up if traffic spiked. And, to be extra safe, I made a rule to scale up incredibly quickly in the event of 502-504 errors. KEDA's nightly scale-down now drops dev and production CPU by roughly half versus daytime peak.

We'd captured most of the savings - on to the VPA. We tried the VPA for some services. It turned out that the VPA is way too aggressive. For the few workloads we enabled it on in our dev environment, we quickly started seeing issues. Frequent OOM errors, and CPU throttling. We started getting way more timeouts. Reeling that back.

We could have done a one-time vertical rightsizing effort - but we'd be in the same situation a year later. We needed continuous rightsizing. The VPA wasn't intelligent enough on its own and CastAI was too expensive, so I built a cron in our infra service that adjusted workload resources based on Prometheus and the VPA's own recommendations.

For CPU, I set the requests to the p90 + 20%. According to best practices, I didn't set limits for CPU. I figured this would be safe - the requests were already set pretty high.

For Memory, I set the requests to the p95 + 20% - OOM is a little more sensitive. Plus, if the metrics detected an OOM it would automatically bump the memory by 10% (with a reasonable cap, and alerts at the cap).

Additionally, as a guard, neither cpu nor memory could ever scale down more than 10% per day. And, memory scale downs were blocked entirely if there'd been an OOM in the last 7 days. That way, when an incident pushed resources up, they'd stay up until the app team fixed the underlying cause, and then drift back down without thrashing. Predictive modeling would have been nicer, but I'm not CastAI - that's a whole product.

The memory side has worked great - it's still running months later. CPU was a different story. Even at p90 + 20% with no limit, workloads got throttled and caused problems - in the worst case, TLS handshakes couldn't complete, causing saturation of the database connection pool. It's great for workloads to be able to flex up on CPU, but some of our workloads were too bursty for a simple p90 + 20% calculation. I turned the CPU portion off and kept the memory auto-bumps. I also built a CLI command into our dev tooling to help me do rightsizing by hand.

The Dragons

Things were great for a while. But over time, our dev environment started having a bunch of errors. Production was fine, but dev hit a trifecta that caused some pain. 1 - We were running fewer services - thus they were sensitive to large disruptions. 2 - Our spot disruptions started being super frequent. The nodes that we settled on due to their high availability stopped having such high availability. 3 - We had a worker that started misbehaving with scaling events. It would require 20+ nodes at 26GB to scale up to process 10k jobs. We would scale from 1 worker to 20, back to 1, and back up a few hours later, and so on and so on.

Karpenter doesn't take spot terminations into account when picking nodes to consolidate. So during all that churn, Karpenter would voluntarily evict a node right as spot reclaimed another - two gone at once, and with services scaled so low, things went down. Node churn is generally bad for scaling health. It's worse when you're running few services on a node type with high spot cancellation rates.

Configuring Karpenter

We couldn't easily fix the worker churn - the application behavior was too deeply rooted. But there were two more knobs that we could control. We could scale up the services and fix the spot terminations. I had constrained Karpenter to have near-perfect binpacking, but once spot disruptions got worse for the node type, Karpenter didn't have any room to make better decisions. It needed some room to breathe.

The quickest thing to fix was slightly increasing the number of services we ran. Then, I got to work on Karpenter. I opened up which node types it could use, and gave it more selection of sizes. I knew this would result in less perfect binpacking, but it wasn't going to regress us that badly. I moved us to use the Bottlerocket AMI so that nodes would spin up faster. Additionally, I allowed on-demand capacity to function at equal weight as spot. Karpenter was now free to make decisions, and I was able to scale our services back down without a rise in errors.

In the End

Karpenter is way healthier than before and our costs are way down. Dev runs about 72% lower in provisioned CPU than where we started. Node-side slack is tight in dev (~14% slack). Production runs tighter at night and looser during the day - we gave Karpenter more room after the CPU and spot issues, since tight binpacking plus high spot churn turned out to be a bad combination.

The leftover frustration is the pod-side slack (the gap between what pods request and what they actually use). Request-vs-used gaps are still wide, especially in production where we request roughly 7x what we actually use during the day. CastAI was at a similar ratio in our dev trial - they pulled requests way down but still kept plenty of headroom over what we actually used. Sizing tighter runs into CPU throttling. I'm not yet sure how to close that gap without inviting throttling back. I think this deserves some load testing to really find the limits.

Why not just pay for CastAI now? At our current efficiency, it would save maybe another 10% on compute, but the platform fees plus per-node charges would cost about 2x what we'd save - and that's at our 50% discount. Plus a year commitment, with no way to test production workloads during the trial. CastAI optimizes aggressively, and after watching thin CPU margins cause incidents in our own cron, I don't trust an outside system to make those calls in prod - at least not without rigorous testing, which we weren't allowed to do.

We have alarms on binpacking efficiency and on vertical/horizontal sizing drift. Rightsizing is a continuous process, and we'll make it a quarterly review to revisit node costs.