CPU Showdown – 500m vs TLS Handshakes
created 2026-02-05
Problem
We had a PHP website limited to 500m CPU that was experiencing intermittent connection pool full errors from our PgBouncer/Postgres layer during traffic spikes. Additionally, we saw a massive spike in P99 latency and 5xx errors that didn't correlate with high average CPU utilization.
Troubleshooting
I needed to find out why our database pool was exhausting its connections even though the database itself had plenty of headroom, and why the service was running slow despite not using very much CPU. I realized that we had been rightsizing our clusters, and we had just run into an issue with the CFS affecting CoreDNS. My hunch - this app could no longer scale during bursts because the rest of the applications had become first-class citizens with their CPU Requests set. We had set up a dashboard to monitor container_cpu_cfs_throttled_seconds_total and it was going up while the CPU usage remained pretty steady and low.
Solution
I increased the CPU to 1500m to see if it would help, and we immediately stopped having TLS handshake failures.
Followup
Analyzing this later, it makes sense that they were failing to sometimes finish the TLS requests. TLS requests require cryptographic math, and while it generally takes 15ms, there can easily be 100 requests going to a pod when it's booting up due to web scraping. Most of these are going to require database connections (to look up things like zip code), and our PHP backend isn't optimized for really great caching. So, if our service allows 10 connections to the pool, but only 500m, and they all need 15ms of CPU time, plus other requests might already be generating HTML pages, then that 500m gets split between them all. Then for every 100ms of core time, the app gets 50ms, and each thread gets 5ms - assuming no other threads which is unlikely. It would take 3 cycles just to get through the 15ms request, and this is in a fairly decent scenario. And it seems that PgBouncer wasn't recognizing that those TLS connections had been dropped by the application, so when the application reached out for more and failed those too, it very quickly saturated the connection pool. This thrashing was happening over and over until we increased the CPU to 1500m.
Next Steps
I would have loved to really drill in and figure out how many requests result in the CPU starvation, but ultimately I don't have the time for that investigation. We are comfortable running this service at 1500m, and asking the frontend team to implement some stronger caching mechanisms and rate limits.