Architecture
GPU-aware load balancing: managing AI compute like a pro
Round-robin is a relic when LLM requests span 50 tokens to 50,000. Prefill vs decode disaggregation, KV-cache-aware routing, prefix matching, and the four metrics that matter — how to route AI traffic so your P99 stops bleeding.
Read post →