To be used with vLLM or any other inference engine.
Built on top of IGW
roadmap.
WG
or well-lit path
- P/D disagg serving
- working implementation
- Think of large MoE, R1 and serve with certain QPS
- NS vs. EW KV Cache management
NS Caching:
- System resources
- Inference Scheduler handles each nodes separately (HPA) EW Caching:
- Global KV Manager to share across nodes
- Scheduler-aware KV (related to the scheduler WG)
- Autoscaling (KEDA)
Autoscaling
optimization: https://docs.google.com/document/d/1X-VQD2U0E2Jb0ncmjxCruyQO02Z_cgB46sinpVk97-A/edit
sig notes: https://docs.google.com/document/d/1dHLWBy8CXaURT-4W562pfFDP6HDrn-WgCtDQb08tD7k/edit
autoscaling examples: https://docs.google.com/document/d/1IFsCwWtIGMujaZZqEMR4ZYeZBi7Hb1ptfImCa1fFf1A
use cases:
- Google: no strong incentives for auto-scaling on large workload
- Single production workload - configurations, extensions, and optimizations of llm-d
- High customer SLO expectation
- Provision for peak (no room to flex)
- IBM: small to medium size (18)
- Think of model as a service
- on-prem: bring their own models → scale-up
- multiple models + dynamic sets
components.
Autoscaling
background: llm-d/llm-d
-
Creating an ILP to solve the bin problem to control routing of the request
-
dynamism is not a big part of the workloads a few big models
-
Exploration for areas and involvement into llm-d:
- Financial customers
- RedHat internal customers
-
Reproduced DeepSeek serving architecture
-
Scheduler decisions between latency-focused vs throughput-focused
-
Request scheduling
-
Opinions for P/D on TPU (mixed batching)
Meeting
notes: https://docs.google.com/document/d/1-VzYejdGXWYXnneSBRDlU0bo22DC6_TTbjuKeGezvTc
Input Type / Volume of requests / Hardware matrices
Scale up/down based on usage
Heterogeneous vs homogeneous resources
Vertical vs horizontal scaling
Offload
dynamic:
- on startup
- outside-in (combination + artefacts)
- model server (range)
- data-point (performance curve)
- max KV cache model-server support (minimum cycle latency on instance, model can serve)
- max model concurrency can serve