To be used with vLLM or any other inference engine.
Built on top of IGW
roadmap.
WG
or well-lit path
- P/D disagg serving
- working implementation
- Think of large MoE, R1 and serve with certain QPS
- NS vs. EW KV Cache management
NS Caching:
- System resources
- Inference Scheduler handles each nodes separately (HPA) EW Caching:
- Global KV Manager to share across nodes
- Scheduler-aware KV (related to the scheduler WG)
- Autoscaling (KEDA)
- Autoscaling