distributed inference

LWS

GitHub: kubernetes-sigs/lws

one leader StatefulSet versus a workers StatefulSet per leader

llm-d

To be used with vLLM or any other inference engine.

Built on top of IGW

roadmap.

WG

or well-lit path

P/D disagg serving
- working implementation
- Think of large MoE, R1 and serve with certain QPS
NS vs. EW KV Cache management NS Caching:
- System resources
- Inference Scheduler handles each nodes separately (HPA) EW Caching:
- Global KV Manager to share across nodes
- Scheduler-aware KV (related to the scheduler WG)
- Autoscaling (KEDA)

Autoscaling

optimization:

docs.google.com/1X-[...]7-A

sig notes:

UPDATED: docs.google.com/1lg[...]tec
OLD: docs.google.com/1dH[...]D7k

autoscaling examples:

docs.google.com/1IF[...]f1A

use cases:

Google: no strong incentives for auto-scaling on large workload
- Single production workload - configurations, extensions, and optimizations of llm-d
- High customer SLO expectation
- Provision for peak (no room to flex)
IBM: small to medium size (18)
- Think of model as a service
- on-prem: bring their own models → scale-up
- multiple models + dynamic sets

components.

Autoscaling

background: llm-d/llm-d

Creating an ILP to solve the bin problem to control routing of the request
dynamism is not a big part of the workloads a few big models
Exploration for areas and involvement into llm-d:
- Financial customers
- RedHat customers
Reproduced DeepSeek serving architecture
Scheduler decisions between latency-focused vs throughput-focused
Request scheduling
Opinions for P/D on TPU (mixed batching)

docs.google.com/1IF[...]f1A

docs.google.com/1j2[...]0hc

docs.google.com/1iG[...]kwA

KV Cache Transfer

docs.google.com/1zB[...]JQU

Meeting

notes: docs.google.com/1-V[...]vTc

Input Type / Volume of requests / Hardware matrices

Scale up/down based on usage

Heterogeneous vs homogeneous resources

Vertical vs horizontal scaling

Offload

dynamic:

on startup
- outside-in (combination + artefacts)
- model server (range)
- data-point (performance curve)
- max KV cache model-server support (minimum cycle latency on instance, model can serve)
- max model concurrency can serve