• ↑↓ pour naviguer
  • pour ouvrir
  • pour sélectionner
  • ⌘ ⌥ ↵ pour ouvrir dans un panneau
  • esc pour rejeter
⌘ '
raccourcis clavier

LWS

GitHub: kubernetes-sigs/lws

one leader StatefulSet versus a workers StatefulSet per leader

llm-d

To be used with vLLM or any other inference engine.

Built on top of IGW

roadmap.

llm-d/llm-d#26

WG

or well-lit path

  1. P/D disagg serving
    • working implementation
    • Think of large MoE, R1 and serve with certain QPS
  2. NS vs. EW KV Cache management NS Caching:
    • System resources
    • Inference Scheduler handles each nodes separately (HPA) EW Caching:
    • Global KV Manager to share across nodes
    • Scheduler-aware KV (related to the scheduler WG)
    • Autoscaling (KEDA)
Autoscaling

optimization:

sig notes:

autoscaling examples:

use cases:

  • Google: no strong incentives for auto-scaling on large workload
    • Single production workload - configurations, extensions, and optimizations of llm-d
    • High customer SLO expectation
    • Provision for peak (no room to flex)
  • IBM: small to medium size (18)
    • Think of model as a service
    • on-prem: bring their own models scale-up
    • multiple models + dynamic sets

components.

Autoscaling

background: llm-d/llm-d

  • Creating an ILP to solve the bin problem to control routing of the request

  • dynamism is not a big part of the workloads a few big models

  • Exploration for areas and involvement into llm-d:

    • Financial customers
    • RedHat customers
  • Reproduced DeepSeek serving architecture

  • Scheduler decisions between latency-focused vs throughput-focused

  • Request scheduling

  • Opinions for P/D on TPU (mixed batching)

docs.google.com/1IF[...]f1A

docs.google.com/1j2[...]0hc

docs.google.com/1iG[...]kwA

KV Cache Transfer


Meeting

notes: docs.google.com/1-V[...]vTc

Input Type / Volume of requests / Hardware matrices

Scale up/down based on usage

Heterogeneous vs homogeneous resources

Vertical vs horizontal scaling

Offload

dynamic:

  • on startup
    • outside-in (combination + artefacts)
    • model server (range)
    • data-point (performance curve)
    • max KV cache model-server support (minimum cycle latency on instance, model can serve)
    • max model concurrency can serve