Continuous batching

(Yu et al., 2022) solves the static batching to reduce cost and improve throughput by appending requests continuously into existing KV cache 1 BibliographieYu, G.-I., Jeong, J.

Étiquette

ml

publié à
08 févr. 2024
modifié à
07 nov. 2024
durée
1 min de lecture (81 words)
source
llms.txt

Continuous batching

(Yu et al., 2022) solves the static batching to reduce cost and improve throughput by appending requests continuously into existing KV cache ¹

Bibliographie

Yu, G.-I., Jeong, J. S., Kim, G.-W., Kim, S., & Chun, B.-G. (2022). Orca: A Distributed Serving System for Transformer-Based Generative Models. 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 521–538. https://www.usenix.org/conference/osdi22/presentation/yu

Remarque

The paper and presentation for the paper. Most notable open source implementation is vLLM.

p/s: Actually, I think first implemented in huggingface/tgi ↩

(Yu et al., 2022) solves the static batching to reduce cost and improve throughput by appending requests continuously into existing KV cache ¹

Remarque

The paper and presentation for the paper. Most notable open source implementation is vLLM.

p/s: Actually, I think first implemented in huggingface/tgi ↩

Bibliographie

Yu, G.-I., Jeong, J. S., Kim, G.-W., Kim, S., & Chun, B.-G. (2022). Orca: A Distributed Serving System for Transformer-Based Generative Models. 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 521–538. https://www.usenix.org/conference/osdi22/presentation/yu