• ↑↓ pour naviguer
  • pour ouvrir
  • pour sélectionner
  • ⌘ ⌥ ↵ pour ouvrir dans un panneau
  • esc pour rejeter
⌘ '
raccourcis clavier

tldr:

  • Structured decoding allows precise control over LLM output formats
  • vLLM now supports both outlines and XGrammar backends for structured decoding
  • Recent XGrammar integration brings up to 5x improvement in time per output token (TPOT) under load
  • Upcoming v1 release focuses on enhanced performance and schedule-level mask broadcasting for mixed-requests batch support

vLLM is the high-throughput and efficient inference engine for running large-language models (LLM). In this post, we will explore the annotated history of language models, describe the current state of structured decoding in vLLM, as well as the recent integration with XGrammar, and share a tentative roadmap for vLLM’s v1 improvement for structured decoding.

We would also invite users to tackle this blog post from a philosophical perspective, and in the process trying to posit that structured decoding represents a fundamental shift in how we think about LLM outputs. It also plays an important role in building complex agentic system.

For more information about vLLM, please check out our documentation.

language model, a brief historical context

If you have read about the history of the field before, feel free to skip this part to reason for structured decoding

The inception of AI might well be traced back to the origin of logics, where men put emphasis on reducing reasoning to some specific sets of calculations (a reductionist approach). As such, Plato generalised the belief in total formalisation of knowledge, where knowledge must be universally applicable with explicit definitions 1.

In 1950, Alan Turing posited that a high-speed digital computer, programmed with rules, would exhibit emergent behaviour of intelligence (Turing, 1950).

A paradigm quickly emerged among researchers in the 1950s, where expert systems were designed to replicate the decision-making capabilities of a human specialist 2, (or symbolic reasoning system), referred to by Haugland as Good Old-Fashioned AI (GOFAI) (Haugeland, 1997). However, it quickly ran into funding problems due to its semantic representation not being able to scaled up to generalized tasks (Also known as the “AI Winter” (Hendler, 2008)).

Concurrently, Donald Norman’s Parallel Distributed Processing (Rumelhart et al., 1986) group investigated variations of Rosenblatt’s perception (Rosenblatt, 1958), where they proposed hidden layers within the network alongside with inputs and outputs to extrapolate appropriate responses based on what it had learned during training process. These connectionist networks, often built on top of statistical methods3, are often referred as New-Fangled AI (NFAI) (Haugeland, 1997). Given the abundance of data and Moore’s Law4 resulting in an unprecedented amount of compute available, we see the complete dominance of connectionist networks in both research and production use-cases, with variants of decoder-only transformers5 for text generations tasks.

In retrospect, GOFAI are deterministic in a sense that intentionality is injected within symbolic tokens through explicit programming. Connectionist networks, on the other hand, are often considered as black-box models, given their hidden nature of intermediate representations of perceptron. Unlike GOFAI, its internal representation is determined by the state of the entire network states rather than one singular unit. Although these models exhibit emergent behaviour of intelligence, one should be aware that this is not artificial general intelligence yet, largely due to researchers’ observer-expectancy effect.

In summary:

  • GOFAI are deterministic and rule-based, given its intentionality is injected through explicit programming
  • NFAI are often considered as “black-box” models (in: input - out: some output), data-driven given the networked complexity nature of its internal representations

why do we need structured decoding?

Shogoth as GPTs. In a sense, RLHF, or any methods for that matter, is an injection of rules (GOFAI system) into post-training
figure1: Shogoth as GPTs. In a sense, RLHF, or any methods for that matter, is an injection of rules (GOFAI system) into post-training

LLMs excel at the following heuristic: given a blob of text, the model will generate a contiguous piece of text that it predicts as the most probable tokens. For example, if you give it a Wikipedia article, the model should produce text consistent with the remainder of said article.

These models works well given the following assumption: the inputs prompt must be coherent and well-structured surrounding a given problem the users want to achieve. In other words, generations are often non-deterministic when you need output in specific formats. Think of asking a model to generate JSON - without guidance, it might produce valid text that breaks JSON specification7.

This arises for the need of applying explicit rules and grammar8 (an addition of GOFAI system) that allows users to have control over certain aspect of the generations format while keeping the non-deterministic nature of the overall system.

OpenAI then introduced JSON mode to constrain 9 the output format from these models. If you have built with these functionality before (such as function calling, coding assistant), chances are you are using structured decoding under the hood.

Guided decoding is to LLMs what validation is to APIs - it acts as a guarantee that what comes out matches what you expect. Guided decoding ensures structure integrity that allows developers to integrate LLMs into their application with ease!

structured decoding and vLLM.

In layman terms, structured decoding gives LLMs a “template” to follow. Users provide a schema that “influences” the model’s output, ensuring compliance with the desired structure:

graph LR
    subgraph Structured automaton
        A[Schema] -->|Defines| B[Logit Bias]
    end

    subgraph Decoding
        C[Language Model] -->|Token distribution| D[Constrained Sampling]
    end

    B --> D
    D --> E[Structured Output]

    style A fill:#e1f3f8,stroke:#333,stroke-width:2px
    style B fill:#e1f3f8,stroke:#333,stroke-width:2px
    style C fill:#f8e1e1,stroke:#333,stroke-width:2px
    style D fill:#f8e1e1,stroke:#333,stroke-width:2px
    style E fill:#e1f8e1,stroke:#333,stroke-width:2px

    classDef default fill:#fff,stroke:#333,stroke-width:2px;

From a technical perspective, an inference engine can modify the probability distribution for next-tokens by applying bias (often via logit masks) for all tokens from any given schemas. To apply these biases, outlines was proposed to construct a finite-state machine 10 (FSM) from any given schemas to then guide the generations (Willard & Louf, 2023). This allows us to track the current state during decoding and filter out invalid tokens by applying logit bias to the output 11.

courtesy of (LMSys, 2024)
figure2: courtesy of (LMSys, 2024)

in vLLM, you can use this by passing a JSON schema to the sampling params (either through Python SDK or HTTP requests)

Note

In some cases, it can even improve the native decoding performance for LLM!

previous limitation in vLLM.

There are few limitations with current vLLM’s support of the Outlines backend:

  1. Slow decoding: FSM has to be constructed at a token-level, meaning it can only transition the state one token per step. Therefore, it can only decode one token at a time, resulting in slow decoding.
  2. Batch processing bottlenecks: Implementation in vLLM relies heavily on logit processor12. As such, this is on the critical path of the sampling process 13. In batching use-case, compiling FSM per requests as well as computing the mask synchronous means that all requests in any given batches will get blocked, resulting in high time-to-first-tokens (TTFT) and lower throughput.
    • We found that compiling FSM is proven to be a relatively expensive task, making it a significant contributor to the increased TTFT.
  3. Performance issues with CFG mode: With outlines integrations, while JSON mode is relatively fast, the CFG mode runs significantly slower, and can occasionally crashes the engine.
  4. Limited advanced feature support: Techniques like jump-forward decoding are currently not possible with logit-processor approach. It requires prefilling a set of k-next tokens, whereas for logit processors we can only deal with the next-token.

integrations with XGrammar

XGrammar introduces a new technique that batch constrained decoding with pushdown automaton (PDA). You can think of a PDA as a “collection of FSMs, and each FSM represents a context-free grammar (CFG).” One significant advantage of PDA is its recursive nature, allowing us to execute multiple state transitions. They also include additional optimisation (for those who are interested) to reduce grammar compilation overhead.

This is great because it addresses limitation (1) by moving grammar compilation out of Python into C, utilising pthread. Additionally, XGrammar lays the groundwork for addressing limitation (4) in future releases. Below are performance comparisons between the XGrammar and Outlines backends:

courtesy of Michael Goin (Red Hat)
figure3: courtesy of Michael Goin (Red Hat)

In vLLM’s v0 architecture, we’ve implemented XGrammar as a logit processor, optimizing it with caching for tokenizer data. While the performance improvements are encouraging, we believe there’s still significant room for optimisation.

integrations with v0

There are still a few usability concerns to match feature parity with all use cases:

  • It is yet to support grammars other than GBNF format (PR on vLLM: github)
  • It is yet to support regex
  • It is yet to support complex JSON that uses regex patterns or numeric ranges 14

Important

vLLM now has a basic support for XGrammar by default. In case where we know XGrammar is insufficient to serve the request, we fall back to outlines.

Note that vLLM also includes support for lm-format-enforcer. However, from our testing we found that in some long context test cases, lm-format-enforcer fails to enforce correct outputs, and not up to par with Outlines in terms of performance.

tentative plans for v1

With the release of v1 on the horizon, the following includes a tentative plan for structured decoding:

  1. Moving guided decoding towards scheduler-level:
    • Reason: We have more context regarding which requests that use structured decoding at a scheduler-level, therefore it shouldn’t block other requests within the batch (tentatively addressing limitation (2)). In a sense, this moves guided decoding outside of the critical path.
    • This would allow for more natural vertical integration with jump-forward decoding (address limitation (4))
  2. Allowing bitmask calculation in one process instead of each GPU workers
    • Reason: We can then broadcast this bitmask to each GPU worker instead of repeating this process per GPU worker.
    • We will look to carefully analyze the bandwidth implication of broadcasting masks for every sample per requests doing guided decoding.
  3. Good baseline for speculative decoding and tool-use
    • Reason: XGrammar includes plans to support tool-use, such that we can move away from Python’s tool parser
    • Tree scoring in speculative decoding can then use the same API as jump-forward decoding. (which depends on the integration of guided decoding at the scheduler-level)

NOTE: if you have any more suggestions we are more than happy to take it into consideration. Consider joining vLLM slack via #feat-structured-output

acknowledgement

I want to thank the vLLM team, XGrammar team, Michael Groin (Red Hat), Chendi Xue (Intel), and Russell Bryant (Red Hat) for their valuable feedback and collaboration on bringing XGrammar to vLLM and the continuous effort to improve structured decoding in vLLM.