Prefill-Decode Scheduling and Disaggregation
The contrast between the resource and performance characteristics of the prefill and decode stages presents both challenges and opportunities: prefill can saturate the GPU with even a single long prompt and processes several tokens in parallel, while decode is memory-bound, incapable of saturating the GPU, and emits tokens serially. Speculative decoding attempts to increase efficiency and reduce latency by making the decode phase look more like prefill, since verification is essentially a prefill-like parallel phase.
For a single request, prefill has to be completed before decode can commence; however, the ordering of prefill and decode across multiple requests can be tuned to better prioritize throughput and resource efficiency or response latency. Schedulers can choose to prioritize batches of entirely prefill or decode operations or run mixed prefill-decode batches. Prioritizing prefill or mixing prefill-decode batches increases resource efficiency at the cost of latency since decode stages are either slowed down or stalled by the prefills. Conversely, schedulers which prioritize decode-only batches before allowing any new prefills to occur have better latency but worse throughput.
Chunked Prefills
Sarathi attempts to balance this trade-off by not taking an all-or-nothing approach to prefill and instead dividing the initial prompt into a number of chunks and running mixed prefill-decode batches. This approach is called chunked prefills, and the size of the chunk is determined by the maximum number of tokens the GPU can process in a batch while achieving its SLO, as well as the number of decode requests already in the batch. The downside is that prefill is a naturally parallelizable operation that could be completed in a single pass. Dividing it into smaller chunks is less efficient both because it operates on smaller matrices which make saturating the GPU harder and because fetching model weights and populating the KV cache multiple times increases memory bandwidth pressure. Consequently, if the chunk sizes get too small, the operation becomes memory-bound and throughput and efficiency suffer.
Prefill-Decode Disaggregation
DistServe takes a more extreme approach to preventing prefill and decode from interfering with each other by moving them to separate devices: most systems co-locate prefill and decode on the same device because it allows both model parameters and the KV cache to be reused which preserves precious HBM. DistServe turns this on its head by replicating the model parameters and instead running prefills on one GPU and then shipping the populated KV cache for a request over to another GPU which runs the autoregressive phase. This is not actually a huge deal because moving the KV cache either within a node or across nodes using a low-latency interconnect is significantly faster than prefill and decode, resulting in only a small impact on request latency (0.1% with a p95 < 30ms).
DistServe is targeted towards environments with multiple GPUs, both within a single node, and across multiple nodes, where latency is the key consideration. Disaggregating prefill and decode also simplifies scheduling (no longer any need to balance prefill or decode prioritization), and allows resources to be scaled independently for either task. The core of DistServe is a placement engine which determines the number of prefill and decode instances and how they should be situated on physical hardware. While DistServe is able to get rid of all prefill-decode interference, it does little to improve the utilization of the decode resources.
Prefill-Decode disaggregation is available in most major serving systems. Splitwise explores disaggregation from the perspective of a hosting provider with heterogeneous fleets and how disaggregation allows prefill and decode to run on the hardware best suited to their specific resource needs—for instance, decode requires lots of memory and little compute and can be run on GPUs with less tensor cores or potentially even CPUs with large memory capacity.
Efficiently transferring the KV cache between the prefill and decode machines is key to getting reasonable performance in disaggregated systems. DistServe takes a pull approach where the KV cache lives in the memory of the prefill nodes and is pulled by the decode nodes just before they are ready to start processing the request. Splitwise overlaps prefill computation with KV transfer by eagerly transferring the KV cache for different layers as they get calculated, while DejaVu also proposes prefill-decode disaggregation as a means to address the bimodal nature of prompt processing and token generation. It focuses on the design and implementation of a layer-by-layer streaming KV management library which swaps the KV cache out of HBM to system memory or across a low-latency network to other devices between the prefill and decode stages. DejaVu also observes that the same mechanisms for KV cache offload and transfer allow for replication and for the system to provide fault tolerance as needed.
Prefill-decode disaggregation is not the only attractive disaggregation target: the attention and the feed-forward network also have vastly different resource and performance characteristics and are potential future areas of research.