Speculative Decoding in LLMs
An overview of major techniques
I recently gave a talk on Speculative Decoding for the EleutherAI ML Performance Reading Group, focusing on how modern speculative sampling techniques improve LLM inference latency and throughput.
The talk covers the evolution of speculative decoding from both a research and systems perspective, including:
- The original speculative decoding framework introduced independently by Leviathan et al. and Chen et al.
- Medusa, which removes the need for a separate draft model via multi-head token prediction
- The EAGLE (1/2/3) series, which advances speculative decoding through feature-level autoregression, dynamic trees, and training-time rollout
Beyond algorithmic intuition, the talk discusses systems tradeoffs such as acceptance rates, verification cost, batching behavior, KV-cache management, and practical implementations in vLLM and SGLang.
Talk & slides:
- Video: Youtube
- Slides: Google Slides