Learn

Multi-Token Prediction Explained: How Gemma 4 Runs 3x Faster Locally

Google's Gemma 4 MTP drafters deliver up to 3x faster inference with zero quality loss — no new hardware required. This tutorial explains the autoregressive bottleneck, compares MTP vs speculative decoding vs draft models, unpacks Google's architecture choices, and shows realistic speed gains on ...

Jason Lee, Tong Zhang

08 May 2026 — 13 min read

The 3x speedup isn't magic — it's speculative arithmetic: a tiny drafter bets on the next five tokens while the big model checks the work in a single pass.

If you have ever run a 30-billion-parameter model on your own machine and watched the cursor blink for four seconds between sentences, you have felt the autoregressive bottleneck in your bones. It is not a hardware problem in the way most people assume. A modern GPU is not idle during that wait — it is fully occupied, loading billions of weights from memory to generate exactly one token before repeating the entire cycle. On May 5, 2026, Google released something that changes that equation without asking you to buy new silicon: Multi-Token Prediction (MTP) drafters for the entire Gemma 4 model family.

Google's release of MTP drafters for the Gemma 4 family uses a specialized speculative decoding architecture to deliver up to a 3x speedup without any degradation in output quality or reasoning logic. The weights are available immediately on Hugging Face and Kaggle, and the runtimes that already serve Gemma 4 — Hugging Face Transformers, MLX, vLLM, SGLang, Ollama, and LiteRT-LM — pick up the drafters with minimal configuration. This tutorial walks you through exactly why the old approach was slow, how three competing strategies try to fix it, what Google actually engineered inside Gemma 4's drafter architecture, and what real throughput numbers look like on the hardware sitting on your desk.

Whether you are an ML practitioner tuning production pipelines or a vibe coder who just wants chat to feel snappy, the mechanics here are worth understanding — because the same pattern is about to propagate across every major open-weight family.

The Autoregressive Bottleneck: Why One Token at a Time Is Expensive

Every transformer language model you have ever used generates text the same way: one token per forward pass, sequentially. Standard autoregressive decoding generates one token per forward pass through the full model. This is not an accident of implementation — it is a structural consequence of how autoregressive models are trained. Each token is conditioned on every previous token, so the model cannot know what token 47 should be until it has committed to tokens 1 through 46.

Speed up your Gemma 4 workflows by up to 3x with Multi-Token Prediction (MTP) drafters.

Standard LLM inference is fundamentally memory-bandwidth bound, creating a latency bottleneck as billions of parameters travel from VRAM just to generate a single token. We're working to ease… pic.twitter.com/1rMFJrpWwh
— Google AI Developers (@googleaidevs) May 5, 2026

The problem is that this sequential structure collides badly with modern GPU memory architecture. Standard LLM inference is slow not because your GPU lacks processing power — it is slow because of memory bandwidth. Every time the model generates a single token, the processor must move billions of parameters from VRAM to the compute units, perform a forward pass, produce one token, and then repeat the entire cycle. A 31-billion-parameter model in bfloat16 occupies roughly 62 GB of parameter storage. For each single token, the GPU's memory controller has to shuttle some portion of those weights across the bus. The processor spends the majority of its time moving billions of parameters from VRAM to the compute units just to generate a single token, leading to under-utilized compute and high latency, especially on consumer-grade hardware.

The cruel irony is that the GPU's actual arithmetic units — the tensor cores doing matrix multiplication — are sitting largely idle during those memory transfers. The compute-to-memory-bandwidth ratio (arithmetic intensity) of a single-token forward pass is far too low to saturate a modern GPU. The bottleneck is the bus, not the cores. This is why throwing a faster GPU at the problem provides diminishing returns past a certain point: you are not compute-bound, you are memory-bandwidth-bound.

This process dedicates the same amount of computation to predicting an obvious continuation — like predicting "words" after "Actions speak louder than…" — as it does to solving a complex logic puzzle. That asymmetry is the key intuition. Most tokens in a typical generation are, in fact, fairly predictable. A fast, cheap model can probably guess them. The slow, powerful model only needs to weigh in when things get hard.

Three Strategies to Break the Bottleneck

The research community has converged on three families of approaches to recover throughput from memory-bound inference. They are related but architecturally distinct.

Approach 1 — Native Multi-Token Prediction (MTP)

Pure MTP is a training-time modification. Instead of training the model with only a next-token prediction objective, you add auxiliary heads that simultaneously predict tokens at positions +1, +2, +3, and so on. Multi-token prediction trains the model to predict multiple future tokens simultaneously rather than one at a time, and at inference time this enables speculative decoding approaches where the model's own draft heads propose multiple tokens in parallel that are then verified in a single verification pass. The canonical academic treatment is Gloeckle et al.'s "Better & Faster Large Language Models via Multi-Token Prediction" (2024), which showed that MTP training improves both downstream quality and enables speedups at inference.

The catch: MTP support is only meaningful for models that were trained with MTP heads, which means models that included the auxiliary multi-token prediction objective during pretraining. You cannot bolt MTP onto an existing checkpoint the way you can apply quantization. It requires a new training run — or at minimum, a fine-tune of the head layers on the frozen backbone.

Approach 2 — Classic Speculative Decoding

Speculative decoding, introduced by Leviathan, Kalman, and Matias at Google in the foundational 2022 paper "Fast Inference from Transformers via Speculative Decoding", takes a different angle. It introduces an algorithm to sample from autoregressive models faster without any changes to the outputs, by computing several tokens in parallel. The key insight: hard language-modeling tasks often include easier subtasks that can be approximated well by more efficient models, and using speculative execution and a novel sampling method, you can make exact decoding from the large models faster, by running them in parallel on the outputs of the approximation models, potentially generating several tokens concurrently, and without changing the distribution.

In practice, a small model proposes a sequence of candidate tokens. The large model then verifies all of them in a single parallel forward pass. Accepted tokens are free; rejected tokens fall back to the large model's own sample. The output distribution is provably identical to running the large model alone.

Approach 3 — The Draft Model

A draft model is the practical implementation of speculative decoding: a standalone smaller model trained to predict what the large model would say. Speculative decoding is a technique where a smaller, faster language model (the "draft model") generates several candidate tokens, which are then quickly verified by a larger, more accurate model to produce a final, high-quality output much faster than decoding the large model alone.

Classic draft models are independent networks — they share a tokenizer with the target model, but they carry their own weights, their own KV cache, and their own memory footprint. Existing speculative decoding in llama.cpp requires maintaining two separate models in memory, a small draft model and a large verification model, which complicates setup and increases VRAM requirements. This is the setup's main limitation: you pay a VRAM tax for the second model, and there is no architectural guarantee that the draft model's internal representations align well with the target.

What Google shipped for Gemma 4 is technically in the draft-model category — but with three critical engineering choices that make it far tighter than a classic independent drafter. That is the interesting part.

What Google Actually Shipped: The Embedded MTP Drafter

Google DeepMind shipped MTP drafter models paired with four Gemma 4 variants: the 31B dense flagship, the 26B A4B Mixture-of-Experts model, and the on-device E2B and E4B edge models. Each drafter is a separate checkpoint, but it is not a conventional independent model. Gemma 4's MTP drafters are not independent small models — they are tightly coupled to the target, with three architectural choices making this work.

Shared input embeddings. The drafter reuses the target model's embedding table instead of learning its own. This eliminates one of the largest parameter matrices from the drafter's footprint and ensures that the drafter and target model operate in identical token-space.

Target-activation conditioning. The draft model uses the activations from the last layer of the target model, concatenates them with the token embeddings, and down-projects them to the drafter model's dimension. This is architecturally significant: the drafter is not trying to independently reconstruct the target's world model from scratch. It is taking the target's own final hidden state as input, then asking "given where the target's representation already landed, what token is it most likely to pick next?" That is a much easier problem than running a full autoregressive pass.

Shared KV cache. Drafters reuse the target's key-value cache rather than rebuilding context, which removes the dominant prefill cost in long-context generation. In a classic two-model setup, both models independently compute attention over the full context. Here the drafter piggybacks on work the target has already done.

For the edge variants, there is an additional optimization. To avoid the expensive operation of predicting across the entire vocabulary, the model groups similar tokens into clusters and first identifies the most likely clusters, then restricts its final calculations to only the tokens within those selected clusters. On a mobile SoC where the softmax over a 256k vocabulary is a meaningful fraction of total inference cost, this clustering trick recovers real throughput.

The end-to-end flow is: the target model completes its forward pass and generates token N. The drafter takes the target's last-layer activations plus the embedding for token N, and rapidly proposes tokens N+1 through N+k. The target then runs a single batched verification pass over all k candidates. If the target model agrees with the draft, it accepts the entire sequence — and even generates one additional token of its own in the process, meaning the application can output the full drafted sequence plus one extra token in roughly the same wall-clock time it would normally take to generate just one token.

Since the primary Gemma 4 model retains the final verification step, the output is identical to what the target model would have produced on its own, token-by-token — there is no quality tradeoff, it is a lossless speedup.

The heuristic scheduling in the Hugging Face Transformers integration handles one more variable: how many tokens to draft. You can set the num_assistant_tokens_schedule to "heuristic", which automatically adapts the number of drafted tokens at runtime: if all tokens are accepted, it increases the draft length by 2; if any are rejected, it reduces by 1. This means you do not need to hand-tune draft depth for your workload — the scheduler converges on the acceptance-rate sweet spot automatically.

The following snippet reflects real usage from the official Google AI developer documentation:

Note that both models share a checkpoint name — the drafter lives at google/gemma-4-31B-it-assistant on Hugging Face, one suffix away. The MTP drafters for the Gemma 4 family are available under the same open-source Apache 2.0 license as Gemma 4.

Real-World Local Hardware Impact

The 3x figure is real but narrow. The 3x gain occurs on the 26B MoE model on NVIDIA RTX PRO 6000 class hardware at optimal batch sizes, generating conversational text where the drafter can predict tokens accurately — a specific, narrow set of conditions. Here is what developers on more typical hardware should expect.

RTX 4090 (24 GB VRAM). The 4090 is the sweet spot for the Gemma 4 26B MoE at 4-bit quantization — the model's active parameter footprint fits comfortably. Community benchmarks show consumer NVIDIA GPU (RTX 4090 class) achieving 1.8x to 2.5x speedup on conversational tasks. For the dense 31B model, the 24 GB VRAM becomes a bottleneck, forcing the system to offload parts of the model to system RAM, and the generation speed drops steeply. For that variant, use the MoE 26B or wait for an RTX 5090.

RTX 5090 (32 GB VRAM). A single RTX 5090 can fit Gemma 4 31B at Q4 quantization without offloading. With the full model in VRAM and MTP drafters enabled, peak conversational throughput should land in the 2.2x–2.8x range over baseline, as the elimination of CPU offload removes the dominant latency source. Real-world numbers will depend on driver and vLLM version.

M3 / M4 Mac (unified memory). Apple Silicon is an interesting case for the MoE model specifically. While the 26B mixture-of-experts model presents unique routing challenges at a batch size of 1 on Apple Silicon, processing multiple requests simultaneously at batch sizes of 4 to 8 unlocks up to a ~2.2x speedup locally. The routing overhead at batch size 1 means solo chat feels less dramatic than the headline number. If you are serving multiple sessions — say, a personal API that handles several simultaneous requests from editor plugins — batch size 4 is easy to reach and the gains materialize. MLX-lm benchmarks on an Apple M4 Pro logged a jump from 15.3 tokens/sec to 23.3 tokens/sec on a 27B four-bit model with an ~80.6% draft acceptance rate, a throughput gain of roughly 1.5x in real-world single-stream generation.

NVIDIA server-class (A100 / H100). On NVIDIA H100, the Gemma 4 31B Dense reaches approximately 27 tokens per second with MTP versus roughly 14 without — about a 1.9x gain. Cloud deployments serving multiple concurrent users will see the gains compound, since the batching that unlocks MoE efficiency also increases draft acceptance rates. On DGX Spark hardware, Gemma 4 26B-A4B FP8 plus Google's drafter hits 108.78 tok/s single-stream — 2.66x over the 40.85 baseline — and 674 tok/s aggregate at concurrency 8.

The honest bottom-line number across typical developer hardware is 1.7x to 2.2x. If Gemma 4 31B was generating at 14 tokens/sec before, reaching 24 tokens/sec changes the interactive feel of a chat application from sluggish to usable — and the quality guarantee is absolute, because the target model retains final verification authority and the output is bit-for-bit identical to what standard inference would have produced.

One gotcha worth flagging from community testing: always pair the drafter with an instruction-tuned target, never a base model. With a base target, one benchmark recorded 41 tok/s baseline dropping to 25 tok/s with MTP — a 38% slowdown, not a speedup. The drafter was trained against instruction-tuned hidden states, and the distributional mismatch is severe enough to flip the sign of the speedup.

This is the same software-over-hardware efficiency logic that made DeepSeek's architecture choices so disruptive — and it is also why on-device inference like Chrome's embedded Gemini Nano keeps getting more practical: inference math is improving faster than hardware roadmaps.

Who Copies Next: Mistral, Llama 4, Qwen 3, and DeepSeek

Google's release sets a new table-stakes expectation: frontier open-weight models should ship with purpose-built, officially maintained drafter checkpoints, not leave inference speedups to community experimentation. What Gemma 4 ships is the first first-party, openly-licensed pairing of frontier open-weight models with purpose-built drafters that share embeddings, activations, and KV cache. The pressure on other labs is real.

Qwen 3 / Qwen 3.5. Alibaba's models are the most MTP-ready family outside Google. Qwen 3.5 models include a built-in MTP head exposed through the checkpoint configuration. Community tooling around those heads is already live — llama.cpp merged beta support for multi-token prediction, and community benchmarks show 1.5x to 2x throughput gains in real-world single-stream generation on compatible models like DeepSeek V3 and Qwen 3.5. What is missing is a first-party drafter checkpoint with the tight embedding and activation sharing that Gemma 4 has. Alibaba's incentive to close that gap is high; watch for it in the next Qwen point release.

DeepSeek V4 / V4-Pro. DeepSeek V3 and DeepSeek R1 have MTP heads in their architecture. The MTP heads in DeepSeek V3 were used primarily as a training signal rather than an inference accelerator — current implementations either discard the MTP modules entirely during inference, reverting to standard next-token prediction, or keep only the first MTP module for multi-token prediction. DeepSeek has the architecture; they need the drafter checkpoint and the KV-sharing plumbing. Given that DeepSeek is now raising at a $45B valuation and has massive incentive to improve developer experience, an official drafter release seems likely for V4.

Llama 4 (Meta). The broader universe of Llama 3, Mistral, and Gemma models does not currently include MTP heads, and adding MTP capability to an existing model is a training problem, not a fine-tuning problem. Llama 4 Maverick and Scout launched in April 2026 without MTP heads, meaning Meta would need to retrain or release a Llama 4 MTP variant from scratch. This is the largest gap in the ecosystem. The architecture pressure is real — Llama, Qwen, and DeepSeek already train MTP-aware variants but ship without official drafter checkpoints; community drafters exist but are uneven in quality, and a polished Apache 2.0 drafter release sets a baseline that other vendors will likely match.

Mistral. A developer running Mistral 7B or Llama 3.1 8B today will not see any difference from llama.cpp's MTP support until their model of choice ships a new checkpoint trained with MTP objectives. Mistral Small 4 launched in 2026 without native MTP heads. Mistral has historically been fast to adopt efficiency techniques — the Mixtral MoE architecture was early-stage — so an MTP-aware Mistral release is plausible in H2 2026. Watch their model card notes for any mention of auxiliary prediction objectives in the training setup.

What to watch in release notes. When any of these labs drops a new checkpoint, scan the model card and technical report for: (1) mention of auxiliary MTP loss during pretraining, (2) any -assistant or -drafter suffix checkpoints published alongside the main model, (3) KV cache sharing documentation in the architecture section, and (4) vLLM or SGLang PR references enabling speculative decoding with the new model. Those four signals together indicate a Gemma 4-class deployment upgrade. For AI agents and on-chain inference latency where every round-trip token counts, this is the inference infrastructure detail worth tracking. And for broader context on how Google DeepMind continues to push applied AI research, the MTP drafter release is another signal that the lab is focused on practical deployment efficiency, not just benchmark scores.

What to Do Next

If you are running Gemma 4 today, the upgrade path is immediate and free. Pull the -assistant checkpoint for your model size from Hugging Face's Google model hub, pass it as assistant_model in your generate call, set num_assistant_tokens_schedule="heuristic", and measure your token throughput before and after. Expect 1.7x–2.5x on typical consumer GPU and Apple Silicon hardware, with the ceiling approaching 3x on high-end NVIDIA workstation GPUs at optimal batch sizes. The output is mathematically identical to your current runs — same model, same quality, faster wall clock. Model weights are available on Hugging Face and Kaggle, with framework support across Transformers, MLX, vLLM, SGLang, and Ollama. For edge deployments, mobile users can try the E-series drafters directly through Google AI Edge Gallery on Android and iOS. The broader lesson for the open-weight ecosystem is simple: training-time MTP objectives are becoming a baseline expectation, and the labs that ship matching drafter checkpoints at release will win developer preference over those that leave the work to the community.

Sources

Primary sources and prior BlockAI News coverage referenced in this article.

Primary sources

From BlockAI News

Stay close to BlockAI News.

The next time a model drops with a matching -assistant checkpoint in the release notes, you will know exactly what it means — and exactly how to turn it on.

How we report: This article cites primary sources, regulatory filings, and on-chain data where available. BlockAI News uses AI tools to assist with research and first-draft generation; every article is reviewed and edited by a human editor before publication. Read our full How We Report page, Editorial Policy, AI Use Policy, and Corrections Policy.