ds4 fork: state of the work vs antirez/ds4 main

PR-prep snapshot on branch pr-prep-2026-05-18 (HEAD 8c4525b, branched from restore-cublas-and-dispatch). 132 commits ahead of origin/main at c9dd9499. 30,723 lines added, 367 deleted across 55 files. Three pillars: CUDA mmq lift, in-process VMM weight arena + sidecar, generalized engine proof harness.

Contents
  1. Executive summary
  2. In plain English — the pitch
  3. Scope — what changed where
  4. Themes (eight pillars, with intent and impact)
    1. CUDA mmq kernel lift (vendored llama.cpp)
    2. In-process CUDA VMM weight arena
    3. CUDA weight server sidecar
    4. Generalized engine proof harness
    5. MTP exact verifier path
    6. Correctness gates (Option D, dispatcher, merge fix)
    7. Bench, observability, server
    8. Documentation & vendor pin
  5. Performance results (charts & tables)
  6. Public surface, env vars, and back-compat
  7. Architectural impact diagram
  8. Risks, known issues, gaps
  9. Path to upstreaming
  10. Appendix — commit-by-commit by theme

New reader? Skim §1, read §2, jump to §5 for charts. Reviewer? Read in order; the deep dive is §4.

1. Executive summary

Prefill ceiling

5.88×
PRO 6000 Blackwell ctx=2048: 2193 t/s on the fork (mmq dispatch + in-process VMM arena) vs 373 t/s upstream baseline (cublas + cudaMalloc 4 KiB pages). Compound of 2.93× from mmq dispatch and 2.04× from VMM page layout.

Generation throughput

+16.3%
PRO 6000 ctx=2048: 38.0 → 44.2 gen t/s vs upstream. Layered from mmq prefill (+3.9%), mmvq decode (+13.7% cumulative), and graph capture with bidirectional stream sync (+16.3% cumulative) — all default-on. GB10 gen is flat within run-to-run noise (integrated LPDDR5X caps decode regardless of dispatch path).

Sidecar payoff

N−1
The weight server amortises the base upload (~20–70 s, fast NVMe to slow storage) across N concurrent workers (proof harness, profile sweeps, MTP correctness).

Kernel work

12.5 k vendored
+ 7.9 k novel
12.5 k LOC of llama.cpp mmq/mmvq kernels vendored verbatim (or lightly patched) as a matmul platform. 7.9 k LOC of ds4-original CUDA on top: adapter + dispatcher + parity tests, 14 purpose-built MTP verifier kernels, in-process VMM arena, weight-server binary, host-side verifier orchestration.

Proof runner

5-tuple
profile × suite × prompt × budget × contract. Owns weight-server lifecycle, enforces parent-PID exit, emits a single pass/fail verdict.

CUDA MTP

enabled
CUDA speculative decoding path landed end-to-end — top-2 draft, paired Q/KV, exact 2-token verifier, acceptance-history gating, session-snapshot persistence. Ships as experimental: throughput is currently neutral-to-negative vs no-MTP, so it is opt-in and the showcase config remains best-nomtp.
One-line framing: the fork has reorganised the CUDA backend around three independent components — a vendored matmul kernel family (mmq/mmvq), a memory-layout pipeline (VMM arena in-process; sidecar across processes), and a proof harness that makes the first two upgradeable without losing argmax determinism. Each component is independently useful and independently revertable.

2. In plain English — the pitch

60-second version. The fork does four things on top of antirez/ds4's CUDA backend: (1) replaces the slow matmul path with a fast one, (2) changes how model weights are laid out in GPU memory so the chip's address-translation cache stops thrashing, (3) builds a test harness that proves the first two didn't change any generated tokens, (4) plumbs CUDA speculative decoding end-to-end (correct, but not yet a perf win, so it's opt-in). The first two compound to 5.88× prefill throughput on the discrete GPU we target. The third is what made it safe to land that fast. The fourth is a future win waiting on amortisation.

1 — A faster way to do the math

Every layer of the model needs to multiply two big quantised matrices together. ds4 used to do this by expanding the weights to FP16 first, then calling NVIDIA's cuBLAS — a step that taxed both memory and time. The fork replaces it on two fronts: we vendored llama.cpp's hand-tuned matmul family (~12.5 k LOC under cuda/mmq/), which gives us fast kernels for both the matrix-shaped prefill case and the vector-shaped single-token decode case, and we wrote the ds4-side adapter, dispatcher, and a CPU-reference parity harness around it (~3.3 k LOC of original code). Discrete-GPU prefill jumps from ~373 to ~1100 tokens/sec on this work alone, and the same compute platform underpins the +16.3% generation throughput headlined in §1.

Pillar 4.1 (mmq lift)

2 — A better way to lay out 80 GB of weights

GPU math speed depends partly on where the weights live in memory — specifically, whether each tensor's base address sits on a big alignment boundary that the GPU's caches and tile-load hardware can exploit. cudaMalloc packs all ~80 GB of weights into one big chunk at arbitrary internal offsets. The fork uses a CUDA driver API to allocate each weight tensor at its own 2 MB-aligned virtual address. That single change lets the matmul kernels coalesce tile loads and hit L2 more reliably — discrete-GPU prefill jumps about from this layout alone. No setup; single-process runs just get it. (The same alignment also produces a small, deterministic FP32 reduction-order drift on tight-margin tokens; same root cause, documented in misc/cuda-env-vars.md.)

Pillar 4.2 (in-process VMM arena)

3 — A sidecar for when you need more than one process

Some workflows want several ds4 processes running against the same model at the same time — e.g. comparing two configurations head-to-head, or running the proof harness. Without help, each process pays the base-weights upload cost on its own — somewhere between ~20 seconds (fast NVMe) and ~70 seconds (slower storage) per process for the V4 Flash quantised weights. The fork introduces a small ds4_weight_server binary that owns the weights once and shares them with the workers through a Unix socket. Auto-disabled when not needed, so single-process users never notice it.

Pillar 4.3 (weight server)

4 — A safety net for moving fast

Every change in this branch could in principle flip a generated token somewhere. The new proof harness boots ds4 in any pair of configurations, feeds them the same prompts, and verifies they produce byte-identical output. If something silently drifts, the harness fails. This is what made it safe to land the perf changes above without anxiety — and what gates any future "let's flip this default on" decision.

Pillar 4.4 (proof harness)

5 — CUDA speculative decoding, plumbed end-to-end

"Speculative decoding" runs a small drafter model ahead of the main one, then has the main model verify several tokens at once. It's a known way to speed up generation. The CUDA backend had no path for it before; this branch lands the full pipeline behind 14 purpose-built CUDA kernels we wrote ourselves — paired Q8 projections, two-token top-2 candidate verifier, candidate-certification + merge, fused MoE down-sum for two tokens, batched FFN body — plus host-side state-barrier transactions, acceptance-history gating, and session persistence. It works and proves byte-equivalent. Throughput isn't yet a net win on the configurations we measured, so it ships behind opt-in flags as experimental.

Pillar 4.5 (MTP exact verifier)

6 — Two guard rails against subtle bugs

(a) Inside the speculative verifier we still use the old matmul kernel, because the drafter was trained against that kernel's exact rounding behavior — mixing kernels would flip tight-margin tokens and collapse draft acceptance. (b) A single env var picks the matmul strategy and the rest of the dispatcher flows from it, with documented fallbacks if the preferred path fails to initialise.

Pillar 4.6 (correctness gates)

7 — Better measurement

The first generated token is much slower than the rest (the model has to "warm up"). Reporting just one average hides that. The fork adds a steady-state generation throughput column to ds4-bench alongside the total, plus a built-in MTP-vs-no-MTP comparison mode and reproducible CSV-emitting frontier sweeps. The HTTP server also emits per-token IDs in the SSE stream at the right structural level so external benchmark tools (vLLM-style) just work against ds4-server.

Pillar 4.7 (bench / server)

8 — An operator's manual

A new AGENT.md documents every environment variable, the matmul dispatcher behaviour, the weight-server-vs-VMM decision tree, and the safety rules. Plus operator guides for CUDA MTP and the proof harness under docs/, and a vendor-pin file under cuda/mmq/ that records exactly which llama.cpp commit we tracked and how to re-sync from upstream.

Pillar 4.8 (docs & vendor pin)

3. Scope — what changed where

ClassFilesAddedDeletedNotes
New — vendored matmul kernelscuda/mmq/{mmq,mma,vecdotq,common,mmid,quantize,unary,mmvq,ggml-*}.{cuh,cu,h} + vendors/cuda.h~12,5000llama.cpp mmq/mmvq family, pinned at 5c0e9468. ~11.3 k verbatim + ~1.2 k patched (mmvq: gated ggml-backend entries, promoted mul_mat_vec_q_switch_type)
New — ds4-original adapter in cuda/mmq/ds4_mmq.{h,cu}, ds4_ggml_stubs.{h,cu}~1,9200Host C ABI, dispatcher entry points, ggml-stub types, context mgmt — what makes the templated kernels callable without the ggml runtime
New — mmq parity & bench harnesscuda/mmq/test/*, tests/mmq_bench_stats.py, tests/run_mmq_bench.sh~1,6700CPU-reference parity for Q8_0/Q2_K/IQ2_XXS/MoE paths; tile-width sweep harness
New — weight servertools/ds4_weight_server.cu1,8020Fully ds4-original: GGUF parser + CUDA Driver VMM host + Unix-socket FD broker + manifest negotiation + session mgmt
New — proof harnesstests/ds4_proof.py, cuda_mtp_proof_matrix.py, ds4_weight_server_harness_smoke.py, proof/*.json2,5510generalized 5-tuple runner
New — docsAGENT.md, docs/cuda-mtp/README.md, docs/proof-harness/README.md, cuda/mmq/VENDOR.md, speed-bench/.../README.md1,0260operator guides, env-var inventory, vendor pin
Modified — core engineds4.c4,495202MTP exact verifier, accept-gate state, helpers
Modified — CUDA backendds4_cuda.cu, ds4_gpu.h3,38081~1.6 k novel CUDA: 14 MTP verifier kernels (~690 LOC), VMM arena (~250), Q8→FP16 fallback cache (~300), CUDA graph caches (~160), FD broker (~100), Q8_0 strategy + bandwidth probe (~75), verifier gate (~20). Remainder is mmq/mmvq wiring and refactors.
Modified — bench/CLI/serverds4_bench.c, ds4_cli.c, ds4_server.c3588steady-state gen_tps, no-MTP baselines, token_ids SSE
Modified — miscMakefile, .gitignore, README.md, ds4.h, ds4_metal.m37722build glue + Metal counterparts to GPU API extensions
Total55 files30,723367132 commits

4. Themes

4.1 CUDA mmq kernel lift — vendored platform + ds4-native dispatch

25 commits~12.5 k vendored~3.6 k ds4-original in this themephases 0–8

The previous CUDA backend dispatched every Q8_0 dense matmul through a Q8→FP16 expansion cache plus cublasGemmEx. Profile of the V4 Flash IQ2XXS w2Q2K AProjQ8 SExpQ8 OutQ8 model at ctx=2048 showed the expansion + GEMM as the dominant prefill cost: 373 t/s on PRO 6000 Blackwell. The fix has two layers, and it's important to keep them distinct in the PR conversation:

Coverage matrix

Dispatch siteQuantizationn_tokensRouted via
Attention projections (Q/K/V/O), shared expert, lm_headQ8_0≥ 2 (prefill)ds4_mmq_q8_0_dense (mmq matrix-shaped)
Attention projections decodeQ8_0= 1ds4_mmq_q8_0_dense_vec (mmvq)
Dense Q4_K (e.g. attn_output_b)Q4_K≥ 2ds4_mmq_q4_K_dense
Routed MoE gate & upIQ2_XXS / Q4_K≥ 2ds4_mmq_{iq2_xxs,q4_K}_moe (paired API shares Q8_1 act)
Routed MoE downQ2_K / Q4_K≥ 2ds4_mmq_{q2_K,q4_K}_moe
Routed MoE decodeIQ2_XXS / Q2_K / Q4_K= 1, n_expert_used ≤ 8ds4_mmq_*_moe_vec (mmvq)

Dispatcher hierarchy (DS4_CUDA_PREFILL_PATH)

A startup-time strategy probe writes one of three resolved paths into a single dispatch variable, then the per-call hot path is a load of that variable. cuBLAS is initialised regardless of selection because we observed the cuBLAS init triggers driver state that makes mmq ~4× faster on sm_121.

Q8_0 prefill throughput by dispatch (V4 Flash, ctx=2048, no weight server) — t/s 0 300 600 900 1200 PRO 6000 Blackwell sm_120 (~1.8 TB/s GDDR7) 1092 mmq 373 cublas 373 warp8 2.93× mmq vs cublas GB10 Spark sm_121 (~546 GB/s LPDDR5X) 458 mmq 401 cublas 56 warp8 +14% mmq vs cublas
Bar: Q8_0 prefill t/s by dispatch strategy. mmq is the validated default on both arches; cublas remains as an escape hatch and warp8 is reserved for the MTP verifier (Option D, §4.6).

mmvq decode wedge (Step 6)

At n_tokens=1 the matrix-shaped mmq path wastes column tiling on a single output column. We additionally vendor mmvq.{cu,cuh} and wire two decode-only sites:

Knobs: DS4_CUDA_NO_MMVQ_DECODE=1 to opt out; DS4_CUDA_MMVQ_DECODE_MAX_TOKENS=N to extend mmvq into short prefill batches (still bound by the gate above).

CUDA Graph capture+replay (Step 8) — opt-in only

Each kernel sequence in the mmvq routed-MoE decode block and the n_tok=1 dense Q8_0 vec path is captured into a cudaGraphExec_t on first execution with a given (layer-shape, buffer-pointer) tuple. The MoE cache holds 256 entries; the dense Q8_0 cache holds 1024 entries. Replay eliminates ~5–15µs of CPU↔driver round-trip per launch — the dominant overhead at decode where individual kernels are small.

Default ON again (commit 7967154) after the cross-stream race that motivated the b66b5d6 revert was diagnosed and fixed. Two legs: (a) captured outputs read by stream=0 without a wait; (b) captured inputs read on g_moe_stream before stream=0 finished writing them. Both are now closed by cudaEventRecord + cudaStreamWaitEvent brackets around every cudaGraphLaunch (commit 687c783). Validated on PRO 6000 (sm_120) and GB10 (sm_121): smoke parity ON vs OFF, MTP-active output coherent on the previously-failing path, ds4-bench gen positive on PRO 6000 (sweep CSV) and flat-within-noise on GB10. Opt-out via DS4_CUDA_MOE_GRAPHS=0.

Parity tests

cuda/mmq/test/test_mmq_parity.cu (1,245 LOC) compares every wired mmq shape to a CPU reference at multiple n_tokens and tile widths. The bench harness tests/run_mmq_bench.sh + tests/mmq_bench_stats.py drove the X_max sweep that picked the default tile (X=128 vanilla wins on sm_120 by ~6–20% over X∈{32,64,96}).

4.2 In-process CUDA VMM weight arena

5 commits+170 LOC ds4_cuda.cu2.04× prefill on PRO 60000% on GB10 (neutral)

Background: when ds4 loads weights via cudaMalloc, the legacy arena packs all ~80 GB of weights into one large chunk where each tensor sits at an arbitrary 256-byte-aligned internal offset. The ds4_weight_server sidecar avoids this by using the CUDA Driver VMM API (cuMemCreate + cuMemAddressReserve per range), giving each weight tensor its own 2 MiB-aligned virtual address. The fork extracts the same machinery into the in-process path so single-process runs (ds4-bench, ds4-server, one-shot CLI) reach the same prefill ceiling without spawning a sidecar. The chunk-size bisect we ran during this work updated our understanding of why the VMM arena is fast. The original framing ("2 MiB pages reduce TLB pressure") is incomplete: VMM with one large 1792 MiB chunk performs identically to cudaMalloc (~1080 t/s prefill on PRO 6000), even though the cuMemCreate-backed memory is still 2 MiB-paged. The actual differentiator is per-tensor 2 MiB-aligned base addresses: when each weight tensor sits at its own fresh cuMemAddressReserve-handed VA, matmul kernels' tile-load coalescing and L2 spatial-locality patterns improve enough to roughly double prefill. Pack the same VMM-paged memory into one big chunk and the bases land at sub-granularity offsets — the perf advantage disappears.

Mechanism

  1. cuda_vmm_arena_supported() — on first call, probes cuMemGetAllocationGranularity. Hard-gated off when DS4_CUDA_WEIGHT_IPC_MANIFEST is set (sidecar already provides identical VMM ranges — running both would double-allocate). Soft-gated off via DS4_CUDA_VMM_ARENA=0.
  2. cuda_vmm_arena_alloc(bytes, label) — bump-pointer over the sequence cuMemCreate(CU_MEM_HANDLE_TYPE_NONE)cuMemAddressReservecuMemMapcuMemSetAccess(PROT_READWRITE). Default chunk size = request size rounded up to granularity, matching the weight server's per-range allocation. Override via DS4_CUDA_VMM_ARENA_CHUNK_MB=N.
  3. cuda_model_range_ptr_from_fd hot path tries VMM first, falls back to the legacy cuda_model_arena_alloc (cudaMalloc) on probe failure.
  4. Teardown via cuda_vmm_arenas_release_all from the existing cuda_model_range_release_all path.
Why default chunk = request size: the first sanity probe used a 1024 MiB chunk and would have allocated 138 chunks for 138 weight ranges — 138 GiB on a 96 GiB card. Commit 6a89ea5 defaulted mb=0 (chunk = request size). Verified post-fix: 138 chunks, 80.77 GiB allocated for 80.76 GiB raw (0.01% overhead).
Prefill t/s vs ctx tokens — arena baseline vs in-process VMM 0 500 1000 1500 2000 2500 2048 8192 16384 24576 32768 PRO 6000 arena (legacy cudaMalloc 4 KiB pages) PRO 6000 in-process VMM (per-tensor 2 MiB-aligned bases, new default) GB10 arena/VMM (overlap — no delta on integrated)
Source: local/docs/ds4_vmm_landing_merged/{pod,gb10}_{arena,vmm}.csv. PRO 6000 holds 2.04× at ctx=2048 and 1.87× at ctx=32768. GB10 lines overlap because the integrated LPDDR5X path already operates near the page-table-insensitive limit.

Why GB10 is neutral and that's fine

On the integrated Spark, weights live in the same LPDDR5X pool as everything else and the per-tensor-base-alignment effect that drives the discrete-GPU win doesn't translate — VMM yields no measurable delta (−0.16% mean across the sweep, no row worse than −0.75%). The earlier worry that integrated GPUs would OOM under VMM was wrong — the weight server has run on GB10 with --reserve-gb 24 for weeks; in-process VMM has the same memory profile.

4.3 CUDA weight server sidecar

12 commits1,802 LOC tools/ds4_weight_server.cu2 backends (VMM, IPC)

A standalone CUDA process that owns the weight allocations and exposes them to one or more ds4 workers through a manifest file. Two transports:

Operating envelope

When to use which (operator decision tree)

WorkloadRecommended pathWhy
One-shot ./ds4 -p ..., single ds4-bench, single ds4-serverIn-process VMM arena (auto)Same per-tensor 2 MiB-aligned base layout as the sidecar, zero setup tax.
Proof harness running N profiles in parallelWeight server, scope=baseBase upload (~20–70 s, NVMe-dependent) amortised over N workers.
MTP correctness work (base + MTP gguf concurrent)Weight server, scope=bothSingle-allocation fragmentation can OOM even with sufficient free VRAM.
Multi-profile bench sweepsWeight server, scope=baseSame as proof harness.

4.4 Generalized engine proof harness

27 commits1,860 LOC tests/ds4_proof.py5-tuple data model

A proof run is modeled as profile × suite × prompt × budget × contract:

Lifecycle ownership

--start-weight-server hands the sidecar's lifecycle to the runner. The runner:

  1. Runs --dry-run preflight; refuses to start if it won't fit.
  2. Launches the server with --exit-on-parent-pid.
  3. Polls the manifest for readiness; only sets DS4_CUDA_WEIGHT_IPC_MANIFEST in worker env after ready.
  4. On profile failure, still tears the server down. If the server dies mid-run, the whole proof fails.

Verdict surface

The JSON report carries a top-level weight_server_validation verdict that automation can gate on. It checks ready state, backend, scope, preflight result, upload telemetry for the requested model scope, parent-PID guard, lock acquisition, shutdown observation, and clean termination. VMM runs additionally check support telemetry, plans, broker startup, and broker request activity. weight_server carries the raw command, manifest path, log path, dry-run preflight, startup time, and cleanup result.

MTP-specific reporting

23cd345 promoted MTP acceptance to a first-class metric so optimisation runs can read it directly from the report instead of grepping logs. 59cd8d2 exposes derived weight artifacts (prebuilt Q8 expansion tables, etc.) so the harness can re-run with the same artifacts the WS owner produced.

4.5 MTP exact verifier path — experimental enablement

35+ commits+~3 k LOC ds4.c orchestration14 novel CUDA kernels (~690 LOC)CUDA MTP enabled end-to-endopt-in: throughput neutral-to-negative

The deliverable here is enablement, not a perf win. Prior to this work the CUDA backend had no exact speculative-decoding path; this branch lands the full pipeline — top-2 draft, paired Q/KV projections, fused MoE down-sum, exact 2-token verifier with state-barrier rollback, certified row-0 logits, acceptance-history gating, and session-snapshot persistence of the accept-gate state — behind opt-in env flags. The throughput case is currently neutral to slightly negative vs no-MTP on the workloads we measured, so MTP is shipped as experimental and is not part of the recommended fast baseline.

Novel CUDA kernels we wrote for the verifier

These are not adaptations of vendored code — they are purpose-built for ds4's two-token exact-verification model and do not exist in cuda/mmq/ upstream:

Kernel familyKernel symbols~LOCJob
Top-2 logits (decode pair)matmul_q8_0_top2_warp8_kernel, matmul_q8_0_top2_logits_n2_warp8_kernel~108Argmax + runner-up for row 0; row-1 logits piped straight to the verifier without a full output projection.
Candidate certification & mergematmul_q8_0_candidates_warp8_kernel, q8_0_row_group_norms_warp_kernel, q8_0_x_group_norms_kernel, q8_0_candidate_certify_prune_warp8_kernel, q8_0_candidate_certify_merge_kernel, q8_0_top2_merge_kernel~280Proves the drafted row-1 token is the row-0 argmax under a derived norm bound, so the row-0 top-2 scan can be skipped on certified pairs; falls back to exact top-2 on miss.
Paired Q8 projections / batched FFN bodymatmul_q8_0_pair_preq_warp8_kernel, matmul_q8_0_pair_preq_batch_warp8_kernel, matmul_q8_0_hc_expand_preq_warp8_kernel, matmul_q8_0_hc_expand_preq_n2_warp8_kernel, matmul_q8_0_preq_batch_warp8_kernel, matmul_q8_0_preq_n2_warp8_kernel~275Shared Q8_0 activation across gate+up; paired attention output A; HC-row and prefix-row direct writes; scalar-order n2 path for exact FFN body batching across the two verifier tokens.

Plus ds4_gpu_set_mtp_verifier / g_in_mtp_verifier — a 20-LOC thread-local gate that forces the Q8_0 dispatcher onto warp8 for the duration of a verifier call, because mmq's stream-k + MMA FP32 reduction order drifts ~1 ULP/layer from the legacy kernel and the drafter is trained against legacy decoding (analyst measured 0/314 acceptance on GB10 with an mmq verifier active).

The fork's MTP work targets exact (bit-identical to no-MTP) speculative decoding on CUDA, building toward a 2-token verifier. The headline path:

  1. Top-2 draft (DS4_CUDA_MTP_TOP2=1) — CUDA top-2 output for draft decisions.
  2. Verify top-2/top-1 (DS4_CUDA_MTP_VERIFY_TOP2=1) — verifier shortcut when full logits are not needed.
  3. Pair-batched Q projections (3cad23c) — two-token Q/KV projections batched into one kernel.
  4. Fused MoE down-sum (6b9fe82) — verifier accumulates two-token MoE down in a single pass.
  5. Exact pair output projection verifier (4b8518b, f6051da) — default-on for CUDA after parity tests.
  6. Certified row-0 logits (614b5a8, opt-in via DS4_MTP_CERT_LOGITS=1) — proves drafted row1 token is row0 argmax before skipping the row0 top-2 scan; uncertified rows fall back to exact top-2.
  7. State-barrier verifier transaction (68166ad) — the verifier appears atomic from the rest of the engine; rejected speculations cleanly roll back KV state.
  8. Acceptance-history gating (f72ba58, d71ee5e, aad1401) — exact speculation only kicks in once recent acceptance exceeds a threshold, so cold prompts don't pay the speculation tax.
  9. Session-snapshot persistence (6a86411) — accept-gate state survives --save-cache / --load-cache.
  10. Decode2 layer entrypoint + batched routed MoE FFN (a5af40d, af5cd0a, 0b99bea, a1f4723, 7a20ab7, 454f539, ce61a50) — the exact verifier replays a full decoder layer for the pair, with in-place HC/prefix-row writes and routed MoE batched inside.
Why experimental, not default-on: on GB10 at --mtp-draft 2 over a full-context sweep, exact MTP is effectively tied with no-MTP (342 t/s prefill, 12.6 t/s gen mean), and it pays a ~3.7 s setup tax at gen=128 that the current generation lengths don't amortise. The plumbing is exact and the harness can prove it; the wash is structural — higher draft depths, lower verifier scheduling overhead, or longer generations are the levers to break it. Until that happens the showcase config in benchmarks stays best-nomtp and MTP is reached only by setting DS4_CUDA_MTP_TOP2=1 + DS4_CUDA_MTP_VERIFY_TOP2=1 + --mtp.

4.6 Correctness gates

Two structural decisions defend the fork against silent regressions:

(a) Option D — legacy kernels inside the MTP verifier

DS4_CUDA_MTP_VERIFIER_USE_MMQ default unset / 0. The CUDA backend honors ds4_gpu_set_mtp_verifier(1) bracketing by routing all Q8_0 dense matmuls (and routed-MoE dispatch via the same gate) onto the legacy warp8 kernels for the duration of one verifier call. Necessary because mmq's stream-k + MMA FP32 reduction order drifts ~1 ULP/layer from warp8; the drafter is trained against legacy-style decoding, so an mmq verifier flips tight-margin argmax tokens and collapses draft acceptance (analyst measured 0/314 on GB10 with mmq verifier active). Setting the env var to 1 reproduces the broken behavior for bisection.

(b) Dispatcher with explicit downgrade chain

On init failure, mmq downgrades to cublas, cublas downgrades to warp8. Strategy logged once on first dispatch with arch and bandwidth, e.g.

ds4: CUDA Q8_0 dispatch: mmq (sm_120, 1792 GB/s memory bandwidth) [default]

DS4_CUDA_PREFILL_PATH=mmq|cublas|warp8|auto is the modern knob; the legacy DS4_CUDA_USE_MMQ=0 still works and resolves to cublas. DS4_CUDA_PREFILL_PATH takes precedence if both are set.

4.7 Bench, observability, server

8 commits358 LOC across bench/CLI/server

ds4-bench

CLI

Server / SSE

4.8 Documentation & vendor pin

42 commits touch docs~1 k LOC across 5 docs

5. Performance results

5.1 Headline prefill at ctx=2048 (V4 Flash IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8)

ArchUpstream baseline
(cublas + cudaMalloc)
Fork — mmq + arena
(prior default)
Fork — mmq + in-process VMM
(current default)
Total speedup
PRO 6000 Blackwell sm_120 ~373 t/s 1078.86 t/s 2193.29 t/s 5.88×
GB10 Spark sm_121 ~401 t/s 461.24 t/s 460.49 t/s 1.15×

PRO 6000 numbers from local/docs/ds4_vmm_landing_merged/pod_{arena,vmm}.csv; GB10 from the matching gb10_{arena,vmm}.csv. Upstream baseline figures from the AGENT.md dispatch table. Each fork column reports a single ctx=2048 frontier; the full sweeps show the VMM gain holds (1.87× arena→VMM at ctx=32768 on PRO 6000) and GB10 is flat across the sweep (the ×1.15 there is purely the mmq dispatcher, not page layout).

5.2 Generation throughput vs upstream — PRO 6000

The mmq lift on its own leaves decode roughly unchanged (mmq is matrix-shaped — at n_tokens=1 there's nothing to tile across). The decode gain comes from a separate piece of work: vendoring llama.cpp's mmvq vector-matmul family and routing the n_tok=1 routed-MoE and dense Q8_0 attention projection paths through it (Step 6 of the mmq optimisation plan).

StageDispatch pathGen t/s @ ctx=2048vs upstream
Upstream baselinecuBLAS + Q8→FP16 expansion + legacy fused decode~38.0
mmq prefill, legacy decodeUSE_MMQ=1, NO_MMVQ_DECODE=1~39.5+3.9%
mmq + mmvq decode (current default)USE_MMQ=1, mmvq decode on43.2+13.7%
+ CUDA graphs with stream sync (current default)auto, opt-out via DS4_CUDA_MOE_GRAPHS=044.2+16.3%

Source: speed-bench on PRO 6000 Blackwell sm_120, CUDA 13.0, V4 Flash IQ2XXS GGUF. The legacy / mmq-only / mmvq-decode rows are from local/docs/ds4_mmq_optimization_session2.html (gen-tokens=128, n=10, p=0.0079 for each pairwise step). The graphs-with-sync row is from the post-fix bench (commits 687c783 + 7967154): ctx=2048 gen 43.31 → 44.20 t/s. On GB10 the same fix doesn't translate into a measurable decode gain: the committed sweep CSV shows gen at ctx=2048 as 14.17 → 14.16 t/s, flat within run-to-run noise. LPDDR5X bandwidth caps GB10 decode regardless of dispatch path. The PRO 6000 graphs gain is also smaller than the pre-revert benchmark because the new pre/post sync brackets serialize stream=0 with g_moe_stream, trading some parallelism for the correctness fix.

Generation throughput trajectory on PRO 6000 (ctx=2048, gen=128) — tok/s 0 10 20 30 40 50 38.0 Upstream baseline 39.5 +3.9% mmq prefill only 43.2 +13.7% mmq + mmvq decode (default) 44.2 +16.3% (default) + CUDA graphs with stream sync (current default)
Each stage cumulative on the one above; current shipping default is the third bar.

5.3 Decode is preserved under the VMM layout switch

The arena→VMM transition is a memory-layout change for the weights, not a kernel change — decode is bandwidth-bound, so VMM should be neutral on it. Verified:

Archgen_tps_ss arenagen_tps_ss VMMΔ
PRO 6000 (ctx=2048, gen=128)43.2143.44+0.5%
PRO 6000 (ctx=32768, gen=128)37.7738.05+0.7%
GB10 (ctx=2048, gen=64)14.1114.12+0.1%
GB10 (ctx=16384, gen=64)13.5213.49−0.2%

No row regresses by more than a single tenth of a percent. VMM is safe to enable as the single-process default for decode-heavy workloads.

5.4 Single-machine speed snapshot at ~12k-token prompt

Single-run CLI numbers for both CUDA targets alongside the existing Mac entries in the project README, with identical settings on each: --ctx 32768 --nothink --temp 0 -n 256, q2 quant, long prompt = first ~40 kB of speed-bench/promessi_sposi.txt (12,461 tokens on the V4 Flash tokenizer). Mac numbers are unchanged from the existing table; the two CUDA rows are the fresh measurements taken from the same build the rest of this report describes (mmq + mmvq + graphs with bidirectional stream sync + in-process VMM arena, all default-on).

Prefill at ~12k-token prompt — tok/s 0 500 1000 1500 2000 1920.7 PRO 6000 Blackwell CUDA, 96 GB GDDR7 468.0 Mac Studio M3 Ultra Metal, 512 GB, q2 448.8 Mac Studio M3 Ultra Metal, 512 GB, q4 403.6 DGX Spark GB10 CUDA, 128 GB 250.1 MacBook Pro M3 Max Metal, 128 GB, q2
PRO 6000 Blackwell leads the long-prompt prefill by ~4× the next Mac entry and ~7.7× the M3 Max baseline. Green bars are CUDA targets; gray bars are Mac targets unchanged from the README table.
Generation throughput at ~12k-token prompt — tok/s 0 10 20 30 40 50 41.1 PRO 6000 Blackwell 27.4 M3 Ultra q2 26.6 M3 Ultra q4 13.6 DGX Spark GB10 21.5 M3 Max q2
PRO 6000 also leads generation (~1.5× M3 Ultra). GB10 stays decode-bandwidth-bound on integrated LPDDR5X. Order kept the same across both charts for visual continuity.

Both CUDA rows use the imatrix-tuned q2 variant (...-imatrix.gguf) for apples-to-apples comparison. As a sanity check we also re-ran GB10 against the non-imatrix q2 variant: results landed within 0.3–1.7% of the imatrix numbers across both prefill and gen, i.e. inside run-to-run noise — imatrix vs non-imatrix doesn't change the throughput story at this granularity.

5.5 MTP exact verifier — GB10 full-context sweep (experimental opt-in)

RunPrefill meanGen meanGen first frontierGen last frontier (38912)
no-MTP (showcase baseline)342.20 t/s12.68 t/s14.00 t/s11.63 t/s
exact MTP, draft=2 experimental341.02 t/s12.62 t/s12.53 t/s11.68 t/s

Source: speed-bench/mtp-compare-2026-05-14/{gb10_nomtp,gb10_exact_mtp}.csv. The point of these numbers is to validate that the new CUDA MTP path runs end-to-end and stays exact; throughput is currently neutral-to-negative (behind on the cold frontier from setup tax, marginally ahead at the largest contexts). Ships opt-in until a configuration breaks the wash in MTP's favor.

6. Public surface, env vars, and back-compat

New env-var surface

VariableDefaultEffect
DS4_CUDA_PREFILL_PATHauto → mmqQ8_0 dispatch: mmq / cublas / warp8 / auto. Explicit override.
DS4_CUDA_USE_MMQunsetLegacy alias: 0 = cublas. Lower precedence than DS4_CUDA_PREFILL_PATH.
DS4_CUDA_MMQ_MOE_MIN_TOKENS2Minimum n_tokens at which routed-MoE uses mmq matrix-shaped path.
DS4_CUDA_MMQ_X_MAXunset (128)Diagnostic: clip get_mmq_x_max_host to N.
DS4_CUDA_NO_MMVQ_DECODEunsetOpt-out of mmvq for n_tok=1 decode (routed-MoE and dense Q8_0).
DS4_CUDA_MMVQ_DECODE_MAX_TOKENS1Cap on n_tokens routed through mmvq decode branch (0–8).
DS4_CUDA_MOE_GRAPHSONCUDA Graph capture/replay for routed-MoE decode and dense Q8_0 vec, with bidirectional stream sync. Opt-out via 0.
DS4_CUDA_MTP_VERIFIER_USE_MMQunset / 0Bisection switch: 1 reproduces the broken mmq-in-verifier behavior.
DS4_CUDA_VMM_ARENAenabled0 disables in-process VMM allocator (escape hatch).
DS4_CUDA_VMM_ARENA_CHUNK_MB0 (request-size)Force a minimum chunk size per cuMemCreate; rarely needed.
DS4_CUDA_WEIGHT_IPC_MANIFESTunsetWorker-side: import weights from the manifest path. Hard-gates in-process VMM off.
DS4_CUDA_MTP_TOP2 / DS4_CUDA_MTP_VERIFY_TOP2unsetEnable CUDA top-2 draft + verifier shortcut.
DS4_MTP_CERT_LOGITS / DS4_MTP_CERT_LOGITS_SHADOWunsetOpt-in row-0 certificate / shadow validator.
DS4_MTP_STRICT (= --quality)unsetForce byte-identical target stream behavior.

Public C API additions (ds4_gpu.h)

Back-compat posture

7. Architectural impact diagram

CUDA backend dataflow — upstream vs fork upstream antirez/ds4 mmap'd GGUF (host pages) cudaMalloc arena (4 KiB pages) Q8→FP16 expansion cache cublasGemmEx (FP16) PRO 6000 ctx=2048: 373 prefill t/s page-table heavy on 80 GiB weights fork (PR-prep) mmap'd GGUF (host pages, direct I/O on upload) In-process VMM arena (cuMemCreate per tensor, 2 MiB-aligned bases) or, when DS4_CUDA_WEIGHT_IPC_MANIFEST set: imports VMM ranges from ds4_weight_server via Unix-socket broker DS4_CUDA_PREFILL_PATH dispatcher (auto→mmq) mmq (default) cublas (fallback) warp8 (MTP verifier — Option D) mmvq (n_tok=1 decode) PRO 6000 ctx=2048: 2193 prefill t/s — 5.88× upstream single-process reaches WS ceiling within 0.07%, zero setup tax tests/ds4_proof.py (profile × suite × prompt × budget × contract) owns weight-server lifecycle, emits pass/fail verdict, gates default-on flips
The fork separates three concerns: (a) where weights live in device memory (VMM in-process; sidecar across processes), (b) which kernel executes the matmul (dispatcher + mmq/mmvq/cublas/warp8), (c) how regressions are caught (proof harness with pass/fail verdict).

8. Risks, known issues, gaps

R1 — DS4_CUDA_MOE_GRAPHS resolved. The cross-stream race that motivated the b66b5d6 revert is now diagnosed and fixed (commits 687c783 + 7967154). Two race legs across the g_moe_stream / stream=0 boundary — pre-launch input read and post-launch output read — are closed by cudaEventRecord + cudaStreamWaitEvent brackets around every cudaGraphLaunch. Validated on both targets: smoke parity ON vs OFF bit-identical; MTP-active output coherent on GB10 (was previously garbled); ds4-bench gen positive on PRO 6000 in the committed sweep, flat within run-to-run noise on GB10 (LPDDR5X bandwidth-bound). Default flipped back to ON; opt-out via DS4_CUDA_MOE_GRAPHS=0. The sync overhead trades some parallelism for correctness, so the gain is smaller than the pre-revert benchmark and the GB10 gain in particular does not survive into the committed sweep.
R2 — MTP ships experimental, not as a perf win. The CUDA MTP path is enabled end-to-end and proves exact, but its throughput is currently neutral-to-negative vs no-MTP on the workloads we measured (GB10 draft=2 sweep: 12.62 vs 12.68 t/s gen, −0.5%; cold frontier 1.5 t/s behind from the ~3.7 s setup tax at gen=128). It is reached only via opt-in flags and is excluded from the showcase config. Path to a default-on case: deeper draft acceptance (draft=3+), lower verifier scheduling overhead, or generation regimes where the setup tax amortises. The exact verifier and proof harness remain useful as a correctness floor regardless.
R3 — Vendored kernels carry a re-sync obligation. 15.8 k LOC are pinned to llama.cpp 5c0e9468 (2026-05-14). The two upstream files we patched (mmvq.{cu,cuh}) need rebasing on re-sync. cuda/mmq/VENDOR.md documents the procedure but the cost is real; we should plan re-syncs at most quarterly unless an upstream fix is load-bearing.
R4 — Two-instance OOM hazard on unified-memory boxes. The weight server allocates VMM on the same LPDDR5X pool the rest of the system uses. Starting a second large DS4 process while the WS is resident can OOM — AGENT.md warns about this but it's still a foot-gun. Mitigated by the instance lock and parent-PID exit guard, but worth callout in PR notes.
R5 — warp8 dispatch is degenerate on GB10. The fallback runs at 56 t/s on GB10 (vs 458 t/s mmq). It's the right correctness baseline for the MTP verifier (Option D) but anyone manually selecting warp8 outside the verifier on Spark will be very surprised. The dispatcher logs the choice once, which helps, but the env-var docs should flag this more loudly.

Gaps acknowledged

9. Path to upstreaming

The work is too large for one PR. A natural split:

#PR titleScopeRiskReviewer ask
1cuda: vendor llama.cpp mmq + adapter + parity testscuda/mmq/ + Makefile + cuda/mmq/VENDOR.mdlow (additive, default off behind DS4_CUDA_USE_MMQ=1 on the first PR if desired)vendor-pin policy — licensing already settled: both MIT, ds4 LICENSE credits "The ggml authors"
2cuda: route Q8_0 / Q4_K / IQ2_XXS / Q2_K dispatch through mmqdispatcher in ds4_cuda.cu + env vars + AGENT.md tablemedium (perf shift)bench reproducibility on at least sm_120 + one other arch
3cuda: in-process VMM weight arena5 commits, ds4_cuda.cu + AGENT.mdlow (gated, falls back to legacy on probe failure)portability across CUDA versions
4tools: ds4_weight_server + broker + import APItools/ds4_weight_server.cu, ds4_gpu.h, AGENT.mdmedium (new binary, FD-over-socket protocol)operator UX, lock file semantics
5tests: generalized engine proof harnesstests/ds4_proof.py + docs/proof-harnesslow (test-only)scope of contract + budget naming
6mtp: exact verifier + acceptance-history gating + session-snapshot accept-stateds4.c, ds4_cuda.cu, docs/cuda-mtpmedium (default-on behavior change, but acceptance-gated)strict-mode equivalence proof
7cuda: Option D MTP-verifier kernel routingtiny ds4.c bracketing + ds4_cuda.cu dispatcher hooklow (default-on)none beyond #6
8bench/server: steady-state gen_tps + token_ids SSE + session-path no-MTP baselinesds4_bench.c, ds4_cli.c, ds4_server.clow (additive)SSE wire-format match with vLLM

If antirez prefers a single mega-PR: #1+#2 must stay together (mmq is dead code without the dispatcher), but every other group can be sequenced.

10. Appendix — commits by theme

10.1 mmq lift & dispatcher (25 commits)
101831c cuda: vendor llama.cpp's mmq kernel family for fused dequant matmul
06747ea cuda: implement ds4_mmq_q8_0_dense and parity-test against CPU reference
c7e0b8c cuda: add ds4_mmq_q2_K_dense and ds4_mmq_iq2_xxs_dense
0bf8040 cuda: add MoE _id matmul wrappers (ds4_mmq_{q8_0,q2_K,iq2_xxs}_moe)
39d3877 cuda: route Q8_0 dense matmuls through cuda/mmq when DS4_CUDA_USE_MMQ=1
a56e07a cuda: route IQ2_XXS/Q2_K routed-MoE through cuda/mmq when DS4_CUDA_USE_MMQ=1
09b38f2 cuda: gate mmq routed-MoE on n_tokens >= 2 to preserve decode throughput
046f02e docs: AGENT.md and cuda/mmq/VENDOR.md - phase 7 lock and env-var inventory
944482d cuda: flip DS4_CUDA_USE_MMQ to opt-out - cuda/mmq is now the default
298022f tests: sprint 0 bench harness - variance <1% at ctx=2048 on PRO 6000
1bbacf6 cuda: step 1 - delete legacy Q8->FP16 expansion cache and cuBLAS path
af99da7 cuda: step 2 - wire Q4_K dense and MoE through cuda/mmq
387e58d cuda: step 3 - paired moe API to share Q8_1 activation across gate+up
66fa20f cuda: step 4 - mmq_x_max sweep diagnostic hook (default unchanged)
380f9bf cuda: step 6 partial - vendor mmvq.{cu,cuh} and unary.cuh
71ca87d cuda: step 6.a/b - vec ABI + parity-tested impls for mmvq
4425ed0 cuda: step 6.c/d - wire mmvq decode into routed_moe and dense Q8_0
abe2657 cuda: step 8 - CUDA Graph capture+replay for routed-MoE decode
2200c67 cuda: step 8.2 - graphs for dense Q8_0 vec + flip default to ON
b66b5d6 cuda: revert DS4_CUDA_MOE_GRAPHS default to OFF (R1, stopgap)
687c783 cuda: bracket every captured cudaGraphLaunch with pre+post stream sync
7967154 cuda: re-enable DS4_CUDA_MOE_GRAPHS by default after sync-fix validation
8df4b2a cuda: route MTP verifier matmuls to legacy kernels (Option D)
961e57a Revert "cuda: step 1 - delete legacy Q8->FP16 expansion cache and cuBLAS path"
90fec89 cuda: pick Q8_0 strategy by device memory bandwidth at startup
38572f9 cuda: query memoryClockRate via cudaDeviceGetAttribute (CUDA 13 compat)
6658bde cuda: simplify Q8_0 dispatch to mmq-default with explicit overrides
10.2 In-process VMM arena (5 commits)
11e37c2 cuda: probe + supported() helper for in-process VMM arena
2e1f4f6 cuda: VMM-backed weight arena allocator + teardown
65e5045 cuda: route fd-cache through the VMM arena when supported
6a89ea5 cuda: VMM arena chunks default to request-size, not 1024 MiB
ba42c13 docs: in-process VMM arena is the default for single-process runs
10.3 Weight server sidecar (12 commits)
87d0a60 weight-server: probe VMM backend support
53fbfd4 weight-server: allocate and upload VMM-owned model ranges
71e803d weight-server: broker VMM allocation file descriptors
48d9e49 cuda: import VMM weight ranges from the broker
e7f7ce1 weight-server: use direct I/O for VMM uploads
e7edd96 Bind CUDA fd-cache to its owning model_map
5076c07 cuda: fix merged fd-cache owner declaration
5b48096 Pre-cache MTP model tensors at CUDA startup
414c790 cuda: let weight server own derived verifier artifacts
6f271e9 cuda: import prebuilt Q8 expansion artifacts
c76edf4 docs: add weight-server guidance to AGENT.md
0acb50e proof: add weight server harness smoke test
10.4 Proof harness (15 selected commits)
08d8fa1 proof: generalize engine proof runner
f2f424e proof: add CUDA weight server lifecycle
76cbc9a proof: support scoped weight server imports
5a6d3ed proof: record weight server telemetry
0b1370d proof: guard weight server parent lifetime
7268f68 proof: lock CUDA weight server ownership
cda4b4d proof: reject stale weight manifests
7536533 proof: report weight server validation verdict
015be9a proof: validate VMM weight-server runs
9466a03 proof: add named budgets and steady-state timing reports
5153bfb proof: report steady-state generation throughput
23cd345 proof: report MTP acceptance as a first-class metric
59cd8d2 proof: expose derived weight artifacts in harness
600e2e8 mtp: add CUDA proof matrix
dbeeec9 proof: default shared weights to base without mtp
10.5 MTP exact verifier (35 selected commits)
ec2dd7c cuda: add MTP top2 verifier paths
2141a37 mtp: reduce exact verifier output scheduling
6b9fe82 cuda: fuse MTP verifier MoE down sum for two tokens
3cad23c cuda: batch Q8 pair projections for MTP verifier
c5a0c36 cuda: pair decode Q and KV projections
143623f cuda: write attention output A directly to low layout
0ccbfcf mtp: add shadow check for custom attention output B
2fcdeb6 cuda: keep Q8 pair batching on verifier-sized prefill
0e85718 cuda: bypass sorted MoE gate setup for two-token verify
7eeed1f proof: make no-opt verifier output the fast MTP baseline
4be36f8 mtp: suppress draft probes when speculation is disabled
9b9755a mtp: account full speculative cycle timing
362bb32 mtp: introduce explicit verifier plan results
c6d1487 mtp: add fixed-depth decode3 verifier shadow
5dbed76 mtp: capture two verifier prefix depths
8f3216e mtp: make raw draft state commits explicit
bbf61e5 proof: add verifier v2 shadow profile
462b613 mtp: add opt-in decode3 verifier path
18c3f59 proof: expose verifier v2 active candidate
bda1ebb mtp: isolate batch-first verifier experiments
31cb5e0 proof: add batch-first verifier profile
aad1401 mtp: make exact verification the default
d206b04 mtp: prefer sequential exact suffix verification
d71ee5e mtp: select exact verifier from acceptance history
f72ba58 mtp: gate exact speculation on verified acceptance
e07e014 mtp: trim exact decode2 output-head work
a5af40d mtp: route exact decode2 through a layer entrypoint
3032763 mtp: profile exact decode2 layer replay
9ab947e mtp: add shadowed candidate output logits
614b5a8 mtp: add opt-in certified row0 logits verifier
3c63d0d mtp: tighten exact candidate output verification
4b8518b mtp: add exact pair output projection verifier
f6051da mtp: default exact pair output verification on CUDA
68166ad mtp: add exact state-barrier decode2 verifier transaction
d504a89 mtp: add decode2 batch FFN stage-diff diagnostics
1fb3c93 mtp: pair exact decode2 attention output
e56b714 mtp: remove staging copies from paired decode2 attention
6a5714b mtp: isolate decode2 routed MoE batch exactness
af5cd0a mtp: batch routed MoE inside exact decode2 FFN
0b99bea mtp: add scalar-order n2 Q8 for exact FFN body batching
a1f4723 mtp: write exact batched FFN HC rows directly
7a20ab7 mtp: write exact FFN prefix rows in place
454f539 mtp: avoid paired Q8 gate-up in exact FFN body
87e5e39 mtp: add exact decode2 Q-path batch replay diagnostic
d627216 mtp: add compressor projection replay diagnostic
ce61a50 mtp: prefer decode2 for optimized exact full-body verifier
6a86411 mtp: serialize accept-gate state in session snapshot
9c99333 Add CUDA output-head verifier microbenchmark
10.6 Bench / CLI / server (8 commits)
d4ab17f cli: allow session-path no-MTP timing baselines
5153bfb proof: report steady-state generation throughput
9466a03 proof: add named budgets and steady-state timing reports
d202148 bench: compare no-MTP and exact MTP throughput
f4ca519 bench: report steady-state gen_tps alongside total
b79c2d7 server: emit token_ids in chat SSE deltas when return_token_ids=true
4698319 server: place token_ids at choice level to match vLLM/llama-benchy
904a0c4 server: snap SSE emission limit to token boundary when return_token_ids

Generated 2026-05-18 from git log $(git merge-base HEAD origin/main)..HEAD on branch pr-prep-2026-05-18 (HEAD 8c4525b). Source CSVs in local/docs/ds4_vmm_landing_merged/; design docs in adjacent ds4_* HTML/markdown files; vendor pin in cuda/mmq/VENDOR.md; operator guides in AGENT.md and docs/{cuda-mtp,proof-harness}/.