ds4 fork: state of the work vs antirez/ds4 main (2026-05-18)

PR-prep snapshot on branch pr-prep-2026-05-18 (HEAD 8c4525b, branched from restore-cublas-and-dispatch). 132 commits ahead of origin/main at c9dd9499. 30,723 lines added, 367 deleted across 55 files. Three pillars: CUDA mmq lift, in-process VMM weight arena + sidecar, generalized engine proof harness.

1. Executive summary

Prefill ceiling

5.88×

PRO 6000 Blackwell ctx=2048: 2193 t/s on the fork (mmq dispatch + in-process VMM arena) vs 373 t/s upstream baseline (cublas + cudaMalloc 4 KiB pages). Compound of 2.93× from mmq dispatch and 2.04× from VMM page layout.

Generation throughput

+16.3%

PRO 6000 ctx=2048: 38.0 → 44.2 gen t/s vs upstream. Layered from mmq prefill (+3.9%), mmvq decode (+13.7% cumulative), and graph capture with bidirectional stream sync (+16.3% cumulative) — all default-on. GB10 gen is flat within run-to-run noise (integrated LPDDR5X caps decode regardless of dispatch path).

Sidecar payoff

N−1

The weight server amortises the base upload (~20–70 s, fast NVMe to slow storage) across N concurrent workers (proof harness, profile sweeps, MTP correctness).

Kernel work

12.5 k vendored
+ 7.9 k novel

12.5 k LOC of llama.cpp mmq/mmvq kernels vendored verbatim (or lightly patched) as a matmul platform. 7.9 k LOC of ds4-original CUDA on top: adapter + dispatcher + parity tests, 14 purpose-built MTP verifier kernels, in-process VMM arena, weight-server binary, host-side verifier orchestration.

Proof runner

5-tuple

profile × suite × prompt × budget × contract. Owns weight-server lifecycle, enforces parent-PID exit, emits a single pass/fail verdict.

CUDA MTP

enabled

CUDA speculative decoding path landed end-to-end — top-2 draft, paired Q/KV, exact 2-token verifier, acceptance-history gating, session-snapshot persistence. Ships as experimental: throughput is currently neutral-to-negative vs no-MTP, so it is opt-in and the showcase config remains best-nomtp.

2. In plain English — the pitch

1 — A faster way to do the math

Every layer of the model needs to multiply two big quantised matrices together. ds4 used to do this by expanding the weights to FP16 first, then calling NVIDIA's cuBLAS — a step that taxed both memory and time. The fork replaces it on two fronts: we vendored llama.cpp's hand-tuned matmul family (~12.5 k LOC under cuda/mmq/), which gives us fast kernels for both the matrix-shaped prefill case and the vector-shaped single-token decode case, and we wrote the ds4-side adapter, dispatcher, and a CPU-reference parity harness around it (~3.3 k LOC of original code). Discrete-GPU prefill jumps from ~373 to ~1100 tokens/sec on this work alone, and the same compute platform underpins the +16.3% generation throughput headlined in §1.

Pillar 4.1 (mmq lift)

2 — A better way to lay out 80 GB of weights

GPU math speed depends partly on where the weights live in memory — specifically, whether each tensor's base address sits on a big alignment boundary that the GPU's caches and tile-load hardware can exploit. cudaMalloc packs all ~80 GB of weights into one big chunk at arbitrary internal offsets. The fork uses a CUDA driver API to allocate each weight tensor at its own 2 MB-aligned virtual address. That single change lets the matmul kernels coalesce tile loads and hit L2 more reliably — discrete-GPU prefill jumps about 2× from this layout alone. No setup; single-process runs just get it. (The same alignment also produces a small, deterministic FP32 reduction-order drift on tight-margin tokens; same root cause, documented in misc/cuda-env-vars.md.)

Pillar 4.2 (in-process VMM arena)

3 — A sidecar for when you need more than one process

Some workflows want several ds4 processes running against the same model at the same time — e.g. comparing two configurations head-to-head, or running the proof harness. Without help, each process pays the base-weights upload cost on its own — somewhere between ~20 seconds (fast NVMe) and ~70 seconds (slower storage) per process for the V4 Flash quantised weights. The fork introduces a small ds4_weight_server binary that owns the weights once and shares them with the workers through a Unix socket. Auto-disabled when not needed, so single-process users never notice it.

Pillar 4.3 (weight server)

4 — A safety net for moving fast

Every change in this branch could in principle flip a generated token somewhere. The new proof harness boots ds4 in any pair of configurations, feeds them the same prompts, and verifies they produce byte-identical output. If something silently drifts, the harness fails. This is what made it safe to land the perf changes above without anxiety — and what gates any future "let's flip this default on" decision.

Pillar 4.4 (proof harness)

5 — CUDA speculative decoding, plumbed end-to-end

"Speculative decoding" runs a small drafter model ahead of the main one, then has the main model verify several tokens at once. It's a known way to speed up generation. The CUDA backend had no path for it before; this branch lands the full pipeline behind 14 purpose-built CUDA kernels we wrote ourselves — paired Q8 projections, two-token top-2 candidate verifier, candidate-certification + merge, fused MoE down-sum for two tokens, batched FFN body — plus host-side state-barrier transactions, acceptance-history gating, and session persistence. It works and proves byte-equivalent. Throughput isn't yet a net win on the configurations we measured, so it ships behind opt-in flags as experimental.

Pillar 4.5 (MTP exact verifier)

6 — Two guard rails against subtle bugs

(a) Inside the speculative verifier we still use the old matmul kernel, because the drafter was trained against that kernel's exact rounding behavior — mixing kernels would flip tight-margin tokens and collapse draft acceptance. (b) A single env var picks the matmul strategy and the rest of the dispatcher flows from it, with documented fallbacks if the preferred path fails to initialise.

Pillar 4.6 (correctness gates)

7 — Better measurement

The first generated token is much slower than the rest (the model has to "warm up"). Reporting just one average hides that. The fork adds a steady-state generation throughput column to ds4-bench alongside the total, plus a built-in MTP-vs-no-MTP comparison mode and reproducible CSV-emitting frontier sweeps. The HTTP server also emits per-token IDs in the SSE stream at the right structural level so external benchmark tools (vLLM-style) just work against ds4-server.

Pillar 4.7 (bench / server)

8 — An operator's manual

A new AGENT.md documents every environment variable, the matmul dispatcher behaviour, the weight-server-vs-VMM decision tree, and the safety rules. Plus operator guides for CUDA MTP and the proof harness under docs/, and a vendor-pin file under cuda/mmq/ that records exactly which llama.cpp commit we tracked and how to re-sync from upstream.

Pillar 4.8 (docs & vendor pin)

3. Scope — what changed where

4. Themes

4.1 CUDA mmq kernel lift — vendored platform + ds4-native dispatch

Class	Files	Added	Deleted	Notes
New — vendored matmul kernels	`cuda/mmq/{mmq,mma,vecdotq,common,mmid,quantize,unary,mmvq,ggml-*}.{cuh,cu,h}` + `vendors/cuda.h`	~12,500	0	llama.cpp `mmq`/`mmvq` family, pinned at `5c0e9468`. ~11.3 k verbatim + ~1.2 k patched (mmvq: gated ggml-backend entries, promoted `mul_mat_vec_q_switch_type`)
New — ds4-original adapter in `cuda/mmq/`	`ds4_mmq.{h,cu}`, `ds4_ggml_stubs.{h,cu}`	~1,920	0	Host C ABI, dispatcher entry points, ggml-stub types, context mgmt — what makes the templated kernels callable without the ggml runtime
New — mmq parity & bench harness	`cuda/mmq/test/*`, `tests/mmq_bench_stats.py`, `tests/run_mmq_bench.sh`	~1,670	0	CPU-reference parity for Q8_0/Q2_K/IQ2_XXS/MoE paths; tile-width sweep harness
New — weight server	`tools/ds4_weight_server.cu`	1,802	0	Fully ds4-original: GGUF parser + CUDA Driver VMM host + Unix-socket FD broker + manifest negotiation + session mgmt
New — proof harness	`tests/ds4_proof.py`, `cuda_mtp_proof_matrix.py`, `ds4_weight_server_harness_smoke.py`, `proof/*.json`	2,551	0	generalized 5-tuple runner
New — docs	`AGENT.md`, `docs/cuda-mtp/README.md`, `docs/proof-harness/README.md`, `cuda/mmq/VENDOR.md`, `speed-bench/.../README.md`	1,026	0	operator guides, env-var inventory, vendor pin
Modified — core engine	`ds4.c`	4,495	202	MTP exact verifier, accept-gate state, helpers
Modified — CUDA backend	`ds4_cuda.cu`, `ds4_gpu.h`	3,380	81	~1.6 k novel CUDA: 14 MTP verifier kernels (~690 LOC), VMM arena (~250), Q8→FP16 fallback cache (~300), CUDA graph caches (~160), FD broker (~100), Q8_0 strategy + bandwidth probe (~75), verifier gate (~20). Remainder is mmq/mmvq wiring and refactors.
Modified — bench/CLI/server	`ds4_bench.c`, `ds4_cli.c`, `ds4_server.c`	358	8	steady-state gen_tps, no-MTP baselines, token_ids SSE
Modified — misc	`Makefile`, `.gitignore`, `README.md`, `ds4.h`, `ds4_metal.m`	377	22	build glue + Metal counterparts to GPU API extensions
Total	55 files	30,723	367	132 commits

The previous CUDA backend dispatched every Q8_0 dense matmul through a Q8→FP16 expansion cache plus cublasGemmEx. Profile of the V4 Flash IQ2XXS w2Q2K AProjQ8 SExpQ8 OutQ8 model at ctx=2048 showed the expansion + GEMM as the dominant prefill cost: 373 t/s on PRO 6000 Blackwell. The fix has two layers, and it's important to keep them distinct in the PR conversation:

Coverage matrix

Dispatch site	Quantization	n_tokens	Routed via
Attention projections (Q/K/V/O), shared expert, lm_head	Q8_0	≥ 2 (prefill)	`ds4_mmq_q8_0_dense` (mmq matrix-shaped)
Attention projections decode	Q8_0	= 1	`ds4_mmq_q8_0_dense_vec` (mmvq)
Dense Q4_K (e.g. attn_output_b)	Q4_K	≥ 2	`ds4_mmq_q4_K_dense`
Routed MoE gate & up	IQ2_XXS / Q4_K	≥ 2	`ds4_mmq_{iq2_xxs,q4_K}_moe` (paired API shares Q8_1 act)
Routed MoE down	Q2_K / Q4_K	≥ 2	`ds4_mmq_{q2_K,q4_K}_moe`
Routed MoE decode	IQ2_XXS / Q2_K / Q4_K	= 1, n_expert_used ≤ 8	`ds4_mmq_*_moe_vec` (mmvq)

Dispatcher hierarchy (DS4_CUDA_PREFILL_PATH)

A startup-time strategy probe writes one of three resolved paths into a single dispatch variable, then the per-call hot path is a load of that variable. cuBLAS is initialised regardless of selection because we observed the cuBLAS init triggers driver state that makes mmq ~4× faster on sm_121.

mmvq decode wedge (Step 6)

At n_tokens=1 the matrix-shaped mmq path wastes column tiling on a single output column. We additionally vendor mmvq.{cu,cuh} and wire two decode-only sites:

Knobs: DS4_CUDA_NO_MMVQ_DECODE=1 to opt out; DS4_CUDA_MMVQ_DECODE_MAX_TOKENS=N to extend mmvq into short prefill batches (still bound by the gate above).

CUDA Graph capture+replay (Step 8) — opt-in only

Each kernel sequence in the mmvq routed-MoE decode block and the n_tok=1 dense Q8_0 vec path is captured into a cudaGraphExec_t on first execution with a given (layer-shape, buffer-pointer) tuple. The MoE cache holds 256 entries; the dense Q8_0 cache holds 1024 entries. Replay eliminates ~5–15µs of CPU↔driver round-trip per launch — the dominant overhead at decode where individual kernels are small.

Parity tests

cuda/mmq/test/test_mmq_parity.cu (1,245 LOC) compares every wired mmq shape to a CPU reference at multiple n_tokens and tile widths. The bench harness tests/run_mmq_bench.sh + tests/mmq_bench_stats.py drove the X_max sweep that picked the default tile (X=128 vanilla wins on sm_120 by ~6–20% over X∈{32,64,96}).

4.2 In-process CUDA VMM weight arena

Background: when ds4 loads weights via cudaMalloc, the legacy arena packs all ~80 GB of weights into one large chunk where each tensor sits at an arbitrary 256-byte-aligned internal offset. The ds4_weight_server sidecar avoids this by using the CUDA Driver VMM API (cuMemCreate + cuMemAddressReserve per range), giving each weight tensor its own 2 MiB-aligned virtual address. The fork extracts the same machinery into the in-process path so single-process runs (ds4-bench, ds4-server, one-shot CLI) reach the same prefill ceiling without spawning a sidecar. The chunk-size bisect we ran during this work updated our understanding of why the VMM arena is fast. The original framing ("2 MiB pages reduce TLB pressure") is incomplete: VMM with one large 1792 MiB chunk performs identically to cudaMalloc (~1080 t/s prefill on PRO 6000), even though the cuMemCreate-backed memory is still 2 MiB-paged. The actual differentiator is per-tensor 2 MiB-aligned base addresses: when each weight tensor sits at its own fresh cuMemAddressReserve-handed VA, matmul kernels' tile-load coalescing and L2 spatial-locality patterns improve enough to roughly double prefill. Pack the same VMM-paged memory into one big chunk and the bases land at sub-granularity offsets — the perf advantage disappears.

Mechanism

Why GB10 is neutral and that's fine

On the integrated Spark, weights live in the same LPDDR5X pool as everything else and the per-tensor-base-alignment effect that drives the discrete-GPU win doesn't translate — VMM yields no measurable delta (−0.16% mean across the sweep, no row worse than −0.75%). The earlier worry that integrated GPUs would OOM under VMM was wrong — the weight server has run on GB10 with --reserve-gb 24 for weeks; in-process VMM has the same memory profile.

4.3 CUDA weight server sidecar

A standalone CUDA process that owns the weight allocations and exposes them to one or more ds4 workers through a manifest file. Two transports:

Operating envelope

When to use which (operator decision tree)

4.4 Generalized engine proof harness

Lifecycle ownership

Workload	Recommended path	Why
One-shot `./ds4 -p ...`, single `ds4-bench`, single `ds4-server`	In-process VMM arena (auto)	Same per-tensor 2 MiB-aligned base layout as the sidecar, zero setup tax.
Proof harness running N profiles in parallel	Weight server, scope=base	Base upload (~20–70 s, NVMe-dependent) amortised over N workers.
MTP correctness work (base + MTP gguf concurrent)	Weight server, scope=both	Single-allocation fragmentation can OOM even with sufficient free VRAM.
Multi-profile bench sweeps	Weight server, scope=base	Same as proof harness.

--start-weight-server hands the sidecar's lifecycle to the runner. The runner:

Verdict surface

The JSON report carries a top-level weight_server_validation verdict that automation can gate on. It checks ready state, backend, scope, preflight result, upload telemetry for the requested model scope, parent-PID guard, lock acquisition, shutdown observation, and clean termination. VMM runs additionally check support telemetry, plans, broker startup, and broker request activity. weight_server carries the raw command, manifest path, log path, dry-run preflight, startup time, and cleanup result.

MTP-specific reporting

23cd345 promoted MTP acceptance to a first-class metric so optimisation runs can read it directly from the report instead of grepping logs. 59cd8d2 exposes derived weight artifacts (prebuilt Q8 expansion tables, etc.) so the harness can re-run with the same artifacts the WS owner produced.

4.5 MTP exact verifier path — experimental enablement

35+ commits+~3 k LOC ds4.c orchestration14 novel CUDA kernels (~690 LOC)CUDA MTP enabled end-to-endopt-in: throughput neutral-to-negative

The deliverable here is enablement, not a perf win. Prior to this work the CUDA backend had no exact speculative-decoding path; this branch lands the full pipeline — top-2 draft, paired Q/KV projections, fused MoE down-sum, exact 2-token verifier with state-barrier rollback, certified row-0 logits, acceptance-history gating, and session-snapshot persistence of the accept-gate state — behind opt-in env flags. The throughput case is currently neutral to slightly negative vs no-MTP on the workloads we measured, so MTP is shipped as experimental and is not part of the recommended fast baseline.

Novel CUDA kernels we wrote for the verifier

These are not adaptations of vendored code — they are purpose-built for ds4's two-token exact-verification model and do not exist in cuda/mmq/ upstream:

Kernel family	Kernel symbols	~LOC	Job
Top-2 logits (decode pair)	`matmul_q8_0_top2_warp8_kernel`, `matmul_q8_0_top2_logits_n2_warp8_kernel`	~108	Argmax + runner-up for row 0; row-1 logits piped straight to the verifier without a full output projection.
Candidate certification & merge	`matmul_q8_0_candidates_warp8_kernel`, `q8_0_row_group_norms_warp_kernel`, `q8_0_x_group_norms_kernel`, `q8_0_candidate_certify_prune_warp8_kernel`, `q8_0_candidate_certify_merge_kernel`, `q8_0_top2_merge_kernel`	~280	Proves the drafted row-1 token is the row-0 argmax under a derived norm bound, so the row-0 top-2 scan can be skipped on certified pairs; falls back to exact top-2 on miss.
Paired Q8 projections / batched FFN body	`matmul_q8_0_pair_preq_warp8_kernel`, `matmul_q8_0_pair_preq_batch_warp8_kernel`, `matmul_q8_0_hc_expand_preq_warp8_kernel`, `matmul_q8_0_hc_expand_preq_n2_warp8_kernel`, `matmul_q8_0_preq_batch_warp8_kernel`, `matmul_q8_0_preq_n2_warp8_kernel`	~275	Shared Q8_0 activation across gate+up; paired attention output A; HC-row and prefix-row direct writes; scalar-order n2 path for exact FFN body batching across the two verifier tokens.

Plus ds4_gpu_set_mtp_verifier / g_in_mtp_verifier — a 20-LOC thread-local gate that forces the Q8_0 dispatcher onto warp8 for the duration of a verifier call, because mmq's stream-k + MMA FP32 reduction order drifts ~1 ULP/layer from the legacy kernel and the drafter is trained against legacy decoding (analyst measured 0/314 acceptance on GB10 with an mmq verifier active).

The fork's MTP work targets exact (bit-identical to no-MTP) speculative decoding on CUDA, building toward a 2-token verifier. The headline path:

4.6 Correctness gates

(a) Option D — legacy kernels inside the MTP verifier

DS4_CUDA_MTP_VERIFIER_USE_MMQ default unset / 0. The CUDA backend honors ds4_gpu_set_mtp_verifier(1) bracketing by routing all Q8_0 dense matmuls (and routed-MoE dispatch via the same gate) onto the legacy warp8 kernels for the duration of one verifier call. Necessary because mmq's stream-k + MMA FP32 reduction order drifts ~1 ULP/layer from warp8; the drafter is trained against legacy-style decoding, so an mmq verifier flips tight-margin argmax tokens and collapses draft acceptance (analyst measured 0/314 on GB10 with mmq verifier active). Setting the env var to 1 reproduces the broken behavior for bisection.

(b) Dispatcher with explicit downgrade chain

On init failure, mmq downgrades to cublas, cublas downgrades to warp8. Strategy logged once on first dispatch with arch and bandwidth, e.g.

DS4_CUDA_PREFILL_PATH=mmq|cublas|warp8|auto is the modern knob; the legacy DS4_CUDA_USE_MMQ=0 still works and resolves to cublas. DS4_CUDA_PREFILL_PATH takes precedence if both are set.

4.7 Bench, observability, server

ds4-bench

CLI

Server / SSE

4.8 Documentation & vendor pin

5. Performance results

5.1 Headline prefill at ctx=2048 (V4 Flash IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8)

Arch	Upstream baseline (cublas + cudaMalloc)	Fork — mmq + arena (prior default)	Fork — mmq + in-process VMM (current default)	Total speedup
PRO 6000 Blackwell sm_120	~373 t/s	1078.86 t/s	2193.29 t/s	5.88×
GB10 Spark sm_121	~401 t/s	461.24 t/s	460.49 t/s	1.15×

PRO 6000 numbers from local/docs/ds4_vmm_landing_merged/pod_{arena,vmm}.csv; GB10 from the matching gb10_{arena,vmm}.csv. Upstream baseline figures from the AGENT.md dispatch table. Each fork column reports a single ctx=2048 frontier; the full sweeps show the VMM gain holds (1.87× arena→VMM at ctx=32768 on PRO 6000) and GB10 is flat across the sweep (the ×1.15 there is purely the mmq dispatcher, not page layout).

5.2 Generation throughput vs upstream — PRO 6000

The mmq lift on its own leaves decode roughly unchanged (mmq is matrix-shaped — at n_tokens=1 there's nothing to tile across). The decode gain comes from a separate piece of work: vendoring llama.cpp's mmvq vector-matmul family and routing the n_tok=1 routed-MoE and dense Q8_0 attention projection paths through it (Step 6 of the mmq optimisation plan).

Stage	Dispatch path	Gen t/s @ ctx=2048	vs upstream
Upstream baseline	cuBLAS + Q8→FP16 expansion + legacy fused decode	~38.0	—
mmq prefill, legacy decode	`USE_MMQ=1`, `NO_MMVQ_DECODE=1`	~39.5	+3.9%
mmq + mmvq decode (current default)	`USE_MMQ=1`, mmvq decode on	43.2	+13.7%
+ CUDA graphs with stream sync (current default)	auto, opt-out via `DS4_CUDA_MOE_GRAPHS=0`	44.2	+16.3%

Source: speed-bench on PRO 6000 Blackwell sm_120, CUDA 13.0, V4 Flash IQ2XXS GGUF. The legacy / mmq-only / mmvq-decode rows are from local/docs/ds4_mmq_optimization_session2.html (gen-tokens=128, n=10, p=0.0079 for each pairwise step). The graphs-with-sync row is from the post-fix bench (commits 687c783 + 7967154): ctx=2048 gen 43.31 → 44.20 t/s. On GB10 the same fix doesn't translate into a measurable decode gain: the committed sweep CSV shows gen at ctx=2048 as 14.17 → 14.16 t/s, flat within run-to-run noise. LPDDR5X bandwidth caps GB10 decode regardless of dispatch path. The PRO 6000 graphs gain is also smaller than the pre-revert benchmark because the new pre/post sync brackets serialize stream=0 with g_moe_stream, trading some parallelism for the correctness fix.

5.3 Decode is preserved under the VMM layout switch

The arena→VMM transition is a memory-layout change for the weights, not a kernel change — decode is bandwidth-bound, so VMM should be neutral on it. Verified:

No row regresses by more than a single tenth of a percent. VMM is safe to enable as the single-process default for decode-heavy workloads.

5.4 Single-machine speed snapshot at ~12k-token prompt

Arch	gen_tps_ss arena	gen_tps_ss VMM	Δ
PRO 6000 (ctx=2048, gen=128)	43.21	43.44	+0.5%
PRO 6000 (ctx=32768, gen=128)	37.77	38.05	+0.7%
GB10 (ctx=2048, gen=64)	14.11	14.12	+0.1%
GB10 (ctx=16384, gen=64)	13.52	13.49	−0.2%

Single-run CLI numbers for both CUDA targets alongside the existing Mac entries in the project README, with identical settings on each: --ctx 32768 --nothink --temp 0 -n 256, q2 quant, long prompt = first ~40 kB of speed-bench/promessi_sposi.txt (12,461 tokens on the V4 Flash tokenizer). Mac numbers are unchanged from the existing table; the two CUDA rows are the fresh measurements taken from the same build the rest of this report describes (mmq + mmvq + graphs with bidirectional stream sync + in-process VMM arena, all default-on).

Both CUDA rows use the imatrix-tuned q2 variant (...-imatrix.gguf) for apples-to-apples comparison. As a sanity check we also re-ran GB10 against the non-imatrix q2 variant: results landed within 0.3–1.7% of the imatrix numbers across both prefill and gen, i.e. inside run-to-run noise — imatrix vs non-imatrix doesn't change the throughput story at this granularity.

5.5 MTP exact verifier — GB10 full-context sweep (experimental opt-in)

Run	Prefill mean	Gen mean	Gen first frontier	Gen last frontier (38912)
no-MTP (showcase baseline)	342.20 t/s	12.68 t/s	14.00 t/s	11.63 t/s
exact MTP, draft=2 experimental	341.02 t/s	12.62 t/s	12.53 t/s	11.68 t/s

Source: speed-bench/mtp-compare-2026-05-14/{gb10_nomtp,gb10_exact_mtp}.csv. The point of these numbers is to validate that the new CUDA MTP path runs end-to-end and stays exact; throughput is currently neutral-to-negative (behind on the cold frontier from setup tax, marginally ahead at the largest contexts). Ships opt-in until a configuration breaks the wash in MTP's favor.

6. Public surface, env vars, and back-compat

New env-var surface

Variable	Default	Effect
`DS4_CUDA_PREFILL_PATH`	`auto` → mmq	Q8_0 dispatch: mmq / cublas / warp8 / auto. Explicit override.
`DS4_CUDA_USE_MMQ`	unset	Legacy alias: `0` = cublas. Lower precedence than `DS4_CUDA_PREFILL_PATH`.
`DS4_CUDA_MMQ_MOE_MIN_TOKENS`	2	Minimum n_tokens at which routed-MoE uses mmq matrix-shaped path.
`DS4_CUDA_MMQ_X_MAX`	unset (128)	Diagnostic: clip `get_mmq_x_max_host` to N.
`DS4_CUDA_NO_MMVQ_DECODE`	unset	Opt-out of mmvq for n_tok=1 decode (routed-MoE and dense Q8_0).
`DS4_CUDA_MMVQ_DECODE_MAX_TOKENS`	1	Cap on n_tokens routed through mmvq decode branch (0–8).
`DS4_CUDA_MOE_GRAPHS`	ON	CUDA Graph capture/replay for routed-MoE decode and dense Q8_0 vec, with bidirectional stream sync. Opt-out via `0`.
`DS4_CUDA_MTP_VERIFIER_USE_MMQ`	unset / 0	Bisection switch: `1` reproduces the broken mmq-in-verifier behavior.
`DS4_CUDA_VMM_ARENA`	enabled	`0` disables in-process VMM allocator (escape hatch).
`DS4_CUDA_VMM_ARENA_CHUNK_MB`	0 (request-size)	Force a minimum chunk size per cuMemCreate; rarely needed.
`DS4_CUDA_WEIGHT_IPC_MANIFEST`	unset	Worker-side: import weights from the manifest path. Hard-gates in-process VMM off.
`DS4_CUDA_MTP_TOP2` / `DS4_CUDA_MTP_VERIFY_TOP2`	unset	Enable CUDA top-2 draft + verifier shortcut.
`DS4_MTP_CERT_LOGITS` / `DS4_MTP_CERT_LOGITS_SHADOW`	unset	Opt-in row-0 certificate / shadow validator.
`DS4_MTP_STRICT` (= `--quality`)	unset	Force byte-identical target stream behavior.

Public C API additions (ds4_gpu.h)

Back-compat posture

7. Architectural impact diagram

8. Risks, known issues, gaps

R1 — DS4_CUDA_MOE_GRAPHS resolved. The cross-stream race that motivated the b66b5d6 revert is now diagnosed and fixed (commits 687c783 + 7967154). Two race legs across the g_moe_stream / stream=0 boundary — pre-launch input read and post-launch output read — are closed by cudaEventRecord + cudaStreamWaitEvent brackets around every cudaGraphLaunch. Validated on both targets: smoke parity ON vs OFF bit-identical; MTP-active output coherent on GB10 (was previously garbled); ds4-bench gen positive on PRO 6000 in the committed sweep, flat within run-to-run noise on GB10 (LPDDR5X bandwidth-bound). Default flipped back to ON; opt-out via DS4_CUDA_MOE_GRAPHS=0. The sync overhead trades some parallelism for correctness, so the gain is smaller than the pre-revert benchmark and the GB10 gain in particular does not survive into the committed sweep.

Gaps acknowledged

9. Path to upstreaming

If antirez prefers a single mega-PR: #1+#2 must stay together (mmq is dead code without the dispatcher), but every other group can be sequenced.

10. Appendix — commits by theme

#	PR title	Scope	Risk	Reviewer ask
1	cuda: vendor llama.cpp mmq + adapter + parity tests	`cuda/mmq/` + Makefile + `cuda/mmq/VENDOR.md`	low (additive, default off behind `DS4_CUDA_USE_MMQ=1` on the first PR if desired)	vendor-pin policy — licensing already settled: both MIT, ds4 `LICENSE` credits "The ggml authors"
2	cuda: route Q8_0 / Q4_K / IQ2_XXS / Q2_K dispatch through mmq	dispatcher in `ds4_cuda.cu` + env vars + AGENT.md table	medium (perf shift)	bench reproducibility on at least sm_120 + one other arch
3	cuda: in-process VMM weight arena	5 commits, `ds4_cuda.cu` + AGENT.md	low (gated, falls back to legacy on probe failure)	portability across CUDA versions
4	tools: ds4_weight_server + broker + import API	`tools/ds4_weight_server.cu`, `ds4_gpu.h`, AGENT.md	medium (new binary, FD-over-socket protocol)	operator UX, lock file semantics
5	tests: generalized engine proof harness	`tests/ds4_proof.py` + docs/proof-harness	low (test-only)	scope of `contract` + budget naming
6	mtp: exact verifier + acceptance-history gating + session-snapshot accept-state	`ds4.c`, `ds4_cuda.cu`, docs/cuda-mtp	medium (default-on behavior change, but acceptance-gated)	strict-mode equivalence proof
7	cuda: Option D MTP-verifier kernel routing	tiny `ds4.c` bracketing + `ds4_cuda.cu` dispatcher hook	low (default-on)	none beyond #6
8	bench/server: steady-state gen_tps + token_ids SSE + session-path no-MTP baselines	`ds4_bench.c`, `ds4_cli.c`, `ds4_server.c`	low (additive)	SSE wire-format match with vLLM

Generated 2026-05-18 from git log $(git merge-base HEAD origin/main)..HEAD on branch pr-prep-2026-05-18 (HEAD 8c4525b). Source CSVs in local/docs/ds4_vmm_landing_merged/; design docs in adjacent ds4_* HTML/markdown files; vendor pin in cuda/mmq/VENDOR.md; operator guides in AGENT.md and docs/{cuda-mtp,proof-harness}/.