PR-prep snapshot on branch pr-prep-2026-05-18 (HEAD 8c4525b, branched from restore-cublas-and-dispatch).
132 commits ahead of origin/main at c9dd9499.
30,723 lines added, 367 deleted across 55 files. Three pillars: CUDA mmq lift, in-process VMM weight arena + sidecar, generalized engine proof harness.
mmq/mmvq kernels vendored verbatim (or lightly patched) as a matmul platform. 7.9 k LOC of ds4-original CUDA on top: adapter + dispatcher + parity tests, 14 purpose-built MTP verifier kernels, in-process VMM arena, weight-server binary, host-side verifier orchestration.best-nomtp.antirez/ds4's CUDA backend:
(1) replaces the slow matmul path with a fast one,
(2) changes how model weights are laid out in GPU memory so the chip's address-translation cache stops thrashing,
(3) builds a test harness that proves the first two didn't change any generated tokens,
(4) plumbs CUDA speculative decoding end-to-end (correct, but not yet a perf win, so it's opt-in).
The first two compound to 5.88× prefill throughput on the discrete GPU we target. The third is what made it safe to land that fast. The fourth is a future win waiting on amortisation.
Every layer of the model needs to multiply two big quantised matrices together. ds4 used to do this by expanding the weights to FP16 first, then calling NVIDIA's cuBLAS — a step that taxed both memory and time. The fork replaces it on two fronts: we vendored llama.cpp's hand-tuned matmul family (~12.5 k LOC under cuda/mmq/), which gives us fast kernels for both the matrix-shaped prefill case and the vector-shaped single-token decode case, and we wrote the ds4-side adapter, dispatcher, and a CPU-reference parity harness around it (~3.3 k LOC of original code). Discrete-GPU prefill jumps from ~373 to ~1100 tokens/sec on this work alone, and the same compute platform underpins the +16.3% generation throughput headlined in §1.
GPU math speed depends partly on where the weights live in memory — specifically, whether each tensor's base address sits on a big alignment boundary that the GPU's caches and tile-load hardware can exploit. cudaMalloc packs all ~80 GB of weights into one big chunk at arbitrary internal offsets. The fork uses a CUDA driver API to allocate each weight tensor at its own 2 MB-aligned virtual address. That single change lets the matmul kernels coalesce tile loads and hit L2 more reliably — discrete-GPU prefill jumps about 2× from this layout alone. No setup; single-process runs just get it. (The same alignment also produces a small, deterministic FP32 reduction-order drift on tight-margin tokens; same root cause, documented in misc/cuda-env-vars.md.)
Some workflows want several ds4 processes running against the same model at the same time — e.g. comparing two configurations head-to-head, or running the proof harness. Without help, each process pays the base-weights upload cost on its own — somewhere between ~20 seconds (fast NVMe) and ~70 seconds (slower storage) per process for the V4 Flash quantised weights. The fork introduces a small ds4_weight_server binary that owns the weights once and shares them with the workers through a Unix socket. Auto-disabled when not needed, so single-process users never notice it.
Every change in this branch could in principle flip a generated token somewhere. The new proof harness boots ds4 in any pair of configurations, feeds them the same prompts, and verifies they produce byte-identical output. If something silently drifts, the harness fails. This is what made it safe to land the perf changes above without anxiety — and what gates any future "let's flip this default on" decision.
"Speculative decoding" runs a small drafter model ahead of the main one, then has the main model verify several tokens at once. It's a known way to speed up generation. The CUDA backend had no path for it before; this branch lands the full pipeline behind 14 purpose-built CUDA kernels we wrote ourselves — paired Q8 projections, two-token top-2 candidate verifier, candidate-certification + merge, fused MoE down-sum for two tokens, batched FFN body — plus host-side state-barrier transactions, acceptance-history gating, and session persistence. It works and proves byte-equivalent. Throughput isn't yet a net win on the configurations we measured, so it ships behind opt-in flags as experimental.
(a) Inside the speculative verifier we still use the old matmul kernel, because the drafter was trained against that kernel's exact rounding behavior — mixing kernels would flip tight-margin tokens and collapse draft acceptance. (b) A single env var picks the matmul strategy and the rest of the dispatcher flows from it, with documented fallbacks if the preferred path fails to initialise.
The first generated token is much slower than the rest (the model has to "warm up"). Reporting just one average hides that. The fork adds a steady-state generation throughput column to ds4-bench alongside the total, plus a built-in MTP-vs-no-MTP comparison mode and reproducible CSV-emitting frontier sweeps. The HTTP server also emits per-token IDs in the SSE stream at the right structural level so external benchmark tools (vLLM-style) just work against ds4-server.
A new AGENT.md documents every environment variable, the matmul dispatcher behaviour, the weight-server-vs-VMM decision tree, and the safety rules. Plus operator guides for CUDA MTP and the proof harness under docs/, and a vendor-pin file under cuda/mmq/ that records exactly which llama.cpp commit we tracked and how to re-sync from upstream.
| Class | Files | Added | Deleted | Notes |
|---|---|---|---|---|
| New — vendored matmul kernels | cuda/mmq/{mmq,mma,vecdotq,common,mmid,quantize,unary,mmvq,ggml-*}.{cuh,cu,h} + vendors/cuda.h | ~12,500 | 0 | llama.cpp mmq/mmvq family, pinned at 5c0e9468. ~11.3 k verbatim + ~1.2 k patched (mmvq: gated ggml-backend entries, promoted mul_mat_vec_q_switch_type) |
New — ds4-original adapter in cuda/mmq/ | ds4_mmq.{h,cu}, ds4_ggml_stubs.{h,cu} | ~1,920 | 0 | Host C ABI, dispatcher entry points, ggml-stub types, context mgmt — what makes the templated kernels callable without the ggml runtime |
| New — mmq parity & bench harness | cuda/mmq/test/*, tests/mmq_bench_stats.py, tests/run_mmq_bench.sh | ~1,670 | 0 | CPU-reference parity for Q8_0/Q2_K/IQ2_XXS/MoE paths; tile-width sweep harness |
| New — weight server | tools/ds4_weight_server.cu | 1,802 | 0 | Fully ds4-original: GGUF parser + CUDA Driver VMM host + Unix-socket FD broker + manifest negotiation + session mgmt |
| New — proof harness | tests/ds4_proof.py, cuda_mtp_proof_matrix.py, ds4_weight_server_harness_smoke.py, proof/*.json | 2,551 | 0 | generalized 5-tuple runner |
| New — docs | AGENT.md, docs/cuda-mtp/README.md, docs/proof-harness/README.md, cuda/mmq/VENDOR.md, speed-bench/.../README.md | 1,026 | 0 | operator guides, env-var inventory, vendor pin |
| Modified — core engine | ds4.c | 4,495 | 202 | MTP exact verifier, accept-gate state, helpers |
| Modified — CUDA backend | ds4_cuda.cu, ds4_gpu.h | 3,380 | 81 | ~1.6 k novel CUDA: 14 MTP verifier kernels (~690 LOC), VMM arena (~250), Q8→FP16 fallback cache (~300), CUDA graph caches (~160), FD broker (~100), Q8_0 strategy + bandwidth probe (~75), verifier gate (~20). Remainder is mmq/mmvq wiring and refactors. |
| Modified — bench/CLI/server | ds4_bench.c, ds4_cli.c, ds4_server.c | 358 | 8 | steady-state gen_tps, no-MTP baselines, token_ids SSE |
| Modified — misc | Makefile, .gitignore, README.md, ds4.h, ds4_metal.m | 377 | 22 | build glue + Metal counterparts to GPU API extensions |
| Total | 55 files | 30,723 | 367 | 132 commits |
25 commits~12.5 k vendored~3.6 k ds4-original in this themephases 0–8
The previous CUDA backend dispatched every Q8_0 dense matmul through a Q8→FP16 expansion cache plus cublasGemmEx. Profile of the V4 Flash IQ2XXS w2Q2K AProjQ8 SExpQ8 OutQ8 model at ctx=2048 showed the expansion + GEMM as the dominant prefill cost: 373 t/s on PRO 6000 Blackwell. The fix has two layers, and it's important to keep them distinct in the PR conversation:
5c0e9468). Templated stream-k + tensor-core MMA kernels for 11+ quantisation types: mmq.cuh (4,195 LOC), mma.cuh (1,456), vecdotq.cuh (1,317), common.cuh + ggml-common.h + quantize.{cu,cuh} (~3,770), mmid.{cu,cuh} + unary.cuh (~280). The mmvq pair is the only patched-vendored piece — we gate the ggml-backend entries behind DS4_MMVQ_INCLUDE_GGML_ENTRIES and promote mul_mat_vec_q_switch_type from static so we can call it directly. The templated CUDA is not cherry-pickable: ds4 takes all of it or none of it. Licensing is uncomplicated: llama.cpp is MIT and ds4 is MIT, and the ds4 LICENSE file already carries both copyright lines ("The ds4.c authors" + "The ggml authors"). Re-sync procedure and the upstream-pin commit hash live in cuda/mmq/VENDOR.md.ds4_mmq.{h,cu} + ds4_ggml_stubs.{h,cu} (~1,920 LOC under cuda/mmq/) supply the host C ABI, the ggml-context stubs, and the per-quantisation dispatch entry points. The Q8_0 dispatcher (ds4_cuda_q8_strategy, ~75 LOC in ds4_cuda.cu) probes SM version + memory bandwidth and resolves auto to mmq, with explicit overrides and an init-failure downgrade chain — llama.cpp has no equivalent because it does not have ds4's "one validated default per arch" policy. cuda/mmq/test/test_mmq_parity.cu (1,245 LOC) + iq2_host_tables.h (~105) is our CPU-reference parity sweep across every wired shape and tile width.| Dispatch site | Quantization | n_tokens | Routed via |
|---|---|---|---|
| Attention projections (Q/K/V/O), shared expert, lm_head | Q8_0 | ≥ 2 (prefill) | ds4_mmq_q8_0_dense (mmq matrix-shaped) |
| Attention projections decode | Q8_0 | = 1 | ds4_mmq_q8_0_dense_vec (mmvq) |
| Dense Q4_K (e.g. attn_output_b) | Q4_K | ≥ 2 | ds4_mmq_q4_K_dense |
| Routed MoE gate & up | IQ2_XXS / Q4_K | ≥ 2 | ds4_mmq_{iq2_xxs,q4_K}_moe (paired API shares Q8_1 act) |
| Routed MoE down | Q2_K / Q4_K | ≥ 2 | ds4_mmq_{q2_K,q4_K}_moe |
| Routed MoE decode | IQ2_XXS / Q2_K / Q4_K | = 1, n_expert_used ≤ 8 | ds4_mmq_*_moe_vec (mmvq) |
DS4_CUDA_PREFILL_PATH)A startup-time strategy probe writes one of three resolved paths into a single dispatch variable, then the per-call hot path is a load of that variable. cuBLAS is initialised regardless of selection because we observed the cuBLAS init triggers driver state that makes mmq ~4× faster on sm_121.
At n_tokens=1 the matrix-shaped mmq path wastes column tiling on a single output column. We additionally vendor mmvq.{cu,cuh} and wire two decode-only sites:
routed_moe_launch — gated on n_tokens * n_expert_used ≤ MMVQ_MAX_BATCH_SIZE (8 on Blackwell). Two separate moe_vec calls preserve the DeepSeek V4 clamp epilogue exactly; the fused moe_pair_vec entry exists but isn't yet wired (fusion applies silu without clamp).cuda_matmul_q8_0_tensor_labeled for n_tok=1 attention projections via ds4_mmq_q8_0_dense_vec.
Knobs: DS4_CUDA_NO_MMVQ_DECODE=1 to opt out; DS4_CUDA_MMVQ_DECODE_MAX_TOKENS=N to extend mmvq into short prefill batches (still bound by the gate above).
Each kernel sequence in the mmvq routed-MoE decode block and the n_tok=1 dense Q8_0 vec path is captured into a cudaGraphExec_t on first execution with a given (layer-shape, buffer-pointer) tuple. The MoE cache holds 256 entries; the dense Q8_0 cache holds 1024 entries. Replay eliminates ~5–15µs of CPU↔driver round-trip per launch — the dominant overhead at decode where individual kernels are small.
7967154) after the cross-stream race that motivated the b66b5d6 revert was diagnosed and fixed. Two legs: (a) captured outputs read by stream=0 without a wait; (b) captured inputs read on g_moe_stream before stream=0 finished writing them. Both are now closed by cudaEventRecord + cudaStreamWaitEvent brackets around every cudaGraphLaunch (commit 687c783). Validated on PRO 6000 (sm_120) and GB10 (sm_121): smoke parity ON vs OFF, MTP-active output coherent on the previously-failing path, ds4-bench gen positive on PRO 6000 (sweep CSV) and flat-within-noise on GB10. Opt-out via DS4_CUDA_MOE_GRAPHS=0.
cuda/mmq/test/test_mmq_parity.cu (1,245 LOC) compares every wired mmq shape to a CPU reference at multiple n_tokens and tile widths. The bench harness tests/run_mmq_bench.sh + tests/mmq_bench_stats.py drove the X_max sweep that picked the default tile (X=128 vanilla wins on sm_120 by ~6–20% over X∈{32,64,96}).
5 commits+170 LOC ds4_cuda.cu2.04× prefill on PRO 60000% on GB10 (neutral)
Background: when ds4 loads weights via cudaMalloc, the legacy arena packs all ~80 GB of weights into one large chunk where each tensor sits at an arbitrary 256-byte-aligned internal offset. The ds4_weight_server sidecar avoids this by using the CUDA Driver VMM API (cuMemCreate + cuMemAddressReserve per range), giving each weight tensor its own 2 MiB-aligned virtual address. The fork extracts the same machinery into the in-process path so single-process runs (ds4-bench, ds4-server, one-shot CLI) reach the same prefill ceiling without spawning a sidecar.
The chunk-size bisect we ran during this work updated our understanding of why the VMM arena is fast. The original framing ("2 MiB pages reduce TLB pressure") is incomplete: VMM with one large 1792 MiB chunk performs identically to cudaMalloc (~1080 t/s prefill on PRO 6000), even though the cuMemCreate-backed memory is still 2 MiB-paged. The actual differentiator is per-tensor 2 MiB-aligned base addresses: when each weight tensor sits at its own fresh cuMemAddressReserve-handed VA, matmul kernels' tile-load coalescing and L2 spatial-locality patterns improve enough to roughly double prefill. Pack the same VMM-paged memory into one big chunk and the bases land at sub-granularity offsets — the perf advantage disappears.
cuda_vmm_arena_supported() — on first call, probes cuMemGetAllocationGranularity. Hard-gated off when DS4_CUDA_WEIGHT_IPC_MANIFEST is set (sidecar already provides identical VMM ranges — running both would double-allocate). Soft-gated off via DS4_CUDA_VMM_ARENA=0.cuda_vmm_arena_alloc(bytes, label) — bump-pointer over the sequence cuMemCreate(CU_MEM_HANDLE_TYPE_NONE) → cuMemAddressReserve → cuMemMap → cuMemSetAccess(PROT_READWRITE). Default chunk size = request size rounded up to granularity, matching the weight server's per-range allocation. Override via DS4_CUDA_VMM_ARENA_CHUNK_MB=N.cuda_model_range_ptr_from_fd hot path tries VMM first, falls back to the legacy cuda_model_arena_alloc (cudaMalloc) on probe failure.cuda_vmm_arenas_release_all from the existing cuda_model_range_release_all path.6a89ea5 defaulted mb=0 (chunk = request size). Verified post-fix: 138 chunks, 80.77 GiB allocated for 80.76 GiB raw (0.01% overhead).
local/docs/ds4_vmm_landing_merged/{pod,gb10}_{arena,vmm}.csv. PRO 6000 holds 2.04× at ctx=2048 and 1.87× at ctx=32768. GB10 lines overlap because the integrated LPDDR5X path already operates near the page-table-insensitive limit.
On the integrated Spark, weights live in the same LPDDR5X pool as everything else and the per-tensor-base-alignment effect that drives the discrete-GPU win doesn't translate — VMM yields no measurable delta (−0.16% mean across the sweep, no row worse than −0.75%). The earlier worry that integrated GPUs would OOM under VMM was wrong — the weight server has run on GB10 with --reserve-gb 24 for weeks; in-process VMM has the same memory profile.
12 commits1,802 LOC tools/ds4_weight_server.cu2 backends (VMM, IPC)
A standalone CUDA process that owns the weight allocations and exposes them to one or more ds4 workers through a manifest file. Two transports:
cuMemCreate(CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR), transfers FDs to workers through a Unix-domain socket broker (SCM_RIGHTS). Workers reconstitute device-mappable ranges via cuMemImportFromShareableHandle.--scope base|mtp|both — choose which weight set to upload. base for non-MTP work saves ~3 GiB.--reserve-gb N — leaves headroom for context buffers and transient allocs. Validated: PRO 6000 (96 GiB) needs 8, fails at 12. GB10 (128 GiB) uses 24.--dry-run — sizes the run and emits a preflight verdict without allocating, so the proof harness can refuse a run that won't fit.--exit-on-parent-pid — auto-exit if the launcher disappears. Combined with a lock file (/tmp/ds4_weight_server_cuda*.lock) this makes the sidecar safe under harness crashes.e7f7ce1) bypasses the page cache when loading the GGUF, halving import time on cold runs.| Workload | Recommended path | Why |
|---|---|---|
One-shot ./ds4 -p ..., single ds4-bench, single ds4-server | In-process VMM arena (auto) | Same per-tensor 2 MiB-aligned base layout as the sidecar, zero setup tax. |
| Proof harness running N profiles in parallel | Weight server, scope=base | Base upload (~20–70 s, NVMe-dependent) amortised over N workers. |
| MTP correctness work (base + MTP gguf concurrent) | Weight server, scope=both | Single-allocation fragmentation can OOM even with sufficient free VRAM. |
| Multi-profile bench sweeps | Weight server, scope=base | Same as proof harness. |
27 commits1,860 LOC tests/ds4_proof.py5-tuple data model
A proof run is modeled as profile × suite × prompt × budget × contract:
argmax_generation, mtp_speculative.
--start-weight-server hands the sidecar's lifecycle to the runner. The runner:
--dry-run preflight; refuses to start if it won't fit.--exit-on-parent-pid.DS4_CUDA_WEIGHT_IPC_MANIFEST in worker env after ready.
The JSON report carries a top-level weight_server_validation verdict that automation can gate on. It checks ready state, backend, scope, preflight result, upload telemetry for the requested model scope, parent-PID guard, lock acquisition, shutdown observation, and clean termination. VMM runs additionally check support telemetry, plans, broker startup, and broker request activity. weight_server carries the raw command, manifest path, log path, dry-run preflight, startup time, and cleanup result.
23cd345 promoted MTP acceptance to a first-class metric so optimisation runs can read it directly from the report instead of grepping logs. 59cd8d2 exposes derived weight artifacts (prebuilt Q8 expansion tables, etc.) so the harness can re-run with the same artifacts the WS owner produced.
35+ commits+~3 k LOC ds4.c orchestration14 novel CUDA kernels (~690 LOC)CUDA MTP enabled end-to-endopt-in: throughput neutral-to-negative
The deliverable here is enablement, not a perf win. Prior to this work the CUDA backend had no exact speculative-decoding path; this branch lands the full pipeline — top-2 draft, paired Q/KV projections, fused MoE down-sum, exact 2-token verifier with state-barrier rollback, certified row-0 logits, acceptance-history gating, and session-snapshot persistence of the accept-gate state — behind opt-in env flags. The throughput case is currently neutral to slightly negative vs no-MTP on the workloads we measured, so MTP is shipped as experimental and is not part of the recommended fast baseline.
These are not adaptations of vendored code — they are purpose-built for ds4's two-token exact-verification model and do not exist in cuda/mmq/ upstream:
| Kernel family | Kernel symbols | ~LOC | Job |
|---|---|---|---|
| Top-2 logits (decode pair) | matmul_q8_0_top2_warp8_kernel, matmul_q8_0_top2_logits_n2_warp8_kernel | ~108 | Argmax + runner-up for row 0; row-1 logits piped straight to the verifier without a full output projection. |
| Candidate certification & merge | matmul_q8_0_candidates_warp8_kernel, q8_0_row_group_norms_warp_kernel, q8_0_x_group_norms_kernel, q8_0_candidate_certify_prune_warp8_kernel, q8_0_candidate_certify_merge_kernel, q8_0_top2_merge_kernel | ~280 | Proves the drafted row-1 token is the row-0 argmax under a derived norm bound, so the row-0 top-2 scan can be skipped on certified pairs; falls back to exact top-2 on miss. |
| Paired Q8 projections / batched FFN body | matmul_q8_0_pair_preq_warp8_kernel, matmul_q8_0_pair_preq_batch_warp8_kernel, matmul_q8_0_hc_expand_preq_warp8_kernel, matmul_q8_0_hc_expand_preq_n2_warp8_kernel, matmul_q8_0_preq_batch_warp8_kernel, matmul_q8_0_preq_n2_warp8_kernel | ~275 | Shared Q8_0 activation across gate+up; paired attention output A; HC-row and prefix-row direct writes; scalar-order n2 path for exact FFN body batching across the two verifier tokens. |
Plus ds4_gpu_set_mtp_verifier / g_in_mtp_verifier — a 20-LOC thread-local gate that forces the Q8_0 dispatcher onto warp8 for the duration of a verifier call, because mmq's stream-k + MMA FP32 reduction order drifts ~1 ULP/layer from the legacy kernel and the drafter is trained against legacy decoding (analyst measured 0/314 acceptance on GB10 with an mmq verifier active).
The fork's MTP work targets exact (bit-identical to no-MTP) speculative decoding on CUDA, building toward a 2-token verifier. The headline path:
DS4_CUDA_MTP_TOP2=1) — CUDA top-2 output for draft decisions.DS4_CUDA_MTP_VERIFY_TOP2=1) — verifier shortcut when full logits are not needed.3cad23c) — two-token Q/KV projections batched into one kernel.6b9fe82) — verifier accumulates two-token MoE down in a single pass.4b8518b, f6051da) — default-on for CUDA after parity tests.614b5a8, opt-in via DS4_MTP_CERT_LOGITS=1) — proves drafted row1 token is row0 argmax before skipping the row0 top-2 scan; uncertified rows fall back to exact top-2.68166ad) — the verifier appears atomic from the rest of the engine; rejected speculations cleanly roll back KV state.f72ba58, d71ee5e, aad1401) — exact speculation only kicks in once recent acceptance exceeds a threshold, so cold prompts don't pay the speculation tax.6a86411) — accept-gate state survives --save-cache / --load-cache.a5af40d, af5cd0a, 0b99bea, a1f4723, 7a20ab7, 454f539, ce61a50) — the exact verifier replays a full decoder layer for the pair, with in-place HC/prefix-row writes and routed MoE batched inside.--mtp-draft 2 over a full-context sweep, exact MTP is effectively tied with no-MTP (342 t/s prefill, 12.6 t/s gen mean), and it pays a ~3.7 s setup tax at gen=128 that the current generation lengths don't amortise. The plumbing is exact and the harness can prove it; the wash is structural — higher draft depths, lower verifier scheduling overhead, or longer generations are the levers to break it. Until that happens the showcase config in benchmarks stays best-nomtp and MTP is reached only by setting DS4_CUDA_MTP_TOP2=1 + DS4_CUDA_MTP_VERIFY_TOP2=1 + --mtp.
Two structural decisions defend the fork against silent regressions:
DS4_CUDA_MTP_VERIFIER_USE_MMQ default unset / 0. The CUDA backend honors ds4_gpu_set_mtp_verifier(1) bracketing by routing all Q8_0 dense matmuls (and routed-MoE dispatch via the same gate) onto the legacy warp8 kernels for the duration of one verifier call. Necessary because mmq's stream-k + MMA FP32 reduction order drifts ~1 ULP/layer from warp8; the drafter is trained against legacy-style decoding, so an mmq verifier flips tight-margin argmax tokens and collapses draft acceptance (analyst measured 0/314 on GB10 with mmq verifier active). Setting the env var to 1 reproduces the broken behavior for bisection.
On init failure, mmq downgrades to cublas, cublas downgrades to warp8. Strategy logged once on first dispatch with arch and bandwidth, e.g.
ds4: CUDA Q8_0 dispatch: mmq (sm_120, 1792 GB/s memory bandwidth) [default]
DS4_CUDA_PREFILL_PATH=mmq|cublas|warp8|auto is the modern knob; the legacy DS4_CUDA_USE_MMQ=0 still works and resolves to cublas. DS4_CUDA_PREFILL_PATH takes precedence if both are set.
8 commits358 LOC across bench/CLI/server
--ctx-start / --ctx-max / --step-incr and snapshot save/restore so each step starts from a hot KV.--gen-tokens N reports both total gen_tps and steady-state gen_tps_ss (commit f4ca519) — the steady-state column removes the first-token amortisation that distorted comparisons at short generations.d202148) and the speed-bench CSVs + plot tool (plot_mtp_compare.py).d4ab17f) so MTP development always has a same-prompt no-MTP control.ds4-server gained a working-directory option for log isolation under parallel proof runs.return_token_ids on chat completions emits token_ids in SSE deltas (b79c2d7); placed at choice level to match vLLM / llama-benchy (4698319); SSE emission snaps to token boundary so partial UTF-8 doesn't desynchronise (904a0c4).42 commits touch docs~1 k LOC across 5 docs
AGENT.md (286 lines) — agent operating guide. Covers the Q8_0 dispatch table, env-var inventory, weight-server vs in-process VMM decision tree, MTP verifier gate, and safety rules.docs/cuda-mtp/README.md — operator guide for CUDA MTP on GB10/Spark with reproducible bench commands.docs/proof-harness/README.md — how to run tests/ds4_proof.py, named budgets, weight-server scoping.cuda/mmq/VENDOR.md — upstream pin (5c0e9468, llama.cpp), file inventory (verbatim vs patched), symbol-resolution table, re-sync procedure.speed-bench/mtp-compare-2026-05-14/README.md — reproducible bench commands + summary table.| Arch | Upstream baseline (cublas + cudaMalloc) | Fork — mmq + arena (prior default) | Fork — mmq + in-process VMM (current default) | Total speedup |
|---|---|---|---|---|
| PRO 6000 Blackwell sm_120 | ~373 t/s | 1078.86 t/s | 2193.29 t/s | 5.88× |
| GB10 Spark sm_121 | ~401 t/s | 461.24 t/s | 460.49 t/s | 1.15× |
The mmq lift on its own leaves decode roughly unchanged (mmq is matrix-shaped — at n_tokens=1 there's nothing to tile across). The decode gain comes from a separate piece of work: vendoring llama.cpp's mmvq vector-matmul family and routing the n_tok=1 routed-MoE and dense Q8_0 attention projection paths through it (Step 6 of the mmq optimisation plan).
| Stage | Dispatch path | Gen t/s @ ctx=2048 | vs upstream |
|---|---|---|---|
| Upstream baseline | cuBLAS + Q8→FP16 expansion + legacy fused decode | ~38.0 | — |
| mmq prefill, legacy decode | USE_MMQ=1, NO_MMVQ_DECODE=1 | ~39.5 | +3.9% |
| mmq + mmvq decode (current default) | USE_MMQ=1, mmvq decode on | 43.2 | +13.7% |
| + CUDA graphs with stream sync (current default) | auto, opt-out via DS4_CUDA_MOE_GRAPHS=0 | 44.2 | +16.3% |
The arena→VMM transition is a memory-layout change for the weights, not a kernel change — decode is bandwidth-bound, so VMM should be neutral on it. Verified:
| Arch | gen_tps_ss arena | gen_tps_ss VMM | Δ |
|---|---|---|---|
| PRO 6000 (ctx=2048, gen=128) | 43.21 | 43.44 | +0.5% |
| PRO 6000 (ctx=32768, gen=128) | 37.77 | 38.05 | +0.7% |
| GB10 (ctx=2048, gen=64) | 14.11 | 14.12 | +0.1% |
| GB10 (ctx=16384, gen=64) | 13.52 | 13.49 | −0.2% |
Single-run CLI numbers for both CUDA targets alongside the existing Mac entries in the project README, with identical settings on each: --ctx 32768 --nothink --temp 0 -n 256, q2 quant, long prompt = first ~40 kB of speed-bench/promessi_sposi.txt (12,461 tokens on the V4 Flash tokenizer). Mac numbers are unchanged from the existing table; the two CUDA rows are the fresh measurements taken from the same build the rest of this report describes (mmq + mmvq + graphs with bidirectional stream sync + in-process VMM arena, all default-on).
| Run | Prefill mean | Gen mean | Gen first frontier | Gen last frontier (38912) |
|---|---|---|---|---|
| no-MTP (showcase baseline) | 342.20 t/s | 12.68 t/s | 14.00 t/s | 11.63 t/s |
| exact MTP, draft=2 experimental | 341.02 t/s | 12.62 t/s | 12.53 t/s | 11.68 t/s |
| Variable | Default | Effect |
|---|---|---|
DS4_CUDA_PREFILL_PATH | auto → mmq | Q8_0 dispatch: mmq / cublas / warp8 / auto. Explicit override. |
DS4_CUDA_USE_MMQ | unset | Legacy alias: 0 = cublas. Lower precedence than DS4_CUDA_PREFILL_PATH. |
DS4_CUDA_MMQ_MOE_MIN_TOKENS | 2 | Minimum n_tokens at which routed-MoE uses mmq matrix-shaped path. |
DS4_CUDA_MMQ_X_MAX | unset (128) | Diagnostic: clip get_mmq_x_max_host to N. |
DS4_CUDA_NO_MMVQ_DECODE | unset | Opt-out of mmvq for n_tok=1 decode (routed-MoE and dense Q8_0). |
DS4_CUDA_MMVQ_DECODE_MAX_TOKENS | 1 | Cap on n_tokens routed through mmvq decode branch (0–8). |
DS4_CUDA_MOE_GRAPHS | ON | CUDA Graph capture/replay for routed-MoE decode and dense Q8_0 vec, with bidirectional stream sync. Opt-out via 0. |
DS4_CUDA_MTP_VERIFIER_USE_MMQ | unset / 0 | Bisection switch: 1 reproduces the broken mmq-in-verifier behavior. |
DS4_CUDA_VMM_ARENA | enabled | 0 disables in-process VMM allocator (escape hatch). |
DS4_CUDA_VMM_ARENA_CHUNK_MB | 0 (request-size) | Force a minimum chunk size per cuMemCreate; rarely needed. |
DS4_CUDA_WEIGHT_IPC_MANIFEST | unset | Worker-side: import weights from the manifest path. Hard-gates in-process VMM off. |
DS4_CUDA_MTP_TOP2 / DS4_CUDA_MTP_VERIFY_TOP2 | unset | Enable CUDA top-2 draft + verifier shortcut. |
DS4_MTP_CERT_LOGITS / DS4_MTP_CERT_LOGITS_SHADOW | unset | Opt-in row-0 certificate / shadow validator. |
DS4_MTP_STRICT (= --quality) | unset | Force byte-identical target stream behavior. |
ds4_gpu.h)ds4_gpu_tensor_alloc_managed — managed-memory tensor for KV in the unified-memory path.ds4_gpu_top2_result, ds4_gpu_candidate_cert_result — structured returns for MTP top-2 and certificate verifier.ds4_gpu_set_model_map_range, ds4_gpu_import_model_ipc_manifest, ds4_gpu_set_model_fd, ds4_gpu_cache_model_range, ds4_gpu_cache_q8_f16_range — weight-server import surface.ds4_gpu_should_use_managed_kv_cache — KV-cache placement policy hook.ds4_metal.m added 333 lines, mostly to satisfy the GPU API surface).1bbacf6) then restored (961e57a) so cublas remains a working fallback. The dispatcher selects it on mmq init failure and the bench harness can opt into it for A/B.DS4_CUDA_MOE_GRAPHS resolved. The cross-stream race that motivated the b66b5d6 revert is now diagnosed and fixed (commits 687c783 + 7967154). Two race legs across the g_moe_stream / stream=0 boundary — pre-launch input read and post-launch output read — are closed by cudaEventRecord + cudaStreamWaitEvent brackets around every cudaGraphLaunch. Validated on both targets: smoke parity ON vs OFF bit-identical; MTP-active output coherent on GB10 (was previously garbled); ds4-bench gen positive on PRO 6000 in the committed sweep, flat within run-to-run noise on GB10 (LPDDR5X bandwidth-bound). Default flipped back to ON; opt-out via DS4_CUDA_MOE_GRAPHS=0. The sync overhead trades some parallelism for correctness, so the gain is smaller than the pre-revert benchmark and the GB10 gain in particular does not survive into the committed sweep.
5c0e9468 (2026-05-14). The two upstream files we patched (mmvq.{cu,cuh}) need rebasing on re-sync. cuda/mmq/VENDOR.md documents the procedure but the cost is real; we should plan re-syncs at most quarterly unless an upstream fix is load-bearing.
warp8 dispatch is degenerate on GB10. The fallback runs at 56 t/s on GB10 (vs 458 t/s mmq). It's the right correctness baseline for the MTP verifier (Option D) but anyone manually selecting warp8 outside the verifier on Spark will be very surprised. The dispatcher logs the choice once, which helps, but the env-var docs should flag this more loudly.
moe_pair_vec entry exists but isn't wired (applies silu without clamp; the V4 Flash clamp epilogue requires the two-call form)../ds4_test --logprob-vectors shipping with 7 pre-existing long_memory_archive failures. We verified by building at the upstream merge-base c9dd9499 (no mmq, no VMM, no graphs, no MTP enhancements, none of our 81 commits applied) and running the same test against the same q2-imatrix GGUF: 7 failures, identical pattern. Most plausibly q2-quantisation noise on a 16k-token needle-in-haystack retrieval task that the official cloud FP API solves but the local q2 doesn't. Not introduced by this PR; upstream territory.The work is too large for one PR. A natural split:
| # | PR title | Scope | Risk | Reviewer ask |
|---|---|---|---|---|
| 1 | cuda: vendor llama.cpp mmq + adapter + parity tests | cuda/mmq/ + Makefile + cuda/mmq/VENDOR.md | low (additive, default off behind DS4_CUDA_USE_MMQ=1 on the first PR if desired) | vendor-pin policy — licensing already settled: both MIT, ds4 LICENSE credits "The ggml authors" |
| 2 | cuda: route Q8_0 / Q4_K / IQ2_XXS / Q2_K dispatch through mmq | dispatcher in ds4_cuda.cu + env vars + AGENT.md table | medium (perf shift) | bench reproducibility on at least sm_120 + one other arch |
| 3 | cuda: in-process VMM weight arena | 5 commits, ds4_cuda.cu + AGENT.md | low (gated, falls back to legacy on probe failure) | portability across CUDA versions |
| 4 | tools: ds4_weight_server + broker + import API | tools/ds4_weight_server.cu, ds4_gpu.h, AGENT.md | medium (new binary, FD-over-socket protocol) | operator UX, lock file semantics |
| 5 | tests: generalized engine proof harness | tests/ds4_proof.py + docs/proof-harness | low (test-only) | scope of contract + budget naming |
| 6 | mtp: exact verifier + acceptance-history gating + session-snapshot accept-state | ds4.c, ds4_cuda.cu, docs/cuda-mtp | medium (default-on behavior change, but acceptance-gated) | strict-mode equivalence proof |
| 7 | cuda: Option D MTP-verifier kernel routing | tiny ds4.c bracketing + ds4_cuda.cu dispatcher hook | low (default-on) | none beyond #6 |
| 8 | bench/server: steady-state gen_tps + token_ids SSE + session-path no-MTP baselines | ds4_bench.c, ds4_cli.c, ds4_server.c | low (additive) | SSE wire-format match with vLLM |
101831c cuda: vendor llama.cpp's mmq kernel family for fused dequant matmul
06747ea cuda: implement ds4_mmq_q8_0_dense and parity-test against CPU reference
c7e0b8c cuda: add ds4_mmq_q2_K_dense and ds4_mmq_iq2_xxs_dense
0bf8040 cuda: add MoE _id matmul wrappers (ds4_mmq_{q8_0,q2_K,iq2_xxs}_moe)
39d3877 cuda: route Q8_0 dense matmuls through cuda/mmq when DS4_CUDA_USE_MMQ=1
a56e07a cuda: route IQ2_XXS/Q2_K routed-MoE through cuda/mmq when DS4_CUDA_USE_MMQ=1
09b38f2 cuda: gate mmq routed-MoE on n_tokens >= 2 to preserve decode throughput
046f02e docs: AGENT.md and cuda/mmq/VENDOR.md - phase 7 lock and env-var inventory
944482d cuda: flip DS4_CUDA_USE_MMQ to opt-out - cuda/mmq is now the default
298022f tests: sprint 0 bench harness - variance <1% at ctx=2048 on PRO 6000
1bbacf6 cuda: step 1 - delete legacy Q8->FP16 expansion cache and cuBLAS path
af99da7 cuda: step 2 - wire Q4_K dense and MoE through cuda/mmq
387e58d cuda: step 3 - paired moe API to share Q8_1 activation across gate+up
66fa20f cuda: step 4 - mmq_x_max sweep diagnostic hook (default unchanged)
380f9bf cuda: step 6 partial - vendor mmvq.{cu,cuh} and unary.cuh
71ca87d cuda: step 6.a/b - vec ABI + parity-tested impls for mmvq
4425ed0 cuda: step 6.c/d - wire mmvq decode into routed_moe and dense Q8_0
abe2657 cuda: step 8 - CUDA Graph capture+replay for routed-MoE decode
2200c67 cuda: step 8.2 - graphs for dense Q8_0 vec + flip default to ON
b66b5d6 cuda: revert DS4_CUDA_MOE_GRAPHS default to OFF (R1, stopgap)
687c783 cuda: bracket every captured cudaGraphLaunch with pre+post stream sync
7967154 cuda: re-enable DS4_CUDA_MOE_GRAPHS by default after sync-fix validation
8df4b2a cuda: route MTP verifier matmuls to legacy kernels (Option D)
961e57a Revert "cuda: step 1 - delete legacy Q8->FP16 expansion cache and cuBLAS path"
90fec89 cuda: pick Q8_0 strategy by device memory bandwidth at startup
38572f9 cuda: query memoryClockRate via cudaDeviceGetAttribute (CUDA 13 compat)
6658bde cuda: simplify Q8_0 dispatch to mmq-default with explicit overrides
11e37c2 cuda: probe + supported() helper for in-process VMM arena 2e1f4f6 cuda: VMM-backed weight arena allocator + teardown 65e5045 cuda: route fd-cache through the VMM arena when supported 6a89ea5 cuda: VMM arena chunks default to request-size, not 1024 MiB ba42c13 docs: in-process VMM arena is the default for single-process runs
87d0a60 weight-server: probe VMM backend support 53fbfd4 weight-server: allocate and upload VMM-owned model ranges 71e803d weight-server: broker VMM allocation file descriptors 48d9e49 cuda: import VMM weight ranges from the broker e7f7ce1 weight-server: use direct I/O for VMM uploads e7edd96 Bind CUDA fd-cache to its owning model_map 5076c07 cuda: fix merged fd-cache owner declaration 5b48096 Pre-cache MTP model tensors at CUDA startup 414c790 cuda: let weight server own derived verifier artifacts 6f271e9 cuda: import prebuilt Q8 expansion artifacts c76edf4 docs: add weight-server guidance to AGENT.md 0acb50e proof: add weight server harness smoke test
08d8fa1 proof: generalize engine proof runner f2f424e proof: add CUDA weight server lifecycle 76cbc9a proof: support scoped weight server imports 5a6d3ed proof: record weight server telemetry 0b1370d proof: guard weight server parent lifetime 7268f68 proof: lock CUDA weight server ownership cda4b4d proof: reject stale weight manifests 7536533 proof: report weight server validation verdict 015be9a proof: validate VMM weight-server runs 9466a03 proof: add named budgets and steady-state timing reports 5153bfb proof: report steady-state generation throughput 23cd345 proof: report MTP acceptance as a first-class metric 59cd8d2 proof: expose derived weight artifacts in harness 600e2e8 mtp: add CUDA proof matrix dbeeec9 proof: default shared weights to base without mtp
ec2dd7c cuda: add MTP top2 verifier paths 2141a37 mtp: reduce exact verifier output scheduling 6b9fe82 cuda: fuse MTP verifier MoE down sum for two tokens 3cad23c cuda: batch Q8 pair projections for MTP verifier c5a0c36 cuda: pair decode Q and KV projections 143623f cuda: write attention output A directly to low layout 0ccbfcf mtp: add shadow check for custom attention output B 2fcdeb6 cuda: keep Q8 pair batching on verifier-sized prefill 0e85718 cuda: bypass sorted MoE gate setup for two-token verify 7eeed1f proof: make no-opt verifier output the fast MTP baseline 4be36f8 mtp: suppress draft probes when speculation is disabled 9b9755a mtp: account full speculative cycle timing 362bb32 mtp: introduce explicit verifier plan results c6d1487 mtp: add fixed-depth decode3 verifier shadow 5dbed76 mtp: capture two verifier prefix depths 8f3216e mtp: make raw draft state commits explicit bbf61e5 proof: add verifier v2 shadow profile 462b613 mtp: add opt-in decode3 verifier path 18c3f59 proof: expose verifier v2 active candidate bda1ebb mtp: isolate batch-first verifier experiments 31cb5e0 proof: add batch-first verifier profile aad1401 mtp: make exact verification the default d206b04 mtp: prefer sequential exact suffix verification d71ee5e mtp: select exact verifier from acceptance history f72ba58 mtp: gate exact speculation on verified acceptance e07e014 mtp: trim exact decode2 output-head work a5af40d mtp: route exact decode2 through a layer entrypoint 3032763 mtp: profile exact decode2 layer replay 9ab947e mtp: add shadowed candidate output logits 614b5a8 mtp: add opt-in certified row0 logits verifier 3c63d0d mtp: tighten exact candidate output verification 4b8518b mtp: add exact pair output projection verifier f6051da mtp: default exact pair output verification on CUDA 68166ad mtp: add exact state-barrier decode2 verifier transaction d504a89 mtp: add decode2 batch FFN stage-diff diagnostics 1fb3c93 mtp: pair exact decode2 attention output e56b714 mtp: remove staging copies from paired decode2 attention 6a5714b mtp: isolate decode2 routed MoE batch exactness af5cd0a mtp: batch routed MoE inside exact decode2 FFN 0b99bea mtp: add scalar-order n2 Q8 for exact FFN body batching a1f4723 mtp: write exact batched FFN HC rows directly 7a20ab7 mtp: write exact FFN prefix rows in place 454f539 mtp: avoid paired Q8 gate-up in exact FFN body 87e5e39 mtp: add exact decode2 Q-path batch replay diagnostic d627216 mtp: add compressor projection replay diagnostic ce61a50 mtp: prefer decode2 for optimized exact full-body verifier 6a86411 mtp: serialize accept-gate state in session snapshot 9c99333 Add CUDA output-head verifier microbenchmark
d4ab17f cli: allow session-path no-MTP timing baselines 5153bfb proof: report steady-state generation throughput 9466a03 proof: add named budgets and steady-state timing reports d202148 bench: compare no-MTP and exact MTP throughput f4ca519 bench: report steady-state gen_tps alongside total b79c2d7 server: emit token_ids in chat SSE deltas when return_token_ids=true 4698319 server: place token_ids at choice level to match vLLM/llama-benchy 904a0c4 server: snap SSE emission limit to token boundary when return_token_ids