MiniMax-M3 on DeepSWE

Pass@1 of the MiniMax-M3 coding model on the 113-task DeepSWE long-horizon software-engineering benchmark, run provider-direct through the unmodified mini-swe-agent harness.
MiniMax-M3113 tasks pass@1 / k=1mini-swe-agent harness MiniMax-direct APIair-gapped sandboxes run 2026-06-02
13.3%
pass@1 strict (15/113)
16.8%
extended cap (19/113)
325
median agent steps
80k
median output tokens
$7.48
median cost / task
5
languages

DeepSWE is a behavior-verified benchmark of 113 real open-source feature requests across five languages. Each task ships hidden tests; a solution scores reward = 1 only if it makes the new-behavior tests pass without breaking the repository's pre-existing suite. This page reports MiniMax-M3's single-attempt (pass@1) performance and where it lands among published models.

Headline result

13.3%
pass@1 (strict) — 15 of 113 tasks solved within the canonical 90-minute working-time budget. This is the directly-comparable figure.
16.8%
pass@1 (extended) — 19 of 113 including 4 passes that the agent only reached after running past 90 minutes of working time (see below). Reported for transparency, not claimed as clean solves.
Why two numbers. To absorb provider rate-limit (429) back-off without losing work, the agent wall was set to 1.5× the 90-minute budget and the real budget was re-imposed at scoring as working = wall − throttle wait. That cleanly credits throttled tasks, but a handful of un-throttled tasks used the slack as genuine extra compute. 4 passes (arcane-drift-detection-baselines, dynamodb-toolbox-lazy-recursive-schemas, helm-unified-manifest-stream, httpx-deterministic-cookie-store) had >90 min of working time — the agent was still editing source past the budget — so the strict figure excludes them.
Read this as: “MiniMax-M3 on DeepSWE via the mini-swe-agent harness, MiniMax-direct API.” An independent measurement, not an official DeepSWE leaderboard entry — see Comparability & limits.

Where M3 lands

DeepSWE's published pass rates (± 95% CI) from deepswe.datacurve.ai, with MiniMax-M3 inserted at its strict rank. M3's row is our independent measurement; all others are Datacurve's reported figures.

Model95% CIpass@1
gpt-5.5 [xhigh]± 4%70%
gpt-5.4 [xhigh]± 5%56%
claude-opus-4.7 [max]± 5%54%
claude-sonnet-4.6 [high]± 4%32%
gemini-3.5-flash [medium]± 4%28%
gpt-5.4-mini [xhigh]± 4%24%
kimi-k2.6± 4%24%
mimo-v2.5-pro± 4%19%
glm-5.1± 4%18%
MiniMax-M3 [default] (this run, independent)n/a13.3%
gemini-3.1-pro± 3%10%
deepseek-v4-pro± 2%8%
gemini-3-flash± 2%5%

Methodology

Harness

agent
mini-swe-agent (unmodified loop)
orchestrator
Pier (Harbor task format)
model id
MiniMax-M3
endpoint
MiniMax Anthropic-compatible API (provider-direct)
k
1 attempt per task (pass@1)
cost limit
unlimited — no per-task token/reasoning cap

Environment & budget

sandbox
Modal, air-gapped (no internet)
concurrency
~36 tasks across 3 MiniMax keys
agent budget
90 min working-time (enforced at scoring)
agent wall
135 min (1.5× — throttle head-room)
verifier budget
30 min wall-clock
reward
1.0 iff new tests pass & base suite intact

429 handling is the one harness deviation, kept deliberately pure: rate-limited model calls are retried in place (the trajectory is preserved, no work lost), and the exact back-off time is logged and subtracted from wall-clock so the model is judged against a true 90-minute working budget. DeepSWE publishes no canonical reasoning-token budget, so we run mini-swe-agent at its defaults — the leaderboard convention of hitting each model at its own provider.

Results by language

LanguageTasksPassedpass@1
typescript35617.1%
go34411.8%
python3438.8%
rust500.0%
javascript5240.0%
typescript
6
go
4
python
3
rust
0
javascript
2

Strict pass@1 by language. Counts are small per language, so treat these as directional rather than precise.

Failure modes

Every scored task classified by outcome. Notably, M3 rarely fails by going silent — it almost always submits a patch (very few empty-patch trials); most losses are precision misses on the new behavior, not give-ups.

pass
19
reward = 1 — new behavior verified, base suite intact
correctness
74
submitted a patch that failed the new-behavior tests
regression
8
patch broke the pre-existing (base) test suite
timeout
12
ran the full agent budget without converging, and failed

How close were the misses?

A raw pass/fail rate hides how close a failed attempt came. Each task's hidden suite has a different size (from 1 test to 100+), so we normalize: % of new-behavior tests passing. Of the 94 non-passes:

35
34
25
≥90% passing — near-miss (35) 1–89% — partial (34) 0% — total miss / build failure (25)

Median across all non-passes: 79% of new tests passing. Many "fails" are a handful of edge-case tests short of a full solve; a separate cluster are total misses where the patch didn't compile against the new suite.

Efficiency

Each orange/green dot is one of the 113 tasks: x = the resource used, y = % of new tests it passed (green = scored a strict pass). Blue points are the few peer anchors Datacurve published, plotted at their model-level pass rate. M3 is strikingly token-hungry and iterative — a median 80k output tokens and 325 agent steps per task, well above frontier models that score higher with less.

Score vs. output tokens

0%25%50%75%100%0k50k100k150k200k250k300kmedian output tokens per task (thousands)% new tests passinggpt-5.5gpt-5.4claude-opus-4.7abs-module-cache-flags — 100% new tests, PASSabs-stepped-slices — 0% new tests, failactionlint-action-pinning-lint — 0% new tests, failadaptix-name-mapping-aliases — 94% new tests, failaiomonitor-task-snapshots-diff — 100% new tests, PASSanko-default-function-arguments — 0% new tests, failanko-typed-variable-bindings — 57% new tests, failarcane-drift-detection-baselines — 100% new tests, failarktype-json-schema-refs-dependencies — 93% new tests, failawilix-async-container-initialization — 100% new tests, PASSbandit-incremental-cache-control — 97% new tests, failbandit-interprocedural-taint-checks — 96% new tests, failbandit-structured-nosec-directives — 0% new tests, failboa-hierarchical-evaluation-cancellation — 94% new tests, failcattrs-partial-structuring-recovery — 96% new tests, failclack-async-autocomplete-options — 8% new tests, failclaude-code-by-agents-recursive-delegation — 86% new tests, failcliffy-config-file-parsing — 0% new tests, failcsstree-shorthand-expansion-compression — 87% new tests, faildasel-html-document-format — 94% new tests, faildateutil-rfc5545-timezone-interop — 87% new tests, faildrizzle-orm-window-function-builders — 100% new tests, PASSdynamodb-toolbox-conditional-attribute-requirements — 97% new tests, faildynamodb-toolbox-lazy-recursive-schemas — 100% new tests, faileffect-sse-httpapi-streaming — 79% new tests, faileicrud-keyset-pagination-cursor — 0% new tests, failetree-xml-diff-patch — 96% new tests, failexpr-try-catch-errors — 0% new tests, failfastapi-deprecation-response-headers — 98% new tests, failfastapi-implicit-head-options — 0% new tests, failfd-deterministic-multi-key-sorting — 98% new tests, failgeo-shapeindex-serialization — 100% new tests, PASSgo-critic-doc-link-checker — 0% new tests, failgo-genai-streamed-function-args — 0% new tests, failgo-git-worktree-merge-conflicts — 53% new tests, failgoreleaser-retry-publish-auditing — 0% new tests, failgql-incremental-graphql-delivery — 17% new tests, failhappy-dom-abort-pending-body-reads — 100% new tests, PASShappy-dom-deterministic-intersectionobserver — 93% new tests, failhelm-array-merge-strategies — 0% new tests, failhelm-unified-manifest-stream — 100% new tests, failhttpx-deterministic-cookie-store — 100% new tests, failhttpx-multipart-response-parsing — 98% new tests, failhttpx-streaming-json-iteration — 88% new tests, failigel-persist-feature-schema — 100% new tests, PASSink-grid-box-layout — 100% new tests, failipython-session-bundle-replay — 76% new tests, failkatex-multicolumn-array-spans — 98% new tests, failkcp-go-multiplexed-kcp-streams — 100% new tests, PASSkea-atomic-signal-selectors — 79% new tests, failkgateway-consistent-hash-policy — 0% new tests, failkombu-single-active-consumer-priority — 99% new tests, failkombu-virtual-queue-dead-lettering — 73% new tests, failkoota-composite-trait-aspects — 86% new tests, failkoota-deferred-mutation-buffer — 55% new tests, failkoota-entity-snapshot-rollback — 91% new tests, failkoota-pair-relation-tracking — 63% new tests, failkoota-query-predicates — 95% new tests, failkysely-window-grouping-helpers — 0% new tests, faillangchain-request-coalescing — 88% new tests, failmashumaro-flattened-dataclass-fields — 90% new tests, failmeriyah-explicit-resource-declarations — 98% new tests, failmnamer-daemon-watch-lifecycle — 92% new tests, failmobly-grouped-test-barriers — 86% new tests, failnarwhals-rolling-window-suite — 88% new tests, failnumba-stencil-boundary-modes — 69% new tests, failobsidian-linter-auto-table-of-contents — 83% new tests, failobsidian-linter-link-format-conversion — 93% new tests, failobsidian-linter-scoped-ignore-markers — 31% new tests, failofetch-per-origin-circuit-breaker — 94% new tests, failonedump-dump-encryption-pipeline — 0% new tests, failopa-rego-rule-profiling — 0% new tests, failopa-template-string-reconstruction — 0% new tests, failoptique-conditional-option-dependencies — 100% new tests, failoxvg-structural-selector-preservation — 60% new tests, failparticiple-grammar-conflict-analysis — 0% new tests, failpebble-durability-wait-apis — 93% new tests, failpest-character-class-coalescing — 96% new tests, failprometheus-transactional-reload-status — 0% new tests, failprometheus-typed-label-sorting — 0% new tests, failpsd-tools-blend-range-api — 69% new tests, failpwntools-tube-multiplexing — 100% new tests, failpython-statemachine-state-data-scoping — 50% new tests, failquery-persist-restored-query-state — 100% new tests, PASSquill-shared-toolbar-focus — 0% new tests, failreturns-validated-error-accumulation — 98% new tests, failscc-bounded-memory-spilling — 0% new tests, failscriggo-method-declarations — 77% new tests, failskrub-duration-encoding — 99% new tests, failsql-formatter-bigquery-pipe-formatting — 100% new tests, PASSsqlfmt-create-table-ddl-formatting — 94% new tests, failsqlite-utils-safe-import-checkpoints — 62% new tests, failsuperjson-error-stack-serialization — 63% new tests, failtask-task-graph-export — 90% new tests, failtengo-callable-instance-isolation — 100% new tests, failtengo-destructuring-bindings — 93% new tests, failtermenv-preserve-ansi-resets — 0% new tests, failtestem-bail-on-test-failure — 67% new tests, failtestem-per-launcher-reports — 100% new tests, PASStextual-kitty-key-phases — 44% new tests, failtextual-richlog-follow-state — 96% new tests, failtomlkit-toml-table-converters — 100% new tests, PASStrue-myth-iterable-collection-combinators — 86% new tests, failts-pattern-match-each — 0% new tests, failupdo-policy-alerting — 0% new tests, failvalibot-recursive-schema-composition — 22% new tests, failvitest-duration-sharding — 100% new tests, PASSvulture-persistent-analysis-cache — 79% new tests, failwasmi-trap-coredumps — 64% new tests, failwazero-multi-module-snapshots — 96% new tests, failyaegi-go-embed-directives — 100% new tests, PASSyjs-map-conflict-detection — 100% new tests, PASSytt-jsonpath-query-api — 89% new tests, fail

Score vs. duration

0%25%50%75%100%0m30m60m90m120m150mworking duration (min, throttle-excluded)% new tests passinggpt-5.5gemini-3.5-flashabs-module-cache-flags — 100% new tests, PASSabs-stepped-slices — 0% new tests, failactionlint-action-pinning-lint — 0% new tests, failadaptix-name-mapping-aliases — 94% new tests, failaiomonitor-task-snapshots-diff — 100% new tests, PASSanko-default-function-arguments — 0% new tests, failanko-typed-variable-bindings — 57% new tests, failarcane-drift-detection-baselines — 100% new tests, failarktype-json-schema-refs-dependencies — 93% new tests, failawilix-async-container-initialization — 100% new tests, PASSbandit-incremental-cache-control — 97% new tests, failbandit-interprocedural-taint-checks — 96% new tests, failbandit-structured-nosec-directives — 0% new tests, failboa-hierarchical-evaluation-cancellation — 94% new tests, failcattrs-partial-structuring-recovery — 96% new tests, failclack-async-autocomplete-options — 8% new tests, failclaude-code-by-agents-recursive-delegation — 86% new tests, failcliffy-config-file-parsing — 0% new tests, failcsstree-shorthand-expansion-compression — 87% new tests, faildasel-html-document-format — 94% new tests, faildateutil-rfc5545-timezone-interop — 87% new tests, faildrizzle-orm-window-function-builders — 100% new tests, PASSdynamodb-toolbox-conditional-attribute-requirements — 97% new tests, faildynamodb-toolbox-lazy-recursive-schemas — 100% new tests, faileffect-sse-httpapi-streaming — 79% new tests, faileicrud-keyset-pagination-cursor — 0% new tests, failetree-xml-diff-patch — 96% new tests, failexpr-try-catch-errors — 0% new tests, failfastapi-deprecation-response-headers — 98% new tests, failfastapi-implicit-head-options — 0% new tests, failfd-deterministic-multi-key-sorting — 98% new tests, failgeo-shapeindex-serialization — 100% new tests, PASSgo-critic-doc-link-checker — 0% new tests, failgo-genai-streamed-function-args — 0% new tests, failgo-git-worktree-merge-conflicts — 53% new tests, failgoreleaser-retry-publish-auditing — 0% new tests, failgql-incremental-graphql-delivery — 17% new tests, failhappy-dom-abort-pending-body-reads — 100% new tests, PASShappy-dom-deterministic-intersectionobserver — 93% new tests, failhelm-array-merge-strategies — 0% new tests, failhelm-unified-manifest-stream — 100% new tests, failhttpx-deterministic-cookie-store — 100% new tests, failhttpx-multipart-response-parsing — 98% new tests, failhttpx-streaming-json-iteration — 88% new tests, failigel-persist-feature-schema — 100% new tests, PASSink-grid-box-layout — 100% new tests, failipython-session-bundle-replay — 76% new tests, failkatex-multicolumn-array-spans — 98% new tests, failkcp-go-multiplexed-kcp-streams — 100% new tests, PASSkea-atomic-signal-selectors — 79% new tests, failkgateway-consistent-hash-policy — 0% new tests, failkombu-single-active-consumer-priority — 99% new tests, failkombu-virtual-queue-dead-lettering — 73% new tests, failkoota-composite-trait-aspects — 86% new tests, failkoota-deferred-mutation-buffer — 55% new tests, failkoota-entity-snapshot-rollback — 91% new tests, failkoota-pair-relation-tracking — 63% new tests, failkoota-query-predicates — 95% new tests, failkysely-window-grouping-helpers — 0% new tests, faillangchain-request-coalescing — 88% new tests, failmashumaro-flattened-dataclass-fields — 90% new tests, failmeriyah-explicit-resource-declarations — 98% new tests, failmnamer-daemon-watch-lifecycle — 92% new tests, failmobly-grouped-test-barriers — 86% new tests, failnarwhals-rolling-window-suite — 88% new tests, failnumba-stencil-boundary-modes — 69% new tests, failobsidian-linter-auto-table-of-contents — 83% new tests, failobsidian-linter-link-format-conversion — 93% new tests, failobsidian-linter-scoped-ignore-markers — 31% new tests, failofetch-per-origin-circuit-breaker — 94% new tests, failonedump-dump-encryption-pipeline — 0% new tests, failopa-rego-rule-profiling — 0% new tests, failopa-template-string-reconstruction — 0% new tests, failoptique-conditional-option-dependencies — 100% new tests, failoxvg-structural-selector-preservation — 60% new tests, failparticiple-grammar-conflict-analysis — 0% new tests, failpebble-durability-wait-apis — 93% new tests, failpest-character-class-coalescing — 96% new tests, failprometheus-transactional-reload-status — 0% new tests, failprometheus-typed-label-sorting — 0% new tests, failpsd-tools-blend-range-api — 69% new tests, failpwntools-tube-multiplexing — 100% new tests, failpython-statemachine-state-data-scoping — 50% new tests, failquery-persist-restored-query-state — 100% new tests, PASSquill-shared-toolbar-focus — 0% new tests, failreturns-validated-error-accumulation — 98% new tests, failscc-bounded-memory-spilling — 0% new tests, failscriggo-method-declarations — 77% new tests, failskrub-duration-encoding — 99% new tests, failsql-formatter-bigquery-pipe-formatting — 100% new tests, PASSsqlfmt-create-table-ddl-formatting — 94% new tests, failsqlite-utils-safe-import-checkpoints — 62% new tests, failsuperjson-error-stack-serialization — 63% new tests, failtask-task-graph-export — 90% new tests, failtengo-callable-instance-isolation — 100% new tests, failtengo-destructuring-bindings — 93% new tests, failtermenv-preserve-ansi-resets — 0% new tests, failtestem-bail-on-test-failure — 67% new tests, failtestem-per-launcher-reports — 100% new tests, PASStextual-kitty-key-phases — 44% new tests, failtextual-richlog-follow-state — 96% new tests, failtomlkit-toml-table-converters — 100% new tests, PASStrue-myth-iterable-collection-combinators — 86% new tests, failts-pattern-match-each — 0% new tests, failupdo-policy-alerting — 0% new tests, failvalibot-recursive-schema-composition — 22% new tests, failvitest-duration-sharding — 100% new tests, PASSvulture-persistent-analysis-cache — 79% new tests, failwasmi-trap-coredumps — 64% new tests, failwazero-multi-module-snapshots — 96% new tests, failyaegi-go-embed-directives — 100% new tests, PASSyjs-map-conflict-detection — 100% new tests, PASSytt-jsonpath-query-api — 89% new tests, fail

Score vs. synthesized cost (PAYG list price)

0%25%50%75%100%0$5$10$15$20$25$30$synthesized cost per task (USD, PAYG list price)% new tests passinggpt-5.4gpt-5.5abs-module-cache-flags — 100% new tests, PASSabs-stepped-slices — 0% new tests, failactionlint-action-pinning-lint — 0% new tests, failadaptix-name-mapping-aliases — 94% new tests, failaiomonitor-task-snapshots-diff — 100% new tests, PASSanko-default-function-arguments — 0% new tests, failanko-typed-variable-bindings — 57% new tests, failarcane-drift-detection-baselines — 100% new tests, failarktype-json-schema-refs-dependencies — 93% new tests, failawilix-async-container-initialization — 100% new tests, PASSbandit-incremental-cache-control — 97% new tests, failbandit-interprocedural-taint-checks — 96% new tests, failbandit-structured-nosec-directives — 0% new tests, failboa-hierarchical-evaluation-cancellation — 94% new tests, failcattrs-partial-structuring-recovery — 96% new tests, failclack-async-autocomplete-options — 8% new tests, failclaude-code-by-agents-recursive-delegation — 86% new tests, failcliffy-config-file-parsing — 0% new tests, failcsstree-shorthand-expansion-compression — 87% new tests, faildasel-html-document-format — 94% new tests, faildateutil-rfc5545-timezone-interop — 87% new tests, faildrizzle-orm-window-function-builders — 100% new tests, PASSdynamodb-toolbox-conditional-attribute-requirements — 97% new tests, faildynamodb-toolbox-lazy-recursive-schemas — 100% new tests, faileffect-sse-httpapi-streaming — 79% new tests, faileicrud-keyset-pagination-cursor — 0% new tests, failetree-xml-diff-patch — 96% new tests, failexpr-try-catch-errors — 0% new tests, failfastapi-deprecation-response-headers — 98% new tests, failfastapi-implicit-head-options — 0% new tests, failfd-deterministic-multi-key-sorting — 98% new tests, failgeo-shapeindex-serialization — 100% new tests, PASSgo-critic-doc-link-checker — 0% new tests, failgo-genai-streamed-function-args — 0% new tests, failgo-git-worktree-merge-conflicts — 53% new tests, failgoreleaser-retry-publish-auditing — 0% new tests, failgql-incremental-graphql-delivery — 17% new tests, failhappy-dom-abort-pending-body-reads — 100% new tests, PASShappy-dom-deterministic-intersectionobserver — 93% new tests, failhelm-array-merge-strategies — 0% new tests, failhelm-unified-manifest-stream — 100% new tests, failhttpx-deterministic-cookie-store — 100% new tests, failhttpx-multipart-response-parsing — 98% new tests, failhttpx-streaming-json-iteration — 88% new tests, failigel-persist-feature-schema — 100% new tests, PASSink-grid-box-layout — 100% new tests, failipython-session-bundle-replay — 76% new tests, failkatex-multicolumn-array-spans — 98% new tests, failkcp-go-multiplexed-kcp-streams — 100% new tests, PASSkea-atomic-signal-selectors — 79% new tests, failkgateway-consistent-hash-policy — 0% new tests, failkombu-single-active-consumer-priority — 99% new tests, failkombu-virtual-queue-dead-lettering — 73% new tests, failkoota-composite-trait-aspects — 86% new tests, failkoota-deferred-mutation-buffer — 55% new tests, failkoota-entity-snapshot-rollback — 91% new tests, failkoota-pair-relation-tracking — 63% new tests, failkoota-query-predicates — 95% new tests, failkysely-window-grouping-helpers — 0% new tests, faillangchain-request-coalescing — 88% new tests, failmashumaro-flattened-dataclass-fields — 90% new tests, failmeriyah-explicit-resource-declarations — 98% new tests, failmnamer-daemon-watch-lifecycle — 92% new tests, failmobly-grouped-test-barriers — 86% new tests, failnarwhals-rolling-window-suite — 88% new tests, failnumba-stencil-boundary-modes — 69% new tests, failobsidian-linter-auto-table-of-contents — 83% new tests, failobsidian-linter-link-format-conversion — 93% new tests, failobsidian-linter-scoped-ignore-markers — 31% new tests, failofetch-per-origin-circuit-breaker — 94% new tests, failonedump-dump-encryption-pipeline — 0% new tests, failopa-rego-rule-profiling — 0% new tests, failopa-template-string-reconstruction — 0% new tests, failoptique-conditional-option-dependencies — 100% new tests, failoxvg-structural-selector-preservation — 60% new tests, failparticiple-grammar-conflict-analysis — 0% new tests, failpebble-durability-wait-apis — 93% new tests, failpest-character-class-coalescing — 96% new tests, failprometheus-transactional-reload-status — 0% new tests, failprometheus-typed-label-sorting — 0% new tests, failpsd-tools-blend-range-api — 69% new tests, failpwntools-tube-multiplexing — 100% new tests, failpython-statemachine-state-data-scoping — 50% new tests, failquery-persist-restored-query-state — 100% new tests, PASSquill-shared-toolbar-focus — 0% new tests, failreturns-validated-error-accumulation — 98% new tests, failscc-bounded-memory-spilling — 0% new tests, failscriggo-method-declarations — 77% new tests, failskrub-duration-encoding — 99% new tests, failsql-formatter-bigquery-pipe-formatting — 100% new tests, PASSsqlfmt-create-table-ddl-formatting — 94% new tests, failsqlite-utils-safe-import-checkpoints — 62% new tests, failsuperjson-error-stack-serialization — 63% new tests, failtask-task-graph-export — 90% new tests, failtengo-callable-instance-isolation — 100% new tests, failtengo-destructuring-bindings — 93% new tests, failtermenv-preserve-ansi-resets — 0% new tests, failtestem-bail-on-test-failure — 67% new tests, failtestem-per-launcher-reports — 100% new tests, PASStextual-kitty-key-phases — 44% new tests, failtextual-richlog-follow-state — 96% new tests, failtomlkit-toml-table-converters — 100% new tests, PASStrue-myth-iterable-collection-combinators — 86% new tests, failts-pattern-match-each — 0% new tests, failupdo-policy-alerting — 0% new tests, failvalibot-recursive-schema-composition — 22% new tests, failvitest-duration-sharding — 100% new tests, PASSvulture-persistent-analysis-cache — 79% new tests, failwasmi-trap-coredumps — 64% new tests, failwazero-multi-module-snapshots — 96% new tests, failyaegi-go-embed-directives — 100% new tests, PASSyjs-map-conflict-detection — 100% new tests, PASSytt-jsonpath-query-api — 89% new tests, fail

Cost synthesized from token counts at MiniMax-M3's standard list price ($0.60/M input, $0.12/M cache-read, $2.40/M output) — the actual run used a flat-rate subscription. Median $7.48/task; M3's heavy context re-reads dominate the bill even at a low per-token rate. Peer anchors are Datacurve's published per-trial costs.

Peer efficiency points are sparse and partly estimated on Datacurve's page; treat them as rough context, not a like-for-like overlay. Duration uses throttle-excluded working time.

Task explorer

all langs gojavascriptpythonrusttypescript all outcomes passcorrectnessregressiontimeout
OutcomeTaskLangNew testsStepsPatch
pass
Harden module loading, cache introspection, and script flags
abs-lang/abs
go100%3266f / +724
pass
Add ShapeIndex encoding and decoding
golang/geo
go100%24710f / +1092
pass
Add multiplexed ordered streams over KCP
xtaci/kcp-go
go100%1934f / +2138
pass
Add go:embed directive support for interpreted packages
traefik/yaegi
go100%58016f / +1360
pass
Partition report files by launcher and expand report templates
testem/testem
javascript100%23912f / +825
pass
Add deterministic map conflict detection to Y.Map writes
yjs/yjs
javascript100%2107f / +575
pass
Add task snapshots, inspection, and diffing to aiomonitor
aio-libs/aiomonitor
python100%15314f / +1630
pass
Persist the fitted feature schema across evaluate, predict, serve, and export
nidhaloff/igel
python100%2575f / +1334
pass
Add bidirectional TOML table converters
python-poetry/tomlkit
python100%3984f / +893
pass
Add dependency-aware async initialization to the container
jeffijoe/awilix
typescript100%2765f / +1020
pass
Add typed window function builders with OVER clauses
drizzle-team/drizzle-orm
typescript100%24727f / +1904
pass
Abort pending body reads on shutdown
capricorn86/happy-dom
typescript100%3716f / +405
pass
Preserve restored query state in persisted snapshots
TanStack/query
typescript100%3267f / +1151
pass
Format BigQuery pipe syntax queries correctly
sql-formatter-org/sql-formatter
typescript100%27011f / +432
pass
Add duration-aware sharding to Vitest
vitest-dev/vitest
typescript100%39424f / +1872
pass
✓ over-budget
Add drift detection and compliance baselines
getarcaneapp/arcane
go100%38816f / +2067
pass
✓ over-budget
Add unified manifest stream output across Helm commands
helm/helm
go100%59937f / +1413
correctness
Fix isolated Go-side calls for Tengo callables and closures
d5/tengo
go100%3704f / +531
timeout
Add tube multiplexing to pwntools
Gallopsled/pwntools
python100%3316f / +2239
pass
✓ over-budget
Add lazy recursive schemas with DTO and JSON Schema export
dynamodb-toolbox/dynamodb-toolbox
typescript100%47351f / +1141
pass
✓ over-budget
Add a deterministic CookieStore with modern Set-Cookie parsing
encode/httpx
typescript100%3178f / +821
correctness
Add CSS Grid layout to the Box component
vadimdemedes/ink
typescript100%2815f / +1000
correctness
Add conditional option dependencies to Optique
dahlia/optique
typescript100%3994f / +1699
regression
Add duration encoding to TableVectorizer
skrub-data/skrub
python99%3068f / +993
correctness
Add single-active-consumer priority and cancel tracking to virtual transports
celery/kombu
python99%2923f / +812
correctness
Add deprecation, sunset, and successor headers to FastAPI routes
fastapi/fastapi
python98%29021f / +1349
correctness
Add an error-accumulating Validated container
dry-python/returns
python98%19712f / +1302
correctness
Add `\\multicolumn` column spans to array-like environments
KaTeX/KaTeX
javascript98%3317f / +678
correctness
Add deterministic multi-key sorting to fd
sharkdp/fd
rust98%3236f / +1004
correctness
Add explicit resource management declarations to the parser
meriyah/meriyah
typescript98%34711f / +1324
correctness
Add multipart response parsing to HTTPX
encode/httpx
python98%2316f / +789
correctness
Add conditional required attributes to schemas
dynamodb-toolbox/dynamodb-toolbox
typescript97%40643f / +984
correctness
Add incremental cache controls to Bandit
PyCQA/bandit
python97%1607f / +1447
correctness
Add interprocedural taint checks for Bandit injection sinks
PyCQA/bandit
python96%36422f / +1562
correctness
Add XML diff, patch, and merge operations to etree
beevik/etree
go96%2674f / +2093
correctness
Coalesce qualifying choices into character classes
pest-parser/pest
rust96%3317f / +845
correctness
Add multi-module memory snapshots to wazero
wazero/wazero
go96%12911f / +1471
correctness
Restore RichLog follow-state parity and expand reflow behavior
Textualize/textual
python96%2943f / +436
correctness
Add partial structuring with error recovery to cattrs
python-attrs/cattrs
python96%3946f / +879
correctness
Add value-based query predicates to Koota
pmndrs/koota
typescript95%53024f / +1633
correctness
Add HTML document format handling to Dasel
TomWright/dasel
go94%2509f / +2048
timeout
Add a per-origin circuit breaker to ofetch
unjs/ofetch
typescript94%1233f / +469
timeout
Add hierarchical evaluation cancellation to Boa
boa-dev/boa
rust94%55210f / +1324
correctness
Add input key aliases to name mapping
reagento/adaptix
python94%39410f / +705
correctness
Format CREATE TABLE DDL and add DDL parsing helpers
tconbeer/sqlfmt
python94%4099f / +1256
correctness
Implement a deterministic IntersectionObserver in Happy DOM
capricorn86/happy-dom
typescript93%2033f / +837
correctness
Add link format conversion between wiki and markdown syntax
platers/obsidian-linter
typescript93%2774f / +7678
correctness
Add durability callbacks and wait APIs for sync writes
cockroachdb/pebble
go93%44211f / +1134
correctness
Add destructuring bindings to Tengo
d5/tengo
go93%51212f / +1010
correctness
Add JSON Schema refs and dependency keywords
arktypeio/arktype
typescript93%52919f / +1317
correctness
Validate daemon watch, status, and log lifecycle
jkwill87/mnamer
python92%1877f / +1797
correctness
Add entity snapshot and rollback APIs to Koota
pmndrs/koota
python91%27916f / +1421
regression
Add flattened dataclass fields to Mashumaro field options
Fatal1ty/mashumaro
python90%3245f / +1011
timeout
Add task graph export with JSON, DOT, and text output
go-task/task
go90%5009f / +1059
correctness
Add JSONPath query APIs to orderedmap and Starlark modules
carvel-dev/ytt
go89%2658f / +1989
regression
Add rolling min, max, median, and quantile methods
narwhals-dev/narwhals
python88%43015f / +1828
correctness
Add streaming JSON iteration to HTTPX responses
encode/httpx
python88%2777f / +983
regression
Add request coalescing to `Runnable`
langchain-ai/langchain
python88%1795f / +1571
correctness
Add shorthand expansion and compression to the lexer
csstree/csstree
javascript87%3246f / +1742
correctness
Add RFC 5545 timezone interoperability to dateutil recurrence parsing
dateutil/dateutil
python87%4004f / +638
correctness
Add composite trait aspects to Koota
pmndrs/koota
typescript86%52624f / +1790
correctness
Add iterable collection combinators to true-myth
true-myth/true-myth
typescript86%2298f / +2085
correctness
Implement recursive agent delegation through delegate_task tool calls
baryhuang/claude-code-by-agents
typescript86%1707f / +2754
regression
Add grouped test phases with synchronized barriers
google/mobly
python86%3782f / +863
correctness
Add automatic table of contents generation for Obsidian linter
platers/obsidian-linter
typescript83%3034f / +7696
correctness
Add a persistent analysis cache to Vulture
jendrikseipp/vulture
python79%2906f / +854
correctness
Add SSE streaming endpoints to HttpApi
Effect-TS/effect
typescript79%50410f / +919
timeout
Add atomic signal selectors to Kea
keajs/kea
typescript79%33633f / +1743
timeout
Add method declarations and interface dispatch to Scriggo
open2b/scriggo
go77%67115f / +866
timeout
Add session bundle recording and replay to IPython
ipython/ipython
python76%1474f / +973
correctness
Add dead-lettering, TTL, and overflow handling to virtual queues
celery/kombu
python73%3048f / +641
correctness
Add typed blend range access and blend-if compositing
psd-tools/psd-tools
python69%2687f / +1084
correctness
Add boundary modes to `@stencil`
numba/numba
python69%4043f / +930
correctness
Add bail-on-test-failure handling to Testem
testem/testem
javascript67%33418f / +619
correctness
Add trap coredump generation to wasmi
wasmi-labs/wasmi
rust64%55619f / +1031
correctness
Add pair-level relation tracking modifiers
pmndrs/koota
typescript63%44314f / +856
correctness
Add error stack serialization to SuperJSON
flightcontrolhq/superjson
typescript63%2476f / +647
correctness
Add safe import checkpoints and invariant validation
simonw/sqlite-utils
python62%2916f / +1391
correctness
Preserve structure needed by stylesheet selectors
noahbald/oxvg
rust60%4725f / +314
correctness
Add typed variable bindings to Anko
mattn/anko
go57%3569f / +1277
timeout
Add a deferred mutation buffer to batch entity changes
pmndrs/koota
typescript55%41512f / +1211
correctness
Add worktree merge conflict handling
go-git/go-git
go53%4176f / +1435
regression
Add scoped state data to state machine callbacks and history
fgmacedo/python-statemachine
python50%59414f / +1229
correctness
Complete Kitty keyboard phases and stable fallback key metadata
Textualize/textual
python44%2998f / +826
correctness
Add scoped per-rule ignore markers to Obsidian Linter
platers/obsidian-linter
typescript31%22013f / +8747
timeout
Add recursive schema composition to Valibot
open-circle/valibot
typescript22%42915f / +967
correctness
Add GraphQL incremental delivery with @defer and @stream
graphql-python/gql
python17%27915f / +1610
timeout
Add async autocomplete options and fetch lifecycle handling
bombshell-dev/clack
typescript8%720f / +0
correctness
Add stepped slices for arrays and strings
abs-lang/abs
go0%2886f / +689
correctness
Add action pinning linting for actions and reusable workflows
rhysd/actionlint
go0%4409f / +1394
timeout
Add default arguments to Anko function parameters
mattn/anko
go0%3654f / +384
timeout
Add try/catch error recovery to expr
expr-lang/expr
go0%40413f / +1645
correctness
Add a checker for broken doc comment links
go-critic/go-critic
go0%4638f / +642
correctness
Expose accumulated streamed function-call args in SDK surfaces
googleapis/go-genai
go0%3667f / +1899
correctness
Add retry-aware publishing audit logs
goreleaser/goreleaser
go0%31221f / +1735
correctness
Add configurable array merge strategies to Helm value coalescing
helm/helm
go0%31023f / +1788
correctness
Add consistent hash policy support to TrafficPolicy
kgateway-dev/kgateway
go0%368191f / +13415
correctness
Add transparent encryption to dump uploads
liweiyi88/onedump
go0%17216f / +1527
correctness
Add rule evaluation profiling to Rego
open-policy-agent/opa
go0%35216f / +1301
correctness
Reconstruct template strings in partial evaluation output
open-policy-agent/opa
go0%3473f / +358
correctness
Add build-time grammar conflict analysis to participle
alecthomas/participle
go0%3025f / +1456
correctness
Fix PromQL label sorting across typed and untyped values
prometheus/prometheus
go0%4575f / +988
correctness
Add bounded-memory spilling to SCC aggregation
boyter/scc
go0%2176f / +1037
correctness
Preserve ANSI resets during truncation and styling
muesli/termenv
go0%11310f / +1183
correctness
Add policy-based alerting for failures, latency, and SSL expiry
Owloops/updo
go0%22210f / +1445
correctness
Add structured nosec directives for regions and next line
PyCQA/bandit
python0%32110f / +1348
correctness
Add implicit HEAD and automatic OPTIONS responses to FastAPI routes
fastapi/fastapi
python0%3494f / +1722
correctness
Add config file parsing to Cliffy commands
c4spar/cliffy
typescript0%3256f / +1067
correctness
Add keyset cursor pagination to `$find`
eicrud/eicrud
typescript0%51911f / +1396
correctness
Add grouping-set and window-frame SQL helpers
kysely-org/kysely
typescript0%39921f / +1749
regression
Add transactional reload status and rollback tracking to Prometheus
prometheus/prometheus
typescript0%47515f / +1083
regression
Reuse one toolbar across multiple Quill editors
slab/quill
typescript0%3146f / +887
correctness
Add `matchEach` to ts-pattern
gvergnaud/ts-pattern
typescript0%1484f / +1061

"New tests" = % of the task's new-behavior suite that passed. Outcomes tagged "✓ over-budget" are extended-cap passes excluded from the strict figure.

Comparability & limits

Provider-direct is the leaderboard-faithful axis. DeepSWE standardizes the harness (mini-swe-agent, identical prompt/tools) and hits each model at its own provider. We run MiniMax-M3 through MiniMax's own API for exactly that reason.
  • Headline is the strict 90-min figure (13.3%). The 16.8% extended figure includes 4 passes the agent only reached past the budget; we disclose but do not claim them.
  • Single pass@1 run (k=1); no across-seed variance estimate, so no CI on M3's own number.
  • We cannot bit-verify our harness config against the leaderboard team's, so this is an independent measurement, not an official entry.
  • The cost column is synthesized from token counts at list price (the run used a flat-rate subscription); peer efficiency anchors are sparse/partly-estimated from the blog.
  • Reward is the program verifier's: new-behavior tests pass and the base suite stays green. No partial credit — the closeness chart above is descriptive only.