MiniMax-M3 on DeepSWE
DeepSWE is a behavior-verified benchmark of 113 real open-source feature requests across five languages. Each task ships hidden tests; a solution scores reward = 1 only if it makes the new-behavior tests pass without breaking the repository's pre-existing suite. This page reports MiniMax-M3's single-attempt (pass@1) performance and where it lands among published models.
Headline result
working = wall − throttle wait.
That cleanly credits throttled tasks, but a handful of un-throttled tasks used the
slack as genuine extra compute. 4 passes
(arcane-drift-detection-baselines, dynamodb-toolbox-lazy-recursive-schemas, helm-unified-manifest-stream, httpx-deterministic-cookie-store) had >90 min of working
time — the agent was still editing source past the budget — so the strict figure excludes them.Where M3 lands
DeepSWE's published pass rates (± 95% CI) from deepswe.datacurve.ai, with MiniMax-M3 inserted at its strict rank. M3's row is our independent measurement; all others are Datacurve's reported figures.
| Model | 95% CI | pass@1 |
|---|---|---|
| gpt-5.5 [xhigh] | ± 4% | 70% |
| gpt-5.4 [xhigh] | ± 5% | 56% |
| claude-opus-4.7 [max] | ± 5% | 54% |
| claude-sonnet-4.6 [high] | ± 4% | 32% |
| gemini-3.5-flash [medium] | ± 4% | 28% |
| gpt-5.4-mini [xhigh] | ± 4% | 24% |
| kimi-k2.6 | ± 4% | 24% |
| mimo-v2.5-pro | ± 4% | 19% |
| glm-5.1 | ± 4% | 18% |
| MiniMax-M3 [default] (this run, independent) | n/a | 13.3% |
| gemini-3.1-pro | ± 3% | 10% |
| deepseek-v4-pro | ± 2% | 8% |
| gemini-3-flash | ± 2% | 5% |
Methodology
Harness
- agent
- mini-swe-agent (unmodified loop)
- orchestrator
- Pier (Harbor task format)
- model id
MiniMax-M3- endpoint
- MiniMax Anthropic-compatible API (provider-direct)
- k
- 1 attempt per task (pass@1)
- cost limit
- unlimited — no per-task token/reasoning cap
Environment & budget
- sandbox
- Modal, air-gapped (no internet)
- concurrency
- ~36 tasks across 3 MiniMax keys
- agent budget
- 90 min working-time (enforced at scoring)
- agent wall
- 135 min (1.5× — throttle head-room)
- verifier budget
- 30 min wall-clock
- reward
- 1.0 iff new tests pass & base suite intact
429 handling is the one harness deviation, kept deliberately pure: rate-limited model calls are retried in place (the trajectory is preserved, no work lost), and the exact back-off time is logged and subtracted from wall-clock so the model is judged against a true 90-minute working budget. DeepSWE publishes no canonical reasoning-token budget, so we run mini-swe-agent at its defaults — the leaderboard convention of hitting each model at its own provider.
Results by language
| Language | Tasks | Passed | pass@1 |
|---|---|---|---|
| typescript | 35 | 6 | 17.1% |
| go | 34 | 4 | 11.8% |
| python | 34 | 3 | 8.8% |
| rust | 5 | 0 | 0.0% |
| javascript | 5 | 2 | 40.0% |
Strict pass@1 by language. Counts are small per language, so treat these as directional rather than precise.
Failure modes
Every scored task classified by outcome. Notably, M3 rarely fails by going silent — it almost always submits a patch (very few empty-patch trials); most losses are precision misses on the new behavior, not give-ups.
How close were the misses?
A raw pass/fail rate hides how close a failed attempt came. Each task's hidden suite has a different size (from 1 test to 100+), so we normalize: % of new-behavior tests passing. Of the 94 non-passes:
Median across all non-passes: 79% of new tests passing. Many "fails" are a handful of edge-case tests short of a full solve; a separate cluster are total misses where the patch didn't compile against the new suite.
Efficiency
Each orange/green dot is one of the 113 tasks: x = the resource used, y = % of new tests it passed (green = scored a strict pass). Blue points are the few peer anchors Datacurve published, plotted at their model-level pass rate. M3 is strikingly token-hungry and iterative — a median 80k output tokens and 325 agent steps per task, well above frontier models that score higher with less.
Score vs. output tokens
Score vs. duration
Score vs. synthesized cost (PAYG list price)
Cost synthesized from token counts at MiniMax-M3's standard list price ($0.60/M input, $0.12/M cache-read, $2.40/M output) — the actual run used a flat-rate subscription. Median $7.48/task; M3's heavy context re-reads dominate the bill even at a low per-token rate. Peer anchors are Datacurve's published per-trial costs.
Peer efficiency points are sparse and partly estimated on Datacurve's page; treat them as rough context, not a like-for-like overlay. Duration uses throttle-excluded working time.
Task explorer
| Outcome | Task | Lang | New tests | Steps | Patch |
|---|---|---|---|---|---|
| pass | Harden module loading, cache introspection, and script flags abs-lang/abs | go | 100% | 326 | 6f / +724 |
| pass | Add ShapeIndex encoding and decoding golang/geo | go | 100% | 247 | 10f / +1092 |
| pass | Add multiplexed ordered streams over KCP xtaci/kcp-go | go | 100% | 193 | 4f / +2138 |
| pass | Add go:embed directive support for interpreted packages traefik/yaegi | go | 100% | 580 | 16f / +1360 |
| pass | Partition report files by launcher and expand report templates testem/testem | javascript | 100% | 239 | 12f / +825 |
| pass | Add deterministic map conflict detection to Y.Map writes yjs/yjs | javascript | 100% | 210 | 7f / +575 |
| pass | Add task snapshots, inspection, and diffing to aiomonitor aio-libs/aiomonitor | python | 100% | 153 | 14f / +1630 |
| pass | Persist the fitted feature schema across evaluate, predict, serve, and export nidhaloff/igel | python | 100% | 257 | 5f / +1334 |
| pass | Add bidirectional TOML table converters python-poetry/tomlkit | python | 100% | 398 | 4f / +893 |
| pass | Add dependency-aware async initialization to the container jeffijoe/awilix | typescript | 100% | 276 | 5f / +1020 |
| pass | Add typed window function builders with OVER clauses drizzle-team/drizzle-orm | typescript | 100% | 247 | 27f / +1904 |
| pass | Abort pending body reads on shutdown capricorn86/happy-dom | typescript | 100% | 371 | 6f / +405 |
| pass | Preserve restored query state in persisted snapshots TanStack/query | typescript | 100% | 326 | 7f / +1151 |
| pass | Format BigQuery pipe syntax queries correctly sql-formatter-org/sql-formatter | typescript | 100% | 270 | 11f / +432 |
| pass | Add duration-aware sharding to Vitest vitest-dev/vitest | typescript | 100% | 394 | 24f / +1872 |
| pass ✓ over-budget | Add drift detection and compliance baselines getarcaneapp/arcane | go | 100% | 388 | 16f / +2067 |
| pass ✓ over-budget | Add unified manifest stream output across Helm commands helm/helm | go | 100% | 599 | 37f / +1413 |
| correctness | Fix isolated Go-side calls for Tengo callables and closures d5/tengo | go | 100% | 370 | 4f / +531 |
| timeout | Add tube multiplexing to pwntools Gallopsled/pwntools | python | 100% | 331 | 6f / +2239 |
| pass ✓ over-budget | Add lazy recursive schemas with DTO and JSON Schema export dynamodb-toolbox/dynamodb-toolbox | typescript | 100% | 473 | 51f / +1141 |
| pass ✓ over-budget | Add a deterministic CookieStore with modern Set-Cookie parsing encode/httpx | typescript | 100% | 317 | 8f / +821 |
| correctness | Add CSS Grid layout to the Box component vadimdemedes/ink | typescript | 100% | 281 | 5f / +1000 |
| correctness | Add conditional option dependencies to Optique dahlia/optique | typescript | 100% | 399 | 4f / +1699 |
| regression | Add duration encoding to TableVectorizer skrub-data/skrub | python | 99% | 306 | 8f / +993 |
| correctness | Add single-active-consumer priority and cancel tracking to virtual transports celery/kombu | python | 99% | 292 | 3f / +812 |
| correctness | Add deprecation, sunset, and successor headers to FastAPI routes fastapi/fastapi | python | 98% | 290 | 21f / +1349 |
| correctness | Add an error-accumulating Validated container dry-python/returns | python | 98% | 197 | 12f / +1302 |
| correctness | Add `\\multicolumn` column spans to array-like environments KaTeX/KaTeX | javascript | 98% | 331 | 7f / +678 |
| correctness | Add deterministic multi-key sorting to fd sharkdp/fd | rust | 98% | 323 | 6f / +1004 |
| correctness | Add explicit resource management declarations to the parser meriyah/meriyah | typescript | 98% | 347 | 11f / +1324 |
| correctness | Add multipart response parsing to HTTPX encode/httpx | python | 98% | 231 | 6f / +789 |
| correctness | Add conditional required attributes to schemas dynamodb-toolbox/dynamodb-toolbox | typescript | 97% | 406 | 43f / +984 |
| correctness | Add incremental cache controls to Bandit PyCQA/bandit | python | 97% | 160 | 7f / +1447 |
| correctness | Add interprocedural taint checks for Bandit injection sinks PyCQA/bandit | python | 96% | 364 | 22f / +1562 |
| correctness | Add XML diff, patch, and merge operations to etree beevik/etree | go | 96% | 267 | 4f / +2093 |
| correctness | Coalesce qualifying choices into character classes pest-parser/pest | rust | 96% | 331 | 7f / +845 |
| correctness | Add multi-module memory snapshots to wazero wazero/wazero | go | 96% | 129 | 11f / +1471 |
| correctness | Restore RichLog follow-state parity and expand reflow behavior Textualize/textual | python | 96% | 294 | 3f / +436 |
| correctness | Add partial structuring with error recovery to cattrs python-attrs/cattrs | python | 96% | 394 | 6f / +879 |
| correctness | Add value-based query predicates to Koota pmndrs/koota | typescript | 95% | 530 | 24f / +1633 |
| correctness | Add HTML document format handling to Dasel TomWright/dasel | go | 94% | 250 | 9f / +2048 |
| timeout | Add a per-origin circuit breaker to ofetch unjs/ofetch | typescript | 94% | 123 | 3f / +469 |
| timeout | Add hierarchical evaluation cancellation to Boa boa-dev/boa | rust | 94% | 552 | 10f / +1324 |
| correctness | Add input key aliases to name mapping reagento/adaptix | python | 94% | 394 | 10f / +705 |
| correctness | Format CREATE TABLE DDL and add DDL parsing helpers tconbeer/sqlfmt | python | 94% | 409 | 9f / +1256 |
| correctness | Implement a deterministic IntersectionObserver in Happy DOM capricorn86/happy-dom | typescript | 93% | 203 | 3f / +837 |
| correctness | Add link format conversion between wiki and markdown syntax platers/obsidian-linter | typescript | 93% | 277 | 4f / +7678 |
| correctness | Add durability callbacks and wait APIs for sync writes cockroachdb/pebble | go | 93% | 442 | 11f / +1134 |
| correctness | Add destructuring bindings to Tengo d5/tengo | go | 93% | 512 | 12f / +1010 |
| correctness | Add JSON Schema refs and dependency keywords arktypeio/arktype | typescript | 93% | 529 | 19f / +1317 |
| correctness | Validate daemon watch, status, and log lifecycle jkwill87/mnamer | python | 92% | 187 | 7f / +1797 |
| correctness | Add entity snapshot and rollback APIs to Koota pmndrs/koota | python | 91% | 279 | 16f / +1421 |
| regression | Add flattened dataclass fields to Mashumaro field options Fatal1ty/mashumaro | python | 90% | 324 | 5f / +1011 |
| timeout | Add task graph export with JSON, DOT, and text output go-task/task | go | 90% | 500 | 9f / +1059 |
| correctness | Add JSONPath query APIs to orderedmap and Starlark modules carvel-dev/ytt | go | 89% | 265 | 8f / +1989 |
| regression | Add rolling min, max, median, and quantile methods narwhals-dev/narwhals | python | 88% | 430 | 15f / +1828 |
| correctness | Add streaming JSON iteration to HTTPX responses encode/httpx | python | 88% | 277 | 7f / +983 |
| regression | Add request coalescing to `Runnable` langchain-ai/langchain | python | 88% | 179 | 5f / +1571 |
| correctness | Add shorthand expansion and compression to the lexer csstree/csstree | javascript | 87% | 324 | 6f / +1742 |
| correctness | Add RFC 5545 timezone interoperability to dateutil recurrence parsing dateutil/dateutil | python | 87% | 400 | 4f / +638 |
| correctness | Add composite trait aspects to Koota pmndrs/koota | typescript | 86% | 526 | 24f / +1790 |
| correctness | Add iterable collection combinators to true-myth true-myth/true-myth | typescript | 86% | 229 | 8f / +2085 |
| correctness | Implement recursive agent delegation through delegate_task tool calls baryhuang/claude-code-by-agents | typescript | 86% | 170 | 7f / +2754 |
| regression | Add grouped test phases with synchronized barriers google/mobly | python | 86% | 378 | 2f / +863 |
| correctness | Add automatic table of contents generation for Obsidian linter platers/obsidian-linter | typescript | 83% | 303 | 4f / +7696 |
| correctness | Add a persistent analysis cache to Vulture jendrikseipp/vulture | python | 79% | 290 | 6f / +854 |
| correctness | Add SSE streaming endpoints to HttpApi Effect-TS/effect | typescript | 79% | 504 | 10f / +919 |
| timeout | Add atomic signal selectors to Kea keajs/kea | typescript | 79% | 336 | 33f / +1743 |
| timeout | Add method declarations and interface dispatch to Scriggo open2b/scriggo | go | 77% | 671 | 15f / +866 |
| timeout | Add session bundle recording and replay to IPython ipython/ipython | python | 76% | 147 | 4f / +973 |
| correctness | Add dead-lettering, TTL, and overflow handling to virtual queues celery/kombu | python | 73% | 304 | 8f / +641 |
| correctness | Add typed blend range access and blend-if compositing psd-tools/psd-tools | python | 69% | 268 | 7f / +1084 |
| correctness | Add boundary modes to `@stencil` numba/numba | python | 69% | 404 | 3f / +930 |
| correctness | Add bail-on-test-failure handling to Testem testem/testem | javascript | 67% | 334 | 18f / +619 |
| correctness | Add trap coredump generation to wasmi wasmi-labs/wasmi | rust | 64% | 556 | 19f / +1031 |
| correctness | Add pair-level relation tracking modifiers pmndrs/koota | typescript | 63% | 443 | 14f / +856 |
| correctness | Add error stack serialization to SuperJSON flightcontrolhq/superjson | typescript | 63% | 247 | 6f / +647 |
| correctness | Add safe import checkpoints and invariant validation simonw/sqlite-utils | python | 62% | 291 | 6f / +1391 |
| correctness | Preserve structure needed by stylesheet selectors noahbald/oxvg | rust | 60% | 472 | 5f / +314 |
| correctness | Add typed variable bindings to Anko mattn/anko | go | 57% | 356 | 9f / +1277 |
| timeout | Add a deferred mutation buffer to batch entity changes pmndrs/koota | typescript | 55% | 415 | 12f / +1211 |
| correctness | Add worktree merge conflict handling go-git/go-git | go | 53% | 417 | 6f / +1435 |
| regression | Add scoped state data to state machine callbacks and history fgmacedo/python-statemachine | python | 50% | 594 | 14f / +1229 |
| correctness | Complete Kitty keyboard phases and stable fallback key metadata Textualize/textual | python | 44% | 299 | 8f / +826 |
| correctness | Add scoped per-rule ignore markers to Obsidian Linter platers/obsidian-linter | typescript | 31% | 220 | 13f / +8747 |
| timeout | Add recursive schema composition to Valibot open-circle/valibot | typescript | 22% | 429 | 15f / +967 |
| correctness | Add GraphQL incremental delivery with @defer and @stream graphql-python/gql | python | 17% | 279 | 15f / +1610 |
| timeout | Add async autocomplete options and fetch lifecycle handling bombshell-dev/clack | typescript | 8% | 72 | 0f / +0 |
| correctness | Add stepped slices for arrays and strings abs-lang/abs | go | 0% | 288 | 6f / +689 |
| correctness | Add action pinning linting for actions and reusable workflows rhysd/actionlint | go | 0% | 440 | 9f / +1394 |
| timeout | Add default arguments to Anko function parameters mattn/anko | go | 0% | 365 | 4f / +384 |
| timeout | Add try/catch error recovery to expr expr-lang/expr | go | 0% | 404 | 13f / +1645 |
| correctness | Add a checker for broken doc comment links go-critic/go-critic | go | 0% | 463 | 8f / +642 |
| correctness | Expose accumulated streamed function-call args in SDK surfaces googleapis/go-genai | go | 0% | 366 | 7f / +1899 |
| correctness | Add retry-aware publishing audit logs goreleaser/goreleaser | go | 0% | 312 | 21f / +1735 |
| correctness | Add configurable array merge strategies to Helm value coalescing helm/helm | go | 0% | 310 | 23f / +1788 |
| correctness | Add consistent hash policy support to TrafficPolicy kgateway-dev/kgateway | go | 0% | 368 | 191f / +13415 |
| correctness | Add transparent encryption to dump uploads liweiyi88/onedump | go | 0% | 172 | 16f / +1527 |
| correctness | Add rule evaluation profiling to Rego open-policy-agent/opa | go | 0% | 352 | 16f / +1301 |
| correctness | Reconstruct template strings in partial evaluation output open-policy-agent/opa | go | 0% | 347 | 3f / +358 |
| correctness | Add build-time grammar conflict analysis to participle alecthomas/participle | go | 0% | 302 | 5f / +1456 |
| correctness | Fix PromQL label sorting across typed and untyped values prometheus/prometheus | go | 0% | 457 | 5f / +988 |
| correctness | Add bounded-memory spilling to SCC aggregation boyter/scc | go | 0% | 217 | 6f / +1037 |
| correctness | Preserve ANSI resets during truncation and styling muesli/termenv | go | 0% | 113 | 10f / +1183 |
| correctness | Add policy-based alerting for failures, latency, and SSL expiry Owloops/updo | go | 0% | 222 | 10f / +1445 |
| correctness | Add structured nosec directives for regions and next line PyCQA/bandit | python | 0% | 321 | 10f / +1348 |
| correctness | Add implicit HEAD and automatic OPTIONS responses to FastAPI routes fastapi/fastapi | python | 0% | 349 | 4f / +1722 |
| correctness | Add config file parsing to Cliffy commands c4spar/cliffy | typescript | 0% | 325 | 6f / +1067 |
| correctness | Add keyset cursor pagination to `$find` eicrud/eicrud | typescript | 0% | 519 | 11f / +1396 |
| correctness | Add grouping-set and window-frame SQL helpers kysely-org/kysely | typescript | 0% | 399 | 21f / +1749 |
| regression | Add transactional reload status and rollback tracking to Prometheus prometheus/prometheus | typescript | 0% | 475 | 15f / +1083 |
| regression | Reuse one toolbar across multiple Quill editors slab/quill | typescript | 0% | 314 | 6f / +887 |
| correctness | Add `matchEach` to ts-pattern gvergnaud/ts-pattern | typescript | 0% | 148 | 4f / +1061 |
"New tests" = % of the task's new-behavior suite that passed. Outcomes tagged "✓ over-budget" are extended-cap passes excluded from the strict figure.
Comparability & limits
- Headline is the strict 90-min figure (13.3%). The 16.8% extended figure includes 4 passes the agent only reached past the budget; we disclose but do not claim them.
- Single pass@1 run (k=1); no across-seed variance estimate, so no CI on M3's own number.
- We cannot bit-verify our harness config against the leaderboard team's, so this is an independent measurement, not an official entry.
- The cost column is synthesized from token counts at list price (the run used a flat-rate subscription); peer efficiency anchors are sparse/partly-estimated from the blog.
- Reward is the program verifier's: new-behavior tests pass and the base suite stays green. No partial credit — the closeness chart above is descriptive only.