MiniMax-M3 on DeepSWE

Pass@1 of the MiniMax-M3 coding model on the 113-task DeepSWE long-horizon software-engineering benchmark, run provider-direct through the unmodified mini-swe-agent harness.

MiniMax-M3113 tasks pass@1 / k=1mini-swe-agent harness MiniMax-direct APIair-gapped sandboxes run 2026-06-02

13.3%

pass@1 strict (15/113)

16.8%

extended cap (19/113)

325

median agent steps

80k

median output tokens

$7.48

median cost / task

languages

DeepSWE is a behavior-verified benchmark of 113 real open-source feature requests across five languages. Each task ships hidden tests; a solution scores reward = 1 only if it makes the new-behavior tests pass without breaking the repository's pre-existing suite. This page reports MiniMax-M3's single-attempt (pass@1) performance and where it lands among published models.

New: for a deep, descriptive tour of what the 113 tasks actually are — every challenge summarized, plus breakdowns by language, behavior, project, and hidden-test-suite size — see the companion DeepSWE Scenario Catalog.

Headline result

13.3%

pass@1 (strict) — 15 of 113 tasks solved within the canonical 90-minute working-time budget. This is the directly-comparable figure.

16.8%

pass@1 (extended) — 19 of 113 including 4 passes that the agent only reached after running past 90 minutes of working time (see below). Reported for transparency, not claimed as clean solves.

Why two numbers. To absorb provider rate-limit (429) back-off without losing work, the agent wall was set to 1.5× the 90-minute budget and the real budget was re-imposed at scoring as working = wall − throttle wait. That cleanly credits throttled tasks, but a handful of un-throttled tasks used the slack as genuine extra compute. 4 passes (

arcane-drift-detection-baselines, dynamodb-toolbox-lazy-recursive-schemas, helm-unified-manifest-stream, httpx-deterministic-cookie-store

) had >90 min of working time — the agent was still editing source past the budget — so the strict figure excludes them.

Read this as: “MiniMax-M3 on DeepSWE via the mini-swe-agent harness, MiniMax-direct API.” An independent measurement, not an official DeepSWE leaderboard entry — see Comparability & limits.

Where M3 lands

DeepSWE's published pass rates (± 95% CI) from deepswe.datacurve.ai, with MiniMax-M3 inserted at its strict rank. M3's row is our independent measurement; all others are Datacurve's reported figures.

Model	95% CI	pass@1
gpt-5.5 [xhigh]	± 4%	70%
gpt-5.4 [xhigh]	± 5%	56%
claude-opus-4.7 [max]	± 5%	54%
claude-sonnet-4.6 [high]	± 4%	32%
gemini-3.5-flash [medium]	± 4%	28%
gpt-5.4-mini [xhigh]	± 4%	24%
kimi-k2.6	± 4%	24%
mimo-v2.5-pro	± 4%	19%
glm-5.1	± 4%	18%
MiniMax-M3 [default] (this run, independent)	n/a	13.3%
gemini-3.1-pro	± 3%	10%
deepseek-v4-pro	± 2%	8%
gemini-3-flash	± 2%	5%

Methodology

Harness

agent: mini-swe-agent (unmodified loop)
orchestrator: Pier (Harbor task format)
model id: MiniMax-M3
endpoint: MiniMax Anthropic-compatible API (provider-direct)
k: 1 attempt per task (pass@1)
cost limit: unlimited — no per-task token/reasoning cap

Environment & budget

sandbox: Modal, air-gapped (no internet)
concurrency: ~36 tasks across 3 MiniMax keys
agent budget: 90 min working-time (enforced at scoring)
agent wall: 135 min (1.5× — throttle head-room)
verifier budget: 30 min wall-clock
reward: 1.0 iff new tests pass & base suite intact

429 handling is the one harness deviation, kept deliberately pure: rate-limited model calls are retried in place (the trajectory is preserved, no work lost), and the exact back-off time is logged and subtracted from wall-clock so the model is judged against a true 90-minute working budget. DeepSWE publishes no canonical reasoning-token budget, so we run mini-swe-agent at its defaults — the leaderboard convention of hitting each model at its own provider.

Results by language

Language	Tasks	Passed	pass@1
typescript	35	6	17.1%
go	34	4	11.8%
python	34	3	8.8%
rust	5	0	0.0%
javascript	5	2	40.0%

typescript

python

rust

javascript

Strict pass@1 by language. Counts are small per language, so treat these as directional rather than precise.

Failure modes

Every scored task classified by outcome. Notably, M3 rarely fails by going silent — it almost always submits a patch (very few empty-patch trials); most losses are precision misses on the new behavior, not give-ups.

pass

reward = 1 — new behavior verified, base suite intact

correctness

submitted a patch that failed the new-behavior tests

regression

patch broke the pre-existing (base) test suite

timeout

ran the full agent budget without converging, and failed

How close were the misses?

A raw pass/fail rate hides how close a failed attempt came. Each task's hidden suite has a different size (from 1 test to 100+), so we normalize: % of new-behavior tests passing. Of the 94 non-passes:

≥90% passing — near-miss (35) 1–89% — partial (34) 0% — total miss / build failure (25)

Median across all non-passes: 79% of new tests passing. Many "fails" are a handful of edge-case tests short of a full solve; a separate cluster are total misses where the patch didn't compile against the new suite.

Efficiency

Each orange/green dot is one of the 113 tasks: x = the resource used, y = % of new tests it passed (green = scored a strict pass). Blue points are the few peer anchors Datacurve published, plotted at their model-level pass rate. M3 is strikingly token-hungry and iterative — a median 80k output tokens and 325 agent steps per task, well above frontier models that score higher with less.

Score vs. output tokens

Score vs. duration

Score vs. synthesized cost (PAYG list price)

Cost synthesized from token counts at MiniMax-M3's standard list price ($0.60/M input, $0.12/M cache-read, $2.40/M output) — the actual run used a flat-rate subscription. Median $7.48/task; M3's heavy context re-reads dominate the bill even at a low per-token rate. Peer anchors are Datacurve's published per-trial costs.

Peer efficiency points are sparse and partly estimated on Datacurve's page; treat them as rough context, not a like-for-like overlay. Duration uses throttle-excluded working time.

Task explorer

all langs gojavascriptpythonrusttypescript all outcomes passcorrectnessregressiontimeout

Outcome	Task	Lang	New tests	Steps	Patch
pass	Harden module loading, cache introspection, and script flags abs-lang/abs	go	100%	326	6f / +724
pass	Add ShapeIndex encoding and decoding golang/geo	go	100%	247	10f / +1092
pass	Add multiplexed ordered streams over KCP xtaci/kcp-go	go	100%	193	4f / +2138
pass	Add go:embed directive support for interpreted packages traefik/yaegi	go	100%	580	16f / +1360
pass	Partition report files by launcher and expand report templates testem/testem	javascript	100%	239	12f / +825
pass	Add deterministic map conflict detection to Y.Map writes yjs/yjs	javascript	100%	210	7f / +575
pass	Add task snapshots, inspection, and diffing to aiomonitor aio-libs/aiomonitor	python	100%	153	14f / +1630
pass	Persist the fitted feature schema across evaluate, predict, serve, and export nidhaloff/igel	python	100%	257	5f / +1334
pass	Add bidirectional TOML table converters python-poetry/tomlkit	python	100%	398	4f / +893
pass	Add dependency-aware async initialization to the container jeffijoe/awilix	typescript	100%	276	5f / +1020
pass	Add typed window function builders with OVER clauses drizzle-team/drizzle-orm	typescript	100%	247	27f / +1904
pass	Abort pending body reads on shutdown capricorn86/happy-dom	typescript	100%	371	6f / +405
pass	Preserve restored query state in persisted snapshots TanStack/query	typescript	100%	326	7f / +1151
pass	Format BigQuery pipe syntax queries correctly sql-formatter-org/sql-formatter	typescript	100%	270	11f / +432
pass	Add duration-aware sharding to Vitest vitest-dev/vitest	typescript	100%	394	24f / +1872
pass ✓ over-budget	Add drift detection and compliance baselines getarcaneapp/arcane	go	100%	388	16f / +2067
pass ✓ over-budget	Add unified manifest stream output across Helm commands helm/helm	go	100%	599	37f / +1413
correctness	Fix isolated Go-side calls for Tengo callables and closures d5/tengo	go	100%	370	4f / +531
timeout	Add tube multiplexing to pwntools Gallopsled/pwntools	python	100%	331	6f / +2239
pass ✓ over-budget	Add lazy recursive schemas with DTO and JSON Schema export dynamodb-toolbox/dynamodb-toolbox	typescript	100%	473	51f / +1141
pass ✓ over-budget	Add a deterministic CookieStore with modern Set-Cookie parsing encode/httpx	typescript	100%	317	8f / +821
correctness	Add CSS Grid layout to the Box component vadimdemedes/ink	typescript	100%	281	5f / +1000
correctness	Add conditional option dependencies to Optique dahlia/optique	typescript	100%	399	4f / +1699
regression	Add duration encoding to TableVectorizer skrub-data/skrub	python	99%	306	8f / +993
correctness	Add single-active-consumer priority and cancel tracking to virtual transports celery/kombu	python	99%	292	3f / +812
correctness	Add deprecation, sunset, and successor headers to FastAPI routes fastapi/fastapi	python	98%	290	21f / +1349
correctness	Add an error-accumulating Validated container dry-python/returns	python	98%	197	12f / +1302
correctness	Add `\\multicolumn` column spans to array-like environments KaTeX/KaTeX	javascript	98%	331	7f / +678
correctness	Add deterministic multi-key sorting to fd sharkdp/fd	rust	98%	323	6f / +1004
correctness	Add explicit resource management declarations to the parser meriyah/meriyah	typescript	98%	347	11f / +1324
correctness	Add multipart response parsing to HTTPX encode/httpx	python	98%	231	6f / +789
correctness	Add conditional required attributes to schemas dynamodb-toolbox/dynamodb-toolbox	typescript	97%	406	43f / +984
correctness	Add incremental cache controls to Bandit PyCQA/bandit	python	97%	160	7f / +1447
correctness	Add interprocedural taint checks for Bandit injection sinks PyCQA/bandit	python	96%	364	22f / +1562
correctness	Add XML diff, patch, and merge operations to etree beevik/etree	go	96%	267	4f / +2093
correctness	Coalesce qualifying choices into character classes pest-parser/pest	rust	96%	331	7f / +845
correctness	Add multi-module memory snapshots to wazero wazero/wazero	go	96%	129	11f / +1471
correctness	Restore RichLog follow-state parity and expand reflow behavior Textualize/textual	python	96%	294	3f / +436
correctness	Add partial structuring with error recovery to cattrs python-attrs/cattrs	python	96%	394	6f / +879
correctness	Add value-based query predicates to Koota pmndrs/koota	typescript	95%	530	24f / +1633
correctness	Add HTML document format handling to Dasel TomWright/dasel	go	94%	250	9f / +2048
timeout	Add a per-origin circuit breaker to ofetch unjs/ofetch	typescript	94%	123	3f / +469
timeout	Add hierarchical evaluation cancellation to Boa boa-dev/boa	rust	94%	552	10f / +1324
correctness	Add input key aliases to name mapping reagento/adaptix	python	94%	394	10f / +705
correctness	Format CREATE TABLE DDL and add DDL parsing helpers tconbeer/sqlfmt	python	94%	409	9f / +1256
correctness	Implement a deterministic IntersectionObserver in Happy DOM capricorn86/happy-dom	typescript	93%	203	3f / +837
correctness	Add link format conversion between wiki and markdown syntax platers/obsidian-linter	typescript	93%	277	4f / +7678
correctness	Add durability callbacks and wait APIs for sync writes cockroachdb/pebble	go	93%	442	11f / +1134
correctness	Add destructuring bindings to Tengo d5/tengo	go	93%	512	12f / +1010
correctness	Add JSON Schema refs and dependency keywords arktypeio/arktype	typescript	93%	529	19f / +1317
correctness	Validate daemon watch, status, and log lifecycle jkwill87/mnamer	python	92%	187	7f / +1797
correctness	Add entity snapshot and rollback APIs to Koota pmndrs/koota	python	91%	279	16f / +1421
regression	Add flattened dataclass fields to Mashumaro field options Fatal1ty/mashumaro	python	90%	324	5f / +1011
timeout	Add task graph export with JSON, DOT, and text output go-task/task	go	90%	500	9f / +1059
correctness	Add JSONPath query APIs to orderedmap and Starlark modules carvel-dev/ytt	go	89%	265	8f / +1989
regression	Add rolling min, max, median, and quantile methods narwhals-dev/narwhals	python	88%	430	15f / +1828
correctness	Add streaming JSON iteration to HTTPX responses encode/httpx	python	88%	277	7f / +983
regression	Add request coalescing to `Runnable` langchain-ai/langchain	python	88%	179	5f / +1571
correctness	Add shorthand expansion and compression to the lexer csstree/csstree	javascript	87%	324	6f / +1742
correctness	Add RFC 5545 timezone interoperability to dateutil recurrence parsing dateutil/dateutil	python	87%	400	4f / +638
correctness	Add composite trait aspects to Koota pmndrs/koota	typescript	86%	526	24f / +1790
correctness	Add iterable collection combinators to true-myth true-myth/true-myth	typescript	86%	229	8f / +2085
correctness	Implement recursive agent delegation through delegate_task tool calls baryhuang/claude-code-by-agents	typescript	86%	170	7f / +2754
regression	Add grouped test phases with synchronized barriers google/mobly	python	86%	378	2f / +863
correctness	Add automatic table of contents generation for Obsidian linter platers/obsidian-linter	typescript	83%	303	4f / +7696
correctness	Add a persistent analysis cache to Vulture jendrikseipp/vulture	python	79%	290	6f / +854
correctness	Add SSE streaming endpoints to HttpApi Effect-TS/effect	typescript	79%	504	10f / +919
timeout	Add atomic signal selectors to Kea keajs/kea	typescript	79%	336	33f / +1743
timeout	Add method declarations and interface dispatch to Scriggo open2b/scriggo	go	77%	671	15f / +866
timeout	Add session bundle recording and replay to IPython ipython/ipython	python	76%	147	4f / +973
correctness	Add dead-lettering, TTL, and overflow handling to virtual queues celery/kombu	python	73%	304	8f / +641
correctness	Add typed blend range access and blend-if compositing psd-tools/psd-tools	python	69%	268	7f / +1084
correctness	Add boundary modes to `@stencil` numba/numba	python	69%	404	3f / +930
correctness	Add bail-on-test-failure handling to Testem testem/testem	javascript	67%	334	18f / +619
correctness	Add trap coredump generation to wasmi wasmi-labs/wasmi	rust	64%	556	19f / +1031
correctness	Add pair-level relation tracking modifiers pmndrs/koota	typescript	63%	443	14f / +856
correctness	Add error stack serialization to SuperJSON flightcontrolhq/superjson	typescript	63%	247	6f / +647
correctness	Add safe import checkpoints and invariant validation simonw/sqlite-utils	python	62%	291	6f / +1391
correctness	Preserve structure needed by stylesheet selectors noahbald/oxvg	rust	60%	472	5f / +314
correctness	Add typed variable bindings to Anko mattn/anko	go	57%	356	9f / +1277
timeout	Add a deferred mutation buffer to batch entity changes pmndrs/koota	typescript	55%	415	12f / +1211
correctness	Add worktree merge conflict handling go-git/go-git	go	53%	417	6f / +1435
regression	Add scoped state data to state machine callbacks and history fgmacedo/python-statemachine	python	50%	594	14f / +1229
correctness	Complete Kitty keyboard phases and stable fallback key metadata Textualize/textual	python	44%	299	8f / +826
correctness	Add scoped per-rule ignore markers to Obsidian Linter platers/obsidian-linter	typescript	31%	220	13f / +8747
timeout	Add recursive schema composition to Valibot open-circle/valibot	typescript	22%	429	15f / +967
correctness	Add GraphQL incremental delivery with @defer and @stream graphql-python/gql	python	17%	279	15f / +1610
timeout	Add async autocomplete options and fetch lifecycle handling bombshell-dev/clack	typescript	8%	72	0f / +0
correctness	Add stepped slices for arrays and strings abs-lang/abs	go	0%	288	6f / +689
correctness	Add action pinning linting for actions and reusable workflows rhysd/actionlint	go	0%	440	9f / +1394
timeout	Add default arguments to Anko function parameters mattn/anko	go	0%	365	4f / +384
timeout	Add try/catch error recovery to expr expr-lang/expr	go	0%	404	13f / +1645
correctness	Add a checker for broken doc comment links go-critic/go-critic	go	0%	463	8f / +642
correctness	Expose accumulated streamed function-call args in SDK surfaces googleapis/go-genai	go	0%	366	7f / +1899
correctness	Add retry-aware publishing audit logs goreleaser/goreleaser	go	0%	312	21f / +1735
correctness	Add configurable array merge strategies to Helm value coalescing helm/helm	go	0%	310	23f / +1788
correctness	Add consistent hash policy support to TrafficPolicy kgateway-dev/kgateway	go	0%	368	191f / +13415
correctness	Add transparent encryption to dump uploads liweiyi88/onedump	go	0%	172	16f / +1527
correctness	Add rule evaluation profiling to Rego open-policy-agent/opa	go	0%	352	16f / +1301
correctness	Reconstruct template strings in partial evaluation output open-policy-agent/opa	go	0%	347	3f / +358
correctness	Add build-time grammar conflict analysis to participle alecthomas/participle	go	0%	302	5f / +1456
correctness	Fix PromQL label sorting across typed and untyped values prometheus/prometheus	go	0%	457	5f / +988
correctness	Add bounded-memory spilling to SCC aggregation boyter/scc	go	0%	217	6f / +1037
correctness	Preserve ANSI resets during truncation and styling muesli/termenv	go	0%	113	10f / +1183
correctness	Add policy-based alerting for failures, latency, and SSL expiry Owloops/updo	go	0%	222	10f / +1445
correctness	Add structured nosec directives for regions and next line PyCQA/bandit	python	0%	321	10f / +1348
correctness	Add implicit HEAD and automatic OPTIONS responses to FastAPI routes fastapi/fastapi	python	0%	349	4f / +1722
correctness	Add config file parsing to Cliffy commands c4spar/cliffy	typescript	0%	325	6f / +1067
correctness	Add keyset cursor pagination to `$find` eicrud/eicrud	typescript	0%	519	11f / +1396
correctness	Add grouping-set and window-frame SQL helpers kysely-org/kysely	typescript	0%	399	21f / +1749
regression	Add transactional reload status and rollback tracking to Prometheus prometheus/prometheus	typescript	0%	475	15f / +1083
regression	Reuse one toolbar across multiple Quill editors slab/quill	typescript	0%	314	6f / +887
correctness	Add `matchEach` to ts-pattern gvergnaud/ts-pattern	typescript	0%	148	4f / +1061

"New tests" = % of the task's new-behavior suite that passed. Outcomes tagged "✓ over-budget" are extended-cap passes excluded from the strict figure.

Comparability & limits

Provider-direct is the leaderboard-faithful axis. DeepSWE standardizes the harness (mini-swe-agent, identical prompt/tools) and hits each model at its own provider. We run MiniMax-M3 through MiniMax's own API for exactly that reason.

Headline is the strict 90-min figure (13.3%). The 16.8% extended figure includes 4 passes the agent only reached past the budget; we disclose but do not claim them.
Single pass@1 run (k=1); no across-seed variance estimate, so no CI on M3's own number.
We cannot bit-verify our harness config against the leaderboard team's, so this is an independent measurement, not an official entry.
The cost column is synthesized from token counts at list price (the run used a flat-rate subscription); peer efficiency anchors are sparse/partly-estimated from the blog.
Reward is the program verifier's: new-behavior tests pass and the base suite stays green. No partial credit — the closeness chart above is descriptive only.