The DeepSWE Scenario Catalog
DeepSWE is a set of 113 real software-engineering tasks drawn from 91 open-source projects. Each task hands the agent a working repository and a precise natural-language specification of a feature or fix to implement, then grades the result against a hidden test suite that the agent never sees. This page catalogs that set: the projects, the behaviors exercised, the verification model, and a searchable index of all 113 scenarios.
How a DeepSWE challenge works
Every task follows the same anatomy. The agent is dropped into a real codebase at a fixed commit and must implement a specified change; an automated verifier then runs a two-phase test suite to decide pass/fail.
Task anatomy
Each task ships as a Harbor-format directory:
instruction.md— the full problem statement handed to the agent (median 273 words; up to 500).task.toml— metadata + budgets: a 90-minute agent wall-clock and a 30-minute verifier budget, 2 CPU / 8 GB, internet disabled.tests/— the hiddentest.patch+ runner the agent is graded on.solution/— a reference solution (median 612 lines added) used to author the task, withheld from the agent.environment/— the pinned container image for an air-gapped sandbox.
The verification model
Grading uses a base + new test split:
- Base tests — the project's existing suite, which must still pass (no regressions).
- New tests — added by the task author to verify the requested behavior. These are the bar the agent must clear and are never revealed to it.
- A task passes only when both phases are green. Across the suite the new-test phase adds ~5,133 assertions (median 38 per task, ranging 1–192).
- Tasks are behavior-verified: the spec describes what to build; many valid implementations pass, so the grader rewards observable behavior, not a diff match.
The shape of the suite
The 113 tasks are deliberately broad — five languages, three change types, and a long tail of one-off projects — so a model can't specialize to a single stack or pattern.
By language
TypeScript, Python and Go are near-evenly weighted (35/34/34); Rust and JavaScript form a smaller tail.
By change type
The suite is overwhelmingly feature work (106/113) — net-new capability against a real codebase — with a few targeted enhancements and bug fixes.
Languages & ecosystems
Each language pulls in its own ecosystem of libraries and frameworks. The tasks touch parsers, web frameworks, ORMs, game-state libraries, CLIs, interpreters, and more.
| Language | Tasks | Representative projects |
|---|---|---|
| TypeScript | 35 | arktype, awilix, clack, claude-code-by-agents, cliffy, drizzle-orm, dynamodb-toolbox, effect, eicrud, happy-dom, httpx, ink, kea, koota, kysely, meriyah, obsidi… |
| Go | 34 | abs, actionlint, anko, arcane, dasel, etree, expr, geo, go-critic, go-genai, go-git, goreleaser, helm, kcp-go, kgateway, onedump, opa, participle, pebble, prome… |
| Python | 34 | adaptix, aiomonitor, bandit, cattrs, dateutil, fastapi, gql, httpx, igel, ipython, kombu, koota, langchain, mashumaro, mnamer, mobly, narwhals, numba, psd-tools… |
| JavaScript | 5 | KaTeX, csstree, testem, yjs… |
| Rust | 5 | boa, fd, oxvg, pest, wasmi… |
Behavior themes
Each task was tagged with the behaviors it exercises (from a fixed vocabulary of 24). The distribution shows where the benchmark concentrates: data shaping, state management, error handling, and parsing lead — reflecting how much of real SWE work is precise data and control flow.
Difficulty & verification depth
Spec complexity
A qualitative read of each instruction's interacting rules and edge cases. 45 tasks are intricate (large config matrices, ordering rules, many error paths); 53 involved; 15 moderate.
Hidden-test suite size
Tasks vary from a handful of focused assertions to 192. 32 tasks carry 60+ hidden tests — these demand broad, systematic coverage to pass.
Source projects
Tasks come from 91 distinct open-source repositories. Most contribute a single task; a handful supply several, each a different feature in the same codebase — testing whether an agent can work repeatedly within one project's conventions.
Projects contributing more than one task
| Repository | Tasks |
|---|---|
pmndrs/koota | 5 |
PyCQA/bandit | 3 |
encode/httpx | 3 |
platers/obsidian-linter | 3 |
Textualize/textual | 2 |
abs-lang/abs | 2 |
capricorn86/happy-dom | 2 |
celery/kombu | 2 |
d5/tengo | 2 |
dynamodb-toolbox/dynamodb-toolbox | 2 |
fastapi/fastapi | 2 |
helm/helm | 2 |
mattn/anko | 2 |
open-policy-agent/opa | 2 |
prometheus/prometheus | 2 |
testem/testem | 2 |
The remaining 75 projects contribute one task each.
What the benchmark exercises
Grouping the behavior tags into families shows the kinds of engineering the suite rewards. (Families overlap — a task can belong to several.)
Roughly a third of the suite lives inside a language toolchain — interpreters, parser generators, formatters, and type checkers. Tasks add real language features (stepped slices, destructuring, default arguments, using declarations, methods on user types), grammar-conflict analysis, and character-class optimization. Success means producing parser/AST/runtime behavior that matches a hand-written reference exactly.
The single largest behavior cluster. Agents implement (de)serializers, schema validators, and bidirectional format converters: error-stack serialization, recursive/lazy schemas, JSON-Schema $ref/if-then-else, TOML table converters, CSS shorthand expansion, and typed heterogeneous sorting. These reward precise edge-case handling and clean round-tripping.
Database- and query-adjacent work: SQL window-function and grouping builders, JSONPath querying, keyset cursor pagination, three-way Git merge, checkpoint/rollback import safety, snapshot systems, and persistent analysis caches. Many require both a correct in-memory model and a durable on-disk format.
Coordination-heavy scenarios: dependency-ordered async init with rollback, hierarchical evaluation cancellation, single-active-consumer election, stream multiplexing with priority and flow control, circuit breakers, deterministic retry, and durability wait APIs. These stress correct ordering, cleanup, and failure semantics under contention.
HTTP and wire-protocol behavior: implicit HEAD/OPTIONS, deprecation headers, a standards-faithful cookie store, multipart and streaming-JSON parsing, incremental GraphQL delivery, and Server-Sent Events. RFC fidelity and header/precedence rules dominate.
A large slice driven by the ECS library koota (aspects, snapshots, deferred mutation buffers, relation-pair tracking, predicates) plus reactive selectors, restored query state, and functional iterator combinators. Fine-grained change detection is the recurring challenge.
Cross-cutting correctness: interprocedural taint analysis, structured warning-suppression directives, evaluation profiling, drift/compliance detection, transactional config reload, and rich error classification. Often the 'feature' is a diagnostic.
Presentation logic with exacting output: CSS grid layout, IntersectionObserver geometry, ANSI-safe truncation, Kitty-protocol key phases, multicolumn math spans, blend-range masks, and shared-toolbar focus routing.
Full catalog — all 113 challenges
Search and filter every scenario. Click a row to expand the full description of what the agent must build and what the hidden tests check.
| Challenge | Project | Lang | Difficulty | Tests |
|---|
Notes & method
How this catalog was built. Per-task facts (project, language, change type, instruction
length, reference-solution size) come from each task's manifest.json /
task.toml. Every instruction.md was read in full to summarize what the
task asks for, what its hidden tests verify, its technical domain, and its behavior tags. Hidden
new-test suite sizes are measured from each task's test.patch by the framework-aware
parser used in our run analysis.
On test counts. "Tests" is the size of each task's hidden new-test suite. For most tasks this is the exact count of new assertions parsed from the verifier output; for a small number (where the suite is graded by exit code, or where a build failure prevented a clean count) it is an estimate from the number of added test functions, shown in muted text and marked ≈. Two Go/JS tasks use sub-test patterns we don't count granularly and show "—". Difficulty is a qualitative read of each instruction's interacting rules, not a measured quantity.
Scope. This is a descriptive map of the benchmark itself — it says nothing about any model's score. For a per-model measurement against this same task set, see the companion MiniMax-M3 on DeepSWE results report.
Upstream: deepswe.datacurve.ai.