The DeepSWE Scenario Catalog

A deep, holistic tour of every challenge the benchmark tests against — what each task asks an agent to build, how it is verified, and how the 113 tasks distribute across languages, domains, and behaviors.

113 tasks91 real OSS projects 5 languages~5,133 hidden tests Harbor format · Modal verifier

113

behavior-verified tasks

distinct source repos

programming languages

5,133

hidden test cases

sum of per-task new-test suites

median tests / task

max 192

612

median ref-solution LOC

lines added; max 1647

273

median instruction words

detailed natural-language specs

'intricate' tasks

multi-rule, many edge cases

DeepSWE is a set of 113 real software-engineering tasks drawn from 91 open-source projects. Each task hands the agent a working repository and a precise natural-language specification of a feature or fix to implement, then grades the result against a hidden test suite that the agent never sees. This page catalogs that set: the projects, the behaviors exercised, the verification model, and a searchable index of all 113 scenarios.

How a DeepSWE challenge works

Every task follows the same anatomy. The agent is dropped into a real codebase at a fixed commit and must implement a specified change; an automated verifier then runs a two-phase test suite to decide pass/fail.

Task anatomy

Each task ships as a Harbor-format directory:

instruction.md — the full problem statement handed to the agent (median 273 words; up to 500).
task.toml — metadata + budgets: a 90-minute agent wall-clock and a 30-minute verifier budget, 2 CPU / 8 GB, internet disabled.
tests/ — the hidden test.patch + runner the agent is graded on.
solution/ — a reference solution (median 612 lines added) used to author the task, withheld from the agent.
environment/ — the pinned container image for an air-gapped sandbox.

The verification model

Grading uses a base + new test split:

Base tests — the project's existing suite, which must still pass (no regressions).
New tests — added by the task author to verify the requested behavior. These are the bar the agent must clear and are never revealed to it.
A task passes only when both phases are green. Across the suite the new-test phase adds ~5,133 assertions (median 38 per task, ranging 1–192).
Tasks are behavior-verified: the spec describes what to build; many valid implementations pass, so the grader rewards observable behavior, not a diff match.

The shape of the suite

The 113 tasks are deliberately broad — five languages, three change types, and a long tail of one-off projects — so a model can't specialize to a single stack or pattern.

By language

TypeScript

Python

JavaScript

Rust

TypeScript, Python and Go are near-evenly weighted (35/34/34); Rust and JavaScript form a smaller tail.

By change type

feature_request

106

bugfix

enhancement

The suite is overwhelmingly feature work (106/113) — net-new capability against a real codebase — with a few targeted enhancements and bug fixes.

Languages & ecosystems

Each language pulls in its own ecosystem of libraries and frameworks. The tasks touch parsers, web frameworks, ORMs, game-state libraries, CLIs, interpreters, and more.

Language	Tasks	Representative projects
TypeScript	35	arktype, awilix, clack, claude-code-by-agents, cliffy, drizzle-orm, dynamodb-toolbox, effect, eicrud, happy-dom, httpx, ink, kea, koota, kysely, meriyah, obsidi…
Go	34	abs, actionlint, anko, arcane, dasel, etree, expr, geo, go-critic, go-genai, go-git, goreleaser, helm, kcp-go, kgateway, onedump, opa, participle, pebble, prome…
Python	34	adaptix, aiomonitor, bandit, cattrs, dateutil, fastapi, gql, httpx, igel, ipython, kombu, koota, langchain, mashumaro, mnamer, mobly, narwhals, numba, psd-tools…
JavaScript	5	KaTeX, csstree, testem, yjs…
Rust	5	boa, fd, oxvg, pest, wasmi…

Behavior themes

Each task was tagged with the behaviors it exercises (from a fixed vocabulary of 24). The distribution shows where the benchmark concentrates: data shaping, state management, error handling, and parsing lead — reflecting how much of real SWE work is precise data and control flow.

data-transformation

state-management

error-handling

parsing

cli/config

serialization

observability/metrics

lifecycle/resource-mgmt

interpreter/codegen

networking/http

schema-validation

streaming

concurrency

persistence

query-builder

type-system

caching

layout/rendering

orm/database

collections/iterators

terminal-ui

retry/resilience

security/taint

time/timezone

Difficulty & verification depth

Spec complexity

moderate

involved

intricate

A qualitative read of each instruction's interacting rules and edge cases. 45 tasks are intricate (large config matrices, ordering rules, many error paths); 53 involved; 15 moderate.

Hidden-test suite size

1–5

6–15

16–30

31–60

61–120

121+

Tasks vary from a handful of focused assertions to 192. 32 tasks carry 60+ hidden tests — these demand broad, systematic coverage to pass.

Source projects

Tasks come from 91 distinct open-source repositories. Most contribute a single task; a handful supply several, each a different feature in the same codebase — testing whether an agent can work repeatedly within one project's conventions.

Projects contributing more than one task

Repository	Tasks
`pmndrs/koota`	5
`PyCQA/bandit`	3
`encode/httpx`	3
`platers/obsidian-linter`	3
`Textualize/textual`	2
`abs-lang/abs`	2
`capricorn86/happy-dom`	2
`celery/kombu`	2
`d5/tengo`	2
`dynamodb-toolbox/dynamodb-toolbox`	2
`fastapi/fastapi`	2
`helm/helm`	2
`mattn/anko`	2
`open-policy-agent/opa`	2
`prometheus/prometheus`	2
`testem/testem`	2

The remaining 75 projects contribute one task each.

What the benchmark exercises

Grouping the behavior tags into families shows the kinds of engineering the suite rewards. (Families overlap — a task can belong to several.)

Languages, parsers & compilers34 tasks

interpreter/codegenparsingtype-system

Roughly a third of the suite lives inside a language toolchain — interpreters, parser generators, formatters, and type checkers. Tasks add real language features (stepped slices, destructuring, default arguments, using declarations, methods on user types), grammar-conflict analysis, and character-class optimization. Success means producing parser/AST/runtime behavior that matches a hand-written reference exactly.

Data shaping, serialization & schemas44 tasks

data-transformationserializationschema-validation

The single largest behavior cluster. Agents implement (de)serializers, schema validators, and bidirectional format converters: error-stack serialization, recursive/lazy schemas, JSON-Schema $ref/if-then-else, TOML table converters, CSS shorthand expansion, and typed heterogeneous sorting. These reward precise edge-case handling and clean round-tripping.

Query builders, stores & persistence30 tasks

query-builderorm/databasepersistencecaching

Database- and query-adjacent work: SQL window-function and grouping builders, JSONPath querying, keyset cursor pagination, three-way Git merge, checkpoint/rollback import safety, snapshot systems, and persistent analysis caches. Many require both a correct in-memory model and a durable on-disk format.

Concurrency, lifecycle & resilience28 tasks

concurrencylifecycle/resource-mgmtstreamingretry/resilience

Coordination-heavy scenarios: dependency-ordered async init with rollback, hierarchical evaluation cancellation, single-active-consumer election, stream multiplexing with priority and flow control, circuit breakers, deterministic retry, and durability wait APIs. These stress correct ordering, cleanup, and failure semantics under contention.

Networking & protocols12 tasks

networking/http

HTTP and wire-protocol behavior: implicit HEAD/OPTIONS, deprecation headers, a standards-faithful cookie store, multipart and streaming-JSON parsing, incremental GraphQL delivery, and Server-Sent Events. RFC fidelity and header/precedence rules dominate.

State management & collections30 tasks

state-managementcollections/iterators

A large slice driven by the ECS library koota (aspects, snapshots, deferred mutation buffers, relation-pair tracking, predicates) plus reactive selectors, restored query state, and functional iterator combinators. Fine-grained change detection is the recurring challenge.

Reliability, security & observability35 tasks

observability/metricserror-handlingsecurity/taint

Cross-cutting correctness: interprocedural taint analysis, structured warning-suppression directives, evaluation profiling, drift/compliance detection, transactional config reload, and rich error classification. Often the 'feature' is a diagnostic.

UI, rendering & terminals11 tasks

terminal-uilayout/rendering

Presentation logic with exacting output: CSS grid layout, IntersectionObserver geometry, ANSI-safe truncation, Kitty-protocol key phases, multicolumn math spans, blend-range masks, and shared-toolbar focus routing.

Full catalog — all 113 challenges

Search and filter every scenario. Click a row to expand the full description of what the agent must build and what the hidden tests check.

Challenge	Project	Lang	Difficulty	Tests

Notes & method

How this catalog was built. Per-task facts (project, language, change type, instruction length, reference-solution size) come from each task's manifest.json / task.toml. Every instruction.md was read in full to summarize what the task asks for, what its hidden tests verify, its technical domain, and its behavior tags. Hidden new-test suite sizes are measured from each task's test.patch by the framework-aware parser used in our run analysis.

On test counts. "Tests" is the size of each task's hidden new-test suite. For most tasks this is the exact count of new assertions parsed from the verifier output; for a small number (where the suite is graded by exit code, or where a build failure prevented a clean count) it is an estimate from the number of added test functions, shown in muted text and marked ≈. Two Go/JS tasks use sub-test patterns we don't count granularly and show "—". Difficulty is a qualitative read of each instruction's interacting rules, not a measured quantity.

Scope. This is a descriptive map of the benchmark itself — it says nothing about any model's score. For a per-model measurement against this same task set, see the companion MiniMax-M3 on DeepSWE results report.

Upstream: deepswe.datacurve.ai.

Generated locally from the task corpus · 113 tasks · 91 repositories · ~5,133 hidden tests catalogued.