The DeepSWE Scenario Catalog

A deep, holistic tour of every challenge the benchmark tests against — what each task asks an agent to build, how it is verified, and how the 113 tasks distribute across languages, domains, and behaviors.
113 tasks91 real OSS projects 5 languages~5,133 hidden tests Harbor format · Modal verifier
113
behavior-verified tasks
91
distinct source repos
5
programming languages
5,133
hidden test cases
sum of per-task new-test suites
38
median tests / task
max 192
612
median ref-solution LOC
lines added; max 1647
273
median instruction words
detailed natural-language specs
45
'intricate' tasks
multi-rule, many edge cases

DeepSWE is a set of 113 real software-engineering tasks drawn from 91 open-source projects. Each task hands the agent a working repository and a precise natural-language specification of a feature or fix to implement, then grades the result against a hidden test suite that the agent never sees. This page catalogs that set: the projects, the behaviors exercised, the verification model, and a searchable index of all 113 scenarios.

How a DeepSWE challenge works

Every task follows the same anatomy. The agent is dropped into a real codebase at a fixed commit and must implement a specified change; an automated verifier then runs a two-phase test suite to decide pass/fail.

Task anatomy

Each task ships as a Harbor-format directory:

  • instruction.md — the full problem statement handed to the agent (median 273 words; up to 500).
  • task.toml — metadata + budgets: a 90-minute agent wall-clock and a 30-minute verifier budget, 2 CPU / 8 GB, internet disabled.
  • tests/ — the hidden test.patch + runner the agent is graded on.
  • solution/ — a reference solution (median 612 lines added) used to author the task, withheld from the agent.
  • environment/ — the pinned container image for an air-gapped sandbox.

The verification model

Grading uses a base + new test split:

  • Base tests — the project's existing suite, which must still pass (no regressions).
  • New tests — added by the task author to verify the requested behavior. These are the bar the agent must clear and are never revealed to it.
  • A task passes only when both phases are green. Across the suite the new-test phase adds ~5,133 assertions (median 38 per task, ranging 1–192).
  • Tasks are behavior-verified: the spec describes what to build; many valid implementations pass, so the grader rewards observable behavior, not a diff match.

The shape of the suite

The 113 tasks are deliberately broad — five languages, three change types, and a long tail of one-off projects — so a model can't specialize to a single stack or pattern.

By language

TypeScript
35
Go
34
Python
34
JavaScript
5
Rust
5

TypeScript, Python and Go are near-evenly weighted (35/34/34); Rust and JavaScript form a smaller tail.

By change type

feature_request
106
bugfix
4
enhancement
3

The suite is overwhelmingly feature work (106/113) — net-new capability against a real codebase — with a few targeted enhancements and bug fixes.

Languages & ecosystems

Each language pulls in its own ecosystem of libraries and frameworks. The tasks touch parsers, web frameworks, ORMs, game-state libraries, CLIs, interpreters, and more.

LanguageTasksRepresentative projects
TypeScript35arktype, awilix, clack, claude-code-by-agents, cliffy, drizzle-orm, dynamodb-toolbox, effect, eicrud, happy-dom, httpx, ink, kea, koota, kysely, meriyah, obsidi…
Go34abs, actionlint, anko, arcane, dasel, etree, expr, geo, go-critic, go-genai, go-git, goreleaser, helm, kcp-go, kgateway, onedump, opa, participle, pebble, prome…
Python34adaptix, aiomonitor, bandit, cattrs, dateutil, fastapi, gql, httpx, igel, ipython, kombu, koota, langchain, mashumaro, mnamer, mobly, narwhals, numba, psd-tools…
JavaScript5KaTeX, csstree, testem, yjs…
Rust5boa, fd, oxvg, pest, wasmi…

Behavior themes

Each task was tagged with the behaviors it exercises (from a fixed vocabulary of 24). The distribution shows where the benchmark concentrates: data shaping, state management, error handling, and parsing lead — reflecting how much of real SWE work is precise data and control flow.

data-transformation
28
state-management
25
error-handling
24
parsing
22
cli/config
17
serialization
16
observability/metrics
13
lifecycle/resource-mgmt
13
interpreter/codegen
12
networking/http
12
schema-validation
12
streaming
10
concurrency
10
persistence
10
query-builder
9
type-system
9
caching
8
layout/rendering
7
orm/database
5
collections/iterators
5
terminal-ui
4
retry/resilience
3
security/taint
3
time/timezone
1

Difficulty & verification depth

Spec complexity

moderate
15
involved
53
intricate
45

A qualitative read of each instruction's interacting rules and edge cases. 45 tasks are intricate (large config matrices, ordering rules, many error paths); 53 involved; 15 moderate.

Hidden-test suite size

1–5
17
6–15
10
16–30
23
31–60
29
61–120
27
121+
5

Tasks vary from a handful of focused assertions to 192. 32 tasks carry 60+ hidden tests — these demand broad, systematic coverage to pass.

Source projects

Tasks come from 91 distinct open-source repositories. Most contribute a single task; a handful supply several, each a different feature in the same codebase — testing whether an agent can work repeatedly within one project's conventions.

Projects contributing more than one task

RepositoryTasks
pmndrs/koota5
PyCQA/bandit3
encode/httpx3
platers/obsidian-linter3
Textualize/textual2
abs-lang/abs2
capricorn86/happy-dom2
celery/kombu2
d5/tengo2
dynamodb-toolbox/dynamodb-toolbox2
fastapi/fastapi2
helm/helm2
mattn/anko2
open-policy-agent/opa2
prometheus/prometheus2
testem/testem2

The remaining 75 projects contribute one task each.

What the benchmark exercises

Grouping the behavior tags into families shows the kinds of engineering the suite rewards. (Families overlap — a task can belong to several.)

Languages, parsers & compilers34 tasks
interpreter/codegenparsingtype-system

Roughly a third of the suite lives inside a language toolchain — interpreters, parser generators, formatters, and type checkers. Tasks add real language features (stepped slices, destructuring, default arguments, using declarations, methods on user types), grammar-conflict analysis, and character-class optimization. Success means producing parser/AST/runtime behavior that matches a hand-written reference exactly.

Data shaping, serialization & schemas44 tasks
data-transformationserializationschema-validation

The single largest behavior cluster. Agents implement (de)serializers, schema validators, and bidirectional format converters: error-stack serialization, recursive/lazy schemas, JSON-Schema $ref/if-then-else, TOML table converters, CSS shorthand expansion, and typed heterogeneous sorting. These reward precise edge-case handling and clean round-tripping.

Query builders, stores & persistence30 tasks
query-builderorm/databasepersistencecaching

Database- and query-adjacent work: SQL window-function and grouping builders, JSONPath querying, keyset cursor pagination, three-way Git merge, checkpoint/rollback import safety, snapshot systems, and persistent analysis caches. Many require both a correct in-memory model and a durable on-disk format.

Concurrency, lifecycle & resilience28 tasks
concurrencylifecycle/resource-mgmtstreamingretry/resilience

Coordination-heavy scenarios: dependency-ordered async init with rollback, hierarchical evaluation cancellation, single-active-consumer election, stream multiplexing with priority and flow control, circuit breakers, deterministic retry, and durability wait APIs. These stress correct ordering, cleanup, and failure semantics under contention.

Networking & protocols12 tasks
networking/http

HTTP and wire-protocol behavior: implicit HEAD/OPTIONS, deprecation headers, a standards-faithful cookie store, multipart and streaming-JSON parsing, incremental GraphQL delivery, and Server-Sent Events. RFC fidelity and header/precedence rules dominate.

State management & collections30 tasks
state-managementcollections/iterators

A large slice driven by the ECS library koota (aspects, snapshots, deferred mutation buffers, relation-pair tracking, predicates) plus reactive selectors, restored query state, and functional iterator combinators. Fine-grained change detection is the recurring challenge.

Reliability, security & observability35 tasks
observability/metricserror-handlingsecurity/taint

Cross-cutting correctness: interprocedural taint analysis, structured warning-suppression directives, evaluation profiling, drift/compliance detection, transactional config reload, and rich error classification. Often the 'feature' is a diagnostic.

UI, rendering & terminals11 tasks
terminal-uilayout/rendering

Presentation logic with exacting output: CSS grid layout, IntersectionObserver geometry, ANSI-safe truncation, Kitty-protocol key phases, multicolumn math spans, blend-range masks, and shared-toolbar focus routing.

Full catalog — all 113 challenges

Search and filter every scenario. Click a row to expand the full description of what the agent must build and what the hidden tests check.

ChallengeProjectLang DifficultyTests

Notes & method

How this catalog was built. Per-task facts (project, language, change type, instruction length, reference-solution size) come from each task's manifest.json / task.toml. Every instruction.md was read in full to summarize what the task asks for, what its hidden tests verify, its technical domain, and its behavior tags. Hidden new-test suite sizes are measured from each task's test.patch by the framework-aware parser used in our run analysis.

On test counts. "Tests" is the size of each task's hidden new-test suite. For most tasks this is the exact count of new assertions parsed from the verifier output; for a small number (where the suite is graded by exit code, or where a build failure prevented a clean count) it is an estimate from the number of added test functions, shown in muted text and marked . Two Go/JS tasks use sub-test patterns we don't count granularly and show "—". Difficulty is a qualitative read of each instruction's interacting rules, not a measured quantity.

Scope. This is a descriptive map of the benchmark itself — it says nothing about any model's score. For a per-model measurement against this same task set, see the companion MiniMax-M3 on DeepSWE results report.

Upstream: deepswe.datacurve.ai.

Generated locally from the task corpus · 113 tasks · 91 repositories · ~5,133 hidden tests catalogued.