Published: 2026-04-08. Data note: leaderboard scores below were cross-checked against arena history on 2026-04-07 UTC. Change descriptions come from the iteration logs and tracked submission notes.

Announcement note: Sentient published the Cohort 0 results on 2026-04-15 in “The results are in: Meet the winners of Cohort 0”, listing RETRO / Robert Amanfu in 5th place.

This was my final report for Treasury Reapers, my team name for Sentient Arena Challenge 0: Grounded Reasoning. The challenge asked participants to build agents for OfficeQA, where the agent answers precise financial questions from the U.S. Treasury Bulletin corpus.

Team info

Field	Value
Team name	Treasury Reapers
Team members	Robert Amanfu
Challenge	Sentient Arena Challenge 0: Grounded Reasoning
Public announcement	5th place as RETRO / Robert Amanfu in Sentient’s Cohort 0 results, published 2026-04-15
Best leaderboard score	183.481
Best submission	`21458905` on 2026-04-06
Final stack direction	`goose` harness + custom OfficeQA prompt + skill contracts

What was built

The system is a grounded reasoning agent for OfficeQA over the Treasury Bulletin corpus: a custom Jinja2 prompt template plus skill contracts packaged as markdown files:

bulletin-retriever — rules for selecting the right Treasury Bulletin issue, normalizing fiscal-year boundaries, and preferring retrospective tables over month-by-month expansion
plaintext-table-parser — rules for locating table blocks by caption, splitting fixed-width rows, normalizing labels (Jan/Jan./January), and handling parenthetical negatives
arithmetic-verifier — rules for safe summation, percent change, unit conversion, and avoiding double-counting of subtotals
cross-doc-aggregator — rules for merging data across multiple bulletin files, detecting duplicates, and checking period coverage
answer-writer — mandatory answer-file contract: write early, overwrite if improved, verify before stopping
python-computation — rules for writing self-contained /app/calc.py scripts using only the standard library

A checklist of known failure modes (answer-file hygiene, arithmetic pitfalls, fiscal-year semantics, retrieval heuristics, output formatting, timeout prevention) was injected into the agent’s context alongside the skills.

An EvoSkill-style iteration loop (propose, distill, analyze) mined agent traces into reusable prompt and skill updates.

A few high-value constraints matched to OfficeQA’s actual failure modes outperformed adding more rules:

exact metric-name matching (e.g., “outstanding” requires the outstanding column, not “sales and redemptions”)
correct fiscal-year handling (pre-1977 = July–June; transition quarter; post-1976 = October–September)
anchoring to the right table family before extraction (match caption + metric phrase, not just date labels)
preferring explicit totals over reconstructed sums
using bracket-safe sequence output for list-valued answers

The strongest submission was a simple goose-based bundle. Simpler prompt bundles consistently generalized better than larger ones.

How the work proceeded

The workflow was trace-driven rather than score-driven.

Initial debugging used 5-task local runs and classified failures into categories: no-answer stalls (F1), script crashes (F2), wrong data extraction (F3), wrong computation (F4), wrong file retrieval (F5), timeouts (F6), bash syntax errors (F7), permission denied (F8), and output format mismatches (F9).
Prompts, skill contracts, and the failure-mode checklist were edited in small steps, then targeted tasks were rerun to verify whether a specific failure mode actually moved.
Local models provided debugging signal; leaderboard submissions were the real evaluation (OfficeQA scored server-side with MiniMax M2.5).
EvoSkill’s propose mode generated candidate prompt/skill edits from traces; distill summarized passes and failures into reusable patterns (fiscal_year, sequence_output, debt_table, quoted_metric, etc.).
The first EvoSkill-guided submission (f0b74a08, opencode, 143.655) scored below baseline, but the workflow separated repeatable patterns from isolated failures.
After goose outperformed opencode, the focus shifted to clean staged goose submissions. EvoSkill shifted to distillation and diagnosis.

Choosing the right harness and maintaining a solid baseline mattered more than expanding prompts.

Leaderboard progression

Score by submission

Leaderboard score by submission. Blue bars indicate the goose harness; red bars indicate opencode. The dashed gold line marks the best score (183.5).

#	Date	Short ID	Harness	Score	Main change
1	Apr 2	`e82a1479`	opencode	157.567	Early opencode baseline
2	Apr 3	`f0b74a08`	opencode	143.655	EvoSkill-guided opencode pass
3	Apr 3	`c98b59cc`	goose	182.993	First goose submission
4	Apr 4	`23ba4d7a`	opencode	137.799	Page-boundary / chart patch
5	Apr 4	`9e92c9db`	goose	176.498	Safe fixes: FY, units, totals
6	Apr 4	`50d0bbd4`	goose	168.706	HP-filter + debt-table fixes
7	Apr 5	`4f1209e8`	goose	167.863	Reported-values retrieval rule
8	Apr 6	`21458905`	goose	183.481	Best simpler goose bundle
9	Apr 6	`8782928a`	goose	178.462	+metric match +CY guard
10	Apr 6	`9decca84`	goose	178.857	+metric match +softened CY
11	Apr 7	`9aba8eca`	goose	168.202	Metric match + bloated checklist
12	Apr 7	`dfcbc10d`	goose	166.102	Exact resubmit of 183.481 bundle

All leaderboard submissions in chronological order.

Headline numbers

Best score: 183.481 (submission 8 of 12)
Gain over first verified leaderboard baseline: +25.914 (157.567 → 183.481)
Goose mean score across 9 submissions: ${\sim}176$ vs. opencode mean across 3 submissions: ${\sim}146$
First goose submission (182.993) was already within 0.488 of the final best score
Best local 20-task result was 16/20 (80%), but that config scored only 178.462 on the leaderboard

Harness comparison

Score distribution by harness. Individual submission scores shown as scatter points.

Harness	Submissions	Mean	Min	Max
opencode	3	146.3	137.8	157.6
goose	9	~175.6	167.9	183.5

Score trajectory with key events

$Score trajectory across all submissions, annotated with key events. The first goose submission (+39\u00a0pts) and the page-boundary regression (−45\u00a0pts) are the largest single-submission swings.$

Non-deterministic variance

Submissions with near-identical configs showed ~7-point variance:

Config family	Scores observed
Goose + safe fiscal/unit/total fixes	176.498, 168.706, 167.863
Goose + simpler bundle	182.993, 183.481
Goose + metric-match additions	178.462, 178.857

This variance means a 5-point score change between submissions may be noise, not signal.

Local accuracy vs. leaderboard score

$Local accuracy (bars, left axis) vs.\u00a0leaderboard score (diamonds, right axis). Higher local accuracy did not predict higher leaderboard score.$

Local run	Local acc.	Leaderboard	Gap
Qwen baseline (5 tasks)	20%	—	—
DeepSeek + rules (5 tasks)	60%	157.567	—
EvoSkill iter (5 tasks)	60%	143.655	Local held, LB fell
Targeted fixes (20 tasks)	80%	178.462	Local up, LB fell vs 183.5

The 20-task local sample did not represent the full ~246-task hidden pool.

Findings

What helped vs. what hurt

Change	Effect	Evidence
Switching to `goose`	Strong +	Jumped to 182.993; goose avg ~176 vs opencode ~146
Keeping prompts shorter	Strong +	Best score from simpler bundle; more rules → 178.x and 168.x
Exact metric-name matching	Positive	Fixed uid0012 and uid0227
Fiscal-year / totals / units	Positive	Recovered uid0041, uid0127, uid0220
Table-family anchoring	Positive	Turned uid0057 and uid0111 into local passes
EvoSkill distillation	Mixed	First submission regressed; distill loop was high-value for debugging
Over-expanding to many files	Negative	Retrospective tables often better than month-by-month
Calendar-year guard	Negative	Both variants regressed below 183.481
Page-boundary rules	Negative	Fell to 137.799 — worst score, -45 point regression
Large failure-mode checklist	Negative	Inadvertently packaged 104-line checklist → 168.202
Visual chart questions	Unresolved	uid0030 remained unsolved from text-only corpus

Summary of changes and their observed effects.

OfficeQA patterns

Exact metric identity beats loose topic matching. “Outstanding” vs “sales,” “gross interest” vs “net interest,” and quoted metric names were common failure points.
Treasury time semantics matter. Pre-1977 fiscal years, the 1976 transition quarter, and publication-lag effects are all relevant, but over-applying calendar-year logic hurts.
Fewer files can be better. One retrospective summary table often beat expanding to 12 monthly bulletins.
List outputs are format-sensitive. Sequence-valued answers required bracketed syntax ([v1, v2, v3]) even when the question said “comma-separated.”
Complex prompt bundles were penalized. Targeted, high-confidence fixes outperformed generic rule accumulation.
EvoSkill helped organize, not auto-optimize. The distillation loop that turned traces into skill tags was more useful than the proposal generator.

Visible task performance across submissions

Visible task results across four submissions. Green = PASS, red = FAIL, gray = not visible in that submission’s trace.

Always passed (across all goose submissions where visible): uid0003, uid0047, uid0065, uid0128, uid0146, uid0164, uid0209, uid0210, uid0111.

Always failed: uid0030 (visual chart), uid0102 (timeout/computation), uid0245 (unknown).

Flipped by specific fixes:

uid0012: FAIL → PASS after metric-name matching rule (submission 9)
uid0227: FAIL → PASS after metric-name matching rule (submissions 9–11)
uid0021: FAIL → PASS intermittently (passed in submissions 10, 11)
uid0199: usually PASS, regressed to FAIL when checklist file was bloated (submission 11)

What EvoSkill specifically contributed

Component	What it did for us	Net value
`propose` loop	Suggested candidate edits after failures	Useful for breadth; needed manual filtering
`distill` loop	Mined passes and failures into abstract skill tags	High value; made debugging faster
Trace-to-skill workflow	Classified misses as retrieval, parsing, arithmetic, or formatting	High value; reduced random tinkering
Direct leaderboard lift	Limited	First EvoSkill submission was worse than baseline

EvoSkill component contributions.

Feedback for Arena

Issue	Impact	Suggestion
Trace access was partial	Missing per-task trajectories blocked deeper debugging	Make trajectories consistently downloadable
Local vs leaderboard opacity	Changes looked strong locally but regressed on hidden pool	Clarify differences between local and leaderboard execution
Harness logging inconsistency	Some runs logged as goose when config said opencode	Make resolved harness explicit in artifacts
Infra failures vs prompt failures	API key issues, rate limits, and stalled containers created noisy signals	Surface config/env failures more clearly
Memory override warnings	Unclear whether warning affected leaderboard eligibility	Clarify origin and scope of the warning

Feedback for Arena organizers.

Bottom line

Key takeaways: (1) use the goose harness; (2) match metrics precisely to the question’s wording; (3) get fiscal-year and table context right; (4) enforce output formatting strictly. Extra rules reduced effectiveness.

Given another submission round, the 183.481 bundle would stay unchanged. Remaining budget would go toward resubmitting the same config to counter the ~7-point variance. A simple baseline maintained carefully beats continued elaboration.

Team info#

What was built#

How the work proceeded#

Leaderboard progression#

Score by submission#

Headline numbers#

Harness comparison#

Score trajectory with key events#

Non-deterministic variance#

Local accuracy vs. leaderboard score#

Findings#

What helped vs. what hurt#

OfficeQA patterns#

Visible task performance across submissions#

What EvoSkill specifically contributed#

Feedback for Arena#

Bottom line#