Don't build the router yet: a cheap test for monitor ensembles

AI control teams often propose routing one monitor for subtle attacks and another for obvious ones. Before such a router is built, a three-step diagnostic on existing logs can show whether it can help. On AgentDojo, it cannot.

March 21, 2026 · 4 min · Robert Amanfu

Mechanistic interpretability of chain-of-thought prompting

Exploration of mechinterp from a newbie perspective

February 2, 2025 · 2 min