Don't build the router yet: a cheap test for monitor ensembles

When an agent books a hotel, files expenses, or edits a calendar, a monitor flags suspicious behavior. Teams usually run two:

Action monitor: reads the tool calls the agent makes.
CoT monitor: reads the agent’s chain-of-thought reasoning.

Routing means sending subtle-looking attacks to the CoT monitor and obvious ones to the action monitor.

A precondition is often skipped: routing only helps if the monitors catch different attacks. If monitor A flags every attack B flags and then some, routing can only hurt whenever the router hands an episode to B while A would have fired.

The setup was tested on AgentDojo inside ControlArena: 130 episodes, with an LLM judge tagging each attack as subtle or non-subtle. No routing variant beat CoT alone, including an oracle router with perfect subtlety labels.

The three-step diagnostic

Before any routing infrastructure is built, three checks can be run on existing logs. None of them needs new API calls.

Figure 1: Three-step diagnostic for monitor routing.

If any of these three checks fails, routing cannot help on this benchmark-and-monitor pair. A better classifier will not solve the underlying issue. The constraint is the monitors, not the router.

Findings

Step 1: per-stratum dominance. The CoT monitor wins every subtle attack (42/42) and most non-subtle ones (20/28, with the rest tied). The action monitor never wins cleanly. The subtle/non-subtle taxonomy is therefore statistically real (χ²=16.25, p=0.0003), but it distinguishes “CoT always wins” from “CoT usually wins” rather than identifying a regime where action wins.

Figure 2: Who wins, by attack type. CoT wins all subtle attacks and nearly all non-subtle ones.

Step 2: oracle routing. Each episode was assigned its correct subtle/non-subtle label, and the router was allowed to pick the best monitor. Even with perfect knowledge, AUROC drops from CoT’s 0.998 to 0.984. The 28 non-subtle episodes where the oracle delegates to the action monitor cost more than they save.

Architecture	AUROC	Better than CoT?
Action monitor	0.799	no, much worse
CoT monitor	0.998	baseline
Hybrid (fixed weights)	0.988	no
Classifier router	0.984	no
Oracle router (perfect labels)	0.984	no
Soft-blended router	0.989	no

Figure 3: ROC curves. CoT (top) dominates; routing variants overlap below it.

Step 3: weight sweep. The CoT-vs-action mixture weight λ was swept from 0 (pure action) to 1 (pure CoT). The curve climbs monotonically and plateaus at pure CoT. There is no valley or interior optimum. Mixing in any action signal at the operating range of interest reduces AUROC.

Figure 4: Weight sweep. AUROC rises monotonically with the CoT weight; there is no interior optimum.

The same picture holds when results are stratified by attack type:

Figure 5: Catch rate at 5% false-alarm by architecture, split by attack type. CoT hits 100% on every stratum.

Why this matters

The result is not “routing is bad.” Routing can help, but only in regimes where the monitors fail on different inputs. AgentDojo isn’t one of those regimes. The CoT monitor reads visible attack instructions in verbose reasoning traces; that is a benchmark fingerprint, not a general property of CoT monitors.

The takeaway is the diagnostic. The three steps run on existing eval logs in about an afternoon:

step 1 needs only per-episode monitor scores;
step 2 needs ground-truth or LLM-judge labels (which most teams already have);
step 3 is a for λ in linspace(0, 1) plot.

If the diagnostic comes back clean (an interior optimum in the sweep, or oracle routing beating the best single monitor), building the classifier is worth the effort. If it doesn’t, the engineering effort is better spent elsewhere, for example, hardening the weaker monitor against attacks designed to suppress the dominant signal.

Honest caveats

This benchmark cannot answer everything.

AgentDojo is saturated for CoT. AUROC of 0.998 with bootstrap CI [0.991, 1.000] is suspicious; adaptive attacks that stayed within the suite could not move the needle (CI [1.000, 1.000]) because there is no headroom to move. A fine-tuned attacker that suppresses CoT entirely is a different test, and one not run here.
One monitor pair, two settings. The same diagnostic should be run on bash and k8s ControlArena suites, where action signals are richer and the answer might flip.
n = 130 is too small to detect ΔAUROC < 0.02 reliably; tighter null statements need n ≳ 400.
The subtlety taxonomy is LLM-judged, not human-validated at scale. Cohen’s κ to a human annotator on at least 20 episodes is a recommended sanity check before reusing the rubric.

In summary, this is not a universal win for CoT. On this benchmark the precondition for routing is not met, and the three-step diagnostic above is a cheap way to check before any router is built.

Code and data available on request.

Citation

Amanfu, Robert. "Don't Build the Router Yet: A Cheap Test for Monitor
Ensembles." Robert's Bytes (May 2026).
https://pubs.onrender.com/posts/monitor_routing/

References

Greenblatt, R. et al. (2024). AI Control: Improving Safety Despite Intentional Subversion. arXiv:2312.06942.
Korbak, T. et al. (2024). CoT Red-Handed: Monitoring Reasoning Traces for Deception.
Debenedetti, E. et al. (2024). AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents. NeurIPS 2024.
ControlArena (2025). github.com/UKGovernmentBEIS/control-arena
Rokach, L. (2010). Ensemble-based classifiers. Artificial Intelligence Review 33(1):1–39.

The three-step diagnostic#

Findings#

Why this matters#

Honest caveats#

Citation#

References#