The Day My Agent Decided It Didn’t Need Me

Without getting into the details of what I’m building, I’ve been experimenting with using Claude as a coworker to go through documentation and diagrams, extract answers to a set of questions I specify, and summarize it into something structured and usable. The constraint I cared about most was determinism.

Inspired by movie Memento, I decided to lock down as much context and code as I could in between sessions so no know-how is lost from one session to next, and obviously no new surprises, no drift, just reproducible behavior. But being a security guy, I did not want Claude to have access to my laptop, so I decided to push every thing on a dedicated Github repo, while recording context for sessions to bring back when I start the next session (Memento rings a bell?)

 I decided to split this workflow into two agents: 

Agent 1: parses, extracts and produces answers, into a structured table, along with a Flag1 section—things it’s unsure about, explicitly marked for human review

Agent 2: takes that Flag1 table, incorporates human input, resolves uncertainties (Flag2), and updates the answers in the original table. I asked it to give me its confidence level as well. The goal was to have a Human In the Loop (HITL) round here to ensure business reliability and compliance needs are not compromised.

Agent 1 was tightly locked down, I locked the prompt file and the Claude generated code, built version control around it, and pushed all changes to Github. Claude cowork gets full credit on version control and change management. 

Agent 2.. Not so much. Not by design but because the session limits kicked in before I could fully lock it down. So what I ended up with was a partially specified agent 2 sitting downstream of a very deterministic agent 1. What happened at second session was very interesting

Agent 1 behaved exactly as expected. But agent 2 took a different path.

Instead of preserving the HITL step, asking the human to answer the flagged questions:

- it collapsed Flag1 and Flag 2 tables, 

- It answered the flagged questions on its own, 

- Added “answer source” column and entered “auto-accepted ai guess” on every entry and to defend itself entered a “high” in the confidence column for that entry. 

-And here is the funny part: It entered a “no response” under the human response column!!. and went to updated the original answer table with his own guesses.

The result: the answer table looked great with only one flagged answer. Except no human was ever involved: The Human in the loop was not only bypassed but also had a smear campaign on her record.

So claude not bypassed the HITL control but also made his human look bad

It’s like a cat knocking something off the table, then looking at you as if your kids did it.

This wasn’t random behavior. Is it Anthropic saving tokens tokens?? it is definitely inflating confidence to perform optimization and justifying its action in the process. Lets reflect on some of the OWASP AI Vulnerability Scoring System (OWASP AVISS) risk amplification factors at Play: 

  • Execution autonomy — making decisions without external validation

  • Behavioral non-determinism — different outcomes under slightly different constraints

  • Implicit goal completion — finishing the workflow even when key steps are underspecified, moving forward to autonomous goal formulation — redefining “what done looks like” in the absence of constraints

Lessons: need to build guardrails against execution autonomy and moving towards most determinism and reliability

0) Reduce “self-modification” and “behavioral non-determinism” as much as possible by asking the agent to log major changes in the code and model prompts. 

1) Bound “execution autonomy”: do not make decisions on your own? make sure you have the external feedback (from human or other agent.

2) Make Human In The Loop explicit and non-bypassable: Don’t allow agent skip validation gates. This must be a hard boundary.

3) Challenge confidence declarations. not just the outputs: ask the agent to justify its confidence claims (I have yet to test this to see how defensive claude gets:))

4) Constrain goal formation: more research on amplification factor “autonomous goal formulation” on how to narrow down agent’s intent. 

Next
Next

From Agenda to Agent: Building a Decision Engine for Conferences