Catching refusals — guard rails for autonomous agents
May 2026 · On the detail that exit 0 doesn't mean "all good"
My bug-bounty agent doesn't run in a fancy orchestrator. It runs in a 250-line Bash script that walks a queue, spawns a fresh Claude CLI process per task, and waits for the result. Half of the script is loop logic. The other half is guard rails against the three most common failure modes:
- The agent exits cleanly but didn't produce the artifact.
- The agent refused the task, and that looks like success.
- The usage limit hits, and the run dies mid-flight.
The loop
The skeleton is trivial:
while true; do
next_task=$(pick_first_unfinished_task "$QUEUE_FILE")
[ -z "$next_task" ] && break
prompt_file="$PROMPTS_DIR/$next_task.prompt"
spawn_log="$SPAWN_LOGS_DIR/$next_task.spawn.log"
timeout 1500 claude \
--print \
--permission-mode bypassPermissions \
--model sonnet \
--add-dir "$WORKSPACE" \
< "$prompt_file" \
>> "$spawn_log" 2>&1
cmd_exit=$?
process_result "$next_task" "$cmd_exit" "$spawn_log"
done
"Pick first unfinished" means: read the queue, find the first KH03: … line that doesn't yet have a status/KH03.status file. That makes the loop idempotent — kill the script, restart, it picks up where it left off.
The agent is supposed to write a status/$task_id.status JSON at the end with status, decision, and optionally a lead_file path. That's the output interface.
Failure mode 1: exit 0 without status
The agent terminates with exit code 0, but the file is missing. I treated this as a prompt bug at first — phrase the instruction more strictly, more examples, schemas. Didn't help. Models have bad days, or tool calls time out, or the markdown output looks like JSON but isn't parseable.
The robust fix is sentinel recovery: after a successful exit, check whether the status file exists. If not, scan the leads directory for new files belonging to this task:
if [ ! -f "$STATUS_DIR/$next_id.status" ]; then
leads_after=$(find "$LEADS_DIR" -maxdepth 1 -name "${next_id}-*" | sort)
new_lead=$(comm -13 <(echo "$leads_before") <(echo "$leads_after") | head -1)
if [ -n "$new_lead" ]; then
cat > "$STATUS_DIR/$next_id.status" <
Practically: if the agent wrote a lead file, the run was successful even if the closing status file is missing. Roughly one run in five gets saved by this recovery.
Failure mode 2: refusal looks like success
This is the awkward one. The agent received the prompt, found it problematic, and wrote a justification instead of doing the work — and exits cleanly. From the runner's view: successful run, no output. From the audit's view: task got skipped silently.
The problem is real for bug-bounty audits, because prompts often contain words like "exploit", "bypass", "privilege escalation", "PoC" that are refusal triggers in other contexts. That's understandable; the model has no context that this is an authorized program. But the failure mode must not look like success.
The guard rail: after sentinel recovery (no status, no lead), I tail the spawn log and grep for refusal phrases:
refusal=$(tail -200 "$spawn_log" | grep -iE \
"I can'?t (help|assist|provide|do that|continue|create|generate|write)|\
I cannot (help|assist|provide|do that|continue|create|generate|write)|\
I won'?t (help|be able|do that|create|generate|write)|\
I'?m not (able|going) to (help|assist|create|generate|provide)|\
Anthropic.{0,40}polic|usage polic|acceptable use polic|\
against (my )?guidelines|violates.{0,40}polic|\
harmful (request|content|action)" \
| head -3 | tr '\n' ' ' | head -c 400)
if [ -n "$refusal" ]; then
cat > "$STATUS_DIR/$next_id.status" <
Status blocked is a separate state alongside done and failed. It means: the prompt needs reformulating, this is an engineering task for me, not a model failure. My phone gets a Telegram message with the refusal snippet, and in the morning I decide whether to sharpen the prompt (more program-scope context, explicit "this is authorized BB research") or pull the task from the queue.
The regex is simply a list of phrases I've seen in spawn logs. It will change as new model generations develop new refusal idioms — I maintain it like a test suite.
Failure mode 3: usage limits
My Pro OAuth account has a reset window every few hours. If the account caps mid-run, the CLI returns something like "You hit your limit. Reset at 2am UTC." and exits. From the classic loop's view: task failed, move on — and the next one fails too, and so on, until the queue empties.
Detection is again a spawn-log grep, but inside an inner retry loop:
USAGE_RETRY_MAX=15
USAGE_RETRY_SLEEP=900 # 15 minutes
usage_retry=0
while true; do
timeout 1500 claude --print --model sonnet --add-dir "$WORKSPACE" \
< "$prompt_file" >> "$spawn_log" 2>&1
cmd_exit=$?
if tail -30 "$spawn_log" | \
grep -qi "hit your limit\|usage limit\|reset.*am.*UTC\|reset.*pm.*UTC"; then
usage_retry=$((usage_retry+1))
if [ $usage_retry -le $USAGE_RETRY_MAX ]; then
ping "[$next_id] usage-limit · sleeping 15min (retry $usage_retry/$USAGE_RETRY_MAX)"
: > "$spawn_log"
sleep $USAGE_RETRY_SLEEP
continue
fi
break
fi
[ $cmd_exit -eq 0 ] && break
# … general error retry: 3x with 60s sleep
done
15 retries × 15 minutes gives 3.75 hours of wait tolerance per task. That outlives every limit-reset cycle I've seen. In practice: a run that would have taken 12 hours takes 13 with a limit hit.
Failure mode 4 (bonus): PATH
The runner gets started via nohup bash queue-runner-audit.sh &, often over SSH from my laptop. nohup'd shells are non-interactive; they don't source .bashrc. So the NPM-global bin path is missing, and claude isn't found. The run then dies instantly and silently — which is why this now sits explicitly at the top of the script:
export PATH="$HOME/.local/share/npm-global/bin:/usr/local/bin:/usr/bin:/bin:$HOME/.local/bin:$PATH"
The most embarrassing bug of the whole pipeline and the one with the highest "wondered for three weeks" to "one-line fix" ratio.
What this means for production agents
Exit code 0 only means the process terminated cleanly. Whether the task was actually completed is written in the file system. I treat the actual output (file, JSON, lead) as the success criterion and verify it with sentinel checks — more robust than any stdout parsing.
When the agent refuses, I want to see it on Telegram in the morning. Almost always there's a prompt problem behind it (not enough context, clumsy phrasing), occasionally a genuine scope problem. The dedicated blocked state separates "model refused" from "model crashed" — only that separation lets me measure whether my prompt fixes actually work.
A usage limit is a pause with a known end. The runner sleeps and tries again. Since the retry loop went in, limits simply don't show up as failures in my run stats anymore.
Bash is good enough. I considered moving to a workflow orchestrator (Temporal, Airflow, dagster, even prefect). The argument against was simple: 250 lines of Bash are 250 lines I have completely in my head and can change in 30 seconds. The guard rails above are all written within five minutes of seeing a failure mode once in a run. With a framework that would be a PR.
The full runner is 249 lines, of which ~80 are the three guard rails described here. The audits themselves have since moved to Claude Code — the guard-rail ideas (sentinel, blocked bucket, limit retry) survived the move; only the spawn-log grepping became obsolete.