Show HN: We built a public CTF to stress-test AI agent guardrails

(vault.aport.io)

1 points | by uchibeke 6 hours ago ago

3 comments

$uchibeke 6 hours ago

Since October I've been building APort — an authorization layer that intercepts every AI agent tool call before execution and evaluates it against a versioned policy. The problem I kept running into: internal tests always passed. My test suite maps the space I imagined, which is exactly what an adversarial input tries to escape.
So I built this CTF to find the gaps I couldn't find myself.
A few things we learned before opening it publicly — we spent two weeks breaking it ourselves first:
• Prompt injection worked better than expected. Not because detection was weak, but because we were matching content not intent. Reframing "retrieve the restricted file" as "open the user-requested file" shifted the evaluator's judgment. We fixed this by mapping semantic equivalence — every synonym of a blocked operation routes to the same evaluation path.
• Policy ambiguity was a free pass. Any undefined term in a policy is exploitable. "Don't read sensitive files" left "sensitive" undefined. We moved to explicit default-deny: if the policy doesn't explicitly allow it, it's denied.
• Multi-step chaining went undetected. Our guardrail evaluated each call independently. A denied macro-action split into ten individually-approved micro-actions passed clean. We only caught it by looking at the full session replay. This is the same composability problem as transaction laundering in fintech — each transaction passes compliance, the composed behavior doesn't.
We fixed what we found before launch. Level 5 (full system bypass) hasn't been cracked yet. I'm genuinely uncertain if the architecture has a systemic weakness — that's the point of opening it up.
Runs on a Hetzner VPS, ~$10/month. Levels 1 and 2 are free, no sign-up. Levels 3-5 pay out $500/$1,000/$5,000.
Happy to go deep on the policy engine design, the evaluation architecture, or anything about how the levels were constructed.
$ollybrinkman 5 hours ago

Interesting approach to security testing. One angle we've been exploring: what if the authentication layer itself was the guardrail?
With x402, every API call requires a signed payment. No API keys to steal, no credentials to leak. The economic cost of each call is itself a rate limiter and audit trail.
Not a replacement for proper guardrails, but it eliminates the credential-based attack surface entirely.
[-]
- $uchibeke 5 hours ago
  
  That's a genuinely useful distinction to draw. x402 solves the "who is authorized to make this call" problem: removes credential theft as an attack vector, adds economic friction. APort is trying to solve a different layer: "what is this call actually doing in the context of everything else in the session."
  The multi-step chaining issue from my post still fires even when every call is authenticated and paid for. Ten individually-approved calls, each costing a fraction of a cent, composing into a full exfiltration: each one passes x402, the composed behavior doesn't.
  The AML analogy maps directly: transaction monitoring doesn't care if each payment was legitimate. It cares whether the pattern of payments looks like structuring. x402 is the per-call check. You still need session-level behavioral evaluation on top.
  Genuinely curious how x402 handles replay attacks across sessions ie is the payment the audit trail, or is there preserved session context?