Great writeup.
I hit the same problems on a 590-file Go monorepo and ended up building two "skill files" that address several of your friction points directly.
Your BODY.PEEK bug is the canonical case of "the agent doesn't know what it doesn't know." My approach: CLAUDE:WARN annotations directly above functions in source code.
The agent sees the trap at the moment it touches the function, not after the bug ships.
If email.py had a CLAUDE:WARN saying "messages marked Seen before processing — use BODY.PEEK or data loss on failure", the agent would have caught it while writing the code, not after production.
Same for architectural drift. Your slash command duplication across cli.py, email.py, telegram.py — locally reasonable, globally messy. A per-directory ASCII schema (skill 2) makes the intended architecture explicit.
An agent reading the dispatch schema sees that all channels must route through commands.py. Without it, it only sees the file it's editing.
The sub-agent blank context problem: each local CLAUDE.md starts with a protocol pointer to root + 3 mandatory grep commands. Sub-agents inherit the research protocol structurally instead of by luck.
The deeper difference from what you describe: your CLAUDE.md is a flat file that grows after each bug. In my pattern, annotations live distributed in the code (CLAUDE:WARN, CLAUDE:SUMMARY, CLAUDE:DEPENDS), an audit skill maintains them periodically, and the per-directory CLAUDE.md stays a 50-line manifest — not a catch-all.
One more pattern I'd add to your workflow: a prod bug is a test failure first. No fix ships without a red test that reproduces it, goes green, and gets logged. ( claude does it autonomously very well. )
The agent is great at writing the fix, but left to itself it will patch the code and move on. Force the cycle: red test → fix → green test → commit.
Your BODY.PEEK bug becomes a regression test that protects you forever, not just a rule in CLAUDE.md that the agent might or might not read next session.
A/B on the same prompt, same codebase: 58K tokens with the full doc system vs 73K without, 2 minutes vs 8 minutes, zero false positives vs at least one.
Great writeup. I hit the same problems on a 590-file Go monorepo and ended up building two "skill files" that address several of your friction points directly.
Your BODY.PEEK bug is the canonical case of "the agent doesn't know what it doesn't know." My approach: CLAUDE:WARN annotations directly above functions in source code.
The agent sees the trap at the moment it touches the function, not after the bug ships.
If email.py had a CLAUDE:WARN saying "messages marked Seen before processing — use BODY.PEEK or data loss on failure", the agent would have caught it while writing the code, not after production.
Same for architectural drift. Your slash command duplication across cli.py, email.py, telegram.py — locally reasonable, globally messy. A per-directory ASCII schema (skill 2) makes the intended architecture explicit.
An agent reading the dispatch schema sees that all channels must route through commands.py. Without it, it only sees the file it's editing.
The sub-agent blank context problem: each local CLAUDE.md starts with a protocol pointer to root + 3 mandatory grep commands. Sub-agents inherit the research protocol structurally instead of by luck.
The deeper difference from what you describe: your CLAUDE.md is a flat file that grows after each bug. In my pattern, annotations live distributed in the code (CLAUDE:WARN, CLAUDE:SUMMARY, CLAUDE:DEPENDS), an audit skill maintains them periodically, and the per-directory CLAUDE.md stays a 50-line manifest — not a catch-all.
One more pattern I'd add to your workflow: a prod bug is a test failure first. No fix ships without a red test that reproduces it, goes green, and gets logged. ( claude does it autonomously very well. )
The agent is great at writing the fix, but left to itself it will patch the code and move on. Force the cycle: red test → fix → green test → commit.
Your BODY.PEEK bug becomes a regression test that protects you forever, not just a rule in CLAUDE.md that the agent might or might not read next session.
A/B on the same prompt, same codebase: 58K tokens with the full doc system vs 73K without, 2 minutes vs 8 minutes, zero false positives vs at least one.
https://github.com/hazyhaar/GenAI_patterns