Claude Code: The Architecture, Governance, and Engineering Practices You Didn’t Know About

This article answers the core question:
How does Claude Code actually work under the hood? Why does context get messy, tools stop working, and rules get ignored—even when you write longer prompts? After spending six months and $40 a month on two accounts, I figured out the real system design. Here’s exactly what I learned so you can skip the frustration and turn Claude Code into a reliable engineering partner.

I started using Claude Code like any other chatbot. Write a prompt, get code, move on. Within weeks the context became chaotic, tools multiplied but delivered worse results, and the rules I kept adding were simply ignored. After digging into how Claude Code is built, I realized the problem wasn’t my prompts—it was the architecture itself.

I now view Claude Code as six balanced layers. Strengthen only one layer and the whole system tips over. Write a giant CLAUDE.md and your context pollutes itself. Pile on too many tools and the model can’t choose correctly. Spin up subagents everywhere and state drifts. Skip verification and you’ll never know where it broke.

How Claude Code Actually Runs: The Real Agent Loop Most People Miss

Claude Code’s core is not “answering.” It is a repeating loop:
Collect context → Take action → Verify result → [Finish or loop back]

Everything flows through five layers:

CLAUDE.md
Hooks / permissions / sandbox
Skills
Tools / MCP
Memory

I wasted weeks thinking stalls were because the model wasn’t smart enough. Almost every stuck moment was actually bad context, missing verification, or broken control.

Looking at these five layers makes troubleshooting instant. Unstable results? Check context loading order. Automation running wild? Look at the control layer. Long sessions losing quality? Intermediate junk polluted the window—start a fresh session instead of tweaking prompts again.

Here’s the exact loop diagram I reference every time I debug:

Claude Code Under-the-Hood Running Cycle Diagram

Once you see the loop, everything clicks. Most problems stop being “model issues” and become architecture issues you can fix.

MCP vs Plugin vs Tools vs Skills vs Hooks vs Subagents: Clear Boundaries That Prevent Chaos

These six terms are constantly mixed up. Here’s the simple rule I now live by:

Tools / MCP = new actions Claude can perform
Skills = repeatable workflows and decision frameworks
Subagents = isolated execution environments
Hooks = forced, deterministic checks and audits
Plugins = shareable across projects

That’s it. Nothing else.

The boundary map below saved me dozens of hours of trial and error:

MCP, Plugin, Tools, Skills, Hooks, Subagents Boundary Map

Use the right concept for the job and the system stops fighting you.

Why Your Claude Code Context Always Gets Messy (And the Exact Cost Breakdown)

Most people treat the 200K context window as a simple capacity limit. The real killer is noise, not length. Useful signals drown in irrelevant output.

Here’s the actual token budget breakdown I calculated after weeks of /context monitoring:

200K total window
├── Fixed overhead (~15-20K)
│ ├── System prompt: ~2K
│ ├── All enabled Skill descriptors: ~1-5K
│ ├── MCP Server tool definitions: ~10-20K ← biggest hidden killer
│ └── LSP state: ~2-5K
├── Semi-fixed (~5-10K)
│ ├── CLAUDE.md: ~2-5K
│ └── Memory: ~1-2K
└── Dynamic usable (~160-180K)
├── Conversation history
├── File contents
└── Tool call results

One typical GitHub MCP server adds 20-30 tools × ~200 tokens each = 4,000-6,000 tokens. Connect five servers and you’ve burned 25,000 tokens (12.5%) before you even start reading code. That number shocked me the first time.

Here’s the exact cost pie chart:

My proven context layering strategy

Always resident → CLAUDE.md (project contract, build commands, NEVER list)
Path-loaded → .claude/rules/ (language or directory rules)
On-demand → Skills (workflows and domain knowledge)
Isolated → Subagents (heavy exploration)
Never in context → Hooks (deterministic scripts and blocks)

Occasional stuff has no business living in every single turn.

Daily context hygiene practices I never skip

Keep CLAUDE.md short, hard, and executable (Anthropic’s own official one is only ~2.5K tokens—copy the style).
Move big reference docs to Skill supporting files, never into SKILL.md body.
Use .claude/rules/ for per-language or per-folder rules.
Run /context before long sessions to see what’s eating tokens.
Switch tasks? Use /clear. New phase of same task? Use /compact.
Write Compact Instructions inside CLAUDE.md so compression keeps what you actually need.

Here’s the layering diagram I follow:

The dangerous default compression trap

The built-in summarizer deletes “re-readable” content first—early tool outputs, old file contents, and even architecture decisions disappear. Two hours later you’re debugging a bug that only exists because the model forgot its own earlier decision.

I fixed it permanently with this block in every CLAUDE.md:

## Compact Instructions

When compressing, preserve in priority order:

1. Architecture decisions (NEVER summarize)
2. Modified files and their key changes
3. Current verification status (pass/fail)
4. Open TODOs and rollback notes
5. Tool outputs (can delete, keep pass/fail only)

Even better: before closing a long session, ask Claude to write a HANDOFF.md that contains progress, what worked, dead ends, and next steps. The next fresh session just reads that one file and continues perfectly.

Why Plan Mode is secretly the most powerful feature

Plan Mode separates exploration from execution. Exploration stays read-only. Claude clarifies goals, proposes a concrete plan, then you approve before any file changes happen.

For big refactors, migrations, or cross-module work, this single habit cut my error rate dramatically.

Here are the two Plan Mode diagrams I reference constantly:

Pro tip: open one Claude to draft the plan, then open a second instance as “senior engineer” to review it. AI reviewing AI works shockingly well.

Skills Design: Not Prompt Templates—Real On-Demand Workflows

Skills are loaded progressively: descriptor stays in context, full content loads only when needed. That’s why they feel different from saved prompts.

A good Skill must:

Tell the model exactly WHEN to use it (not just what it is)
Include complete steps, inputs, outputs, and stop conditions
Keep the main SKILL.md short—move big data to supporting files
Set disable-model-invocation: true for anything with side effects

Recommended folder structure:

.claude/skills/
└── incident-triage/
    ├── SKILL.md
    ├── runbook.md
    ├── examples.md
    └── scripts/
        └── collect-context.sh

Three Skill types I actually use in production

1. Checklist type (quality gate)

---
name: release-check
description: Use before cutting a release to verify build, version, and smoke test.
---

## Pre-flight (All must pass)
- [ ] cargo build --release passes
- [ ] cargo clippy -- -D warnings clean
- [ ] Version bumped in Cargo.toml
- [ ] CHANGELOG updated
- [ ] kaku doctor passes on clean env

2. Workflow type (standardized risky operations)

---
name: config-migration
description: Migrate config schema. Run only when explicitly requested.
disable-model-invocation: true
---

## Steps
1. Backup: cp ~/.config/kaku/config.toml ~/.config/kaku/config.toml.bak
2. Dry run: kaku config migrate --dry-run
3. Apply after confirmation
4. Verify: kaku doctor all pass

## Rollback
cp ~/.config/kaku/config.toml.bak ~/.config/kaku/config.toml

3. Domain-expert type (structured diagnosis)

---
name: runtime-diagnosis
description: Use when kaku crashes, hangs, or behaves unexpectedly at runtime.
---

## Evidence Collection
1. Run kaku doctor
2. Last 50 lines of logs
3. Plugin state: kaku --list-plugins

## Decision Matrix
| Symptom            | First Check                     |
|--------------------|---------------------------------|
| Crash on startup   | doctor output → Lua syntax      |
| Rendering glitch   | GPU backend / terminal caps     |
| Config not applied | Config path + schema version    |

Short descriptors matter—every Skill steals context tokens. I reduced one from 45 tokens to 9 and immediately felt the difference.

Auto-invoke strategy I follow:

High frequency (>1× per session) → keep auto-invoke, optimize description
Low frequency (<1× per session) → disable-auto-invoke, trigger manually
Ultra-low (<1× per month) → move to AGENTS.md instead

Avoid these anti-patterns: vague one-line descriptions, giant SKILL.md walls of text, one Skill trying to do review + deploy + debug + docs + incidents, and side-effect Skills that can self-trigger.

Tool Design: Make Claude Choose Correctly Instead of Guessing

Tools for agents are not the same as APIs for humans. The goal is not completeness—it’s making the right choice obvious.

Good tool rules I now enforce:

Prefix by system or resource: github_pr_*, jira_issue_*
Support response_format: concise or detailed
Error messages must teach the model how to fix the issue
Merge low-level tools into high-level ones when possible

I studied how the Claude Code team evolved their own tools. The biggest lesson came from user-interaction tools. They tried three versions:

Add a “question” parameter to existing tools → Claude ignored it
Require special Markdown in output → Claude forgot the format
Create a dedicated AskUserQuestion tool → perfect, because calling it = pause

Here are the evolution diagrams that explain why the third version won:

When NOT to add another tool:

Local shell can already do it reliably
Only static knowledge needed
Better suited as a Skill workflow
You haven’t tested that the schema is model-stable

Hooks: The Deterministic Safety Net You Can’t Trust Claude to Remember

Hooks are not “auto-scripts.” They are the way to pull things Claude should never improvise back into guaranteed execution.

Supported hook points and what belongs there:

Perfect for Hooks

Block edits to protected files
Auto-format + lint after every Edit
Inject dynamic context on SessionStart (branch name, env vars)
Send notification when task completes

Never put in Hooks

Complex semantic decisions
Long-running business logic
Multi-step reasoning

Real example from my Rust + Lua terminal project:

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Edit",
        "pattern": "*.rs",
        "hooks": [{ "type": "command", "command": "cargo check 2>&1 | head -30" }]
      },
      {
        "matcher": "Edit",
        "pattern": "*.lua",
        "hooks": [{ "type": "command", "command": "luajit -b $FILE /dev/null 2>&1 | head -10" }]
      }
    ]
  }
}

In 100 edits, each hook saves 30-60 seconds and prevents entire classes of bugs. Limit output with | head -30 so hooks don’t pollute context themselves.

Here are the hook point and time-saving diagrams:

Subagents: Isolation Is the Real Superpower, Not Parallelism

A subagent is a fresh Claude instance with its own context window, restricted tools, and max turns. Its job is to take heavy, noisy tasks (full codebase scan, test runs, deep reviews) and return only a clean summary.

Must-set constraints every time:

tools / disallowedTools
model (Haiku for exploration, Opus for reviews)
maxTurns
isolation: worktree when files will be touched

Long-running bash? Press Ctrl+B to background it. Subagents work the same way.

Anti-patterns I no longer fall for:

Giving subagents the same permissions as main thread
Loose output formats
Tight dependencies between sub-tasks

Prompt Caching: The Invisible Architecture That Controls Cost and Speed

“Cache Rules Everything Around Me” is literally how Claude Code is engineered. High hit rate = lower cost + higher rate limits.

The prompt order is deliberately built for prefix caching:

System Prompt (locked)
Tool Definitions (locked)
Chat History (dynamic)
Current user input (last)

Here’s the exact layout diagram:

Common cache destroyers:

Time-stamps in system prompt
Random tool order
Adding/removing tools mid-session

Compaction flow (the magic that keeps long sessions alive):

defer_loading + ToolSearch keeps the prefix stable even with dozens of MCP tools.

Verification Closed Loop: Without This Layer You Don’t Have Engineering

“Claude says it’s done” means nothing. You need proof, rollback path, and audit trail.

Verification layers I require:

Lowest: exit codes, lint, typecheck, unit tests
Middle: integration tests, screenshot diffs, contract tests, smoke tests
Highest: prod logs, metrics, human checklist

Define acceptance criteria in Prompt, Skill, and CLAUDE.md every single time. If you can’t clearly state “how I will know it’s correct,” the task probably shouldn’t be handed to autonomous mode.

High-Frequency Commands That Actually Manage Context and Governance

These commands are how you stay in control instead of letting the system decide:

Context management

/context – see exact token breakdown
/clear – same problem corrected twice? Start fresh
/compact – compress while respecting your Compact Instructions
/memory – confirm what CLAUDE.md is really loaded

Governance

/mcp – manage servers and their token cost
/hooks – control forced entry points
/permissions – update allowlist
/sandbox – tighten isolation
/model – switch Opus/Sonnet/Haiku instantly

Session continuity

claude --continue – pick up yesterday’s work
claude --continue --fork – try two approaches from same point
claude --worktree – isolated git worktree
claude -p --output-format json – script-friendly output

Bonus commands I use daily: /simplify, /rewind, /insight, double-ESC to edit last message.

Here’s the full command governance map:

High-Frequency Commands and Governance Map

How to Write a Truly Effective CLAUDE.md (The Project Contract)

CLAUDE.md is not documentation. It is the living contract between you and every Claude instance.

What belongs:

Exact build/test/run commands
Architecture boundaries
Coding conventions
Environment gotchas
NEVER list
Compact Instructions

What never belongs:

Long background stories
Full API docs
Vague principles like “write good code”
Anything the model can infer from the repo

My battle-tested template (copy-paste ready):

# Project Contract

## Build And Test
- Install: pnpm install
- Dev: pnpm dev
- Test: pnpm test
- Typecheck: pnpm typecheck
- Lint: pnpm lint

## Architecture Boundaries
- HTTP handlers live in src/http/handlers/
- Domain logic lives in src/domain/
- Never put persistence in handlers

## NEVER
- Modify .env, lockfiles, or CI secrets without approval
- Remove feature flags without searching call sites
- Commit without running tests

## ALWAYS
- Show diff before committing
- Update CHANGELOG for user-facing changes

## Verification
- Backend: make test + make lint
- API: update contract tests
- UI: before/after screenshots

## Compact Instructions
Preserve: architecture decisions, modified files, verification status, TODOs, rollback notes

Pro trick: after every mistake, tell Claude “Update your CLAUDE.md so you don’t make that mistake again.” It maintains its own rules surprisingly well.

Here’s the CLAUDE.md writing guide visual:

Real-World Lessons from My Rust + Lua Terminal Project

When I built an open-source terminal with Rust + Lua during Chinese New Year, the mixed-language setup exposed every typical agent collaboration flaw.

Environment transparency matters more than you think.
I added a doctor command that dumps structured health report (dependencies, config, state). Claude runs it first—suddenly “I don’t know the environment” errors vanished.

Mixed-language Hooks example:

{
  "hooks": {
    "PostToolUse": [
      { "matcher": "Edit", "pattern": "*.rs", "hooks": [{ "command": "cargo check 2>&1 | head -30" }] },
      { "matcher": "Edit", "pattern": "*.lua", "hooks": [{ "command": "luajit -b $FILE /dev/null 2>&1 | head -10" }] }
    ]
  }
}

Recommended full engineering layout (copy and trim to your needs):

Project/
├── CLAUDE.md
├── .claude/
│   ├── rules/
│   ├── skills/
│   │   ├── runtime-diagnosis/
│   │   ├── config-migration/
│   │   ├── release-check/
│   │   └── incident-triage/
│   ├── agents/
│   └── settings.json
└── docs/ai/

Common Anti-Patterns to Avoid at a Glance

Here’s the exact anti-pattern map I check against every new project:

One-Click Configuration Health Check

Install the open Skill I published:
npx skills add tw93/claude-health

Then run /health in any session. It scores your CLAUDE.md, rules, skills, hooks, and actual behavior and gives a prioritized fix list.

The Three Stages Every Serious Claude Code User Goes Through

Stage 1: Treat it like ChatGPT
Stage 2: Learn prompts and tools
Stage 3: Focus on governance so the agent runs reliably by itself

Here’s the stage progression diagram:

The biggest mindset shift: if you can’t clearly define “what done looks like,” the task probably shouldn’t be fully autonomous.

Practical Checklist You Can Use Today

Write a short CLAUDE.md with build commands, NEVER list, and Compact Instructions.
Create at least three Skills (checklist, workflow, diagnosis).
Add PostToolUse Hooks for linting on *.rs and *.lua files.
Use Plan Mode for anything bigger than 30 minutes.
Run /context before every long session.
After any correction, ask Claude to update CLAUDE.md.
For mixed-language projects, add a doctor command.
Run /health weekly.

One-Page Summary (Copy-Paste Reference)

Claude Code = Six Layers
Running loop + Context governance + Concept boundaries + Skills + Tools/Hooks + Subagents + Verification

Core Formula
Short CLAUDE.md + Layered rules + On-demand Skills + Forced Hooks + Isolated Subagents + Explicit Verification

Context Formula
Fixed overhead ≤ 20K + Compact Instructions + HANDOFF.md handoff

Caching Formula
System + Tools locked at front + Dynamic content at back + defer_loading

Verification Formula
Prompt definition + Hook hard checks + Skill decision matrix + CLAUDE.md Definition of Done

Master this and you stop “using AI to write code” and start “governing AI that delivers reliable engineering.”

FAQ – Quick Answers to Common Questions

Why does my Claude Code context always get messy?
Fixed overhead from MCP tools and Skill descriptors eats 15-20K tokens, plus compression deletes important decisions. Fix: layer context properly + add Compact Instructions + use HANDOFF.md.

What’s the real difference between Skills and regular prompts?
Skills load progressively (descriptor stays, content loads on demand) and must include full steps plus stop conditions. Regular prompts dump everything at once and don’t support disable-model-invocation.

Which things should go in Hooks?
Only deterministic actions: post-edit linting, protected-file blocking, session-start environment injection, and completion notifications. Never complex reasoning.

When is Plan Mode actually worth using?
Any complex refactor, migration, or cross-module change. Exploration stays read-only, you approve the plan before any writes happen—error rate drops dramatically.

How do I make Claude maintain its own CLAUDE.md?
After every mistake, say: “Update your CLAUDE.md so you don’t make that mistake again.” It adds the rule automatically and forgets the bad behavior over time.

What is the real value of Subagents?
Isolation. Heavy tasks (scans, tests, reviews) run in their own window and return only clean summaries. Main context stays clean.

What exactly does the /context command show?
Exact token breakdown—especially MCP server definitions and file reads—so you can spot the hidden fixed-cost killers instantly.

How should I use Hooks in mixed-language projects?
Matcher by file extension: *.rs triggers cargo check, *.lua triggers luajit syntax check. Always pipe output to head -30 so hooks don’t pollute context.

These are the exact lessons that turned my $40/month experiment into a production-grade engineering workflow. If you implement even half of them, you’ll immediately feel the difference. Drop your own tricks in the comments—I’m still learning too.

Claude Code Architecture Secrets: Fix Messy Context & Broken Tools in AI Coding