The Three-Layer Architecture of AI-Assisted Development: How gstack, Superpowers, and Compound Engineering Work Together
What are the fundamental differences between gstack, Superpowers, and Compound Engineering, and how should developers combine them to build a complete AI-assisted workflow?
These three Claude Code tools—gstack by Garry Tan (54.6K stars), Superpowers by Jesse Vincent (121K stars), and Compound Engineering by Every Inc (11.5K stars)—are not competitors. They operate at three distinct layers of the development stack: decision-making, process structuring, and knowledge accumulation. Most developers make the mistake of installing one tool and assuming they have complete coverage. This article explains why you need all three layers, how they complement each other, and the specific workflow for combining them effectively.
The Restaurant Metaphor: Understanding the Four Responsibilities of AI-Assisted Development
How can we conceptualize the different roles that AI tools must play in software development?
To understand why these three tools are necessary, we need a framework for what AI-assisted development actually requires. I use a restaurant metaphor because it clarifies the separation of concerns that most developers overlook.
Anthropic published an engineering blog post on November 26, 2025, describing effective harnesses for long-running agents. Their architecture consists of an initializer agent that breaks down tasks and subsequent coding agents that execute them. Testing, QA, and specialized agents were noted as future work. I expand this into four concrete responsibilities using the restaurant analogy:
| Responsibility | Restaurant Role | Development Equivalent | Critical Principle |
|---|---|---|---|
| Planning | Head chef decides the menu | Determining what to build and whether it’s worth building | Direction matters more than speed |
| Execution | Kitchen team cooks | Writing code to implement the plan | Follow the plan without deviation |
| Evaluation | Independent food taster checks quality | Verifying that output meets requirements | The maker and checker must be separate |
| Cross-session state | Closing notes pass to the morning shift | Transferring knowledge between tasks | Knowledge must be searchable and reusable |
The core insight from Anthropic’s research: builders who evaluate their own work are systematically overoptimistic. A chef rating their own cooking will always find it delicious. The maker and the checker must be distinct entities. Using this harness architecture, agents autonomously built a complete application with over 200 verifiable features.
Author’s reflection: This separation of maker and checker feels counterintuitive at first. We want to believe we can objectively judge our own work. But the research confirms what experienced engineering managers already know—code review by the original author catches fewer bugs than review by a fresh pair of eyes. The AI equivalent requires explicitly designing evaluation as a separate function, not an afterthought.
gstack: The Decision and Testing Layer
How does gstack ensure you build the right thing, and how does it validate that your implementation works in the real world?
gstack excels at the planning and evaluation responsibilities. It provides specific commands that function as gates—decision points that must be cleared before work proceeds.
The Dual Gate System: Product and Architecture Validation
gstack provides two critical planning commands:
-
/plan-ceo-review: Asks “is this worth building?” from a product perspective -
/plan-eng-review: Asks “will this blow up later?” from an architecture perspective
Both gates must pass before execution begins. This dual validation prevents the common failure mode of building the wrong thing beautifully, or building the right thing on a foundation that collapses under scale.
Application scenario: Imagine you’re planning to add a real-time collaboration feature to your SaaS product. Running /plan-ceo-review prompts questions like: “Do your core users actually need real-time editing, or would asynchronous commenting solve the same problem with less complexity? How do competitors approach this, and what’s their maintenance burden?” Meanwhile, /plan-eng-review challenges you: “Your current WebSocket infrastructure handles 100 concurrent connections. This feature requires 10,000. What’s your migration strategy? If two users edit the same text simultaneously, what’s your conflict resolution algorithm? Have you tested it with simulated latency?”
These questions often reveal assumptions that would have caused weeks of rework if discovered after implementation began.
The 95% Confidence Prompt: Inverting the Interview Dynamic
Before running /office-hours, gstack users can employ a specific prompt to clarify requirements:
“I’m about to start this project. Interview me until you have 95% confidence about what I actually want, not what I think I should want.”
This inverts the typical AI interaction pattern. Instead of you prompting the AI, the AI interviews you. Most projects fail not because they were built wrong, but because nobody clarified what to build in the first place.
Application scenario: You’re starting a project to “improve the onboarding flow.” Without clarification, you might assume this means “add more tooltips.” But when the AI interviews you with questions like “What specific drop-off point in your analytics prompted this project?” and “Have you interviewed users who completed onboarding versus those who didn’t?” you discover the real problem is account verification email deliverability, not UI complexity. The AI asking questions uncovers this in minutes; you prompting the AI would have produced a polished tooltip system that solved the wrong problem.
Author’s reflection: The 10x effectiveness claim for AI interviewing versus human prompting initially seemed exaggerated. But reflecting on my own experience, I realize how often I frame prompts based on my existing assumptions. When the AI asks the questions, it has no preconceptions to protect. It can follow threads that my own prompting would never surface because I didn’t know to ask. This dynamic—where the AI’s ignorance becomes its strength—represents a fundamental shift in how we should approach requirements gathering.
Real-World QA: The Independent Taster
The /qa command opens a real browser and interacts with your application like an actual user. It doesn’t check “does the code look correct?”—it verifies “can a user actually complete their task?”
Anthropic’s testing found that explicitly requiring browser-based end-to-end testing significantly improved performance compared to relying solely on code-level checks.
Application scenario: Your code review shows perfect API integration and clean form validation logic. But /qa discovers that when users enable password autofill on mobile Safari, the browser’s native styling obscures the submit button, making registration impossible on iOS. This is an interaction bug that static analysis cannot detect. The independent taster—separate from the implementation logic—catches what the builder missed.
Context Window Tactics and External State
Claude Opus 4.6 offers a 1 million token context window (currently in beta on the Claude Platform). For projects that fit within this window, you can load the complete codebase and documentation in a single pass rather than feeding it piecemeal.
However, Anthropic’s harness architecture still emphasizes external state files—feature-list, claude-progress.txt, and similar—as the primary coordination mechanism, not just raw context. For long-running projects, structured external records prove more sustainable than relying solely on large context windows.
The Limitation: Great Chef, No Recipe Binder
Garry Tan reports shipping 600,000 lines of production code in 60 days using this setup—10,000 to 20,000 lines per day—while running Y Combinator full-time. These numbers represent his personal experience; individual results will vary. For decision-making and QA, gstack remains the strongest option.
But gstack has a critical gap: it’s like a restaurant with an excellent head chef and a rigorous food taster, but no recipe binder. Nobody documents what went wrong tonight. Tomorrow’s team starts fresh, repeating mistakes that were already solved.
gstack does include /review and /ship commands, creating some overlap with Compound Engineering’s review capabilities. The distinction is one of emphasis, not hard boundaries.
Author’s reflection: I’ve experienced this pattern directly. Using gstack for a three-month project, I noticed we were solving similar architecture problems multiple times. Each new feature triggered debates we’d already resolved. The team had the knowledge, but it was trapped in individual memories rather than being institutionalized. This isn’t a flaw in gstack’s design—it’s simply not what gstack was built to solve. Recognizing this limitation is what led me to explore the next layer.
Superpowers: Structured Process Without Memory
What problem does Superpowers solve, and why is it insufficient for long-term projects?
Superpowers has earned 121K stars, demonstrating its quality and utility. Its core contribution is upgrading developers from “chatting randomly with AI” to “using AI with a structured process.”
The Process Discipline
Superpowers defines a clear workflow:
Brainstorm → Plan → Execute → Review
This structure transforms development from improvisation to choreography. Like moving from a kitchen where everyone cooks by instinct to one with standardized recipes and prep checklists, this represents genuine progress. Superpowers also includes subagent-driven development with separate specification and code-quality reviewers.
Application scenario: Before Superpowers, your AI interactions might look like: “Hey, can you help me add a payment system? Actually, wait, let’s talk about the database first. No, actually, let’s look at the UI.” Each session wanders. With Superpowers, you commit to the phase structure: first exhaustively brainstorm payment approaches (Stripe, Paddle, custom), then select one and plan implementation details, then execute against that plan, then review against the specification. The discipline prevents mid-implementation scope creep and ensures review actually happens against defined criteria rather than vague “does this look okay?”
The Memory Gap
Author’s reflection: Superpowers was my entry point into structured AI-assisted development, and it genuinely improved my output quality. But after several weeks, I noticed a frustrating pattern. I’d spend a session debugging a complex CORS configuration issue, finally resolve it, and document the solution in the conversation. Three days later, a different feature triggered a similar problem. The AI confidently suggested the same approaches we’d already proven wrong. I had to re-explain: “We tried that last Tuesday. It doesn’t work because of the load balancer configuration.”
This happens because Superpowers doesn’t treat knowledge accumulation as a first-class feature. Each session’s context remains in that session. The next session starts without the lessons from the last one. The process ensures quality within a session, but sessions remain isolated islands.
This limitation—process without memory—is what led me to add Compound Engineering on top of Superpowers.
Compound Engineering: The Knowledge Accumulation Layer
How does Compound Engineering solve the knowledge loss problem, and what makes its “compounding” mechanism distinct from simple documentation?
Compound Engineering (CE) addresses the layer that both gstack and Superpowers neglect: systematic knowledge accumulation that improves over time.
The Five-Phase Cycle
CE’s workflow extends beyond Superpowers’ four phases:
Brainstorm → Plan → Work → Review → Compound
The first four phases resemble Superpowers but execute with greater depth. The fifth phase—/ce:compound—is where CE earns its name.
Research-Driven Planning
Instead of writing plans from scratch in each conversation, CE’s /ce:plan spawns parallel research agents that:
-
Dig through your project’s history -
Scan codebase patterns -
Read Git commit logs
Application scenario: You’re adding a “limited-time discount” feature to an e-commerce system. Rather than designing from first principles, CE’s research agents discover that six months ago, a similar feature failed because timezone handling caused discounts to expire early in certain regions. They also find that the current codebase already has a timezone utility in src/utils/time.ts that was built specifically for this previous attempt. Your plan automatically incorporates: “Use existing timezone utility; test against the failure cases documented in commit 7a3f2d.” A new cook designing tomorrow’s menu has read every complaint from the past three months instead of guessing.
Dynamic Reviewer Ensemble
CE’s /ce:review doesn’t rely on a single reviewer saying “looks good.” It runs a dynamic ensemble with:
-
Minimum 6 always-on reviewers: correctness, security, performance, testing, maintainability, adversarial testing -
Conditional reviewers activated based on the diff: database changes trigger data consistency reviewers, authentication changes trigger security specialists
Each produces an independent report. This is like having a food critic, health inspector, and customer panel all taste the same dish separately rather than trusting the cook’s self-assessment.
The Compound Phase: Knowledge as Compound Interest
/ce:compound is CE’s defining innovation. After fixing a bug or completing a feature, this command spawns five Phase 1 subagents in parallel:
| Subagent | Function | Output |
|---|---|---|
| Context Analyzer | Traces the conversation, extracts problem type and components involved | Classification tags |
| Solution Extractor | Captures what failed, what succeeded, root cause, final fix | Structured resolution path |
| Related Docs Finder | Searches existing knowledge base for duplicates; updates existing docs rather than creating new ones if similar issues exist | Deduplicated knowledge entry |
| Prevention Strategist | Identifies how to prevent this problem class in future | Prevention checklist |
| Category Classifier | Tags and categorizes the learning for structured retrieval | Searchable metadata |
All five complete, then merge results into docs/solutions/ as structured, categorized, searchable documents.
Application scenario: You spend three hours debugging an edge-case runtime compatibility bug—a dependency behaves differently in Node.js 18 versus 20 due to AsyncLocalStorage changes, causing data races that manifest as random 5% request failures with no error logs. After fixing it, you run /ce:compound. The system generates:
# Runtime Compatibility: Node.js Version Async Behavior Differences
## Problem Type
Runtime Compatibility / Asynchronous Processing / Data Races
## Components
- Dependency: legacy-data-processor@v2.1.0
- Node.js versions: 18.x (dev) vs 20.x (production)
- Affected: AsyncLocalStorage context propagation
## Symptoms
- Random 5% data loss on requests
- No error logs (silent failure)
- Only reproducible in production (Node 20)
- Appears related to timing/load, not specific user actions
## Failed Approaches
1. Added retry logic → Increased race condition frequency
2. Upgraded dependency to v3.0 → Breaking changes, migration cost prohibitive
## Final Solution
Lock dependency to v2.1.0-patch1, which explicitly handles Node 20 AsyncLocalStorage behavior change
## Prevention
- CI matrix testing across Node 18, 20, 22
- Dependency onboarding checklist includes Node compatibility verification
- Production/dev environment parity enforcement
Three weeks later, during a different feature, you encounter “random data loss, no error logs.” The plan-phase research agent retrieves this document: “Similar symptoms detected in docs/solutions/node-async-compatibility.md. Solution: check Node version compatibility with AsyncLocalStorage-dependent dependencies.” What would have been hours of debugging becomes minutes of verification.
Author’s reflection: The first time this happened—when CE automatically surfaced a previous solution and prevented a repeat debugging session—I experienced a genuine shift in how I think about development work. Previously, I accepted that software engineering involves repeating certain classes of mistakes. You forget, you relearn. That’s just the job. But CE demonstrates that this repetition is a solvable coordination problem, not an inherent property of the work. The “compound interest” metaphor is apt in a way I didn’t fully appreciate until experiencing it: the returns don’t just add up, they multiply as the knowledge base becomes more interconnected and searchable.
Linear vs. Exponential Knowledge
Anthropic’s claude-progress.txt represents tonight’s closing notes passed to the morning shift—linear continuity, one session to the next. CE’s docs/solutions/ is the restaurant’s recipe binder that every employee reads on day one and every day after, searchable by anyone, anytime.
Closing notes solve continuity. A recipe binder solves accumulation. One is linear. One is exponential.
Author’s reflection: This distinction clarified something that had bothered me about previous AI-assisted development approaches. I had been treating knowledge transfer as a handoff problem—how do I tell the next session what happened? But the real problem is accumulation—how do I build an organizational memory that improves over time? The linear approach means each new session starts with the previous session’s context, but no more. The exponential approach means each new session starts with the accumulated wisdom of all previous sessions. Over months, this difference becomes transformative.
The Complete Stack: How Three Layers Integrate
How do gstack, Superpowers, and Compound Engineering map to specific development responsibilities, and how should you combine them?
The following table clarifies the responsibility mapping:
| Layer | Primary Tool | Restaurant Analog | Core Value |
|---|---|---|---|
| Decisions (whether to build) | gstack | Head chef sets the menu | Prevents building the wrong thing |
| Planning (how to build) | CE /ce:plan |
Researcher reviews past complaints | Designs based on historical experience |
| Execution (actual building) | CE /ce:work |
Kitchen team cooks | Implements plan with task tracking |
| Review (built correctly?) | CE /ce:review + gstack /qa |
Food critic + inspector + customer panel | Multi-dimensional quality verification |
| Knowledge (remember) | CE /ce:compound |
Recipe binder everyone reads | Experience becomes reusable asset |
These tools have different centers of gravity, not hard boundaries. gstack includes review capabilities; CE includes decision support. The overlap is by design—redundancy in critical functions like review is preferable to gaps.
Practical Implementation: Getting Started and Advanced Workflows
What specific steps should you take to implement these tools, whether you’re a beginner or experienced practitioner?
For Beginners: Start with One Framework
If you’re new to AI-assisted development tools, do not attempt to configure all three simultaneously. Multiple skill packs can create process conflicts and command overlaps. Choose either gstack or Compound Engineering as your primary framework:
-
Choose Compound Engineering if your priority is long-term knowledge accumulation and you’re working on projects with extended timelines -
Choose gstack if your priority is decision quality and real-world testing, and you’re comfortable managing knowledge through other means
Use your chosen tool for 2-3 complete project cycles. Build familiarity with its command structure, failure modes, and optimal use patterns. Only after achieving fluency should you consider adding a second tool.
For Experienced Users: The Combined Workflow
Once comfortable with individual tools, integrate them as follows:
Phase 1: Requirements Clarification and Decision Validation (gstack-led)
-
Reverse interview for requirement clarification
-
Prompt: “I’m about to start this project. Interview me until you have 95% confidence about what I actually want, not what I think I should want.” -
Let AI questions surface assumptions you didn’t know you held
-
-
/office-hours— Describe your proposed build, accept challenges -
/plan-ceo-review— Product gate: validate that this is worth building -
/plan-eng-review— Architecture gate: validate that this won’t create future technical debt
Phase 2: Research-Driven Planning and Execution (CE-led)
-
/ce:brainstorm— Explore implementation approaches, converge on specification -
/ce:plan— Research agents scan project history, codebase patterns, and commit logs; generate detailed implementation plan informed by past experience -
/ce:work— Execute plan with integrated task tracking
Phase 3: Multi-Dimensional Quality Verification (CE + gstack)
-
/ce:review— Dynamic reviewer ensemble (minimum 6 reviewers + conditionals based on diff) -
/qa— Real browser, real interaction testing on staging environment
Phase 4: Knowledge Accumulation (CE-led)
-
/ce:compound— Five subagents extract lessons, write structured documents todocs/solutions/ -
Ship and iterate — Next cycle begins with step 1, but step 6 already knows everything learned in this cycle
Author’s reflection: Steps 1-4 ensure you build the right thing. Steps 5-9 ensure you build it well. Step 10 ensures you build faster next time. This sequence addresses the three failure modes I see most often in AI-assisted development: building the wrong thing beautifully, building the right thing poorly, and repeating the same mistakes indefinitely. The workflow is deliberately sequential—each phase gates the next. Skipping gates, particularly the decision gates, undermines the entire system.
Common Pitfalls and Tool Selection Guidance
What mistakes do developers commonly make with these tools, and how should project characteristics influence your selection?
Pitfall 1: Single-Tool Assumption
Mistake: “I’ve installed gstack. That should cover everything.”
Consequence: Strong decision-making and QA, but after three months your team repeats the same debugging patterns, solving identical problems without benefit of previous solutions.
Prevention: Recognize that gstack explicitly does not prioritize knowledge accumulation. If your project timeline exceeds one month, plan for how you’ll capture and reuse learnings.
Pitfall 2: Tool Proliferation Without Integration
Mistake: “More tools means better coverage. I’ll install all three and use whichever command seems relevant.”
Consequence: Command conflicts, contradictory process guidance, and AI confusion about which framework’s rules to follow.
Prevention: Establish clear responsibility boundaries before adding tools. Document which tool “owns” which phase of your workflow.
Pitfall 3: Skipping the Compound Phase
Mistake: “Bug fixed, feature shipped. On to the next task.”
Consequence: Knowledge loss. The three hours spent debugging the Node.js compatibility issue benefit no future session.
Prevention: Make /ce:compound a mandatory step in your definition of done. The 30-second command execution pays dividends across months.
Selection Decision Framework
| Project Characteristics | Recommended Primary | Complementary Tool | Rationale |
|---|---|---|---|
| Short-term prototype (< 2 weeks) | Superpowers | None | Structured process sufficient; knowledge accumulation unnecessary |
| Long-term product (> 3 months) | Compound Engineering | gstack (decision/QA) | Knowledge compounding value increases exponentially with time |
| High-stakes architecture decisions | gstack | CE (execution/knowledge) | Decision quality paramount; mistakes expensive |
| Multi-person team | Compound Engineering | gstack | Knowledge base becomes shared organizational asset |
| Individual side project | gstack or CE | Context-dependent | Personal projects benefit from knowledge accumulation but have lower decision stakes |
Action Checklist / Implementation Steps
-
[ ] Assess project timeline: Determine whether knowledge accumulation justifies CE adoption -
[ ] Select primary framework: Choose gstack or CE; avoid simultaneous adoption -
[ ] Establish single-tool fluency: Complete 2-3 project cycles before adding tools -
[ ] Implement requirement clarification: Use the 95% confidence reverse interview prompt before all planning -
[ ] Configure decision gates: Set up /plan-ceo-reviewand/plan-eng-reviewas mandatory checkpoints -
[ ] Enable research-driven planning: Ensure /ce:planhas access to project history and commit logs -
[ ] Layer quality verification: Combine CE’s multi-dimensional review with gstack’s browser-based QA -
[ ] Mandate knowledge capture: Make /ce:compoundpart of your definition of done -
[ ] Maintain knowledge base: Periodically review docs/solutions/structure for searchability -
[ ] Iterate and refine: Adjust tool boundaries based on your team’s specific friction points
One-Page Overview
| Tool | Core Layer | Key Commands | Solves |
|---|---|---|---|
| gstack | Decision + Testing | /plan-ceo-review, /plan-eng-review, /qa |
“Build the right thing” + “Verify in real world” |
| Superpowers | Process Structure | brainstorm → plan → execute → review |
Upgrade from random chat to structured workflow |
| Compound Engineering | Knowledge Accumulation | /ce:plan, /ce:review, /ce:compound |
Research-driven planning + compounding organizational memory |
Integration principle: Start with one tool, achieve fluency, then layer. gstack ensures correct decisions and real-world validation. CE ensures each project builds on previous learning. Superpowers provides accessible structure for those beginning structured AI-assisted development.
Frequently Asked Questions
Q1: Are these three tools competitors? Do I need to choose just one?
No. They operate at different layers—decisions, process, and knowledge. They complement rather than replace each other. Most experienced users eventually combine them, but beginners should start with one.
Q2: As a solo developer, do I need this entire stack?
Depends on project duration. For prototypes under two weeks, Superpowers’ structure is sufficient. For projects you’ll maintain for months, even solo developers benefit from CE’s knowledge accumulation—unless you enjoy solving the same debugging problems repeatedly.
Q3: What’s the difference between gstack’s /qa and CE’s /ce:review?
/ce:review performs multi-dimensional code-level analysis (correctness, security, performance, maintainability, testing, adversarial). /qa opens a real browser and interacts with your application like a user. They catch different failure classes and should be used sequentially.
Q4: What does “compound” mean in Compound Engineering?
It refers to compound interest, not composition. Each task’s output includes not just code but reusable experience stored in docs/solutions/. The knowledge base grows exponentially more valuable as it accumulates interconnected solutions.
Q5: How do I start using these tools?
Select either Compound Engineering or gstack as your primary framework. Use it exclusively for 2-3 complete projects. Once you’ve internalized its workflow patterns, add the second tool to fill gaps in your current setup. Avoid adopting all three simultaneously.
Q6: Do these tools require specific Claude versions?
Yes. These are Claude Code tools requiring appropriate Claude platform access. Some features (1M token context window) need Claude Opus 4.6 or later. Check each tool’s documentation for specific version requirements.
Q7: Won’t the docs/solutions/ directory become unwieldy over time?
CE’s Related Docs Finder subagent automatically detects duplicates and updates existing documents rather than creating new ones. The Category Classifier ensures consistent tagging for searchability. Periodic human review of the classification scheme remains valuable.
Q8: How do teams prevent command conflicts when using multiple tools?
Establish a “tool responsibility charter” documenting which tool owns which workflow phase. For example: “All decision-phase commands use gstack; all execution and knowledge-phase commands use CE.” Explicit conventions prevent the AI from receiving contradictory instructions.

