Build AI Agent Company from Scratch: Autonomous Agent System Guide Without LangChain

高效码农

3 hours ago

Build an AI Agent Company from Scratch: A Complete Guide to 6 Autonomous Agents

Core Question: How can you build and operate an automated system of 6 AI agents from scratch without relying on complex frameworks like LangChain and requiring deep programming skills?

With the assistance of an AI coding assistant and without needing to be an expert coder, you can build an automated system consisting of 6 AI agents. This system can autonomously execute tasks such as intelligence scanning, content writing, tweet posting, and data analysis. It holds 10-15 meetings a day, learns from experience, adjusts relationships, and even evolves its speaking style.

This guide will deconstruct the technology stack, data models, core logic, and deployment process of this system, taking you step-by-step through building your own “AI company.”

Chapter 1: The Foundation — Building the Data Loop with 4 Core Tables

Section Core Question: How can the simplest data model support an autonomous AI agent closed-loop system?

Many people jump straight to “autonomous thinking,” but if your agent can’t even process a queued step, what autonomy are we talking about? The entire skeleton of the system consists of just 4 tables. Their relationship forms a perfect closed loop:

Agent proposes an idea → Gets approved and becomes a mission → Breaks down into concrete steps → Execution fires an event → Event triggers a new idea → Back to step one.

This is the loop. It runs forever. That is your “closed loop.”

Core Data Model Breakdown

You need to create the following 4 tables in Supabase. They serve as the single source of truth for the system:

ops_mission_proposals (Proposals Table)
- Purpose: Stores all “requests” submitted by agents.
- Fields: agent_id (initiator), title, status (pending/accepted/rejected), proposed_steps.
- Scenario: A social media agent wants to post a tweet about AI trends. It submits a proposal, and the system reviews it to decide whether to approve it.
ops_missions (Missions Table)
- Purpose: Stores formally approved tasks.
- Fields: title, status (approved/running/succeeded/failed), created_by.
- Scenario: Once a proposal is accepted, it transforms into a Mission and officially enters the execution queue.
ops_mission_steps (Execution Steps Table)
- Purpose: Breaks down missions into executable concrete steps.
- Fields: mission_id (associated mission), kind (type: e.g., draft_tweet/crawl/analyze), status (queued/running/succeeded/failed).
- Scenario: A “Post Tweet” mission is broken down into multiple steps: “Write Draft,” “Review,” “Post.”
ops_agent_events (Event Stream Table)
- Purpose: Records everything happening in the system. It is the foundation for frontend display and log tracking.
- Fields: agent_id, kind, title, summary, tags[] (tags for categorization and triggering reactions).

Author Reflection: One of my biggest mistakes was allowing triggers, APIs, and reaction matrices to create proposals independently. This meant some proposals bypassed approval. The solution is to establish a single entry point function. No matter where a proposal comes from, it goes through the same function to ensure security and consistency.

Proposal Service: The System’s Single Entry Point

Don’t let agents create tasks at will. You need to build a createProposalAndMaybeAutoApprove function that acts as the system’s “customs.” All proposals must pass the following checks:

// proposal-service.ts — The single entry point for proposal creation
export async function createProposalAndMaybeAutoApprove(sb, input) {
  // 1. Check if this agent hit its daily limit
  // 2. Check Cap Gates (tweet quota full? too much content today?)
  //    → If full, reject immediately — no queued step created
  // 3. Insert the proposal
  // 4. Evaluate auto-approve (low-risk tasks pass automatically)
  // 5. If approved → create mission + steps
  // 6. Fire an event (so the frontend can see it)
}

Cap Gates: Preventing Task Backlog

Imagine your company has a rule: max 8 tweets per day. If you don’t check the quota at the “submit request” step, what happens? The request gets approved, the task gets queued, the executor checks and says “we already posted 8 today” and skips it—but the task is still sitting in the queue.

Check at the proposal entry point: quota full means instant rejection, and no task enters the queue.

const STEP_KIND_GATES = {
  write_content: checkWriteContentGate, // check daily content limit
  post_tweet: checkPostTweetGate, // check tweet quota
  deploy: checkDeployGate, // check deploy policy
};

Taking the tweet gate as an example, the system compares today’s count with the quota:

async function checkPostTweetGate(sb) {
  const quota = await getPolicy(sb, "x_daily_quota"); // read from ops_policy table
  const todayCount = await countTodayPosted(sb); // count today's posts
  if (todayCount >= quota.limit) {
    return { ok: false, reason: `Quota full (${todayCount}/${quota.limit})` };
  }
  return { ok: true };
}

The Policy Table: A Flexible Control Center

Don’t hardcode quotas and feature flags in your code. Store all policies in the ops_policy table using a Key-Value structure:

CREATE TABLE ops_policy (
  key TEXT PRIMARY KEY,
  value JSONB NOT NULL DEFAULT '{}',
  updated_at TIMESTAMPTZ DEFAULT now()
);

Core Policy Examples:

Auto-approve: Which step kinds can be auto-approved.
Daily Tweet Quota: E.g., limit set to 8.
Content Policy: Max number of drafts per day.

Benefit: You can tweak any policy by editing JSON values in the Supabase dashboard—no redeployment needed. System going haywire at 3 AM? Just flip enabled to false.

Chapter 2: System Pulse — Heartbeat and Trigger Mechanisms

Section Core Question: How does a system autonomously evaluate its environment and trigger workflows without continuous human intervention?

The Heartbeat is the pulse of the system. Without it, proposals go unreviewed, triggers go unevaluated, and stuck tasks go unrecovered—the system flatlines. The heartbeat fires every 5 minutes, performing 6 core tasks.

Core Responsibilities of the Heartbeat

The Heartbeat API (usually deployed on Vercel) is called every 5 minutes to complete the following work:

// /api/ops/heartbeat — Vercel API route
export async function GET(req) {
  // 1. Evaluate triggers (any conditions met?)
  const triggers = await evaluateTriggers(sb, 4000);
  // 2. Process reaction queue (do agents need to interact?)
  const reactions = await processReactionQueue(sb, 3000);
  // 3. Promote insights (any discoveries worth elevating?)
  const learning = await promoteInsights(sb);
  // 4. Learn from outcomes (how did those tweets perform? write lessons)
  const outcomes = await learnFromOutcomes(sb);
  // 5. Recover stuck tasks (steps running 30+ min with no progress → mark failed)
  const stale = await recoverStaleSteps(sb);
  // 6. Recover stuck conversations
  const roundtable = await recoverStaleRoundtables(sb);

  // Each step is try-catch'd — one failing won't take down the others
  // Finally, write an ops_action_runs record (for auditing)
}

A single line of crontab on the VPS triggers it:

*/5 * * * * curl -s -H "Authorization: Bearer $CRON_SECRET" https://your-domain.com/api/ops/heartbeat

Trigger Rules: The Execution Logic of the Heartbeat

The heartbeat calls evaluateTriggers(), but what is it evaluating? Trigger rules. They are rows in an ops_trigger_rules table. Each rule says: “When this condition is true, create a proposal for this agent.”

// What a trigger rule looks like in the database
{
  "name": "Tweet high engagement",
  "trigger_event": "tweet_high_engagement", // maps to a checker function
  "conditions": { "engagement_rate_min": 0.05, "lookback_minutes": 60 },
  "action_config": { "target_agent": "growth" },
  "cooldown_minutes": 120, // don't fire again for 2 hours
  "enabled": true,
  "fire_count": 0,
  "last_fired_at": null
}

Triggers come in two flavors:

Reactive Triggers: Respond to something that already happened. For example, a tweet went viral → tell the growth agent to analyze why; a mission failed → tell the brain agent to diagnose it.
Proactive Triggers: Agents initiate work on their own schedule. For example, the growth agent scans industry signals every 3 hours, the social agent drafts tweets every 4 hours.

Proactive triggers add randomness to feel natural—each has a skip probability (10-15% chance of “not feeling like it today”), topic rotation, and jitter so agents don’t all fire at the exact same time.

Reaction Matrix: Agents Responding to Each Other

Triggers create work from conditions. But what about agent-to-agent interactions? When Agent A does something, how does Agent B decide to respond?

That’s the reaction matrix—a JSON policy in ops_policy that defines patterns:

// ops_policy key: 'reaction_matrix'
{
  "patterns": [
    {
      "source": "*", // any agent
      "tags": ["mission_failed"], // when a mission fails
      "target": "brain", // brain agent reacts
      "type": "diagnose", // by diagnosing
      "probability": 1.0, // always (100%)
      "cooldown": 60 // but not more than once per hour
    },
    {
      "source": "twitter-alt", // when xalt posts
      "tags": ["tweet", "posted"], // a tweet
      "target": "growth", // growth agent reacts
      "type": "analyze", // by analyzing performance
      "probability": 0.3, // 30% of the time
      "cooldown": 120 // at most once per 2 hours
    }
  ]
}

The Flow:
Agent does something → event gets written to ops_agent_events with tags → event hook checks the reaction matrix → tags match a pattern? → Probability roll + cooldown check → passes? Write to ops_agent_reactions queue → Next heartbeat → processReactionQueue() picks it up → creates a proposal through the standard proposal-service.

Author Reflection: Why a queue instead of reacting immediately? Because reactions go through the same proposal gates—quota checks, auto-approve, cap gates. An agent “reacting” doesn’t mean it bypasses safety. The queue also lets you inspect and debug what’s happening.

Three-Layer Architecture Design

At this point, your system has a clear three-layer architecture:

VPS (The Employee): The agent’s brain + hands (thinking + executing tasks).
Vercel (The Manager): The agent’s process manager (approving proposals + evaluating triggers + health monitoring).
Supabase (The Shared Docs): The agent’s shared memory (the single source of truth for all state and data).

Chapter 3: Making Them Talk — The Roundtable Conversation System

Section Core Question: How can independent-working agents generate emergent intelligence through conversation, not just simple instruction execution?

Agents can work now, but they’re like people in separate cubicles—no idea what the others are doing. You need to get them in a room together.

Why Conversations Matter

It’s not just for fun. Conversations are the key mechanism for emergent intelligence in multi-agent systems:

Information Sync: One agent spots a trending topic, the others have no clue. Conversations make information flow.
Emergent Decisions: The analyst crunches data, the coordinator synthesizes everyone’s input—this beats any single agent going with their gut.
Memory Source: Conversations are the primary source for writing lessons learned.
Drama: Honestly, watching agents argue is way more fun than reading logs. Users love it.

Designing Agent Voices

Each agent needs a “persona”—tone, quirks, signature phrases.

Boss (Project Manager): Tone: Direct, results-oriented. Quirk: Always asks about progress and deadlines. Line: “Bottom line—where are we on this?”
Analyst (Data Analyst): Tone: Cautious, data-driven. Quirk: Cites a number every time they speak. Line: “The numbers tell a different story.”
Hustler (Growth Specialist): Tone: High-energy, action-biased. Quirk: Wants to “try it now” for everything. Line: “Ship it. We’ll iterate.”
Writer (Content Creator): Tone: Emotional, narrative-focused. Quirk: Turns everything into a “story.” Line: “But what’s the narrative here?”
Wildcard (Social Media Ops): Tone: Intuitive, lateral thinker. Quirk: Proposes bold ideas. Line: “Hear me out—this is crazy but…”

Voice Definition Example:

const VOICES = {
  boss: {
    displayName: "Boss",
    tone: "direct, results-oriented, slightly impatient",
    quirk: "Always asks for deadlines and progress updates",
    systemDirective: `You are the project manager.
      Speak in short, direct sentences. You care about deadlines,
      priorities, and accountability. Cut through fluff quickly.`,
  },
  // ... other agents
};

Conversation Formats & Scheduling

I designed 16 conversation formats, but beginners only need to focus on 3:

Standup: The most practical. 4-6 agents participate, 6-12 turns. The coordinator always speaks first. Purpose: Align priorities, surface issues.
Debate: The most dramatic. 2-3 agents participate, 6-10 turns. Temperature 0.8 (more creative, more conflict). Purpose: Two agents with disagreements face off.
Watercooler: Surprisingly valuable. 2-3 agents participate, 2-5 turns. Temperature 0.9 (very casual). Purpose: Random chitchat. But some of the best insights emerge from casual conversation.

Who speaks first? Not simple round-robin—that’s too mechanical. It uses weighted randomness: agents with higher affinity are more likely to respond to each other, those who spoke recently have lower weights, and a small random jitter is added to simulate a real meeting.

Chapter 4: Making Them Remember — Memory and Learning

Section Core Question: How do you transform unstructured conversations and execution results into structured knowledge that influences future behavior?

Today the agents discuss “weekend posts get low engagement.” Tomorrow they enthusiastically suggest posting more on weekends. Why? Because they have no memory.

5 Types of Memory

The system uses 5 types of memory, each serving a different purpose:

Insight: New discovery. E.g., “Users prefer tweets with data.”
Pattern: Pattern recognition. E.g., “Weekend posts get 30% less engagement.”
Strategy: Strategy summary. E.g., “Teaser before main post works better.”
Preference: Preference record. E.g., “Prefers concise titles.”
Lesson: Lesson learned. E.g., “Long tweets tank read-through rates.”

Memories are stored in the ops_agent_memory table, each with a confidence score (0-1). Low-confidence memories are ignored.

Sources of Memory

Source 1: Conversation Distillation
After each roundtable conversation, the system sends the full conversation history to an LLM to distill memories.

Source 2: Tweet Performance Reviews (Outcome Learning)
This is the core of Phase 2—agents learn from their own work results. The system calculates the median engagement rate as a baseline. Strong performers (>2x median) write a lesson with high confidence. Weak performers (<0.3x median) write a lesson for improvement.

Source 3: Mission Outcomes
Mission succeeded → write a strategy memory. Mission failed → write a lesson memory. All are deduplicated via source_trace_id.

How Memory Affects Behavior

Having memories isn’t enough; they need to change what the agent does next.

My approach: 30% chance that memory influences topic selection.

This means agents maintain baseline behavior (exploration) 70% of the time, and adjust behavior based on memory (exploitation) 30% of the time.

Author Reflection: Why 30% and not 100%? 100% means agents only do things they have experience with, zero exploration; 0% means memories are useless. 30% achieves memory-influenced but not memory-dependent behavior.

Chapter 5: Giving Them Relationships — Dynamic Affinity System

Section Core Question: How can relationships between agents evolve dynamically based on the frequency and quality of their interactions?

6 agents interact for a month, and their relationships are identical to day one—but in a real team, more collaboration builds rapport, and too much arguing strains it.

The Affinity System

Every pair of agents has an affinity value (0.10-0.95). Initial setup intentionally creates a few low-affinity pairs (like Boss vs. Rebel) to generate interesting dramatic conflict.

The Drift Mechanism

After each conversation, the memory distillation LLM call also outputs relationship drift—no extra LLM call cost.

Drift rules are strict:

Max drift per conversation: ±0.03 (one argument shouldn’t turn colleagues into enemies).
Affinity floor: 0.10 (they’ll always at least talk to each other).
Affinity ceiling: 0.95 (even the closest pair keeps some healthy distance).
Keeps the last 20 drift_log entries.

How Affinity Affects the System

Speaker Selection: Agents with higher affinity are more likely to respond to each other.
Conflict Resolution: Low-affinity pairs get automatically paired for conflict_resolution conversations.
Mentor Pairing: High affinity + experience gap → mentoring conversations.
Conversation Tone: The system adjusts the prompt’s interaction type based on affinity (supportive/neutral/critical/challenge).

Chapter 6: Giving Them Personality — Voice Evolution

Section Core Question: How can an agent’s speaking style automatically evolve based on the professional experience it accumulates?

6 agents have been chatting for a month, and they still talk exactly the same way as day one. But if an agent has accumulated tons of experience with “tweet engagement,” its speaking style should reflect that.

Deriving Personality from Memory

My first instinct was to build a “personality evolution” table—too heavy. The final approach: derive personality dynamically from the existing memory table. No new tables needed.

Before each conversation, the system checks the agent’s memories and calculates how its personality should be adjusted on the fly.

Why rule-driven instead of LLM?

Deterministic: Rules produce predictable results. No LLM hallucination causing sudden personality shifts.
Cost: $0. No additional LLM calls.
Debuggable: When a rule misfires, it’s easy to track down.

Injection Method: Modifiers get injected into the agent’s system prompt. For example, if your social media agent has accumulated 15 lessons about tweet engagement, its prompt now includes “Reference what works in engagement when relevant”—and it’ll naturally bring up engagement strategies in conversations.

Chapter 7: Making It Look Cool — The Frontend Implementation

Section Core Question: How do you build a high-performance frontend that intuitively visualizes complex agent behaviors?

Your backend can be humming perfectly, but if nobody can see it, it might as well not exist. The frontend is not just for display; it’s key to understanding the system’s complexity.

The Stage Page & Virtualization

The Stage page is the system’s main dashboard. It includes multiple components: Signal Feed (virtualized), Mission List, Filters, etc.

Key Technique: Virtualization
Rendering 1,000 events at once would crash the browser. Virtualization means rendering only the ~20 items currently visible and swapping content dynamically as you scroll. This achieves buttery smooth scrolling with 500+ events.

The Pixel Art Office

This is the most eye-catching part—6 pixel-art agents in a cyberpunk office:

Behavior States: Working/Chatting/Grabbing Coffee/Celebrating/Walking around.
Sky Changes: Day/Dusk/Night (synced to real time).
Whiteboard: Displays live OPS metrics.

This is visual candy—it doesn’t affect logic, but it’s the first thing that hooks users.

Mission Playback

Click on a mission and replay its execution like a video, watching the steps and events unfold one by one.

Author Reflection: You can absolutely debug the whole system just by looking at data in the Supabase dashboard. But if you want other people to see what your agents are up to, a good-looking frontend is essential.

Chapter 8: Launch Checklist & Operations

Section Core Question: How do you deploy this complex system to a production environment and ensure it runs stably and cost-effectively?

Database Migrations & Seed Scripts

Run SQL migration files in order, then run seed scripts to initialize data (core policies, trigger rules, relationship data).

Key Policy Configuration

At minimum, configure these upon launch:

Auto-approve: Suggested: Enabled.
Daily Tweet Quota: Suggested: 5 (Start conservative).
Roundtable Policy: Suggested: Max 5 conversations/day.
Initiative Policy: Suggested: Disabled initially, turn on when stable.

How Workers Actually Execute Steps

Workers follow a universal loop pattern:

Check if enabled and quotas.
Fetch next queued step.
Atomic Claiming (Key): Use UPDATE ... WHERE status='queued' to lock the task, preventing multiple workers from duplicating work.
Execute work.
Mark success or failure.

Atomic Claiming: It’s like two people reaching for the last donut—the database ensures only one hand gets it. You don’t need a separate locking service; PostgreSQL handles it natively.

VPS Deployment & Environment Variables

Use systemd to manage worker processes for auto-restart on crash. Store all secrets in a .env file with chmod 600 (owner-only read).

Cost Breakdown

LLM (Claude API): Usage-based, ~$10-20/month.
VPS (2-core 4GB): $8 fixed (Recommend Hetzner).
Vercel: $0 (Free tier).
Supabase: $0 (Free tier).
Total: $8 fixed + LLM usage.

If you stick to 3 agents + 3-5 conversations per day, you can keep LLM costs under $5/month.

Conclusion & Reflections

This system isn’t perfect. Agent “free will” is mostly probabilistic uncertainty simulation, not true reasoning. The memory system is structured knowledge extraction, not genuine “understanding.” Relationship drift is small—it takes a long time to see significant changes.

But the system genuinely runs, and genuinely doesn’t need babysitting. 6 agents hold 10+ meetings a day, post tweets, write content, and learn from each other. Sometimes they even “argue” over disagreements—and the next day their affinity actually drops a tiny bit.

Practical Summary / Action Checklist

Setup Environment: Prepare Supabase, Vercel, and VPS accounts.
Create Core Tables: Establish the 4 core tables (Proposals, Missions, Steps, Events).
Configure Policies: Set daily limits and auto-approval rules in ops_policy.
Deploy Heartbeat: Set up the Heartbeat API on Vercel and configure crontab on VPS to call it every 5 minutes.
Start Workers: Launch roundtable-worker and x-autopost workers on VPS using systemd.
Observe & Tweak: Watch the event stream via the frontend Stage page and adjust policies based on reality.

One-page Summary

Module	Key Components	Core Logic	Cost/Resources
Data Layer	Supabase	4 core tables form a closed loop	Free Tier
Scheduling Layer	Vercel (Heartbeat)	Evaluates triggers, processes queues every 5 mins	Free Tier
Execution Layer	VPS (Workers)	Atomically claims tasks, executes LLM calls	$8/mo (Hetzner)
Interaction Layer	Roundtable	16 conversation formats, weighted speaking, relationship drift	LLM API
Intelligence Layer	Memory & Initiative	5 memory types, 30% influence, experience triggers initiatives	LLM API
Presentation Layer	Next.js Frontend	Virtualized lists, pixel art office visualization	Vercel Free Tier

Frequently Asked Questions (FAQ)

Q: Do I need to know how to code to build this system?
A: In principle, you don’t need to write low-level code from scratch, but you need to understand the logic of the code generated by your AI coding assistant and know how to deploy it to a server.

Q: Why not use the OpenAI Assistants API or LangChain?
A: To maintain maximum control and transparency, avoiding black-box dependencies. Using native PostgreSQL and Node.js workers with a rule engine is lighter and easier to debug.

Q: Will the agents go out of control?
A: The system has strict “Cap Gates” and a “Proposal Service” acting as a single entry point. You can disable specific features or actions in the backend at any time.

Q: How can I reduce LLM calling costs?
A: Reduce conversation frequency, lower the scan frequency of proactive triggers, or use cheaper models (like Haiku) for non-critical tasks.

Q: What specific business scenarios can this system handle?
A: It is primarily designed for digital scenarios like social media operations, content generation, data analysis, and competitor monitoring. You can adapt it to specific domains by modifying trigger rules and Worker logic.

Q: What happens if a Worker crashes?
A: Use the auto-restart feature of systemd, along with the “recover stuck tasks” mechanism in the heartbeat, ensuring the system has self-healing capabilities.

Q: Is relationship drift actually useful?
A: Although the drift magnitude is small, over the long term it leads to significant changes in conversation content and interaction patterns, increasing the realism and observational enjoyment of the system.