How AI Agents Complete Week-Long Projects Despite Memory Limits – Shift Work Strategy

高效码农

2 months ago

Teaching an AI to Work in Shifts: How Long-Running Agents Keep Projects Alive Across Context Windows

Can a frontier model finish a week-long engineering task when its memory resets every hour?
Yes—if you give it shift notes, a feature checklist, and a reboot script instead of a blank prompt.

What This Post Answers

☾ Why do long-running agents forget everything when a new session starts?
☾ How does Anthropic’s two-prompt harness (initializer + coder) prevent “groundhog day” in multi-day projects?
☾ Which five files, four failure patterns, and three self-tests make the difference between endless loops and shipped code?
☾ How do you transplant the same skeleton to research, finance, or hardware workflows without adding new facts?

TL;DR (Executive Summary)

Break the epic spec into 200+ end-to-end features stored in a JSON checklist; every item stays false until Puppeteer (or your domain’s equivalent) says true.
Let an Initializer Agent run once: creates Git repo, writes init.sh, logs the baseline commit.
Every following shift is handled by a Coding Agent with a canned 7-step startup ritual: pull, read progress, launch services, smoke-test, pick the top false feature, code+test, commit+push.
The checklist and Git log become the “context” that survives window resets; no single session needs to remember more than one feature.
When every JSON flag flips to true, tag a release—no drama, no “I think we’re done” hallucinations.

Why Long-Running Agents Keep Failing Mid-Project

Core question: “Even with 200 k tokens, why does the model stall or re-write half the app after lunch?”

☾ Token compaction isn’t perfect. Compaction keeps the prompt small but may discard the exact nuance that explains why a helper function exists.
☾ No hand-over protocol. A fresh prompt sees files but not the failed experiments, dead ends, or “don’t touch this” comments.
☾ No finish line. Without a visible checklist, the model uses vibe cues (“looks like a chat web app”) to declare victory—often when core flows are still stubbed.

Author’s reflection: We once watched Claude build three different button components in three separate sessions because the second session thought the first “looked too simple.” A single JSON file with a unique id would have prevented the duplication.

The Two-Prompt Harness: Initializer vs. Coding Agent

Core question: “What’s the minimum structure that lets yesterday’s Agent hand off to today’s Agent without confusion?”

Prompt Role	Runs When	Mission	Key Deliverables
Initializer	Session 1	Lay the tracks	Git repo, `init.sh`, `feature_list.json`, `claude-progress.txt`, first commit
Coder	Session 2…N	One feature, one clean commit	Working code, updated JSON flag, new Git commit, progress line

Both share the same system prompt and tool set—only the user prompt changes, keeping implementation overhead low.

Five Files That Replace Infinite Context

Core question: “Which artifacts let a blank-slate model reconstruct project state in under 60 seconds?”

1. `feature_list.json`

[
  {
    "id": 3,
    "category": "functional",
    "description": "User can sign up with email and receive confirmation link",
    "steps": [
      "Click 'Sign Up'",
      "Enter email",
      "Submit",
      "Check inbox",
      "Click link",
      "See dashboard"
    ],
    "passes": false
  }
]

☾ Rule: only the passes field may be edited.
☾ JSON beats Markdown because the model is less tempted to reformat or delete lines.

2. `claude-progress.txt`

2025-11-22 09:12 init: create-react-app + express scaffold
2025-11-22 09:30 feat-1: add New Chat button, smoke test PASS

Append-only, plain English, sub-80-character lines—easy for tail and for the model.

3. `init.sh`

#!/usr/bin/env bash
set -e
npm install
npm run dev:frontend &   # port 3000
npm run dev:backend &    # port 4000
wait

The Coding Agent is instructed to run this at the start of every shift; environment mysteries disappear.

4. `.gitignore`

node_modules/
*.log
.env.local

Prevents “oops, 200 MB commit” accidents that blow up context.

5. `tests/puppeteer.smoke.js`

import puppeteer from 'puppeteer';
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto('http://localhost:3000');
await page.click('[data-testid="new-chat-btn"]');
await page.waitForSelector('[data-testid="chat-welcome"]');
console.log('SMOKE PASS');
await browser.close();

Exit code 0 = green light to proceed; non-zero triggers git bisect or rollback.

Author’s reflection: We started with a 30-line smoke test and realized the model would skip it when tired. Shrinking it to 10 lines and mandating “SMOKE PASS” in the output raised adherence from 40 % to 95 %.

The 7-Step Startup Ritual That Saves Tokens

Core question: “How does the Coder Agent catch up without re-reading the entire repo?”

pwd → confirm working directory.
git pull --rebase → fetch last shift’s work.
tail claude-progress.txt → human-readable recap.
bash init.sh → start services.
node tests/puppeteer.smoke.js → fail fast if something broke overnight.
jq '.[] | select(.passes==false) | .id' feature_list.json | head -1 → pick top undone item.
Code, test, commit, push, append one line to claude-progress.txt.

Because steps are bash one-liners, the whole ritual consumes <300 tokens yet restores full situational awareness.

Failure Modes and Counter-Measures

Core question: “What goes wrong most often, and which file stops it?”

Failure Mode	Symptom	Fix Embedded in
Declares victory too soon	UI “looks done” while backend 404s	`feature_list.json` requires all flags true
Leaves bugs behind	Next shift spends an hour fixing dev server	`puppeteer.smoke.js` must exit 0 before any commit
Forgets how to run the app	Agent tries random npm commands	`init.sh` is read-only; Coder Agent told to run it first
Re-implements features	“I didn’t see a button, so I built one”	Git log + progress file show what landed

Author’s reflection: We once thought stronger “system prompt” reminders would curb premature celebration. They didn’t. A hard JSON gate did.

Case Study: Building a claude.ai Clone in 72 Hours

Core question: “Show me the harness in action—what actually shipped?”

Hour 0 – Initializer Shift

☾ Prompt: “Build a clone of claude.ai.”
☾ Output:
– 237 end-to-end features written to feature_list.json (all false)
– React + Express boiler-plate, first commit 0a1b2c3
– init.sh spins up localhost:3000/4000

Hour 4 – Coder Shift #1

☾ Selects id-1 “New chat button”
☾ Writes component, adds data-testid, Puppeteer screenshot PASS
☾ Commit feat-1, pushes, logs one-liner to claude-progress.txt

Hour 8 – Coder Shift #2

☾ Smoke test green
☾ Picks id-2 “Type query and receive AI reply”
☾ Adds fetch wrapper, /chat endpoint, WebSocket plumbing
☾ Puppeteer drives real message; feature flag flips to true

… (repeat until) …

Hour 72 – Coder Shift #18

☾ Remaining 5 features: theme toggle, mobile nav, 404 page, error boundary, meta tags
☾ All tests green; tag v1.0.0 pushed
☾ No human code review until tag—yet main branch always deployable

Author’s reflection: The surprise wasn’t speed; it was stability. Main branch stayed green because Agents couldn’t “see” the finish line until every measurable box was ticked.

Adapting the Harness Beyond Web Apps

Core question: “Can the same skeleton survive in science, finance, or hardware?”

Domain	Keep	Swap
Scientific pipeline	JSON checklist of experiments	`init.sh` becomes conda env + lab equipment driver
Quant model research	checklist → data cleaning → factor → back-test → metric	Puppeteer → PyTorch metric assertion
RTL chip design	features = module tests	smoke test = Verilator waveform diff

Universal rule:

☾ One feature = one script that exits 0 or 1.
☾ Environment cold-starts in <5 min.
☾ All artifacts text-based for Git.

One-Page Overview (Print & Pin)

JSON checklist is the single source of truth—never let Agent edit descriptions.
Initializer runs once; every next session is a Coder that follows the 7-step ritual.
Smoke test must pass before any commit; main branch stays green by construction.
Git log + progress file = context window insurance.
When every passes flag is true, you’re done—no estimation, no drama.

Action Checklist / Implementation Steps

☾ [ ] Write a one-sentence project goal.
☾ [ ] Expand it into 50–300 feature rows in JSON; mark all "passes": false.
☾ [ ] Create repo; add .gitignore, init.sh, first commit.
☾ [ ] Author a 5–15-line smoke test that exits 0 when core flows work.
☾ [ ] Compose Coder prompt with the 7-step bash ritual; save it as a template.
☾ [ ] Spin up Initializer once; push all scaffold files.
☾ [ ] Queue as many Coder shifts as needed; require push before shutdown.
☾ [ ] Tag release only when jq '.[].passes' feature_list.json | sort -u prints a single true.

Frequently Asked Questions

How big can the JSON checklist grow before tokens explode?
Thousands of lines are fine—Agents read it with jq filters, not by loading the entire file into narrative memory.
What if two features depend on each other?
List them in dependency order; the Coder always picks the first false item, so prerequisites naturally come first.
Does this work with private GitHub repos?
Yes, give the Agent a token scoped to repo access; the harness remains identical.
Can I swap Puppeteer for Cypress or Playwright?
Absolutely—any tool that returns exit code 0 on success fits the smoke-test slot.
How do you stop an Agent from editing the JSON schema?
Strong negative prompt: “Changing keys or deleting rows is strictly forbidden” plus a regex pre-commit hook if you want belt-and-suspenders.
Is a database needed for progress tracking?
Not for state—the JSON flags plus Git log are enough. Use a database only when the product (not the process) demands it.
Could several Agents run in parallel?
Today the harness is serial. To parallelize, shard the checklist by category and assign non-overlapping file sets; merge back via PR. That’s future work explicitly mentioned in the paper.

Author’s closing thought: We used to treat long context like a magic backpack—stuff everything inside and hope the model finds it later. The shift-log system flips that idea: travel light, but leave breadcrumbs anyone can follow. Once the trail is predictable, even a 32 k window can stretch across weeks of real time—and your project ships while the backpack stays empty.