Teaching an AI to Work in Shifts: How Long-Running Agents Keep Projects Alive Across Context Windows
Can a frontier model finish a week-long engineering task when its memory resets every hour?
Yes—if you give it shift notes, a feature checklist, and a reboot script instead of a blank prompt.
What This Post Answers
-
☾ Why do long-running agents forget everything when a new session starts? -
☾ How does Anthropic’s two-prompt harness (initializer + coder) prevent “groundhog day” in multi-day projects? -
☾ Which five files, four failure patterns, and three self-tests make the difference between endless loops and shipped code? -
☾ How do you transplant the same skeleton to research, finance, or hardware workflows without adding new facts?
TL;DR (Executive Summary)
-
Break the epic spec into 200+ end-to-end features stored in a JSON checklist; every item stays falseuntil Puppeteer (or your domain’s equivalent) saystrue. -
Let an Initializer Agent run once: creates Git repo, writes init.sh, logs the baseline commit. -
Every following shift is handled by a Coding Agent with a canned 7-step startup ritual: pull, read progress, launch services, smoke-test, pick the top falsefeature, code+test, commit+push. -
The checklist and Git log become the “context” that survives window resets; no single session needs to remember more than one feature. -
When every JSON flag flips to true, tag a release—no drama, no “I think we’re done” hallucinations.
Why Long-Running Agents Keep Failing Mid-Project
Core question: “Even with 200 k tokens, why does the model stall or re-write half the app after lunch?”
-
☾ Token compaction isn’t perfect. Compaction keeps the prompt small but may discard the exact nuance that explains why a helper function exists. -
☾ No hand-over protocol. A fresh prompt sees files but not the failed experiments, dead ends, or “don’t touch this” comments. -
☾ No finish line. Without a visible checklist, the model uses vibe cues (“looks like a chat web app”) to declare victory—often when core flows are still stubbed.
Author’s reflection: We once watched Claude build three different button components in three separate sessions because the second session thought the first “looked too simple.” A single JSON file with a unique id would have prevented the duplication.
The Two-Prompt Harness: Initializer vs. Coding Agent
Core question: “What’s the minimum structure that lets yesterday’s Agent hand off to today’s Agent without confusion?”
| Prompt Role | Runs When | Mission | Key Deliverables |
|---|---|---|---|
| Initializer | Session 1 | Lay the tracks | Git repo, init.sh, feature_list.json, claude-progress.txt, first commit |
| Coder | Session 2…N | One feature, one clean commit | Working code, updated JSON flag, new Git commit, progress line |
Both share the same system prompt and tool set—only the user prompt changes, keeping implementation overhead low.
Five Files That Replace Infinite Context
Core question: “Which artifacts let a blank-slate model reconstruct project state in under 60 seconds?”
1. feature_list.json
[
{
"id": 3,
"category": "functional",
"description": "User can sign up with email and receive confirmation link",
"steps": [
"Click 'Sign Up'",
"Enter email",
"Submit",
"Check inbox",
"Click link",
"See dashboard"
],
"passes": false
}
]
-
☾ Rule: only the passesfield may be edited. -
☾ JSON beats Markdown because the model is less tempted to reformat or delete lines.
2. claude-progress.txt
2025-11-22 09:12 init: create-react-app + express scaffold
2025-11-22 09:30 feat-1: add New Chat button, smoke test PASS
Append-only, plain English, sub-80-character lines—easy for tail and for the model.
3. init.sh
#!/usr/bin/env bash
set -e
npm install
npm run dev:frontend & # port 3000
npm run dev:backend & # port 4000
wait
The Coding Agent is instructed to run this at the start of every shift; environment mysteries disappear.
4. .gitignore
node_modules/
*.log
.env.local
Prevents “oops, 200 MB commit” accidents that blow up context.
5. tests/puppeteer.smoke.js
import puppeteer from 'puppeteer';
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto('http://localhost:3000');
await page.click('[data-testid="new-chat-btn"]');
await page.waitForSelector('[data-testid="chat-welcome"]');
console.log('SMOKE PASS');
await browser.close();
Exit code 0 = green light to proceed; non-zero triggers git bisect or rollback.
Author’s reflection: We started with a 30-line smoke test and realized the model would skip it when tired. Shrinking it to 10 lines and mandating “SMOKE PASS” in the output raised adherence from 40 % to 95 %.
The 7-Step Startup Ritual That Saves Tokens
Core question: “How does the Coder Agent catch up without re-reading the entire repo?”
-
pwd→ confirm working directory. -
git pull --rebase→ fetch last shift’s work. -
tail claude-progress.txt→ human-readable recap. -
bash init.sh→ start services. -
node tests/puppeteer.smoke.js→ fail fast if something broke overnight. -
jq '.[] | select(.passes==false) | .id' feature_list.json | head -1→ pick top undone item. -
Code, test, commit, push, append one line to claude-progress.txt.
Because steps are bash one-liners, the whole ritual consumes <300 tokens yet restores full situational awareness.
Failure Modes and Counter-Measures
Core question: “What goes wrong most often, and which file stops it?”
| Failure Mode | Symptom | Fix Embedded in |
|---|---|---|
| Declares victory too soon | UI “looks done” while backend 404s | feature_list.json requires all flags true |
| Leaves bugs behind | Next shift spends an hour fixing dev server | puppeteer.smoke.js must exit 0 before any commit |
| Forgets how to run the app | Agent tries random npm commands | init.sh is read-only; Coder Agent told to run it first |
| Re-implements features | “I didn’t see a button, so I built one” | Git log + progress file show what landed |
Author’s reflection: We once thought stronger “system prompt” reminders would curb premature celebration. They didn’t. A hard JSON gate did.
Case Study: Building a claude.ai Clone in 72 Hours
Core question: “Show me the harness in action—what actually shipped?”
Hour 0 – Initializer Shift
-
☾ Prompt: “Build a clone of claude.ai.” -
☾ Output:
– 237 end-to-end features written tofeature_list.json(allfalse)
– React + Express boiler-plate, first commit0a1b2c3
–init.shspins up localhost:3000/4000
Hour 4 – Coder Shift #1
-
☾ Selects id-1 “New chat button” -
☾ Writes component, adds data-testid, Puppeteer screenshot PASS -
☾ Commit feat-1, pushes, logs one-liner toclaude-progress.txt
Hour 8 – Coder Shift #2
-
☾ Smoke test green -
☾ Picks id-2 “Type query and receive AI reply” -
☾ Adds fetch wrapper, /chat endpoint, WebSocket plumbing -
☾ Puppeteer drives real message; feature flag flips to true
… (repeat until) …
Hour 72 – Coder Shift #18
-
☾ Remaining 5 features: theme toggle, mobile nav, 404 page, error boundary, meta tags -
☾ All tests green; tag v1.0.0pushed -
☾ No human code review until tag—yet main branch always deployable
Author’s reflection: The surprise wasn’t speed; it was stability. Main branch stayed green because Agents couldn’t “see” the finish line until every measurable box was ticked.
Adapting the Harness Beyond Web Apps
Core question: “Can the same skeleton survive in science, finance, or hardware?”
| Domain | Keep | Swap |
|---|---|---|
| Scientific pipeline | JSON checklist of experiments | init.sh becomes conda env + lab equipment driver |
| Quant model research | checklist → data cleaning → factor → back-test → metric | Puppeteer → PyTorch metric assertion |
| RTL chip design | features = module tests | smoke test = Verilator waveform diff |
Universal rule:
-
☾ One feature = one script that exits 0 or 1. -
☾ Environment cold-starts in <5 min. -
☾ All artifacts text-based for Git.
One-Page Overview (Print & Pin)
-
JSON checklist is the single source of truth—never let Agent edit descriptions. -
Initializer runs once; every next session is a Coder that follows the 7-step ritual. -
Smoke test must pass before any commit; main branch stays green by construction. -
Git log + progress file = context window insurance. -
When every passesflag istrue, you’re done—no estimation, no drama.
Action Checklist / Implementation Steps
-
☾ [ ] Write a one-sentence project goal. -
☾ [ ] Expand it into 50–300 feature rows in JSON; mark all "passes": false. -
☾ [ ] Create repo; add .gitignore,init.sh, first commit. -
☾ [ ] Author a 5–15-line smoke test that exits 0 when core flows work. -
☾ [ ] Compose Coder prompt with the 7-step bash ritual; save it as a template. -
☾ [ ] Spin up Initializer once; push all scaffold files. -
☾ [ ] Queue as many Coder shifts as needed; require push before shutdown. -
☾ [ ] Tag release only when jq '.[].passes' feature_list.json | sort -uprints a singletrue.
Frequently Asked Questions
-
How big can the JSON checklist grow before tokens explode?
Thousands of lines are fine—Agents read it withjqfilters, not by loading the entire file into narrative memory. -
What if two features depend on each other?
List them in dependency order; the Coder always picks the firstfalseitem, so prerequisites naturally come first. -
Does this work with private GitHub repos?
Yes, give the Agent a token scoped to repo access; the harness remains identical. -
Can I swap Puppeteer for Cypress or Playwright?
Absolutely—any tool that returns exit code 0 on success fits the smoke-test slot. -
How do you stop an Agent from editing the JSON schema?
Strong negative prompt: “Changing keys or deleting rows is strictly forbidden” plus a regex pre-commit hook if you want belt-and-suspenders. -
Is a database needed for progress tracking?
Not for state—the JSON flags plus Git log are enough. Use a database only when the product (not the process) demands it. -
Could several Agents run in parallel?
Today the harness is serial. To parallelize, shard the checklist by category and assign non-overlapping file sets; merge back via PR. That’s future work explicitly mentioned in the paper.
Author’s closing thought: We used to treat long context like a magic backpack—stuff everything inside and hope the model finds it later. The shift-log system flips that idea: travel light, but leave breadcrumbs anyone can follow. Once the trail is predictable, even a 32 k window can stretch across weeks of real time—and your project ships while the backpack stays empty.
