Teaching an AI to Work in Shifts: How Long-Running Agents Keep Projects Alive Across Context Windows
Can a frontier model finish a week-long engineering task when its memory resets every hour?
Yes—if you give it shift notes, a feature checklist, and a reboot script instead of a blank prompt.
What This Post Answers
-
☾ Why do long-running agents forget everything when a new session starts? -
☾ How does Anthropic’s two-prompt harness (initializer + coder) prevent “groundhog day” in multi-day projects? -
☾ Which five files, four failure patterns, and three self-tests make the difference between endless loops and shipped code? -
☾ How do you transplant the same skeleton to research, finance, or hardware workflows without adding new facts?
TL;DR (Executive Summary)
-
Break the epic spec into 200+ end-to-end features stored in a JSON checklist; every item stays falseuntil Puppeteer (or your domain’s equivalent) saystrue. -
Let an Initializer Agent run once: creates Git repo, writes init.sh, logs the baseline commit. -
Every following shift is handled by a Coding Agent with a canned 7-step startup ritual: pull, read progress, launch services, smoke-test, pick the top falsefeature, code+test, commit+push. -
The checklist and Git log become the “context” that survives window resets; no single session needs to remember more than one feature. -
When every JSON flag flips to true, tag a release—no drama, no “I think we’re done” hallucinations.
Why Long-Running Agents Keep Failing Mid-Project
Core question: “Even with 200 k tokens, why does the model stall or re-write half the app after lunch?”
-
☾ Token compaction isn’t perfect. Compaction keeps the prompt small but may discard the exact nuance that explains why a helper function exists. -
☾ No hand-over protocol. A fresh prompt sees files but not the failed experiments, dead ends, or “don’t touch this” comments. -
☾ No finish line. Without a visible checklist, the model uses vibe cues (“looks like a chat web app”) to declare victory—often when core flows are still stubbed.
Author’s reflection: We once watched Claude build three different button components in three separate sessions because the second session thought the first “looked too simple.” A single JSON file with a unique id would have prevented the duplication.
The Two-Prompt Harness: Initializer vs. Coding Agent
Core question: “What’s the minimum structure that lets yesterday’s Agent hand off to today’s Agent without confusion?”
Both share the same system prompt and tool set—only the user prompt changes, keeping implementation overhead low.
Five Files That Replace Infinite Context
Core question: “Which artifacts let a blank-slate model reconstruct project state in under 60 seconds?”
1. feature_list.json
-
☾ Rule: only the passesfield may be edited. -
☾ JSON beats Markdown because the model is less tempted to reformat or delete lines.
2. claude-progress.txt
Append-only, plain English, sub-80-character lines—easy for tail and for the model.
3. init.sh
The Coding Agent is instructed to run this at the start of every shift; environment mysteries disappear.
4. .gitignore
Prevents “oops, 200 MB commit” accidents that blow up context.
5. tests/puppeteer.smoke.js
Exit code 0 = green light to proceed; non-zero triggers git bisect or rollback.
Author’s reflection: We started with a 30-line smoke test and realized the model would skip it when tired. Shrinking it to 10 lines and mandating “SMOKE PASS” in the output raised adherence from 40 % to 95 %.
The 7-Step Startup Ritual That Saves Tokens
Core question: “How does the Coder Agent catch up without re-reading the entire repo?”
-
pwd→ confirm working directory. -
git pull --rebase→ fetch last shift’s work. -
tail claude-progress.txt→ human-readable recap. -
bash init.sh→ start services. -
node tests/puppeteer.smoke.js→ fail fast if something broke overnight. -
jq '.[] | select(.passes==false) | .id' feature_list.json | head -1→ pick top undone item. -
Code, test, commit, push, append one line to claude-progress.txt.
Because steps are bash one-liners, the whole ritual consumes <300 tokens yet restores full situational awareness.
Failure Modes and Counter-Measures
Core question: “What goes wrong most often, and which file stops it?”
Author’s reflection: We once thought stronger “system prompt” reminders would curb premature celebration. They didn’t. A hard JSON gate did.
Case Study: Building a claude.ai Clone in 72 Hours
Core question: “Show me the harness in action—what actually shipped?”
Hour 0 – Initializer Shift
-
☾ Prompt: “Build a clone of claude.ai.” -
☾ Output:
– 237 end-to-end features written tofeature_list.json(allfalse)
– React + Express boiler-plate, first commit0a1b2c3
–init.shspins up localhost:3000/4000
Hour 4 – Coder Shift #1
-
☾ Selects id-1 “New chat button” -
☾ Writes component, adds data-testid, Puppeteer screenshot PASS -
☾ Commit feat-1, pushes, logs one-liner toclaude-progress.txt
Hour 8 – Coder Shift #2
-
☾ Smoke test green -
☾ Picks id-2 “Type query and receive AI reply” -
☾ Adds fetch wrapper, /chat endpoint, WebSocket plumbing -
☾ Puppeteer drives real message; feature flag flips to true
… (repeat until) …
Hour 72 – Coder Shift #18
-
☾ Remaining 5 features: theme toggle, mobile nav, 404 page, error boundary, meta tags -
☾ All tests green; tag v1.0.0pushed -
☾ No human code review until tag—yet main branch always deployable
Author’s reflection: The surprise wasn’t speed; it was stability. Main branch stayed green because Agents couldn’t “see” the finish line until every measurable box was ticked.
Adapting the Harness Beyond Web Apps
Core question: “Can the same skeleton survive in science, finance, or hardware?”
Universal rule:
-
☾ One feature = one script that exits 0 or 1. -
☾ Environment cold-starts in <5 min. -
☾ All artifacts text-based for Git.
One-Page Overview (Print & Pin)
-
JSON checklist is the single source of truth—never let Agent edit descriptions. -
Initializer runs once; every next session is a Coder that follows the 7-step ritual. -
Smoke test must pass before any commit; main branch stays green by construction. -
Git log + progress file = context window insurance. -
When every passesflag istrue, you’re done—no estimation, no drama.
Action Checklist / Implementation Steps
-
☾ [ ] Write a one-sentence project goal. -
☾ [ ] Expand it into 50–300 feature rows in JSON; mark all "passes": false. -
☾ [ ] Create repo; add .gitignore,init.sh, first commit. -
☾ [ ] Author a 5–15-line smoke test that exits 0 when core flows work. -
☾ [ ] Compose Coder prompt with the 7-step bash ritual; save it as a template. -
☾ [ ] Spin up Initializer once; push all scaffold files. -
☾ [ ] Queue as many Coder shifts as needed; require push before shutdown. -
☾ [ ] Tag release only when jq '.[].passes' feature_list.json | sort -uprints a singletrue.
Frequently Asked Questions
-
How big can the JSON checklist grow before tokens explode?
Thousands of lines are fine—Agents read it withjqfilters, not by loading the entire file into narrative memory. -
What if two features depend on each other?
List them in dependency order; the Coder always picks the firstfalseitem, so prerequisites naturally come first. -
Does this work with private GitHub repos?
Yes, give the Agent a token scoped to repo access; the harness remains identical. -
Can I swap Puppeteer for Cypress or Playwright?
Absolutely—any tool that returns exit code 0 on success fits the smoke-test slot. -
How do you stop an Agent from editing the JSON schema?
Strong negative prompt: “Changing keys or deleting rows is strictly forbidden” plus a regex pre-commit hook if you want belt-and-suspenders. -
Is a database needed for progress tracking?
Not for state—the JSON flags plus Git log are enough. Use a database only when the product (not the process) demands it. -
Could several Agents run in parallel?
Today the harness is serial. To parallelize, shard the checklist by category and assign non-overlapping file sets; merge back via PR. That’s future work explicitly mentioned in the paper.
Author’s closing thought: We used to treat long context like a magic backpack—stuff everything inside and hope the model finds it later. The shift-log system flips that idea: travel light, but leave breadcrumbs anyone can follow. Once the trail is predictable, even a 32 k window can stretch across weeks of real time—and your project ships while the backpack stays empty.

