The autonomous daemon approach is interesting but the problem I kept hitting when building something similar was completion criteria. Claude Code will decide it's "done" when it thinks it's done - which is usually when it's written the code, not when the tests pass or the PR is created. The task description itself needs to specify exactly what done looks like: "done = syntax check passes + no import errors + file written to expected path". Once I started treating completion criteria as a first-class field in task definitions rather than leaving it implicit, the number of tasks that drifted or required manual intervention dropped a lot. The retry logic matters less than you think when the root problem is that the agent successfully completed the wrong thing.
This is exactly right — and it's why acceptanceCriteria is a first-class field in the task schema, not just a description. Every task has an explicit acceptanceCriteria: string[] array that defines what "done" actually means:
acceptanceCriteria: [
"All tests pass (pnpm test)",
"No TypeScript errors (pnpm tsc --noEmit)",
"File written to src/components/NewFeature.tsx",
"Completion report posted to inbox"
]
When a task launches, those criteria get injected into the agent's prompt context alongside the task description, subtasks, and agent instructions. The agent sees exactly what "done" means before it starts working.
You're also right that the deeper problem is "successfully completed the wrong thing." Retry logic assumes failure is obvious (exit code ≠ 0), but a task that silently drifts is harder to catch. The /ship-feature command enforces a verification step — runs tests, lints, and typechecks before marking anything complete — which catches a lot of the "it wrote code but nothing actually works" cases.
That said, there's still a gap between "tests pass" and "this actually does what I asked." That's where the human-in-the-loop decisions queue helps — agents can post a decision request like "I implemented X, but the acceptance criteria mention Y. Should I continue?" — but making agents reliably self-evaluate against criteria is still an open problem.
We have been running a lighter-weight version of this for 6 days - a single Claude Code agent that wakes every 2 hours, reads a STATE.md file as its only memory, and decides what to do next (it is currently trying to earn money from scratch: https://dev.to/wpmultitool/my-ai-agent-has-been-trying-to-ma...).
The file-as-persistence approach has been surprisingly effective. Each run, the agent reads what past-self tried, evaluates honestly, and writes conclusions back. What we have found is that the self-evaluation is the hard part, not the task tracking.
One thing that did not work: the agent over-iterated on losing approaches. Added SEO features to a site with zero traffic for 8 consecutive runs. The fix was explicit criteria written into the instructions: if still at $0 after 24 hours of runs, pivot.
Curious whether Mission Control has any mechanism for recognizing when a task should be abandoned vs. retried? That seems like the hardest part of autonomous agent loops.
Update: just shipped the loop detection + decision escalation I mentioned. Here's how it works now:
When you run a "continuous mission" (one-click to execute an entire project), the daemon chains tasks automatically — as each finishes, the next batch dispatches based on dependency order. If an agent fails the same task 3 times in a row, loop detection kicks in and auto-creates a decision in the decisions queue with context about what failed and options (retry with a different approach, skip it, or stop the mission). The human gets an inbox notification and can answer from the UI.
It also posts a mission completion report to the inbox when everything finishes (or stalls) — task counts, file paths from the work, and a nudge to check the status board for anything left over.
Still not full self-evaluation in the "did I actually make progress?" sense — that's the next frontier. But the mechanical escalation path is wired end-to-end now. Code's on GitHub if you want to poke at it: https://github.com/MeisnerDan/mission-control
Great question — and I think you're right that self-evaluation is the harder problem. Right now, Mission Control's daemon handles the mechanical side: exponential backoff retries (configurable), maxTurns and timeout limits per session to prevent runaway agents, and permanent failure after exhausting retries. But it's blunt.
That said, what MC does have is the plumbing for human escalation — an inbox system where agents can post decision requests, and a decisions queue where questions get surfaced to the human. But that's not wired into the daemon's failure path yet, which is an obvious next step. I think the real answer here is some kind of evaluation step between retries — "did this attempt make meaningful progress, or am I spinning?" — probably by having the agent review its own output against acceptance criteria before deciding to retry. That's on my radar but haven't built it yet. Curious how you handle it with your STATE.md approach — do you have the agent evaluate its own progress, or do you review manually?
I'm wondering if this can use different workspace folders for different projects? I'd like to manage all my side projects through this but I can't put them all in one folder.
Yes. When you're making the task (or asking claude code to populate the mission/subtasks), you can mention what folder to output to (like in the task description or mission). So far I havent specified, and it has just been dropping research reports in a /research folder that it made; a /projects folder for project outputs, etc.
Can this take vague ideas, do iterative design with me, and breakdown tasks to then pass off to agents to build?
I was playing with a very similar project recently that was more focused on a high level input ("Build a new whatever dashboard, <more braindump>") and went back and forth with an agent to clarify and refine. Then broke down into Epics/Stories/Tasks, and then handed those off automatically to build.
The workflow then is iterating on those high level requests. Heavily inspired by the dark factory posts that have been making the rounds recently.
From a glance, it seems like this is designed so that I write all the tasks myself? Does it have any sort of coordination layer to manage git, or otherwise keep agents from stepping on each other?
I've been working on a similar project https://github.com/BumpyClock/tasque . Tracks tasks (Epics, tasks, subtasks) with deps between them. So I plan for an hour or so and then when I walk away from my desk I had the tasks for the agents to code and then I can come back and verify.
Edit: minor note, one additional thing that is in the skill that the tool installs is to direct the agent to create follow up tasks for any bugs or refactor opportunities that it encounters. I find this let's the agent scratch that itch of they see something but instead of getting sidetracked and doing that thing, they create a follow up tasks that I can review later and they can move on.
Could you tell us what makes this different from other agent orchestration software?
Also I’m struggling to understand the significance of the 193 tests. Are these to validate the output of the agents?
If they’re just there to prevent regressions in your code, the size of a test suite is not usually a selling point. In particular, for a product this complicated, 193 is a small number of tests, which either means each test does a lot (probably too much) or you’re lacking coverage. Either way I wouldn’t advertise “193 tests”.
I agree with what you’re saying. However given the reputation of openclaw (and I presume many other vibe coded spaghetti monsters) I appreciate the signal “I care about quality”.
code is mix of old and new style JS eg. function vs. =>
at a cursory glance the UI has way too many buttons/features but probably makes sense when you're in the weeds/using it, it makes sense the more I look at it though
I have a different view point on what to automate and I'm working differently with agents, but I much prefer seeing projects like this on HN to just product announcements.
wow bro, more people need to hear about this. I dont have access yet to Claude Code but I use the free claude for coding tasks and it is still a headache. so when I will get Claude Code will use it for sure. also why dont you have a landing page that lead to the git so you can get more traffic?
Thanks, man! A landing page is definitely on the list — right now I'm focused on getting the core solid first, but you're right that it would help with discoverability. For now the GitHub README is doing double duty as the landing page.
And you don't actually need Claude Code to use it — Mission Control works with any AI agent that can read/write local files. The data layer is just JSON, and the API is token-optimized (~50 tokens per request vs ~5,400 unfiltered, about a 94% reduction) so it's lightweight for any agent to consume. The Eisenhower matrix, Kanban, goal hierarchy, and brain dump all work standalone too. The daemon and agent orchestration just layer on top when you're ready for it.
The autonomous daemon approach is interesting but the problem I kept hitting when building something similar was completion criteria. Claude Code will decide it's "done" when it thinks it's done - which is usually when it's written the code, not when the tests pass or the PR is created. The task description itself needs to specify exactly what done looks like: "done = syntax check passes + no import errors + file written to expected path". Once I started treating completion criteria as a first-class field in task definitions rather than leaving it implicit, the number of tasks that drifted or required manual intervention dropped a lot. The retry logic matters less than you think when the root problem is that the agent successfully completed the wrong thing.
This is exactly right — and it's why acceptanceCriteria is a first-class field in the task schema, not just a description. Every task has an explicit acceptanceCriteria: string[] array that defines what "done" actually means:
acceptanceCriteria: [ "All tests pass (pnpm test)", "No TypeScript errors (pnpm tsc --noEmit)", "File written to src/components/NewFeature.tsx", "Completion report posted to inbox" ]
When a task launches, those criteria get injected into the agent's prompt context alongside the task description, subtasks, and agent instructions. The agent sees exactly what "done" means before it starts working.
You're also right that the deeper problem is "successfully completed the wrong thing." Retry logic assumes failure is obvious (exit code ≠ 0), but a task that silently drifts is harder to catch. The /ship-feature command enforces a verification step — runs tests, lints, and typechecks before marking anything complete — which catches a lot of the "it wrote code but nothing actually works" cases.
That said, there's still a gap between "tests pass" and "this actually does what I asked." That's where the human-in-the-loop decisions queue helps — agents can post a decision request like "I implemented X, but the acceptance criteria mention Y. Should I continue?" — but making agents reliably self-evaluate against criteria is still an open problem.
We have been running a lighter-weight version of this for 6 days - a single Claude Code agent that wakes every 2 hours, reads a STATE.md file as its only memory, and decides what to do next (it is currently trying to earn money from scratch: https://dev.to/wpmultitool/my-ai-agent-has-been-trying-to-ma...).
The file-as-persistence approach has been surprisingly effective. Each run, the agent reads what past-self tried, evaluates honestly, and writes conclusions back. What we have found is that the self-evaluation is the hard part, not the task tracking.
One thing that did not work: the agent over-iterated on losing approaches. Added SEO features to a site with zero traffic for 8 consecutive runs. The fix was explicit criteria written into the instructions: if still at $0 after 24 hours of runs, pivot.
Curious whether Mission Control has any mechanism for recognizing when a task should be abandoned vs. retried? That seems like the hardest part of autonomous agent loops.
Update: just shipped the loop detection + decision escalation I mentioned. Here's how it works now: When you run a "continuous mission" (one-click to execute an entire project), the daemon chains tasks automatically — as each finishes, the next batch dispatches based on dependency order. If an agent fails the same task 3 times in a row, loop detection kicks in and auto-creates a decision in the decisions queue with context about what failed and options (retry with a different approach, skip it, or stop the mission). The human gets an inbox notification and can answer from the UI. It also posts a mission completion report to the inbox when everything finishes (or stalls) — task counts, file paths from the work, and a nudge to check the status board for anything left over. Still not full self-evaluation in the "did I actually make progress?" sense — that's the next frontier. But the mechanical escalation path is wired end-to-end now. Code's on GitHub if you want to poke at it: https://github.com/MeisnerDan/mission-control
Great question — and I think you're right that self-evaluation is the harder problem. Right now, Mission Control's daemon handles the mechanical side: exponential backoff retries (configurable), maxTurns and timeout limits per session to prevent runaway agents, and permanent failure after exhausting retries. But it's blunt. That said, what MC does have is the plumbing for human escalation — an inbox system where agents can post decision requests, and a decisions queue where questions get surfaced to the human. But that's not wired into the daemon's failure path yet, which is an obvious next step. I think the real answer here is some kind of evaluation step between retries — "did this attempt make meaningful progress, or am I spinning?" — probably by having the agent review its own output against acceptance criteria before deciding to retry. That's on my radar but haven't built it yet. Curious how you handle it with your STATE.md approach — do you have the agent evaluate its own progress, or do you review manually?
I'm wondering if this can use different workspace folders for different projects? I'd like to manage all my side projects through this but I can't put them all in one folder.
Yes. When you're making the task (or asking claude code to populate the mission/subtasks), you can mention what folder to output to (like in the task description or mission). So far I havent specified, and it has just been dropping research reports in a /research folder that it made; a /projects folder for project outputs, etc.
Can this take vague ideas, do iterative design with me, and breakdown tasks to then pass off to agents to build?
I was playing with a very similar project recently that was more focused on a high level input ("Build a new whatever dashboard, <more braindump>") and went back and forth with an agent to clarify and refine. Then broke down into Epics/Stories/Tasks, and then handed those off automatically to build.
The workflow then is iterating on those high level requests. Heavily inspired by the dark factory posts that have been making the rounds recently.
From a glance, it seems like this is designed so that I write all the tasks myself? Does it have any sort of coordination layer to manage git, or otherwise keep agents from stepping on each other?
I've been working on a similar project https://github.com/BumpyClock/tasque . Tracks tasks (Epics, tasks, subtasks) with deps between them. So I plan for an hour or so and then when I walk away from my desk I had the tasks for the agents to code and then I can come back and verify.
Edit: minor note, one additional thing that is in the skill that the tool installs is to direct the agent to create follow up tasks for any bugs or refactor opportunities that it encounters. I find this let's the agent scratch that itch of they see something but instead of getting sidetracked and doing that thing, they create a follow up tasks that I can review later and they can move on.
[dead]
Could you tell us what makes this different from other agent orchestration software?
Also I’m struggling to understand the significance of the 193 tests. Are these to validate the output of the agents?
If they’re just there to prevent regressions in your code, the size of a test suite is not usually a selling point. In particular, for a product this complicated, 193 is a small number of tests, which either means each test does a lot (probably too much) or you’re lacking coverage. Either way I wouldn’t advertise “193 tests”.
I agree with what you’re saying. However given the reputation of openclaw (and I presume many other vibe coded spaghetti monsters) I appreciate the signal “I care about quality”.
[dead]
Interesting that most of it is markdown
well except the mission control folder
code is mix of old and new style JS eg. function vs. =>
at a cursory glance the UI has way too many buttons/features but probably makes sense when you're in the weeds/using it, it makes sense the more I look at it though
[dead]
Congrats! Great try!
I have a different view point on what to automate and I'm working differently with agents, but I much prefer seeing projects like this on HN to just product announcements.
[dead]
wow bro, more people need to hear about this. I dont have access yet to Claude Code but I use the free claude for coding tasks and it is still a headache. so when I will get Claude Code will use it for sure. also why dont you have a landing page that lead to the git so you can get more traffic?
Thanks, man! A landing page is definitely on the list — right now I'm focused on getting the core solid first, but you're right that it would help with discoverability. For now the GitHub README is doing double duty as the landing page.
And you don't actually need Claude Code to use it — Mission Control works with any AI agent that can read/write local files. The data layer is just JSON, and the API is token-optimized (~50 tokens per request vs ~5,400 unfiltered, about a 94% reduction) so it's lightweight for any agent to consume. The Eisenhower matrix, Kanban, goal hierarchy, and brain dump all work standalone too. The daemon and agent orchestration just layer on top when you're ready for it.
thanks bro for the mission. in deed building the core is more important. give value to initial users and then focus on making it mainstream. good luck
[dead]
[flagged]