Hey HN! I'm Daniel, cofounder of GrowthX and Ben's colleague (who posted it). We have about 20 engineers building AI agents and workflows for companies like Lovable, Webflow, Airbyte. Output is the framework we extracted from that work. It runs our AI infrastructure and we open-sourced it.
We kept hitting the same problems: writing and iterating on prompts at scale, orchestrating API calls that fail unpredictably, tracking costs, testing non-deterministic code, building datasets from production data, organizing repos so coding agents perform well. And every piece of tooling was a different SaaS product that didn't talk to the others.
We built Output around three ideas:
1. Make it easy for devs and coding agents to create and modify workflows in one or a few shots.
Filesystem first. Everything your agent needs lives in self-contained folders, full context visible without hunting. TypeScript and Zod provide the first validation layer for whether your workflow is correct.
2. One framework, minimal tooling sprawl.
We got tired of scattering data across SaaS products that don't talk to each other. Prompt files, evals, tracing, cost tracking, credentials all live in one place.
Your data stays on your infrastructure. Under the hood, we built on Temporal for orchestration. It's a hard problem and we weren't going to reinvent the wheel they've perfected. Open source and self-hostable, or Temporal Cloud. We wrapped it so you don't need to learn Temporal upfront, but the full power is there underneath.
3. A flat learning curve.
Our team is web engineers at different levels. We didn't want anyone to learn Python, five different tools, or the nuances of workflow idempotency before they could ship. We baked in conventions: same folder structure, file names, patterns across every workflow. Advanced features like Temporal primitives, evals, LLM-as-a-judge stay out of the way until you reach for them.
We've been building production workflows this way for over a year.
We extracted it, cleaned it up, and wanted to put it in front of people who'd push on it.
Docs and a video building a HN AI digest newsletter from scratch: https://output.ai
So the API keys during setup are entirely optional. They're used in the example workflow that evaluates blog posts for clarity and provides feedback on how to improve.
Youre more than free to ignore/delete the example workflow and create your own that doesn't make use of an LLM
1. Fetching trending hn posts
2. Pulling reddit posts that match keywords
3. Transforming Daily calendar events into an html page etc..
And the claude code plugins (that are installed for you) all work with you Anthropic subscription no problem
The Unicode injection is a real vector, but I keep running into a problem one step before that: how do you even know which MCP servers to trust with tool definitions?
The official MCP Registry is basically a flat list. No verification metadata, no attestation chain. If someone gets a malicious server listed there, Unicode tricks in tool descriptions are almost beside the point. Your agents are already pulling definitions from an unvetted source.
I have been tracking the IETF drafts that try to solve agent discovery and registration. There are about 11 competing ones (ARDP, AID, AINS, agents.txt, etc). Six expired or are expiring this month, no renewals filed. The ones still alive do not include any mechanism for cryptographic verification of tool descriptions.
At 500 agents, the question stops being "is this tool description clean" and becomes "should my agent be talking to this server at all." The sanitization work matters, but it is downstream of a trust problem that is currently wide open.
Hey! Ben here (one of the engineers who built this).
This is a reason why we made our http framework (@outputai/http) a first class citizen for the greater framework and our claude code plugins.
As you pointed out at this moment in time theres a Cambrian explosion both in new tools/libraries and the willingness to use them, which poses a systemic security threat when combined with how LLMs function.
So while you're free to use any third party tool or library you want with Output. We encourage you to roll your own as often as possible both for the security/control it gives you. But also for the vertical integration it provides (debugging, cost tracking, evals etc...)
Interesting that this came out of 500 agents in production. The hardest part I've seen with agent tool calls is handling partial failures gracefully — the tool returns something but it's incomplete or stale. Do you bake retry/fallback logic into the framework itself or leave that to individual tool implementations?
1. Be opinionated on best practises, tools and libraries
2. Not get in the way of what the developer wants to do
To that end the core is built on top of Temporal, and our llm package is a thin wrapper around ai-sdk that provides QoL enhancements (Prompt files, tracing, cost tracking etc..)
So for failures in general, and tool calling specifically there are two levels of retries.
1. ai-sdk level tool retries: The library by default handles tool call failures and will retry if the LLM deems it a transient issue, and will never hard fail if one of its tool calls in unsuccessful (unless perhaps you instruct it to).
2. Temporal level activity failures: Our workflows and steps are all configured with a base line affordance to reattempt steps that have failed. You as the developer are able to change this, you can make it so a step is never retried, or retried say 100 times with exponential backoff.
Hey HN! I'm Daniel, cofounder of GrowthX and Ben's colleague (who posted it). We have about 20 engineers building AI agents and workflows for companies like Lovable, Webflow, Airbyte. Output is the framework we extracted from that work. It runs our AI infrastructure and we open-sourced it.
We kept hitting the same problems: writing and iterating on prompts at scale, orchestrating API calls that fail unpredictably, tracking costs, testing non-deterministic code, building datasets from production data, organizing repos so coding agents perform well. And every piece of tooling was a different SaaS product that didn't talk to the others.
We built Output around three ideas:
1. Make it easy for devs and coding agents to create and modify workflows in one or a few shots.
Filesystem first. Everything your agent needs lives in self-contained folders, full context visible without hunting. TypeScript and Zod provide the first validation layer for whether your workflow is correct.
2. One framework, minimal tooling sprawl.
We got tired of scattering data across SaaS products that don't talk to each other. Prompt files, evals, tracing, cost tracking, credentials all live in one place.
Your data stays on your infrastructure. Under the hood, we built on Temporal for orchestration. It's a hard problem and we weren't going to reinvent the wheel they've perfected. Open source and self-hostable, or Temporal Cloud. We wrapped it so you don't need to learn Temporal upfront, but the full power is there underneath.
3. A flat learning curve.
Our team is web engineers at different levels. We didn't want anyone to learn Python, five different tools, or the nuances of workflow idempotency before they could ship. We baked in conventions: same folder structure, file names, patterns across every workflow. Advanced features like Temporal primitives, evals, LLM-as-a-judge stay out of the way until you reach for them.
We've been building production workflows this way for over a year.
We extracted it, cleaned it up, and wanted to put it in front of people who'd push on it.
Docs and a video building a HN AI digest newsletter from scratch: https://output.ai
Happy to answer questions.
[flagged]
[flagged]
This looks really interesting - appreciate you sharing. Is it only API key driven or is there a way to try out with a Claude/Anthropic subscription?
Hey dangent! Glad you find it interesting!
So the API keys during setup are entirely optional. They're used in the example workflow that evaluates blog posts for clarity and provides feedback on how to improve.
Youre more than free to ignore/delete the example workflow and create your own that doesn't make use of an LLM 1. Fetching trending hn posts 2. Pulling reddit posts that match keywords 3. Transforming Daily calendar events into an html page etc..
And the claude code plugins (that are installed for you) all work with you Anthropic subscription no problem
This is awesome!
Looks great. Sharing with my team
[flagged]
The Unicode injection is a real vector, but I keep running into a problem one step before that: how do you even know which MCP servers to trust with tool definitions?
The official MCP Registry is basically a flat list. No verification metadata, no attestation chain. If someone gets a malicious server listed there, Unicode tricks in tool descriptions are almost beside the point. Your agents are already pulling definitions from an unvetted source.
I have been tracking the IETF drafts that try to solve agent discovery and registration. There are about 11 competing ones (ARDP, AID, AINS, agents.txt, etc). Six expired or are expiring this month, no renewals filed. The ones still alive do not include any mechanism for cryptographic verification of tool descriptions.
At 500 agents, the question stops being "is this tool description clean" and becomes "should my agent be talking to this server at all." The sanitization work matters, but it is downstream of a trust problem that is currently wide open.
[flagged]
Hey! Ben here (one of the engineers who built this).
This is a reason why we made our http framework (@outputai/http) a first class citizen for the greater framework and our claude code plugins.
As you pointed out at this moment in time theres a Cambrian explosion both in new tools/libraries and the willingness to use them, which poses a systemic security threat when combined with how LLMs function.
So while you're free to use any third party tool or library you want with Output. We encourage you to roll your own as often as possible both for the security/control it gives you. But also for the vertical integration it provides (debugging, cost tracking, evals etc...)
[flagged]
Do you mind sharing any content from your team's research? I've recently gotten interested in agent/llm attacks and how to protect against them.
[flagged]
Interesting that this came out of 500 agents in production. The hardest part I've seen with agent tool calls is handling partial failures gracefully — the tool returns something but it's incomplete or stale. Do you bake retry/fallback logic into the framework itself or leave that to individual tool implementations?
Oh I can answer that.
So we had a few goals here
1. Be opinionated on best practises, tools and libraries
2. Not get in the way of what the developer wants to do
To that end the core is built on top of Temporal, and our llm package is a thin wrapper around ai-sdk that provides QoL enhancements (Prompt files, tracing, cost tracking etc..)
So for failures in general, and tool calling specifically there are two levels of retries.
1. ai-sdk level tool retries: The library by default handles tool call failures and will retry if the LLM deems it a transient issue, and will never hard fail if one of its tool calls in unsuccessful (unless perhaps you instruct it to).
2. Temporal level activity failures: Our workflows and steps are all configured with a base line affordance to reattempt steps that have failed. You as the developer are able to change this, you can make it so a step is never retried, or retried say 100 times with exponential backoff.
Hope that helps!
[dead]