I am using Agents SDK by OpenAI. It's interoperable with every other inference provider, even local models running on LMStudio.
I am using OpenAI from the beginning for all AI Apps I build. Initially it was only Completions API. Then came Responses API. They introduced something called Assistants API with conversation stored on server side and soon pulled the plug on Assistants API for the enhanced Agents SDK with all sessions and things stored locally as we want.
So I moved all my old completions/responses API projects to Agents SDK! They feel good and stable. Making chat with Agents SDK is super easy. Can stream tool calls and tokens effortlessly!
Agents SDK takes care of sessions, token tracking, caching, and so many things! In my apps, it helps me track how much is cached, how much is new!!! And best part about Agents SDK is it takes care of cleaning up old tool calls that saves your context, and it also auto summarises as the chat grows (I might be wrong about the last one).
I am building an EdTech with lot of AI learning / evaluation tools including isolated compute layer for my students! That led me to create an OSS project - which might have some answers for your original question.
I am working on an Open Source Async SAPI for PHP to make PHP convenient for building realtime AI apps (still in alpha and actively developing), and have created a small lesson on how to use agents SDK for AI Apps as a way to showcase my framework. If you like to see my approach, this lesson is a good place to skim.
I follow this style everywhere in my code. Agents work as separate python code detached from whatever framework we use to build Apps, streams via STDIO and I stream the tokens over SSE/WebSockets to frontend as needed - clean architecture.
Different architecture may have different needs! A simple chat response, SSE is ok. A complicated long running stream, WebSockets!
This is how I am doing. Interested to know how others are approaching this.
I am using Agents SDK by OpenAI. It's interoperable with every other inference provider, even local models running on LMStudio.
I am using OpenAI from the beginning for all AI Apps I build. Initially it was only Completions API. Then came Responses API. They introduced something called Assistants API with conversation stored on server side and soon pulled the plug on Assistants API for the enhanced Agents SDK with all sessions and things stored locally as we want.
So I moved all my old completions/responses API projects to Agents SDK! They feel good and stable. Making chat with Agents SDK is super easy. Can stream tool calls and tokens effortlessly!
Agents SDK takes care of sessions, token tracking, caching, and so many things! In my apps, it helps me track how much is cached, how much is new!!! And best part about Agents SDK is it takes care of cleaning up old tool calls that saves your context, and it also auto summarises as the chat grows (I might be wrong about the last one).
I am building an EdTech with lot of AI learning / evaluation tools including isolated compute layer for my students! That led me to create an OSS project - which might have some answers for your original question.
I am working on an Open Source Async SAPI for PHP to make PHP convenient for building realtime AI apps (still in alpha and actively developing), and have created a small lesson on how to use agents SDK for AI Apps as a way to showcase my framework. If you like to see my approach, this lesson is a good place to skim.
Lesson 29: https://php.zeal.ninja/learn/ai-chat
Agent SDK Example Code used on above lesson: https://github.com/sibidharan/zealphp/blob/master/examples/a...
I follow this style everywhere in my code. Agents work as separate python code detached from whatever framework we use to build Apps, streams via STDIO and I stream the tokens over SSE/WebSockets to frontend as needed - clean architecture.
Different architecture may have different needs! A simple chat response, SSE is ok. A complicated long running stream, WebSockets!
This is how I am doing. Interested to know how others are approaching this.
I try to batch requests and reuse responses wherever possible.LLM API calls are awesome,but if you don't watch your usage,cost sneaks up fast.
Agreed, are you doing anything to take advantage of cacheing?
Anything else you can share about project?
Mostly multi-model routing and structured outputs