Steering interpretable language models with concept algebra

(guidelabs.ai)

77 points | by luulinh90s 5 days ago ago

8 comments

giang_at_glai 5 days ago ago
Author here.
This post shows “concept algebra” on language model: inject, suppress, and compose human-understandable concepts at inference time (no retraining, no prompt engineering).
There’s an interactive demo on the post.
Would love feedback on: (1) what steering tasks you’d benchmark, (2) failure cases you’d want to see, (3) whether this kind of compositional control is useful in real products.
Related: https://news.ycombinator.com/item?id=47131225
[-]
- anon291 4 days ago ago
  I would personally like some quantification of how good this is compared to just replacing the system prompt of an off the shelf 8B parameter language model.
  The suppression bit is very powerful. I would like to see a quantification of how often a steered 'normal' language model will mention things you asked it to suppress vs how often this one does
  [-]
  - giang_at_glai 4 days ago ago
    We will share a technical write-up soon that addresses both of your questions: (1) steering vs. prompt engineering, and (2) how effectively our steering suppresses undesired generations.
    If you have joined our waitlist, we will notify you as soon as it is available.
- didgeoridoo 4 days ago ago
  Hi! Have you published the concept dictionary yet? I’m looking into using Steerling to investigate how different moral scenarios elicit various responses in LLMs (using Haidt MFT concepts mostly), and my first few inference runs have been hamstrung by not having a canonical mapping of concepts to IDs. Thanks!
  [-]
  - luulinh90s 3 days ago ago
    Hi! Thanks for checking.
    We haven’t published the concept dictionary yet.
    We plan to release it in soon with other important artifacts.
    [-]
    - didgeoridoo 3 days ago ago
      On the waitlist — please announce it there!
AIorNot 3 days ago ago
How good would this steering be for function calling as part of an agent to keep agent on task or gaurdrail
[-]
- luulinh90s 3 days ago ago
  We haven’t benchmarked our steering for scaffolding function-calling in an agent loop yet (and the model we are using is just a base model), so I can’t give a quantitative claim. But concept-based steering should be a good fit for keeping the agent on task and enforcing behavioral guardrails around tool use.
  In practice, you can treat concepts as soft/hard constraints to bias the agent toward: (1) calling tools only when needed, (2) selecting the right tool/function, or (3) using the correct argument schema.