Model Priority & Model Pinning for Multi-Model / Router Mode #17979
bluemoehre
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I'm currently setting up a small "home assistant / work support landscape" for myself:
A primary intelligence (gpt‑oss‑120b) and a coding assistant (Qwen‑3 Coder) that I use in VS Code all day.
But my new AI server still has VRAM left (and it was expensive 😉). So I want to make the most of it by adding occasional services such as TTS or TTI and whatever comes next.
At the moment I'm juggling four Docker containers, starting and stopping the extra services as needed. My custom scripts for doing that have become a bit of a code jungle, and I'm not happy with it.
I then discovered llama‑swap, which looked promising, but after digging a bit deeper I realized it would introduce yet another tool and add even more complexity, leaving me with three Dockers anyway.
A few minutes ago I ran into the news about the llama‑server router, which is fantastic! It eliminates that extra tool, so that’s a win. However, I would still have to run three Dockers to guarantee that both the main LLM and the coder are always available.
To get rid of most of the complexity in this scenario and collapse everything into a single Docker container, a fully flexible LLM/VRAM management would help.
TL;DR
So what about a priority / pinning mechanism with two options:
Implementing such a feature would be straightforward and elegant for home‑grown multi‑model stacks, because it preserves the conversational flow with the user. I'd rather prefer the system say:
... than leave the user staring at dead silence, while models are swapped forward and backwards. =D
Beta Was this translation helpful? Give feedback.
All reactions