I like how in copilot now, I need to consider in vscode whether to accept a tab-complete, because if its coming from copilot it will count against my usage, whereas if it is coming from the ide tools it will not. So I'm like, making individual decisions on whether to type something myself or just "use up" some completion budget. Funny to get nickel and dimed like this by one of the biggest companies in the world.
It's a seriously degraded experience from a developer's perspective. Okay you've got one local LLM installed finally after configuring everything perfectly, what happens when you want to run a second instance? Now you've blown past your vram and system ram limits, and you're stuck to just one.
Furthermore, the model they recommend doesn't quite reach ~gpt-5.4-mini level performance- that quality dip means you may as well just pay for something like Kimi K2.6 via openrouter if you want a something ~>= sonnet 4.6 in performance as a backup for when you run out of anthropic/openai usage.
Your point about caliber/quality is fair, but I have been pretty astonished by some of the newer/better models (Gemma 4 variants, GPT-OSS before that).
However, there's not a lot of memory increase to have multiple sessions in parallel with one model. It's an HTTP server, and other than some caching, basically stateless.
Doesn't llama.cpp (or similar) have to evict the kv cache for this, so that performance is degraded when running multiple sessions? Or how do you load a model in memory and then use it in multiple sessions? I am still learning this stuff
There are so many flags to llama.ccp that I won't try to say anything too strong, but I believe things related to `--kv-offload` mean you can have the KV cache in GPU VRAM, regular GPU RAM, paged to disk, etc...
I'm on a Mac with unified memory, so I can't easily benchmark it for you, but I think a PC with 64GB of regular RAM and a 24GB gaming card could swap between multiple sessions without too much pain. The weights could stay resident on the GPU.
On the other hand, I did just dump some Project Gutenberg texts into a prompt, and building that cache in the first place was slower than I though it would be.
The model is loaded once and can be used for multiple sessions, and even parallel requests.
llama.cpp uses a unified KV cache that is shared between requests (be they happening in parallel or not). As new requests come in, they'll evict no longer referenced branches, then move to evict the least recently used entry, and so on.
If you come back to a session that's been evicted it will just be parsed again. This is a problem only on very long context sessions, but it can still be a problem to you.
So one way to reduce such evictions (and reduce KV cache size significantly as a bonus) is to reduce the number of kv cache checkpoints.
Checkpoints allow you to branch a session at any point and not have to recompute it from the start. If you find that you rarely branch a conversation, or if you rely entirely on a coding harness, then setting ctx-checkpoints to 0 or 1 will save tons of VRAM and allow more different sessions to stay in VRAM. This is especially true for models with very large checkpoints (such as Gemma 4).
Local AI is "cheaper" when you already have the hardware sitting around, like an old MacBook or gaming GPU, or the API cost (subscriptions will all run out if you churn 24/7) is too high to bare. I'm surprised companies are still selling their old MacBooks to employees, when they could be turning them into Beowulf clusters for cheap AI compute on long-running jobs (the cost is just electricity)
I think you're right about the cost/benefit trade-off in general, but I do wonder how much "compaction" Codex and Claude do is to keep context fresh and how much is to save (them) runtime costs.
If you've got a 1M token context, but they constantly summarize it down to something much smaller, is it really 1M tokens of benefit? With a local model, you can use all 256k tokens on your own terms. However, I don't have any benchmarks to know.
I think you might be confused a bit about compaction? The LLM API endpoint does not do compaction, it's an external agent harness that does it. And the Codex/Claude agents aren't constantly summarizing it down, they generally wait until you get within 3/4 of the max of the context size.
Compaction doesn't save them money, it just makes it possible for you to continue a session. If you compact a session too many times, besides the fact that the model basically stops being useful, you eventually just cannot do anything else in the session because all the context is taken up by compaction notes. But if you don't compact it, pretty soon the session is completely unusable because it can't output any more tokens. You can disable compaction in those agents if you want to see the difference.
Also, using a lot of context can make the model perform poorly, so compaction can improve results. If you have a much larger context size, it means you have more headroom before the model starts to perform poorly (as it grows closer to max context size). A larger context also lets you do things like handle larger documents or reason over a larger amount of data without having to break it up into subtasks. Eventually we want models' context to get much bigger so we can do more things in a session. (Some research is being done to see if we can get rid of the limit entirely)
The names for the pieces are confusing, so it's easy to talk past each other. For instance, you're saying "Codex the agent", which isn't a thing now. It's currently GPT-5.5, and at one point it was GPT-5.3-Codex, so when I say "Codex", I meant the MacOS "harness". Similar for Claude Code vs Claude Opus/Sonnet.
Anyways, I don't know specifics well enough to argue with you on anything, but there is a cost for input tokens, and you see/pay it when you use the API directly or through OpenRouter. Maybe you looked at the leaked source for the Claude Code and can tell me definitively otherwise, but Anthropics and OpenAI's incentives for when to compact are not always aligned with the users depending on pricing plans.
I recently set up a Gemma 4 heretic fine tune on my MacBook to prove that I could more than anything else and it is probably around 4o levels of performance imo. Not fit for any real work. That said the fact that 4o was frontier two years ago and today I can equal it on local hardware and uncensored is pretty impressive.
> 95% of people should be paying for a subscription.
Subscription plans are the "first hit is free" plans. Real pricing once subscriptions are phased out in a year or two is gonna be orders of magnitude more.
Actually subscription plans will be here indefinitely. The cost of inference will only go down over time, and subscriptions are the end-game for all businesses as it's recurring revenue. Most subscribers don't use all the capacity, and there are limits imposed, so the financials work out. Same basic model as residential internet & mobile phones, but cheaper because there's an order of magnitude (or two) less support and maintenance.
If you're going to rent a few ec2 gpu instances you might as well funnel things through openrouter. Not that many of us have workflows where trusting an LLM provider is a problem but sending the data to EC2 is not.
As for why, why would you not? Sitting around waiting for a single assistant is inefficient use of time; I tend to have more like 4-10 instances running in parallel.
> Not that many of us have workflows where trusting an LLM provider is a problem but sending the data to EC2 is not.
I'd imagine plenty of people have a problem with trusting fly-by-night inference providers or model owners with opt-out policies [1] [2] about training on your data, who would be more than happy to send data to EC2, or even the same models in Amazon Bedrock.
BTW, LMStudio and a few others are really amazing. They allow you to download models from HF and manage many details before load them. A medium pc with an 8 or 10gb graphics card is already a nice setup to run many models, that are really good.
You can also run Ollama that is very simple to use and help you code on vscodium with Continue. Pretty nice!
I've tried these small models and they're nowhere near as good as Claude or GPT-5.
The new ones running on a 16GB M1 are maybe GPT-4 level (with decent performance to be fair).
I wonder if it's possible to make some hyper-overturned model that, say, does nothing but program in Python get SOTA-ish performance in that narrow task.
You can get them for half that price on Reddit used. I have a few. You will not get top-tier intelligence out of them. GPT-5.5 and Claude Sonnet/Opus are in an unbelievable tier. Not all problems need that, though. I have a Qwen-based agent write short websites for me to use and it is adequate to the task.
Which is how many months of Claude or Claude + chatgpt when Claude is down? And do you own anything after using those subscriptions? Can you pick and choose from dozens of models and whatever comes next? Can you play video games with your Claude subscription?
I let Qwen3.6-27B chew on a bug all last night. It choked at some point and stopped responding (probably a context overflow before pi-coding-agent could compact it). Claude Sonnet 4.6 found and fixed the bug in under 10 minutes.
Qwen3.6 is pretty amazing for a 27B model, but it's not hard to run into its limits. With a Radeon R9700 and unsloth's 6-bit quantization, I get ~20 TPS and 110k context, so it can do a fair bit quickly.
You definitely need to watch it more than a model 100 times larger. But the fact that it runs one 1 GPU and does what it does is insane. Imagine what a 30b model looks like in 6 months or 1 year?
Inference speed is still slow in a meaningfully different way. The models are good, but not great, and much slower, which for coding means a 2-3 minute task with claude code and opus takes an hour and has a higher chance of being wrong.
We're in the same boat. I would rather have NO llm, than an llm that collects my data (which you should assume is all of them, unless you've been asleep for the last 20 years).
Fortunately, I don't have to pick one or the other - instead I run Qwen 3.6 35B A3B. It's a bit slow with my 8gb GPU (I'm in the process of getting a bigger one) but again, to me the choice isn't "what's the best I can get", it's "what's the best local I can get".
Apple Silicon before the M4 does not have matmul instructions which causes the prompt processing to be very slow. It's quite different on the M5, much like using a nvidia GPU
This is qwen3.6:27b-coding-nvfp4. It's only an M1. If they ever ship an M5 studio with 96GB of ram, that's my next upgrade path for the local llm experiments.
You can get work done with them if you have a harness that can drive outcomes without needing feedback (I've been building a tdd red to green agent harness lately that is very effective if given a good plan upfront). So if you can stand waiting a few days to see results that would only take hours with a model deployed to frontier nvidia hardware, you can get results this way.
Let me also add that most of services that are private, will connect to the internet. LMStudio and many others will try to get a connection and all others. I don't remember a single one that does not connect to their servers and send some kind of information.
I like how in copilot now, I need to consider in vscode whether to accept a tab-complete, because if its coming from copilot it will count against my usage, whereas if it is coming from the ide tools it will not. So I'm like, making individual decisions on whether to type something myself or just "use up" some completion budget. Funny to get nickel and dimed like this by one of the biggest companies in the world.
It's a seriously degraded experience from a developer's perspective. Okay you've got one local LLM installed finally after configuring everything perfectly, what happens when you want to run a second instance? Now you've blown past your vram and system ram limits, and you're stuck to just one.
Furthermore, the model they recommend doesn't quite reach ~gpt-5.4-mini level performance- that quality dip means you may as well just pay for something like Kimi K2.6 via openrouter if you want a something ~>= sonnet 4.6 in performance as a backup for when you run out of anthropic/openai usage.
Your point about caliber/quality is fair, but I have been pretty astonished by some of the newer/better models (Gemma 4 variants, GPT-OSS before that).
However, there's not a lot of memory increase to have multiple sessions in parallel with one model. It's an HTTP server, and other than some caching, basically stateless.
Doesn't llama.cpp (or similar) have to evict the kv cache for this, so that performance is degraded when running multiple sessions? Or how do you load a model in memory and then use it in multiple sessions? I am still learning this stuff
There are so many flags to llama.ccp that I won't try to say anything too strong, but I believe things related to `--kv-offload` mean you can have the KV cache in GPU VRAM, regular GPU RAM, paged to disk, etc...
I'm on a Mac with unified memory, so I can't easily benchmark it for you, but I think a PC with 64GB of regular RAM and a 24GB gaming card could swap between multiple sessions without too much pain. The weights could stay resident on the GPU.
On the other hand, I did just dump some Project Gutenberg texts into a prompt, and building that cache in the first place was slower than I though it would be.
The model is loaded once and can be used for multiple sessions, and even parallel requests.
llama.cpp uses a unified KV cache that is shared between requests (be they happening in parallel or not). As new requests come in, they'll evict no longer referenced branches, then move to evict the least recently used entry, and so on.
If you come back to a session that's been evicted it will just be parsed again. This is a problem only on very long context sessions, but it can still be a problem to you.
So one way to reduce such evictions (and reduce KV cache size significantly as a bonus) is to reduce the number of kv cache checkpoints.
Checkpoints allow you to branch a session at any point and not have to recompute it from the start. If you find that you rarely branch a conversation, or if you rely entirely on a coding harness, then setting ctx-checkpoints to 0 or 1 will save tons of VRAM and allow more different sessions to stay in VRAM. This is especially true for models with very large checkpoints (such as Gemma 4).
Not sure why you got downvoted. 95% of people should be paying for a subscription. It's far cheaper, far more scalable, and far less hassle.
Local AI only makes sense for a couple of use cases:
Local AI is "cheaper" when you already have the hardware sitting around, like an old MacBook or gaming GPU, or the API cost (subscriptions will all run out if you churn 24/7) is too high to bare. I'm surprised companies are still selling their old MacBooks to employees, when they could be turning them into Beowulf clusters for cheap AI compute on long-running jobs (the cost is just electricity)If usage-based pricing is killing your vibe, find a cheaper subscription with higher limits. Here's a list of them compared on price-per-request-limit: https://codeberg.org/mutablecc/calculate-ai-cost/src/branch/...
I think you're right about the cost/benefit trade-off in general, but I do wonder how much "compaction" Codex and Claude do is to keep context fresh and how much is to save (them) runtime costs.
If you've got a 1M token context, but they constantly summarize it down to something much smaller, is it really 1M tokens of benefit? With a local model, you can use all 256k tokens on your own terms. However, I don't have any benchmarks to know.
I think you might be confused a bit about compaction? The LLM API endpoint does not do compaction, it's an external agent harness that does it. And the Codex/Claude agents aren't constantly summarizing it down, they generally wait until you get within 3/4 of the max of the context size.
Compaction doesn't save them money, it just makes it possible for you to continue a session. If you compact a session too many times, besides the fact that the model basically stops being useful, you eventually just cannot do anything else in the session because all the context is taken up by compaction notes. But if you don't compact it, pretty soon the session is completely unusable because it can't output any more tokens. You can disable compaction in those agents if you want to see the difference.
Also, using a lot of context can make the model perform poorly, so compaction can improve results. If you have a much larger context size, it means you have more headroom before the model starts to perform poorly (as it grows closer to max context size). A larger context also lets you do things like handle larger documents or reason over a larger amount of data without having to break it up into subtasks. Eventually we want models' context to get much bigger so we can do more things in a session. (Some research is being done to see if we can get rid of the limit entirely)
LLM API endpoint does do compaction. OpenAI definitely does support serverside compaction, both explicit and automatic, and this is different than what could be implemented purely clientside: https://developers.openai.com/api/docs/guides/compaction (and there was rumors a few months ago on HN about how activation-preserving/latent it is, vs just summarization). Anthropic as well, in beta (new to me): https://platform.claude.com/docs/en/build-with-claude/compac...
The names for the pieces are confusing, so it's easy to talk past each other. For instance, you're saying "Codex the agent", which isn't a thing now. It's currently GPT-5.5, and at one point it was GPT-5.3-Codex, so when I say "Codex", I meant the MacOS "harness". Similar for Claude Code vs Claude Opus/Sonnet.
Anyways, I don't know specifics well enough to argue with you on anything, but there is a cost for input tokens, and you see/pay it when you use the API directly or through OpenRouter. Maybe you looked at the leaked source for the Claude Code and can tell me definitively otherwise, but Anthropics and OpenAI's incentives for when to compact are not always aligned with the users depending on pricing plans.
I recently set up a Gemma 4 heretic fine tune on my MacBook to prove that I could more than anything else and it is probably around 4o levels of performance imo. Not fit for any real work. That said the fact that 4o was frontier two years ago and today I can equal it on local hardware and uncensored is pretty impressive.
> 95% of people should be paying for a subscription.
Subscription plans are the "first hit is free" plans. Real pricing once subscriptions are phased out in a year or two is gonna be orders of magnitude more.
Actually subscription plans will be here indefinitely. The cost of inference will only go down over time, and subscriptions are the end-game for all businesses as it's recurring revenue. Most subscribers don't use all the capacity, and there are limits imposed, so the financials work out. Same basic model as residential internet & mobile phones, but cheaper because there's an order of magnitude (or two) less support and maintenance.
Why are you running 2 instances anyways? If you want that workflow just rent a few ec2 gpu instances and fire away?
If you're going to rent a few ec2 gpu instances you might as well funnel things through openrouter. Not that many of us have workflows where trusting an LLM provider is a problem but sending the data to EC2 is not.
As for why, why would you not? Sitting around waiting for a single assistant is inefficient use of time; I tend to have more like 4-10 instances running in parallel.
I absolutely see no reason to send company IP, future plans, and current code base to any other company.
I also do not run 10 agents at the same time. There's no way I could keep up with the volume of work from doing that in any meaningful way
Nobody wants or needs your company IP, future plans, and current code base.
You don't run 10 agents to get more volume of work. You run 10 agents to get better quality work
Does your company self host everything though. Many are already in the cloud, why single out llms to not use cloud for
I trust most cloud providers more than most LLMs providers but I still don't trust them much. Anything I can keep safeguarded on premises I do.
> Not that many of us have workflows where trusting an LLM provider is a problem but sending the data to EC2 is not.
I'd imagine plenty of people have a problem with trusting fly-by-night inference providers or model owners with opt-out policies [1] [2] about training on your data, who would be more than happy to send data to EC2, or even the same models in Amazon Bedrock.
[1]: https://github.blog/news-insights/company-news/updates-to-gi...
[2]: https://help.openai.com/en/articles/5722486-how-your-data-is...
you e gott token addiction.
BTW, LMStudio and a few others are really amazing. They allow you to download models from HF and manage many details before load them. A medium pc with an 8 or 10gb graphics card is already a nice setup to run many models, that are really good. You can also run Ollama that is very simple to use and help you code on vscodium with Continue. Pretty nice!
[dead]
I've tried these small models and they're nowhere near as good as Claude or GPT-5.
The new ones running on a 16GB M1 are maybe GPT-4 level (with decent performance to be fair).
I wonder if it's possible to make some hyper-overturned model that, say, does nothing but program in Python get SOTA-ish performance in that narrow task.
A 24GB Nvidia RTX 3090 TI is ~2000 euro.
You can get them for half that price on Reddit used. I have a few. You will not get top-tier intelligence out of them. GPT-5.5 and Claude Sonnet/Opus are in an unbelievable tier. Not all problems need that, though. I have a Qwen-based agent write short websites for me to use and it is adequate to the task.
Which is how many months of Claude or Claude + chatgpt when Claude is down? And do you own anything after using those subscriptions? Can you pick and choose from dozens of models and whatever comes next? Can you play video games with your Claude subscription?
Believe me when I say that I want to run local models, and I do. But in my testing, 24 GB doesn't get you much brainpower.
Have you tried the latest qwen3.6 models?
For most of my questions and 8-9b model works great. Upshot is not having chatgpt/meta sell my data or target me with random thoughts later.
I let Qwen3.6-27B chew on a bug all last night. It choked at some point and stopped responding (probably a context overflow before pi-coding-agent could compact it). Claude Sonnet 4.6 found and fixed the bug in under 10 minutes.
Qwen3.6 is pretty amazing for a 27B model, but it's not hard to run into its limits. With a Radeon R9700 and unsloth's 6-bit quantization, I get ~20 TPS and 110k context, so it can do a fair bit quickly.
You definitely need to watch it more than a model 100 times larger. But the fact that it runs one 1 GPU and does what it does is insane. Imagine what a 30b model looks like in 6 months or 1 year?
Inference speed is still slow in a meaningfully different way. The models are good, but not great, and much slower, which for coding means a 2-3 minute task with claude code and opus takes an hour and has a higher chance of being wrong.
It's only slow if you can't afford to run it properly. A lot of people are getting 70-100 tokens per second on 1 gpu.
Not sure what Claude opus or sonnet run at. I know when it goes offline it's 0 tokens per second
We're in the same boat. I would rather have NO llm, than an llm that collects my data (which you should assume is all of them, unless you've been asleep for the last 20 years).
Fortunately, I don't have to pick one or the other - instead I run Qwen 3.6 35B A3B. It's a bit slow with my 8gb GPU (I'm in the process of getting a bigger one) but again, to me the choice isn't "what's the best I can get", it's "what's the best local I can get".
$20/month cloud plan is definitely better than anything you'll get locally.
Cost is not a reason to go local.
qwen3.6 does a good job locally except it can take 20-30 minutes to respond to a prompt on a mac studio with 32gb of ram.
Apple Silicon before the M4 does not have matmul instructions which causes the prompt processing to be very slow. It's quite different on the M5, much like using a nvidia GPU
Yea you probably do want to use a GPU for models of that size.
I also wonder what quantization you are using? If you haven't tried other quants I really would
This is qwen3.6:27b-coding-nvfp4. It's only an M1. If they ever ship an M5 studio with 96GB of ram, that's my next upgrade path for the local llm experiments.
You can get work done with them if you have a harness that can drive outcomes without needing feedback (I've been building a tdd red to green agent harness lately that is very effective if given a good plan upfront). So if you can stand waiting a few days to see results that would only take hours with a model deployed to frontier nvidia hardware, you can get results this way.
The time delay is the real issue. Much much slower wall clock time.
Local AI does not mean privacy or offline. Claude code does not run offline. It needs an internet connection.
"./claude-2.1.126-linux-x64
Welcome to Claude Code v2.1.126
Unable to connect to Anthropic services
Failed to connect to api.anthropic.com: ECONNREFUSED
Please check your internet connection and network settings.
Note: Claude Code might not be available in your country. Check supported countries at https://anthropic.com/supported-countries"
Let me also add that most of services that are private, will connect to the internet. LMStudio and many others will try to get a connection and all others. I don't remember a single one that does not connect to their servers and send some kind of information.