We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data. It would be a great way to ensure an initial thumbs up from a prominent reviewer. It's a good benchmark but it seems like it would be a good idea to include an additional random or unannounced similar test to catch any benchmaxxing.
That depends on if "SVG generation" is a particularly useful LLM/coding model skill outside of benchmarking. I.e., if they make that stronger with some params that otherwise may have been used for "rust type system awareness" or somesuch, it might be a net loss outside of the benchmarks.
> We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data.
I may be stupid, but _why_ is this prompt used as a benchmark? I mean, pelicans _can't_ ride a bicycle, so why is it important for "AI" to show that they can (at least visually)?
The "wine glass problem"[0] - and probably others - seems to me to be a lot more relevant...?
The fact that pelicans can't ride bicycles is pretty much the point of the benchmark! Asking an LLM to draw something that's physically impossible means it can't just "get it right" - seeing how different models (especially at different sizes) handle the problem is surprisingly interesting.
Honestly though, the benchmark was originally meant to be a stupid joke.
I only started taking it slightly more seriously about six months ago, when I noticed that the quality of the pelican drawings really did correspond quite closely to how generally good the underlying models were.
If a model draws a really good picture of a pelican riding a bicycle there's a solid chance it will be great at all sorts of other things. I wish I could explain why that was!
So ever since then I've continue to get models to draw pelicans. I certainly wouldn't suggest anyone take serious decisions on model usage based on my stupid benchmark, but it's a fun first-day initial impression thing and it appears to be a useful signal for which models are worth diving into in more detail.
I wish I knew why. I didn't think it would be a useful indicator of model
skills at all when I started doing it, but over time the pattern has held that performance on pelican riding a bicycle is a good indicator of performance on other tasks.
a posteriori knowledge. the pelican isn't the point, it's just amusing. the point is that Simon has seen a correlation between this skill and and the model's general capabilities.
> It's not nessessarily the best benchmark, it's a popular one, probably because it's funny.
> Yes it's like the wine glass thing.
No, it's not!
That's part of my point; the wine glass scenario is a _realistic_ scenario. The pelican riding a bike is not. It's a _huge_ difference. Why should we measure intelligence (...) in regards to something that is realistic and something that is unrealistic?
I may have missed something but where are we saying the website should be recreated with 1996 tech or specs? The model is free to use any modern CSS, there is no technical limitations. So yes I genuinely think it is a good generalization test, because it is indeed not in the training set, and yet it is easy an easy task for a human developer.
The point stands. Whether or not the standard is current has no relevance for the ability of the "AI" to produce the requested content. Either it can or can't.
I'd be curious about that actually, feel like W3C specifications (I don't mean browser support of them) rarely deprecate and precisely try to keep the Web running.
Yes, now please prepare an email template which renders fine in outlook using modern web standards. Write it up if you succeed, front page of HN guaranteed!
But it does have a verifiable output, no more or less than HTML+CSS. Not sure what you mean by "input" -- it's not a function that takes in parameters if that's what you're getting at, but not every app does.
Less than a year behind the SOTA, faster, and cheaper. I think Mistral is mounting a good recovery. I would not use it yet since it is not the best along any dimension that matters to me (I'm not EU-bound) but it is catching up. I think its closed source competitors are Haiku 4.5 and Gemini 3 Pro Fast (TBA) and whatever ridiculously-named light model OpenAI offers today (GPT 5.1 Codex Max Extra High Fast?)
People have been doing this for literally every anticipated model release, and I presume skimming some amount of legitimate interest since their sites end up being top indexed until the actual model is released.
Google should be punishing these sites but presumably it's too narrow of a problem for them to care.
Every link in the "Legal" tree is a dead end redirecting back to the home page... strange thing to put together without any acknowledgement, unless they spam it on LLM adjacent subreddits for clout/karma?
It's open source; the price is up to the provider, and I do not see any on openrouter yet. ̶G̶i̶v̶e̶n̶ ̶t̶h̶a̶t̶ ̶d̶e̶v̶s̶t̶r̶a̶l̶ ̶i̶s̶ ̶m̶u̶c̶h̶ ̶s̶m̶a̶l̶l̶e̶r̶,̶ ̶I̶ ̶c̶a̶n̶ ̶n̶o̶t̶ ̶i̶m̶a̶g̶i̶n̶e̶ ̶i̶t̶ ̶w̶i̶l̶l̶ ̶b̶e̶ ̶m̶o̶r̶e̶ ̶e̶x̶p̶e̶n̶s̶i̶v̶e̶,̶ ̶l̶e̶t̶ ̶a̶l̶o̶n̶e̶ ̶5̶x̶.̶ ̶I̶f̶ ̶a̶n̶y̶t̶h̶i̶n̶g̶ ̶D̶e̶e̶p̶S̶e̶e̶k̶ ̶w̶i̶l̶l̶ ̶b̶e̶ ̶5̶x̶ ̶t̶h̶e̶ ̶c̶o̶s̶t̶.̶
edit: Mea culpa. I missed the active vs dense difference.
> Given that devstral is much smaller, I can not imagine it will be more expensive
Devstral 2 is 123B dense. Deepseek is 37B Active. It will be slower and more expensive to run inference on this than dsv3. Especially considering that dsv3.2 has some goodies that make inference at higher context be more effective than their previous gen.
Devstral is purely nonthinking too it’s very possible it uses less models (I don’t know how DS 3.2 nonthinking compares). It’s interesting because Qwen pretty much proved hybrid models work worse than fully separate models.
I gave Devstral 2 in their CLI a shot and let it run over one of my smaller private projects, about 500 KB of code. I asked it to review the codebase, understand the application's functionality, identify issues, and fix them.
It spent about half an hour, correctly identified what the program did, found two small bugs, fixed them, made some minor improvements, and added two new, small but nice features.
It introduced one new bug, but then fixed it on the first try when I pointed it out.
The changes it made to the code were minimal and localized; unlike some more "creative" models, it didn't randomly rewrite stuff it didn't have to.
It's too early to form a conclusion, but so far, it's looking quite competent.
So I tested the bigger model with my typical standard test queries which are not so tough, not so easy. They are also some that you wouldn't find extensive training data for. Finally, I already have used them to get answers from gpt-5.1, sonnet 4.5 and gemini 3 ....
Here is what I think about the bigger model: It sits between sonnet 4 and sonnet 4.5. Something like "sonnet 4.3". The response sped was pretty good.
Overall, I can see myself shifting to this for reguar day-to-day coding if they can offer this for copetitive pricing.
I'll still use sonnet 4.5 or gemini 3 for complex queries, but, for everything else code related, this seems to be pretty good.
Congrats Mistral. You most probably have caught up to the big guys. Not there yet exactly, but, not far now.
Look interesting, eager to play around with it! Devstral was a neat model when it released and one of the better ones to run locally for agentic coding. Nowadays I mostly use GPT-OSS-120b for this, so gonna be interesting to see if Devstral 2 can replace it.
I'm a bit saddened by the name of the CLI tool, which to me implies the intended usage. "Vibe-coding" is a fun exercise to realize where models go wrong, but for professional work where you need tight control over the quality, you can obviously not vibe your way to excellency, hard reviews are required, so not "vibe coding" which is all about unreviewed code and just going with whatever the LLM outputs.
But regardless of that, it seems like everyone and their mother is aiming to fuel the vibe coding frenzy. But where are the professional tools, meant to be used for people who don't want to do vibe-coding, but be heavily assisted by LLMs? Something that is meant to augment the human intellect, not replace it? All the agents seem to focus on off-handing work to vibe-coding agents, while what I want is something even tighter integrated with my tools so I can continue delivering high quality code I know and control. Where are those tools? None of the existing coding agents apparently aim for this...
Their new CLI agent tool [1] is written in Python unlike similar agents from Anthropic/Google (Typescript/Bun) and OpenAI (Rust). It also appears to have first class ACP support, where ACP is the new protocol from Zed [2].
This is exactly the CLI I'm referring to, whose name implies it's for playing around with "vibe-coding", instead of helping professional developers produce high quality code. It's the opposite of what I and many others are looking for.
I think that's just the name they picked. I don't mind it. Taking a glance at what it actually does, it just looks like another command line coding assistant/agent similar to Opencode and friends. You can use it for whatever you want not just "vibe coding", including high quality, serious, professional development. You just have to know what you're doing.
A surprising amount of programming is building cardboard services or apps that only need to last six months to a year and then thrown away when temporary business needs change. Execs are constantly clamoring for semi-persistent dashboards and ETL visualized data that lasts just long enough to rein in the problem and move on to the next fire. Agentic coding is good enough for cardboard services that collapse when they get wet. I wouldn't build an industrial data lake service with it, but you can certainly build cardboard consumers of the data lake.
But there is nothing more permanent that a quickly hacked together prototype or personal productivity hack that works. There are so many Python (or Perl or Visual Basic) scripts or Excel spreadsheets - created by people who have never been "developers" - which solve in-the-trenches pain points and become indispensable in exactly the way _that_ xkcd shows.
> But where are the professional tools, meant to be used for people who don't want to do vibe-coding, but be heavily assisted by LLMs? Something that is meant to augment the human intellect, not replace it?
Claude Code has absolutely zero features that help me review code or do anything else than vibe-coding and accept changes as they come in. We need diff-comparisons between different executions, tailored TUI for that kind of work and more. Claude Code is basically a MVP of that.
Still, I do use Claude Code and Codex daily as there is nothing better out there currently. But they still feel tailored towards vibe-coding instead of professional development.
I really do not want those things in Claude COde - I much prefer choosing my own diff tools etc. and running them in a separate terminal. If they start stuffing too much into the TUI they'd ruin it - if you want all that stuff built in, they have the VS Code integration.
Me neither, hence the stated preference for something completely new and different, a stab in the different direction instead of the same boring iteration on yet another agentic TUI coder.
Are any of them integrated with git? AFAIK, you'd have to instruct them to use git for you if you don't want to do it manually.
Imagine a GUI built around git branches + agents working in those branches + tooling to manage the orchestration and small review points, rather than "here's a chat and tool calling, glhf".
Says the person who will find themselves unable to change the software even in the slightest way without having to large refactors across everything at the same time.
High quality code matters more than ever, would be my argument. The second you let the LLM sneak in some quick hack/patch instead of correctly solving the problem, is the second you invite it to continue doing that always.
I have a feeling this will only supercharge the long established industry practice of new devs or engineering leadership getting recruited and immediately criticising the entire existing tech stack, and pushing for (and often succeeding) a ground up rewrite in language/framework de jour. This is hilariously common in web work, particularly front end web work. I suspect there are industry sectors that're well protected from this, I doubt people writing firmware for fuel injection and engine management systems suffer too much from this, the Javascript/Nodejs/NPM scourge _probably_ hasn't hit the PowerPC or 68K embedded device programming workflow. Yet...
"high quality specifications" have _always_ been a thing that matters.
In my mind, it's somewhat orthogonal to code quality.
Waterfall has always been about "high quality specifications" written by people who never see any code, much less write it. Agile make specs and code quality somewhat related, but in at least some ways probably drives lower quality code in the pursuit of meeting sprint deadlines and producing testable artefacts at the expense of thoroughness/correctness/quality.
I did, although a long time ago, so maybe I need to try it again. But it still seems to be stuck in a chat-like interface instead of something tailored to software development. Think IDE but better.
It has a new “watch files” mode where you can work interactively. You just code normally but can send commands to the llm via a special string. Its a great way if interacting with LLMs, if only they where much faster.
If you're interested in much faster LLM coding, GLM 4.6 on Cerebras is pretty mind blowing. It's not quite as smart as the latest Claude and Gemini, but it generates code so fast it's kind of comical if you're used to the other models. Good with Aider since you can keep it on a tighter leash than with a fully agentic tool.
When I think "IDE but better", a Claude Code-like interface is increasingly what I want.
If you babysit every interaction, rather than reviewing a completed unit of work of some size, you're wasting your time second-guessing that the model won't "recover" from stupid mistakes. Sometimes that's right, but more often than not it corrects itself faster than you can.
And so it's far more effective to interact with it far more async, where the UI is more for figuring out what it did if something doesn't seem right, than for working live. I have Claude writing a game engine in another window right now, while writing this, and I have no interest in reviewing every little change, because I know the finished change will look nothing like the initial draft (it did just start the demo game right now, though, and it's getting there). So I review no smaller units of change than 30m-1h, often it will be hours, sometimes days, between each time I review the output, when working on something well specified.
If your goal is to edit code and not discuss it aider also supports a watch mode. You can keep adding comments about what you want it to do in a minimal format and it will make changes to the files and you can diff/revert them.
The chat interface is optimal to me because you often are asking questions and seeking guidance or proposals as you are making actual code changes. On reason I do like it is that its default mode of operation is to make a commit for each change it makes. So it is extremely clear what the AI did vs what you did vs what is a hodge podge of both.
As others have mentioned, you can integrate with your IDE through the watch mode. It's somewhat crude but still useful way. But I find myself more often than not just running Aider in a terminal under the code editor window and chatting with it about what's in the window.
Seems very much not, if it's still a chat interface :) Figuring out a chat UX is easy compared to something that was creating with letting LLM fill in some parts from the beginning. I guess I'm searching for something with a different paradigm than just "chat + $Something".
the question is, how do you want to provide instructions for what the AI is to do? You might not like calling it "chat" but somehow you need to communicate that, right? With aider you can write a comment for a function and then instruct it to finish the function inline (see other comments). But unless you just want pure autocomplete based on it guessing things, you need to provide guidance to it somehow.
I don't know exactly, but I guess in a more declarative manner rather than anything. Maybe we set goals/milestones/concrete objectives, or similar, rather than imperatively steer it, give it space to experiment yet make it very easy to understand exactly what important tradeoffs everything is doing.
I find a good compromise on that front is not to use the chat primarily, but to create files like 'ARCHITECTURE.md', 'REQUIREMENTS.md' and put information in there describing how the application works. Then you add those to the chat as context docs.From the chat interface then you are just referring to those not just describing features willy nilly. So the nice thing is you are building documentation for the application in a formal sense as part of instructing the LLM.
But that is the typical agentic LLM coder style program I was initially referring to, saying we maybe should explore other alternatives to. It's too basic and primitive, with some imagination.
The typical "best practice" for these tools tend to be to ask it something like
"I want you to do feature X. Analyse the code for me and make suggestions how to implement this feature."
Then it will go off and work for a while and typically come back after a bit with some suggestions. Then iterate on those if needed and end with.
"Ok. Now take these decided upon ideas and create a plan for how to implement. And create new tests where appropriate."
Then it will go off and come back with a plan for what to do. And then you send it off with.
"Ok, start implementing."
So sure. You probably can work on this to make it easier to use than with a CLI chat. It would likely be less like an IDE and more like a planning tool you'd use with human colleagues though.
Aider can be a chat interface and it's great for that but you can also use it from your editor by telling it to watch your files.[1]
So you'd write a function name and then tell it to flesh it out.
function factorial(n) // Implement this. AI!
Becomes:
function factorial(n) {
if (n === 0 || n === 1) {
return 1;
} else {
return n \* factorial(n - 1);
}
}
Last I looked Aider's maintainer has had to focus on other things recently, but aider-ce is a fantastic fork.
I'm really curious to try Mistral's vibe, but even though I'm a big fanboi I don't want to be tied to just one model. Aider lets tier your models such that your big, expensive model can do all the thinking and then stuff like code reviews can run through a smaller model. It's a pretty capable tool
Very much this for me - I really don't get why, given a new models are popping out every month from different providers, people are so happy to sink themselves into provider ecosystems when there are open source alternatives that work with any model.
The main problem with Aider is it isn't agentic enough for a lot of people but to me that's a benefit.
RTX Pro 6000, ends up taking ~66GB when running the MXFP4 native quant with llama-server/llama.cpp and max context, as an example. Guess you could do it with two 5090s with slightly less context, or different software aimed at memory usage efficiency.
I'm sure I'm not the only one that thinks "Vibe CLI" sounds like an unserious tool. I use Claude Code a lot and little of it is what I would consider Vibe Coding.
So people have different definitions of the word, but originally Vibe Coding meant "don't even look at the code".
If you're actually making sure it's legit, it's not vibe coding anymore. It's just... Backseat Coding? ;)
There's a level below that I call Power Coding (like power armor) where you're using a very fast model interactively to make many very small edits. So you're still doing the conceptual work of programming, but outsourcing the plumbing (LLM handles details of syntax and stdlib).
I know tech bros like to come up with fancy words to make trivial things sounds fancy but as long as it’s a slop out process, it’s vibe coding. If you’re fixing what a bot spits out, should be a different word … something painful that could’ve been avoided?
Also, we’re both “people in tech”, we know LLMs can’t conceptualise beyond finding the closest collection of tokens rhyming with your prompt/code. Doesn’t mean it’s good or even correct. So that’s why it’s vibe coding.
The original definition was very different. The main thing with vibe coding is that you don't care about the code. You don't even look at the code. You prompt, test that you got what you wanted, and move on. You can absolutely use cc to vibe code. But you can also use it to ... code based on prompts. Or specs. Or docs. Or whatever else. The difference is if you want / care to look at the code or not.
It sure doesn't feel like it given how closely I have to babysit Claude Code lest I don't recognize the code after Claude Code is done with it when left to its own devices for a minute.
Let's say you had a hardware budget of $5,000. What machine would you buy or build to run Devstral Small 2? The HuggingFace page claims it can run on a Mac with 32 GB of memory or an RTX 4090. What kind of tokens per second would you get on each? What about DGX Spark? What about RTX 5090 or Pro series? What about external GPUs on Oculink with a mini PC?
All those choices seem to have very different trade-offs? I hate $5,000 as a budget - not enough to launch you into higher-VRAM RTX Pro cards, too much (for me personally) to just spend on a "learning/experimental" system.
I've personally decided to just rent systems with GPUs from a cloud provider and setup SSH tunnels to my local system. I mean, if I was doing some more HPC/numerical programming (say, similarity search on GPUs :-) ), I could see just taking the hit and spending $15,000 on a workstation with an RTX Pro 6000.
For grins:
Max t/s for this and smaller models? RTX 5090 system. Barely squeezing in for $5,000 today and given ram prices, maybe not actually possible tomorrow.
Max CUDA compatibility, slower t/s? DGX Spark.
Ok with slower t/s, don't care so much about CUDA, and want to run larger models? Strix Halo system with 128gb unified memory, order a framework desktop.
Prefer Macs, might run larger models? M3 Ultra with memory maxed out. Better memory bandwidth speed, mac users seem to be quite happy running locally for just messing around.
I ran ollama first because it was easy, but now download source and build llama.cpp on the machine. I don't bother saving a file system between runs on the rented machine, I build llama.cpp every time I start up.
I am usually just running gpt-oss-120b or one of the qwen models. Sometimes gemma? These are mostly "medium" sized in terms of memory requirements - I'm usually trying unquantized models that will easily run on an single 80-ish gb gpu because those are cheap.
I tend to spend $10-$20 a week. But I am almost always prototyping or testing an idea for a specific project that doesn't require me to run 8 hrs/day. I don't use the paid APIs for several reasons but cost-effectiveness is not one of those reasons.
I know you say you don't use the paid apis, but renting a gpu is something I've been thinking about and I'd be really interested in knowing how this compares with paying by the token. I think gpt-oss-120b is 0.10/input 0.60/output per million tokens in azure. In my head this could go a long way but I haven't used gpt oss agentically long enough to really understand usage. Just wondering if you know/be willing to share your typical usage/token spend on that dedicated hardware?
I don't suppose you have (or would be interested in writing) a blog post about how you set that up? Or maybe a list of links/resources/prompts you used to learn how to get there?
No, I don't blog. But I just followed the docs for starting an instance on lambda.ai and the llama.cpp build instructions. Both are pretty good resources. I had already setup an SSH key with lambda and the lambda OS images are linux pre-loaded with CUDA libraries on startup.
Here are my lazy notes + a snippet of the history file from the remote instance for a recent setup where I used the web chat interface built into llama.cpp.
I created an instance gpu_1x_gh200 (96 GB on ARM) at lambda.ai.
connected from terminal on my box at home and setup the ssh tunnel.
ssh -L 22434:127.0.0.1:11434 ubuntu@<ip address of rented machine - can see it on lambda.ai console or dashboard>
Started building llama.cpp from source, history:
21 git clone https://github.com/ggml-org/llama.cpp
22 cd llama.cpp
23 which cmake
24 sudo apt list | grep libcurl
25 sudo apt-get install libcurl4-openssl-dev
26 cmake -B build -DGGML_CUDA=ON
27 cmake --build build --config Release
MISTAKE on 27, SINGLE-THREADED and slow to build see -j 16 below for faster build
28 cmake --build build --config Release -j 16
29 ls
30 ls build
31 find . -name "llama.server"
32 find . -name "llama"
33 ls build/bin/
34 cd build/bin/
35 ls
36 ./llama-server -hf ggml-org/gpt-oss-120b-GGUF -c 0 --jinja
MISTAKE, didn't specify the port number for the llama-server
I switched to qwen3 vl because I need a multimodal model for that day's experiment. Lines 38 and 39 show me not using the right name for the model. I like how llama.cpp can download and run models directly off of huggingface.
Then pointed my browser at http//:localhost:22434 on my local box and had the normal browser window where I could upload files and use the chat interface with the model. That also gives you an openai api-compatible endpoint. It was all I needed for what I was doing that day. I spent a grand total of $4 that day doing the setup and running some NLP-oriented prompts for a few hours.
dual 3090's (24GB each) on 8x+8x pcie has been a really reliable setup for me (with nvlink bridge... even though it's relatively low bandwidth compared to tesla nvlink, it's better than going over pcie!)
48GB of vram and lots of cuda cores, hard to beat this value atm.
If you want to go even further, you can get an 8x V100 32GB server complete with 512GB ram and nvlink switching for $7000 USD from unixsurplus (ebay.com/itm/146589457908) which can run even bigger models and with healthy throughput. You would need 240V power to run that in a home lab environment though.
I've been running local models on an AMD 7800 XT with ollama-rocm. I've had zero technical issues. It's really just the usefulness of a model with only 16GB vram + 64GB of main RAM is questionable, but that isn't an AMD specific issue. It was a similar experience running locally with an nvidia card.
I'm not excited that it's done in python. I've had experience with Aider struggling to display text as fast as the llm is spitting it out, though that was probably 6 months ago now.
Something like GPT 5-mini is a lot cheaper than even Haiku but when I tried it in my experience it was so bad it was a waste of time. But it’s probably still more than 1/10 the performance of Haiku probably?
In work, where my employer pays for it, Haiku tends to be the workhorse with Sonnet or Opus when I see it flailing. On my own budget I’m a lot more cost conscious, so Haiku actually ends up being “the fancy model” and minimax m2 the “dumb model”.
Even if it is 10x cheaper and 2x worse it's going to eat up even more tokens spinning its wheels trying to implement things or squash bugs and you may end up spending more because of that. Or at least spending way more of your time.
Is it? The actual SOTA are not amazing at coding, so at least for me there is absolutely no reason to optimize on price at the moment. If I am going to use an LLM for coding it makes little sense to settle for a worse coder.
I dunno. Even pretty weak models can be decently performant, and 9/10 the performance for 1/10 the price means 10x the output, and for a lot of stuff that quality difference dosent really matter. Considering even sota models are trash, slightly worse dosent really make that much difference.
Fair. Mostly the argument is, if all you need is to iterate on output to refine it, you get 10x the iterations, while lesser quality, its still a aspect to consider. But yes, why bother eine coding when they do make so many mistakes.
Ah, finally! I was checking just a few days ago if they had a Claude Code-like tool as I would much rather give money to a European effort. I'll stop my Pro subscription at Anthropic and switch over and test it out.
Does anyone know where their SWE-bench Verified results are from? I can't find matching results on the leaderboards for their models or the Claude models and they don't provide any links.
I was briefly excited when Mistral Vibe launched and mentions "0 MCP Servers" in its startup screen... but I can't find how to configure any MCP servers. It doesn't respond to the /mcp command, and asking Devstral 2 for help, it thinks MCP is "Model Context Preservation". I'd really like to be able to run my local MCP tools that I wrote in Golang.
I'm team Anthropic with Claude Max & Claude Code, but I'm still excited to see Mistral trying this. Mistral has occasionally saved the day for me when Claude refused an innocuous request, and it's good to have alternatives... even if Mistral / Devstral seems to be far behind the quality of Claude.
Thank you! Finally got it working, had to comment out the mcp_servers line near the top of the config.toml file in ~/.vibe/, before adding my [[mcp_servers]] sections at the end of the file.
Just tried it out via their free API and the Roo Code VSCode extension, and it's impressive. It walked through a data analytics and transformation problem (150.000 dataset entries) I have been debugging for the past 2 hours.
> Mistral Code is available with enterprise deployments.
> Contact our team to get started.
The competition is much smoother. Where are the subscriptions which would give users the coding agent and the chat for a flat fee and working out of the box?..
Very nice that there's a coding cli finally. I have a Mistral Pro account. I hope that it will be included. It's the main reason to have a Pro account tbh.
Open sourcing the TUI is pretty big news actually. Unless I missed something, I had to dig a bit to find it, but I think this is it: https://github.com/mistralai/mistral-vibe
Let's see which company becomes the first to sell "coding appliances": hardware with a model good enough for normal coding.
If Mistral is so permissive they could be the first ones, provided that hardware is then fast/cheap/efficient enough to create a small box that can be placed in an office.
My Macbook Pro with an M4 Pro chip can handle a number of these models (I think it has 16GB of VRAM) with reasonable performance, my bottleneck continuously is the token caps. I assume someone with a much more powerful Mac Studio could run way more than I can, considering they get access to about 96GB of VRAM out of the system RAM iirc.
...so it won't ever happen, it'll require wifi and will only be accessible via the cloud, and you'll have to pay a subscription fee to access the hardware you bought. obviously.
Extremely happy with this release, the previous Devstral was great but training it for open hands crippled the usefulness. Having their own CLI dev tool will hopefully be better
The original Devstral was a collaboration between All Hands AI (OpenHands) and Mistral [1]. You can use it with other agents but had to transfer over the prompt. Even then, the agents still didn't work that well. I tried it in RooCline and it worked extremely poorly with the tool calls.
They’ll switch to military tech the second it becomes necessary, don’t kid yourself. I’m just glad we have a European alternative for the day the US decides to turn its back on us.
This tech is simply too critical to pretend the military won’t use it. That’s clearer now than ever, especially after the (so far flop-ish) launch of the U.S. military’s own genAI platform.
> I’m just glad we have a European alternative for the day the US decides to turn its back on us
Not sure you've kept up to date, US have turned their backs on most allies so far including Europe and the EU, and now welcome previous enemies with open arms.
I've not spent enough time with Mistral Vibe yet for a credible comparison, but given what I know about the underlying models (likely-1T-plus Opus 4.5 compared to the 123B Devstral 2) I'd be shocked if Vibe could out-perform Claude Code for the kinds of things I'm using it for.
I gave it the job of modifying a fairly simple regex replacement and it took a while over 5 minutes, claude failed on the same prompt (which surprised me), codex did a similar job but faster. So all in all not bad!
offtopic but it hurts my eyes: I dislike for their font choice and their "cool looks" in their graphics.
Surprising and good is only: Everything including graphics fixed when clicking my "speedreader" button in Brave. So they are doing that "cool look" by CSS.
> Devstral 2 ships under a modified MIT license, while Devstral Small 2 uses Apache 2.0. Both are open-source and permissively licensed to accelerate distributed intelligence.
Uh, the "Modified MIT license" here[0] for Devstral 2 doesn't look particularly permissively licensed (or open-source):
> 2. You are not authorized to exercise any rights under this license if the global consolidated monthly revenue of your company (or that of your employer) exceeds $20 million (or its equivalent in another currency) for the preceding month. This restriction in (b) applies to the Model and any derivatives, modifications, or combined works based on it, whether provided by Mistral AI or by a third party. You may contact Mistral AI (sales@mistral.ai) to request a commercial license, which Mistral AI may grant you at its sole discretion, or choose to use the Model on Mistral AI's hosted services available at https://mistral.ai/.
Personally I really like the normalization of these "Permissively" licensed models that only restrict companies with massive revenues from using them for free.
If you want to use something, and your company makes $240,000,000 in annual revenue, you should probably pay for it.
These are not permissively licensed though, the terms "permissive license" has connotations that pretty much everyone who is into FLOSS understands (same with "open source").
I do not mind having a license like that, my gripe is with using the terms "permissive" and "open source" like that because such use dilutes them. I cannot think of any reason to do that aside from trying to dilute the term (especially when some laws, like the EU AI Act, are less restrictive when it comes to open source AIs specifically).
> I do not mind having a license like that, my gripe is with using the terms "permissive" and "open source" like that because such use dilutes them. I cannot think of any reason to do that aside from trying to dilute the term (especially when some laws, like the EU AI Act, are less restrictive when it comes to open source AIs specifically).
Good. In this case, let it be diluted! These extra "restrictions" don't affect normal people at all, and won't even affect any small/medium businesses. I couldn't care less that the term is "diluted" and that makes it harder for those poor, poor megacorporations. They swim in money already, they can deal with it.
We can discuss the exact threshold, but as long as these "restrictions" are so extreme that they only affect huge megacorporations, this is still "permissive" in my book. I will gladly die on this hill.
> Good. In this case, let it be diluted! These extra "restrictions" don't affect normal people at all,
Yes, they do, and the only reason for using the term “open source” for things whose licensing terms flagrantly defy the Open Source definition is to falsely sell the idea that using the code carries the benefits that are tied to the combination of features that are in the definition and which are lost with only a subset of those features. The freedom to use the software in commercial services is particularly important to end-users that are not interested in running their own services as a guarantee against lock-in and of whatever longevity they are able to pay to have provided even if the original creator later has interests that conflict with offering the software as a commercial service.
If this deception wasn't important, there would be no incentive not to use the more honest “source available for limited uses” description.
> I couldn't care less that the term is "diluted" and that makes it harder
It also makes life harder for individuals and small companies, because this is not Open Source. It's incompatible with Open Source, it can't be reused in other Open Source projects.
Terms have meanings. This is not Open Source, and it will never be Open Source.
That's fine, but I don't think you should call it open source or call it MIT or even 'modified MIT.' Call it Mistral license or something along those lines
That's probably better, but Modified MIT is pretty descriptive, I read it as "mostly MIT, but with caveats for extreme cases" which is about right, if you already know what the MIT license entails
Whatever name they come up with for a new license will be less useful, because I'll have to figure out that this is what that is
imo this is a hill people need to stop dying on. Open source means "I can see the source" to most of the world. Wishing it meant "very permissively licensed" to everyone is a lost cause.
And honestly it wasn't a good hill to begin with: if what you are talking about is the license, call it "open license". The source code is out in the open, so it is "open source". This is why the purists have lost ground to practical usage.
> imo this is a hill people need to stop dying on.
As someone who was born and raised on FOSS, and still mostly employed to work on FOSS, I disagree.
Open source is what it is today because it's built by people with a spine who stand tall for their ideals even if it means less money, less industry recognition, lots of unglorious work and lots of other negatives.
It's not purist to believe that what built open source so far should remain open source, and not wanting to dilute that ecosystem with things that aren't open source, yet call themselves open source.
> Open source is what it is today because it's built by people with a spine who stand tall for their ideals even if it means less money, less industry recognition, lots of unglorious work and lots of other negatives.
With all due respect, don't you see the irony in saying "people with a spine who stand tall for their ideals", and then arguing that attaching "restrictions" which only affect the richest megacorporations in the world somehow makes the license not permissive anymore?
What ideals are those exactly? So that megacorporations have the right to use the software without restrictions? And why should we care about that?
Anyone can use the code for whatever purpose they want, in any way they want. I've never been a "rich megacorporation", but I have gone from having zero money to having enough money, and I still think the very same thing about the code I myself release as I did from the beginning, it should be free to be used by anyone, for any purpose.
You should stand up for your ideals, but dying on the hill of what you call your ideals is actually getting in the way of that.
Because instead of making the point "this license isn't as permissive as it could/should be" (easy to understand), instead the point being made is "this isn't real open source", which comes across to most people as just some weird gate-keeping / No True Scotsman kinda thing.
"No True Scotsman" is about specifically about changing the rules to exclude a new example you don't want to permit. The rules haven't changed, and the attempts to violate the requirements aren't new. Proprietary licenses continue to be proprietary. Open Source continues to not allow restrictions on commercial use.
ultimately you have to imbue words with meaning, otherwise it is impossible to have a discussion. what i said about no true scotsman was false, i was just trying to prove a point.
> Open source means "I can see the source" to most of the world
well we don't really want to open that can of worms though, do we?
I don't agree with ceding technical terms to the rest of the world. I'm increasingly told we need to stop calling cancer detection AI "AI" or "ML" because it is not the 'bad AI' and confuses people.
If you are happy that time is being spent quibbling over definitions instead of actually focusing on the ideal, I'm not sure you care about the ideals as much as you say you do.
Who gives a shit what we call "cancer AI", what matters is the result.
Free software to me means GPL and associates, so if that is what Stallman was trying to be a stickler for - it worked.
Open source has a well understood meaning, including licenses like MIT and Apache - but not including MIT but only if you make less than $500million dollars, MIT unless you were born on a wednesday, etc.
Earnestly, what's the concern here? People complain about open source being mostly beneficial to megacorps, if that's the main change (idk I haven't looked too closely) then that's pretty good, no?
They are claiming something is open-source when it isn’t. Regardless of whether you think the deviation from open-source is a good thing or not, you should still be in favour of honesty.
No, according to the commonly accepted definition of open-source.
Whenever anybody tries to claim that a non-commercial licenses is open-source, it always gets complaints that it is not open-source. This particular word hasn’t been watered down by misuse like so many others.
There is no commonly-accepted definition of open-source that allows commercial restrictions. You do not get to make up your own meaning for words that differs from how other people use it. Open-source does not have commercial restrictions by definition.
Where are you getting this compendium of commonly-accepted definitions?
Looking up open-source in the dictionary does include definitions that would allow for commercial restrictions, depending on how you define "free" (a matter that is most certainly up for debate).
"Open-source" isn't a term that emerged organically from conversations between people. It is a term that was very deliberately coined for a specific purpose, defined into existence by an authority. It's a term of art, and its exact definition is available here: https://opensource.org/osd
The term "open-source" exists for the purposes of a particular movement. If you are "for" the misuse and abuse of the term, you not only aren't part of that movement, but you are ignorant about it and fail to understand it— which means you frankly have no place speaking about the meanings of its terminology.
Unless this authority has some ownership over the term and can prevent its misuse (e.g. with lawsuits or similar), it is not actually the authority of the term, and people will continue to use it how they see fit.
Indeed, I am not part of a movement (nor would I want to be) which focuses more on what words are used rather than what actions are taken.
> people will continue to use it how they see fit.
And whenever they do so, this pointless argument will happen. Again, and again, and again. Because that’s not what the word means and your desired redefinition has been consistently and continuously rejected over and over again for decades.
What do you gain from misusing this term? The only thing it does is make you look dishonest and start arguments.
> people will continue to use it how they see fit.
People can also say 2+2=5, and they're wrong. And people will continue to call them out on it. And we will keep doing so, because stopping lets people move the Overton window and try to get away with even more.
They don't have to enforce it, evil megacorps won't risk the legal consequences of using it without talking to Mistral first. In reality they just won't use it.
I am very disappointed they don't have an equivalent subscription for coding to the 200 EUR ChatGPT or Claude one, and it is only available for Enterprise deployments.
The only thing I found is a pay-as-you-go API, but I wonder if it is any good (and cost-effective) vs Claude et al.
> Devstral 2 is currently offered free via our API. After the free period, the API pricing will be $0.40/$2.00 per million tokens (input/output) for Devstral 2
With pricing so low I don't see any reason why someone would buy sub for 200 EUR. These days those subs are so much limited in Claude Code or Cursor than it used to be (or used to unlimited). Better pay-as-you-go especially when there are days when you probably use AI less or not at all (weekends/holidays etc.) as long as those credits don't expire.
Looks like another Deepseek distil like the new Ministrals. For every other use case that would be an insult, but for coding that's a great approach given how much lead in coding performance Qwen and Deepseek have on Mistral's internal datasets. The Small 24B seems to have a decent edge on 30BA3B, though it'll be comparatively extremely slow to run.
Pretty good for a 123B model!
(That said I'm not 100% certain I guessed the correct model ID, I asked Mistral here: https://x.com/simonw/status/1998435424847675429)
We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data. It would be a great way to ensure an initial thumbs up from a prominent reviewer. It's a good benchmark but it seems like it would be a good idea to include an additional random or unannounced similar test to catch any benchmaxxing.
I wrote about that possibility here: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...
It would be easy to out models that train on the bike pelican, because they would probably suck at the kayaking bumblebee.
So far though, the models good at bike pelican are also good at kayak bumblebee, or whatever other strange combo you can come up with.
So if they are trying to benchmaxx by making SVG generation stronger, that's not really a miss, is it?
That depends on if "SVG generation" is a particularly useful LLM/coding model skill outside of benchmarking. I.e., if they make that stronger with some params that otherwise may have been used for "rust type system awareness" or somesuch, it might be a net loss outside of the benchmarks.
I assume all of the models also have variations on, “how many ‘r’s in strawberry”.
> We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data.
I may be stupid, but _why_ is this prompt used as a benchmark? I mean, pelicans _can't_ ride a bicycle, so why is it important for "AI" to show that they can (at least visually)?
The "wine glass problem"[0] - and probably others - seems to me to be a lot more relevant...?
[0] https://medium.com/@joe.richardson.iii/the-curious-case-of-t...
The fact that pelicans can't ride bicycles is pretty much the point of the benchmark! Asking an LLM to draw something that's physically impossible means it can't just "get it right" - seeing how different models (especially at different sizes) handle the problem is surprisingly interesting.
Honestly though, the benchmark was originally meant to be a stupid joke.
I only started taking it slightly more seriously about six months ago, when I noticed that the quality of the pelican drawings really did correspond quite closely to how generally good the underlying models were.
If a model draws a really good picture of a pelican riding a bicycle there's a solid chance it will be great at all sorts of other things. I wish I could explain why that was!
If you start here and scroll through and look at the progression of pelican on bicycle images it's honestly spooky how well they match the vibes of the models they represent: https://simonwillison.net/2025/Jun/6/six-months-in-llms/#ai-...
So ever since then I've continue to get models to draw pelicans. I certainly wouldn't suggest anyone take serious decisions on model usage based on my stupid benchmark, but it's a fun first-day initial impression thing and it appears to be a useful signal for which models are worth diving into in more detail.
> If a model draws a really good picture of a pelican riding a bicycle there's a solid chance it will be great at all sorts of other things.
Why?
If I hired a worker that was really good at drawing pelicans riding a bike, it wouldn't tell me anything about his/her other qualities?!
I wish I knew why. I didn't think it would be a useful indicator of model skills at all when I started doing it, but over time the pattern has held that performance on pelican riding a bicycle is a good indicator of performance on other tasks.
a posteriori knowledge. the pelican isn't the point, it's just amusing. the point is that Simon has seen a correlation between this skill and and the model's general capabilities.
It's not nessessarily the best benchmark, it's a popular one, probably because it's funny.
Yes it's like the wine glass thing.
Also it's kind of got depth. Does it draw the pelican and the bicycle? Can the penguin reach the peddles? How?
I can imagine a really good AI finding a funny or creative or realistic way for the penguin to reach the peddles.
An slightly worse AI will do an OK job, maybe just making the bike small or the legs too long.
An OK AI will draw a penguin on top of a bicycle and just call it a day.
It's not as binary as the wine glass example.
> It's not nessessarily the best benchmark, it's a popular one, probably because it's funny.
> Yes it's like the wine glass thing.
No, it's not!
That's part of my point; the wine glass scenario is a _realistic_ scenario. The pelican riding a bike is not. It's a _huge_ difference. Why should we measure intelligence (...) in regards to something that is realistic and something that is unrealistic?
I just don't get it.
If this had any substance then it could be criticized, which is what they're trying to avoid.
How? There's no way for you to verify if they put synthetic data for that into the dataset or not.
but can it recreate the spacejam 1996 website? https://www.spacejam.com/1996/jam.html
in case folks are missing the context
https://news.ycombinator.com/item?id=46183294
That is not a meaningful metric given that we don't live in 1996 and neither do our web standards.
In what year was it meaningful to have pelicans riding bicycles?
SVG is a current standard. Do not be coy just to satisfy your urge to disagree.
The website is live and renders correctly on my Safari mobile: https://www.spacejam.com/1996/
I may have missed something but where are we saying the website should be recreated with 1996 tech or specs? The model is free to use any modern CSS, there is no technical limitations. So yes I genuinely think it is a good generalization test, because it is indeed not in the training set, and yet it is easy an easy task for a human developer.
The point stands. Whether or not the standard is current has no relevance for the ability of the "AI" to produce the requested content. Either it can or can't.
https://news.ycombinator.com/item?id=46183673
> neither do our web standards
I'd be curious about that actually, feel like W3C specifications (I don't mean browser support of them) rarely deprecate and precisely try to keep the Web running.
Yes, now please prepare an email template which renders fine in outlook using modern web standards. Write it up if you succeed, front page of HN guaranteed!
The parent comment is a reference to a different story that was on the HN home page yesterday where someone attempted that with Claude.
Yes, and I had a lengthier response in that thread explaining why this isn't a useful metric.
https://news.ycombinator.com/item?id=46183673
I think this benchmark could be slightly misleading to assess coding model. But still very good result.
Yes, SVG is code, but not in a sense of executable with verifiable inputs and outputs.
But it does have a verifiable output, no more or less than HTML+CSS. Not sure what you mean by "input" -- it's not a function that takes in parameters if that's what you're getting at, but not every app does.
I love that we are earnestly contemplating the merits of the pelican benchmark. What a timeline.
It's not even halfway up the list of inane things of the AI hype cycle.
Where did you get llm tool from?!
He made it: https://github.com/simonw/llm
Cool! I can't find it on the read me, but can it run Qwen locally?
The best way to do that at the moment is using the llm-ollama plugin.
Skipped the bicycle entirely and upgraded to a sweet motorcycle :)
Looks like a Cybertruck actually!
I was thinking a Warthog
https://www.halopedia.org/Warthog
The Batman motorcycle!
I'm Pelicanman </raspy voice>
Is it really an svg if it’s just embedded base64 of a jpg
You were seeing the base64 image tag output at the bottom. The SVG input is at the top.
Impressive! I'm really excited to leverage this in my gooning sessions!
Less than a year behind the SOTA, faster, and cheaper. I think Mistral is mounting a good recovery. I would not use it yet since it is not the best along any dimension that matters to me (I'm not EU-bound) but it is catching up. I think its closed source competitors are Haiku 4.5 and Gemini 3 Pro Fast (TBA) and whatever ridiculously-named light model OpenAI offers today (GPT 5.1 Codex Max Extra High Fast?)
The OpenAI thing is named Garlic.
(Surely they won't release it like that, right..?)
TIL: https://garlicmodel.com/
That looks like the next flagship rather than the fast distillation, but thanks for sharing.
Lol, someone vibecoded an entire website for OpenAI's model, that's some dedication.
People have been doing this for literally every anticipated model release, and I presume skimming some amount of legitimate interest since their sites end up being top indexed until the actual model is released.
Google should be punishing these sites but presumably it's too narrow of a problem for them to care.
Black SEO in the age of LLMs
It would need outbound links to be SEO
Or at least a profit model. I don't see either on that page but maybe I'm missing something
Every link in the "Legal" tree is a dead end redirecting back to the home page... strange thing to put together without any acknowledgement, unless they spam it on LLM adjacent subreddits for clout/karma?
"GPT, please make me a website about OpenAI's 'Garlic' model."
No this is comparable to Deepseek-v3.2 even on their highlight task, with significantly worse general ability. And it's priced 5x of that.
It's open source; the price is up to the provider, and I do not see any on openrouter yet. ̶G̶i̶v̶e̶n̶ ̶t̶h̶a̶t̶ ̶d̶e̶v̶s̶t̶r̶a̶l̶ ̶i̶s̶ ̶m̶u̶c̶h̶ ̶s̶m̶a̶l̶l̶e̶r̶,̶ ̶I̶ ̶c̶a̶n̶ ̶n̶o̶t̶ ̶i̶m̶a̶g̶i̶n̶e̶ ̶i̶t̶ ̶w̶i̶l̶l̶ ̶b̶e̶ ̶m̶o̶r̶e̶ ̶e̶x̶p̶e̶n̶s̶i̶v̶e̶,̶ ̶l̶e̶t̶ ̶a̶l̶o̶n̶e̶ ̶5̶x̶.̶ ̶I̶f̶ ̶a̶n̶y̶t̶h̶i̶n̶g̶ ̶D̶e̶e̶p̶S̶e̶e̶k̶ ̶w̶i̶l̶l̶ ̶b̶e̶ ̶5̶x̶ ̶t̶h̶e̶ ̶c̶o̶s̶t̶.̶
edit: Mea culpa. I missed the active vs dense difference.
> Given that devstral is much smaller, I can not imagine it will be more expensive
Devstral 2 is 123B dense. Deepseek is 37B Active. It will be slower and more expensive to run inference on this than dsv3. Especially considering that dsv3.2 has some goodies that make inference at higher context be more effective than their previous gen.
Devstral is purely nonthinking too it’s very possible it uses less models (I don’t know how DS 3.2 nonthinking compares). It’s interesting because Qwen pretty much proved hybrid models work worse than fully separate models.
Deepseek v3.2 is that cheap because its attention mechanism is ridiculously efficient.
Yeah, DeepSeek Sparse Attention. Section 2: https://arxiv.org/abs/2512.02556
I gave Devstral 2 in their CLI a shot and let it run over one of my smaller private projects, about 500 KB of code. I asked it to review the codebase, understand the application's functionality, identify issues, and fix them.
It spent about half an hour, correctly identified what the program did, found two small bugs, fixed them, made some minor improvements, and added two new, small but nice features.
It introduced one new bug, but then fixed it on the first try when I pointed it out.
The changes it made to the code were minimal and localized; unlike some more "creative" models, it didn't randomly rewrite stuff it didn't have to.
It's too early to form a conclusion, but so far, it's looking quite competent.
On what hardware did you run it?
FWIW, it’s free through Mistral right now
and openrouter https://openrouter.ai/mistralai/devstral-2512:free
So I tested the bigger model with my typical standard test queries which are not so tough, not so easy. They are also some that you wouldn't find extensive training data for. Finally, I already have used them to get answers from gpt-5.1, sonnet 4.5 and gemini 3 ....
Here is what I think about the bigger model: It sits between sonnet 4 and sonnet 4.5. Something like "sonnet 4.3". The response sped was pretty good.
Overall, I can see myself shifting to this for reguar day-to-day coding if they can offer this for copetitive pricing.
I'll still use sonnet 4.5 or gemini 3 for complex queries, but, for everything else code related, this seems to be pretty good.
Congrats Mistral. You most probably have caught up to the big guys. Not there yet exactly, but, not far now.
Look interesting, eager to play around with it! Devstral was a neat model when it released and one of the better ones to run locally for agentic coding. Nowadays I mostly use GPT-OSS-120b for this, so gonna be interesting to see if Devstral 2 can replace it.
I'm a bit saddened by the name of the CLI tool, which to me implies the intended usage. "Vibe-coding" is a fun exercise to realize where models go wrong, but for professional work where you need tight control over the quality, you can obviously not vibe your way to excellency, hard reviews are required, so not "vibe coding" which is all about unreviewed code and just going with whatever the LLM outputs.
But regardless of that, it seems like everyone and their mother is aiming to fuel the vibe coding frenzy. But where are the professional tools, meant to be used for people who don't want to do vibe-coding, but be heavily assisted by LLMs? Something that is meant to augment the human intellect, not replace it? All the agents seem to focus on off-handing work to vibe-coding agents, while what I want is something even tighter integrated with my tools so I can continue delivering high quality code I know and control. Where are those tools? None of the existing coding agents apparently aim for this...
Their new CLI agent tool [1] is written in Python unlike similar agents from Anthropic/Google (Typescript/Bun) and OpenAI (Rust). It also appears to have first class ACP support, where ACP is the new protocol from Zed [2].
[1] https://github.com/mistralai/mistral-vibe
[2] https://zed.dev/acp
I did not know A2A had a competitor :(
They're different use cases, ACP is for clients (UIs, interfaces)
> Their new CLI agent tool [1] is written in
This is exactly the CLI I'm referring to, whose name implies it's for playing around with "vibe-coding", instead of helping professional developers produce high quality code. It's the opposite of what I and many others are looking for.
I think that's just the name they picked. I don't mind it. Taking a glance at what it actually does, it just looks like another command line coding assistant/agent similar to Opencode and friends. You can use it for whatever you want not just "vibe coding", including high quality, serious, professional development. You just have to know what you're doing.
>vibe-coding
A surprising amount of programming is building cardboard services or apps that only need to last six months to a year and then thrown away when temporary business needs change. Execs are constantly clamoring for semi-persistent dashboards and ETL visualized data that lasts just long enough to rein in the problem and move on to the next fire. Agentic coding is good enough for cardboard services that collapse when they get wet. I wouldn't build an industrial data lake service with it, but you can certainly build cardboard consumers of the data lake.
You are right.
But there is nothing more permanent that a quickly hacked together prototype or personal productivity hack that works. There are so many Python (or Perl or Visual Basic) scripts or Excel spreadsheets - created by people who have never been "developers" - which solve in-the-trenches pain points and become indispensable in exactly the way _that_ xkcd shows.
> where are the professional tools, meant to be used for people who don't want to do vibe-coding, but be heavily assisted by LLMs?
This is what we're building at Brokk: https://brokk.ai/
Quick intro: https://blog.brokk.ai/introducing-lutz-mode/
> But where are the professional tools, meant to be used for people who don't want to do vibe-coding, but be heavily assisted by LLMs? Something that is meant to augment the human intellect, not replace it?
Claude Code not good enough for ya?
Claude Code has absolutely zero features that help me review code or do anything else than vibe-coding and accept changes as they come in. We need diff-comparisons between different executions, tailored TUI for that kind of work and more. Claude Code is basically a MVP of that.
Still, I do use Claude Code and Codex daily as there is nothing better out there currently. But they still feel tailored towards vibe-coding instead of professional development.
I really do not want those things in Claude COde - I much prefer choosing my own diff tools etc. and running them in a separate terminal. If they start stuffing too much into the TUI they'd ruin it - if you want all that stuff built in, they have the VS Code integration.
Mind elaborating a bit on the diff tool / flow you’re using? Trying to follow along better with what CC is doing
Me neither, hence the stated preference for something completely new and different, a stab in the different direction instead of the same boring iteration on yet another agentic TUI coder.
> Claude Code has absolutely zero features that help me review code
Err, doesn’t it have /review?
What’s wrong with using GIT for reviewing the changes?
Are any of them integrated with git? AFAIK, you'd have to instruct them to use git for you if you don't want to do it manually.
Imagine a GUI built around git branches + agents working in those branches + tooling to manage the orchestration and small review points, rather than "here's a chat and tool calling, glhf".
High quality code is a thing from the past
What matters is high quality specifications including test cases
> High quality code is a thing from the past
Says the person who will find themselves unable to change the software even in the slightest way without having to large refactors across everything at the same time.
High quality code matters more than ever, would be my argument. The second you let the LLM sneak in some quick hack/patch instead of correctly solving the problem, is the second you invite it to continue doing that always.
I dunno...
I have a feeling this will only supercharge the long established industry practice of new devs or engineering leadership getting recruited and immediately criticising the entire existing tech stack, and pushing for (and often succeeding) a ground up rewrite in language/framework de jour. This is hilariously common in web work, particularly front end web work. I suspect there are industry sectors that're well protected from this, I doubt people writing firmware for fuel injection and engine management systems suffer too much from this, the Javascript/Nodejs/NPM scourge _probably_ hasn't hit the PowerPC or 68K embedded device programming workflow. Yet...
"high quality specifications" have _always_ been a thing that matters.
In my mind, it's somewhat orthogonal to code quality.
Waterfall has always been about "high quality specifications" written by people who never see any code, much less write it. Agile make specs and code quality somewhat related, but in at least some ways probably drives lower quality code in the pursuit of meeting sprint deadlines and producing testable artefacts at the expense of thoroughness/correctness/quality.
Did you try Aider?
I did, although a long time ago, so maybe I need to try it again. But it still seems to be stuck in a chat-like interface instead of something tailored to software development. Think IDE but better.
It has a new “watch files” mode where you can work interactively. You just code normally but can send commands to the llm via a special string. Its a great way if interacting with LLMs, if only they where much faster.
If you're interested in much faster LLM coding, GLM 4.6 on Cerebras is pretty mind blowing. It's not quite as smart as the latest Claude and Gemini, but it generates code so fast it's kind of comical if you're used to the other models. Good with Aider since you can keep it on a tighter leash than with a fully agentic tool.
When I think "IDE but better", a Claude Code-like interface is increasingly what I want.
If you babysit every interaction, rather than reviewing a completed unit of work of some size, you're wasting your time second-guessing that the model won't "recover" from stupid mistakes. Sometimes that's right, but more often than not it corrects itself faster than you can.
And so it's far more effective to interact with it far more async, where the UI is more for figuring out what it did if something doesn't seem right, than for working live. I have Claude writing a game engine in another window right now, while writing this, and I have no interest in reviewing every little change, because I know the finished change will look nothing like the initial draft (it did just start the demo game right now, though, and it's getting there). So I review no smaller units of change than 30m-1h, often it will be hours, sometimes days, between each time I review the output, when working on something well specified.
If your goal is to edit code and not discuss it aider also supports a watch mode. You can keep adding comments about what you want it to do in a minimal format and it will make changes to the files and you can diff/revert them.
I think Aider is closest to what you want.
The chat interface is optimal to me because you often are asking questions and seeking guidance or proposals as you are making actual code changes. On reason I do like it is that its default mode of operation is to make a commit for each change it makes. So it is extremely clear what the AI did vs what you did vs what is a hodge podge of both.
As others have mentioned, you can integrate with your IDE through the watch mode. It's somewhat crude but still useful way. But I find myself more often than not just running Aider in a terminal under the code editor window and chatting with it about what's in the window.
> I think Aider is closest to what you want.
> The chat interface
Seems very much not, if it's still a chat interface :) Figuring out a chat UX is easy compared to something that was creating with letting LLM fill in some parts from the beginning. I guess I'm searching for something with a different paradigm than just "chat + $Something".
the question is, how do you want to provide instructions for what the AI is to do? You might not like calling it "chat" but somehow you need to communicate that, right? With aider you can write a comment for a function and then instruct it to finish the function inline (see other comments). But unless you just want pure autocomplete based on it guessing things, you need to provide guidance to it somehow.
I don't know exactly, but I guess in a more declarative manner rather than anything. Maybe we set goals/milestones/concrete objectives, or similar, rather than imperatively steer it, give it space to experiment yet make it very easy to understand exactly what important tradeoffs everything is doing.
It's all very fluffy and theoretical of course.
I find a good compromise on that front is not to use the chat primarily, but to create files like 'ARCHITECTURE.md', 'REQUIREMENTS.md' and put information in there describing how the application works. Then you add those to the chat as context docs.From the chat interface then you are just referring to those not just describing features willy nilly. So the nice thing is you are building documentation for the application in a formal sense as part of instructing the LLM.
But that is the typical agentic LLM coder style program I was initially referring to, saying we maybe should explore other alternatives to. It's too basic and primitive, with some imagination.
The typical "best practice" for these tools tend to be to ask it something like
"I want you to do feature X. Analyse the code for me and make suggestions how to implement this feature."
Then it will go off and work for a while and typically come back after a bit with some suggestions. Then iterate on those if needed and end with.
"Ok. Now take these decided upon ideas and create a plan for how to implement. And create new tests where appropriate."
Then it will go off and come back with a plan for what to do. And then you send it off with.
"Ok, start implementing."
So sure. You probably can work on this to make it easier to use than with a CLI chat. It would likely be less like an IDE and more like a planning tool you'd use with human colleagues though.
Aider can be a chat interface and it's great for that but you can also use it from your editor by telling it to watch your files.[1]
So you'd write a function name and then tell it to flesh it out.
Becomes: Last I looked Aider's maintainer has had to focus on other things recently, but aider-ce is a fantastic fork.I'm really curious to try Mistral's vibe, but even though I'm a big fanboi I don't want to be tied to just one model. Aider lets tier your models such that your big, expensive model can do all the thinking and then stuff like code reviews can run through a smaller model. It's a pretty capable tool
Edit: Fix formatting
[1] https://aider.chat/docs/usage/watch.html
> I don't want to be tied to just one model.
Very much this for me - I really don't get why, given a new models are popping out every month from different providers, people are so happy to sink themselves into provider ecosystems when there are open source alternatives that work with any model.
The main problem with Aider is it isn't agentic enough for a lot of people but to me that's a benefit.
I created a very unprofessional tool, which apparently does what you want!
While True:
0. Context injected automatically. (My repos are small.)
1. I describe a change.
2. LLM proposes a code edit. (Can edit multiple files simultaneously. Only one LLM call required :)
3. I accept/reject the edit.
> run locally for agentic coding. Nowadays I mostly use GPT-OSS-120b for this
What kind of hardware do you have to be able to run a performant GPT-OSS-120b locally?
RTX Pro 6000, ends up taking ~66GB when running the MXFP4 native quant with llama-server/llama.cpp and max context, as an example. Guess you could do it with two 5090s with slightly less context, or different software aimed at memory usage efficiency.
That has 96GB GDDR7 ECC, to save people looking it up.
The model is 64GB (int4 native), add 20GB or so for context.
There are many platforms out there that can run it decently.
AMD strix halo, Mac platforms. Two (or three without extra ram) of the new AMD AI Pro R9700 (32GB of RAM, $1200), multi consumer gpu setups, etc.
Mbp 128gb.
I'm sure I'm not the only one that thinks "Vibe CLI" sounds like an unserious tool. I use Claude Code a lot and little of it is what I would consider Vibe Coding.
They're looking for free publicity. "This French company launched a tool that lets you 'vibe' an application into being. Programmers outraged!"
Using LLM's to write code is inherently best for unserious work.
"Not reviewing generated code" is the problem. Not the LLM generated code.
These are the cutting insights I come to HN for.
these are just old senior devs not wanting to accept new changes in the industry.
These are the cutting insights I come to HN for.
If you’re letting Claude write code you’re vibe coding
So people have different definitions of the word, but originally Vibe Coding meant "don't even look at the code".
If you're actually making sure it's legit, it's not vibe coding anymore. It's just... Backseat Coding? ;)
There's a level below that I call Power Coding (like power armor) where you're using a very fast model interactively to make many very small edits. So you're still doing the conceptual work of programming, but outsourcing the plumbing (LLM handles details of syntax and stdlib).
Peer coding?
Maybe common usage is shifting, but Karpathy's "vibe coding" was definitely meant to be a never look at the code, just feel the AI vibes thing.
I know tech bros like to come up with fancy words to make trivial things sounds fancy but as long as it’s a slop out process, it’s vibe coding. If you’re fixing what a bot spits out, should be a different word … something painful that could’ve been avoided?
Also, we’re both “people in tech”, we know LLMs can’t conceptualise beyond finding the closest collection of tokens rhyming with your prompt/code. Doesn’t mean it’s good or even correct. So that’s why it’s vibe coding.
> If you're actually making sure it's legit, it's not vibe coding anymore.
sorry to disappoint you but that is also been considered vibecoding. It is just not pejorative.
Pretty sure Karpathy coined the term here: https://x.com/karpathy/status/1886192184808149383
Imo, if you read the code, it's no longer vibecoding.
The original definition was very different. The main thing with vibe coding is that you don't care about the code. You don't even look at the code. You prompt, test that you got what you wanted, and move on. You can absolutely use cc to vibe code. But you can also use it to ... code based on prompts. Or specs. Or docs. Or whatever else. The difference is if you want / care to look at the code or not.
It sure doesn't feel like it given how closely I have to babysit Claude Code lest I don't recognize the code after Claude Code is done with it when left to its own devices for a minute.
It gets pretty close for me, but I usually tell it how I want it done from the get go.
Maybe they are just trying to be funny.
Let's say you had a hardware budget of $5,000. What machine would you buy or build to run Devstral Small 2? The HuggingFace page claims it can run on a Mac with 32 GB of memory or an RTX 4090. What kind of tokens per second would you get on each? What about DGX Spark? What about RTX 5090 or Pro series? What about external GPUs on Oculink with a mini PC?
All those choices seem to have very different trade-offs? I hate $5,000 as a budget - not enough to launch you into higher-VRAM RTX Pro cards, too much (for me personally) to just spend on a "learning/experimental" system.
I've personally decided to just rent systems with GPUs from a cloud provider and setup SSH tunnels to my local system. I mean, if I was doing some more HPC/numerical programming (say, similarity search on GPUs :-) ), I could see just taking the hit and spending $15,000 on a workstation with an RTX Pro 6000.
For grins:
Max t/s for this and smaller models? RTX 5090 system. Barely squeezing in for $5,000 today and given ram prices, maybe not actually possible tomorrow.
Max CUDA compatibility, slower t/s? DGX Spark.
Ok with slower t/s, don't care so much about CUDA, and want to run larger models? Strix Halo system with 128gb unified memory, order a framework desktop.
Prefer Macs, might run larger models? M3 Ultra with memory maxed out. Better memory bandwidth speed, mac users seem to be quite happy running locally for just messing around.
You'll probably find better answers heading off to https://www.reddit.com/r/LocalLLaMA/ for actual benchmarks.
> I've personally decided to just rent systems with GPUs from a cloud provider and setup SSH tunnels to my local system.
That's a good idea!
Curious about this, if you don't mind sharing:
- what's the stack ? (Do you run like llama.cpp on that rented machine?)
- what model(s) do you run there?
- what's your rough monthly cost? (Does it come up much cheaper than if you called the equivalent paid APIs)
I ran ollama first because it was easy, but now download source and build llama.cpp on the machine. I don't bother saving a file system between runs on the rented machine, I build llama.cpp every time I start up.
I am usually just running gpt-oss-120b or one of the qwen models. Sometimes gemma? These are mostly "medium" sized in terms of memory requirements - I'm usually trying unquantized models that will easily run on an single 80-ish gb gpu because those are cheap.
I tend to spend $10-$20 a week. But I am almost always prototyping or testing an idea for a specific project that doesn't require me to run 8 hrs/day. I don't use the paid APIs for several reasons but cost-effectiveness is not one of those reasons.
I know you say you don't use the paid apis, but renting a gpu is something I've been thinking about and I'd be really interested in knowing how this compares with paying by the token. I think gpt-oss-120b is 0.10/input 0.60/output per million tokens in azure. In my head this could go a long way but I haven't used gpt oss agentically long enough to really understand usage. Just wondering if you know/be willing to share your typical usage/token spend on that dedicated hardware?
I don't suppose you have (or would be interested in writing) a blog post about how you set that up? Or maybe a list of links/resources/prompts you used to learn how to get there?
No, I don't blog. But I just followed the docs for starting an instance on lambda.ai and the llama.cpp build instructions. Both are pretty good resources. I had already setup an SSH key with lambda and the lambda OS images are linux pre-loaded with CUDA libraries on startup.
Here are my lazy notes + a snippet of the history file from the remote instance for a recent setup where I used the web chat interface built into llama.cpp.
I created an instance gpu_1x_gh200 (96 GB on ARM) at lambda.ai.
connected from terminal on my box at home and setup the ssh tunnel.
ssh -L 22434:127.0.0.1:11434 ubuntu@<ip address of rented machine - can see it on lambda.ai console or dashboard>
MISTAKE on 27, SINGLE-THREADED and slow to build see -j 16 below for faster build MISTAKE, didn't specify the port number for the llama-server I switched to qwen3 vl because I need a multimodal model for that day's experiment. Lines 38 and 39 show me not using the right name for the model. I like how llama.cpp can download and run models directly off of huggingface.Then pointed my browser at http//:localhost:22434 on my local box and had the normal browser window where I could upload files and use the chat interface with the model. That also gives you an openai api-compatible endpoint. It was all I needed for what I was doing that day. I spent a grand total of $4 that day doing the setup and running some NLP-oriented prompts for a few hours.
dual 3090's (24GB each) on 8x+8x pcie has been a really reliable setup for me (with nvlink bridge... even though it's relatively low bandwidth compared to tesla nvlink, it's better than going over pcie!)
48GB of vram and lots of cuda cores, hard to beat this value atm.
If you want to go even further, you can get an 8x V100 32GB server complete with 512GB ram and nvlink switching for $7000 USD from unixsurplus (ebay.com/itm/146589457908) which can run even bigger models and with healthy throughput. You would need 240V power to run that in a home lab environment though.
V100 is outdated (no bf16, dropped in CUDA 13) and power hungry (8 cards 3 years continuous use are about $12k of electricity).
I'd throw a 7900xtx in an AM4 rig with 128gb of ddr4 (which is what I've been using for the past two years)
Fuck nvidia
You know, I haven't even been thinking about those AMD gpus for local llms and it is clearly a blind spot for me.
How is it? I'd guess a bunch of the MoE models actually run well?
I've been running local models on an AMD 7800 XT with ollama-rocm. I've had zero technical issues. It's really just the usefulness of a model with only 16GB vram + 64GB of main RAM is questionable, but that isn't an AMD specific issue. It was a similar experience running locally with an nvidia card.
Get a Radeon AI Pro r9700! 32GB of RAM
I'm glad it's not another LLM CLI that uses React. Vibe-cli seems to be built with https://github.com/textualize/textual/
I'm not excited that it's done in python. I've had experience with Aider struggling to display text as fast as the llm is spitting it out, though that was probably 6 months ago now.
Python is more than capable of doing that. It’s not an issue of raw execution speed.
https://willmcgugan.github.io/streaming-markdown/
Just added it to our inventory. For those of you using Nix:
The repo is updated daily.This is such a cool project. Thanks for sharing.
10x cheaper price per token than Claude, am I reading it right?
As long as it doesn't mean 10x worse performance, that's a good selling point.
Something like GPT 5-mini is a lot cheaper than even Haiku but when I tried it in my experience it was so bad it was a waste of time. But it’s probably still more than 1/10 the performance of Haiku probably?
In work, where my employer pays for it, Haiku tends to be the workhorse with Sonnet or Opus when I see it flailing. On my own budget I’m a lot more cost conscious, so Haiku actually ends up being “the fancy model” and minimax m2 the “dumb model”.
Even if it is 10x cheaper and 2x worse it's going to eat up even more tokens spinning its wheels trying to implement things or squash bugs and you may end up spending more because of that. Or at least spending way more of your time.
The benchmark of swe places it in a comparable score with respect to open models and just a few points below the top notch models though
Is it? The actual SOTA are not amazing at coding, so at least for me there is absolutely no reason to optimize on price at the moment. If I am going to use an LLM for coding it makes little sense to settle for a worse coder.
I dunno. Even pretty weak models can be decently performant, and 9/10 the performance for 1/10 the price means 10x the output, and for a lot of stuff that quality difference dosent really matter. Considering even sota models are trash, slightly worse dosent really make that much difference.
> SOTA models are "trash"
> this model is worse (but cheaper)
> use it to output 10x the amount of trashier trash
You've lost me.
Fair. Mostly the argument is, if all you need is to iterate on output to refine it, you get 10x the iterations, while lesser quality, its still a aspect to consider. But yes, why bother eine coding when they do make so many mistakes.
This is great! I just made an AUR package for it: https://aur.archlinux.org/packages/mistral-vibe
Ah, finally! I was checking just a few days ago if they had a Claude Code-like tool as I would much rather give money to a European effort. I'll stop my Pro subscription at Anthropic and switch over and test it out.
Does anyone know where their SWE-bench Verified results are from? I can't find matching results on the leaderboards for their models or the Claude models and they don't provide any links.
I was briefly excited when Mistral Vibe launched and mentions "0 MCP Servers" in its startup screen... but I can't find how to configure any MCP servers. It doesn't respond to the /mcp command, and asking Devstral 2 for help, it thinks MCP is "Model Context Preservation". I'd really like to be able to run my local MCP tools that I wrote in Golang.
I'm team Anthropic with Claude Max & Claude Code, but I'm still excited to see Mistral trying this. Mistral has occasionally saved the day for me when Claude refused an innocuous request, and it's good to have alternatives... even if Mistral / Devstral seems to be far behind the quality of Claude.
Check this out: https://github.com/mistralai/mistral-vibe?tab=readme-ov-file...
Thank you! Finally got it working, had to comment out the mcp_servers line near the top of the config.toml file in ~/.vibe/, before adding my [[mcp_servers]] sections at the end of the file.
That was very helpful, thanks!
Just tried it out via their free API and the Roo Code VSCode extension, and it's impressive. It walked through a data analytics and transformation problem (150.000 dataset entries) I have been debugging for the past 2 hours.
> Mistral Code is available with enterprise deployments. > Contact our team to get started.
The competition is much smoother. Where are the subscriptions which would give users the coding agent and the chat for a flat fee and working out of the box?..
Very nice that there's a coding cli finally. I have a Mistral Pro account. I hope that it will be included. It's the main reason to have a Pro account tbh.
Open sourcing the TUI is pretty big news actually. Unless I missed something, I had to dig a bit to find it, but I think this is it: https://github.com/mistralai/mistral-vibe
Going to start hacking on this ASAP
Let's see which company becomes the first to sell "coding appliances": hardware with a model good enough for normal coding.
If Mistral is so permissive they could be the first ones, provided that hardware is then fast/cheap/efficient enough to create a small box that can be placed in an office.
Maybe in 5 years.
My Macbook Pro with an M4 Pro chip can handle a number of these models (I think it has 16GB of VRAM) with reasonable performance, my bottleneck continuously is the token caps. I assume someone with a much more powerful Mac Studio could run way more than I can, considering they get access to about 96GB of VRAM out of the system RAM iirc.
I bought a framework desktop hoping to do this.
And it can do it, right? I think AMD AI Max line the first realistic offering for this type of thing.
The Apple offerings are interesting but the lack of x86, Linux, and general compatibility make it hard sell imo.
my bet is a deepseek box
llm in a box connected via usb is the dream.
...so it won't ever happen, it'll require wifi and will only be accessible via the cloud, and you'll have to pay a subscription fee to access the hardware you bought. obviously.
Extremely happy with this release, the previous Devstral was great but training it for open hands crippled the usefulness. Having their own CLI dev tool will hopefully be better
Can you explain "training it for open hands"? I can't parse the meaning.
The original Devstral was a collaboration between All Hands AI (OpenHands) and Mistral [1]. You can use it with other agents but had to transfer over the prompt. Even then, the agents still didn't work that well. I tried it in RooCline and it worked extremely poorly with the tool calls.
[1] https://openhands.dev/blog/devstral-a-new-state-of-the-art-o...
I'm so glad Mistral never sold out. We're really lucky to have them in the EU at the time when we're so focused on mil-tech etc.
I don’t think it was ever an option since it had ties with the french government early on (Cédric O) and Macron’s party is quite pro EU
They let so many important French companies down. So, yes, it could happen despite this beginning.
They’ll switch to military tech the second it becomes necessary, don’t kid yourself. I’m just glad we have a European alternative for the day the US decides to turn its back on us.
This tech is simply too critical to pretend the military won’t use it. That’s clearer now than ever, especially after the (so far flop-ish) launch of the U.S. military’s own genAI platform.
They have already:
- https://helsing.ai/newsroom/helsing-and-mistral-announce-str... - https://sifted.eu/articles/mistral-helsing-defence-ai-action... - Luxembourg army chose Mistral: https://www.forcesoperations.com/la-pepite-francaise-mistral... - French army: https://www.defense.gouv.fr/actualites/ia-defense-sebastien-...
> I’m just glad we have a European alternative for the day the US decides to turn its back on us
Not sure you've kept up to date, US have turned their backs on most allies so far including Europe and the EU, and now welcome previous enemies with open arms.
Wow! BLUMPF has really done it this time! Excited to be part of the resistance!
It's not like there aren't already military AI startups in the EU. e.g. Helsing.
> I’m just glad we have a European alternative for the day the US decides to turn its back on us.
They did.
The system prompt and tool prompts for their open source (Apache 2 licensed) Python+Textual+Pydantic CLI tool are fun to read:
core/prompts/cli.md https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/c...
core/prompts/compact.md https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/c...
.../prompts/bash.md https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/c...
.../prompts/grep.md https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/c...
.../prompts/read_file.md https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/c...
.../prompts/write_file.md https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/c...
.../prompts/search_replace.md https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/c...
.../prompts/todo.md https://github.com/mistralai/mistral-vibe/blob/v1.0.4/vibe/c...
Based on your experience with Claude Code, how does Mistral Vibe compare?
I've not spent enough time with Mistral Vibe yet for a credible comparison, but given what I know about the underlying models (likely-1T-plus Opus 4.5 compared to the 123B Devstral 2) I'd be shocked if Vibe could out-perform Claude Code for the kinds of things I'm using it for.
Here's n example of the kinds of things I do with Claude Code now: https://gistpreview.github.io/?b64d5ee40439877eee7c224539452... - that one involved several from-scratch rewrites of the history of an entire Git repo just because I felt like it.
I gave it the job of modifying a fairly simple regex replacement and it took a while over 5 minutes, claude failed on the same prompt (which surprised me), codex did a similar job but faster. So all in all not bad!
offtopic but it hurts my eyes: I dislike for their font choice and their "cool looks" in their graphics.
Surprising and good is only: Everything including graphics fixed when clicking my "speedreader" button in Brave. So they are doing that "cool look" by CSS.
Yeah, it's a bit gimicky. You can hit `esc` and it will revert to the normal page design.
There's a scan lines affect they apply to everything that's "cool", but gets old after a minute.
Finally, we can use a european model to replace claude code.
Somehow it writes bad React code and misses to check linting prompts half the time. But surprisingly, the Python coding was great!
> Devstral 2 ships under a modified MIT license, while Devstral Small 2 uses Apache 2.0. Both are open-source and permissively licensed to accelerate distributed intelligence.
Uh, the "Modified MIT license" here[0] for Devstral 2 doesn't look particularly permissively licensed (or open-source):
> 2. You are not authorized to exercise any rights under this license if the global consolidated monthly revenue of your company (or that of your employer) exceeds $20 million (or its equivalent in another currency) for the preceding month. This restriction in (b) applies to the Model and any derivatives, modifications, or combined works based on it, whether provided by Mistral AI or by a third party. You may contact Mistral AI (sales@mistral.ai) to request a commercial license, which Mistral AI may grant you at its sole discretion, or choose to use the Model on Mistral AI's hosted services available at https://mistral.ai/.
[0] https://huggingface.co/mistralai/Devstral-2-123B-Instruct-25...
Personally I really like the normalization of these "Permissively" licensed models that only restrict companies with massive revenues from using them for free.
If you want to use something, and your company makes $240,000,000 in annual revenue, you should probably pay for it.
These are not permissively licensed though, the terms "permissive license" has connotations that pretty much everyone who is into FLOSS understands (same with "open source").
I do not mind having a license like that, my gripe is with using the terms "permissive" and "open source" like that because such use dilutes them. I cannot think of any reason to do that aside from trying to dilute the term (especially when some laws, like the EU AI Act, are less restrictive when it comes to open source AIs specifically).
> I do not mind having a license like that, my gripe is with using the terms "permissive" and "open source" like that because such use dilutes them. I cannot think of any reason to do that aside from trying to dilute the term (especially when some laws, like the EU AI Act, are less restrictive when it comes to open source AIs specifically).
Good. In this case, let it be diluted! These extra "restrictions" don't affect normal people at all, and won't even affect any small/medium businesses. I couldn't care less that the term is "diluted" and that makes it harder for those poor, poor megacorporations. They swim in money already, they can deal with it.
We can discuss the exact threshold, but as long as these "restrictions" are so extreme that they only affect huge megacorporations, this is still "permissive" in my book. I will gladly die on this hill.
> Good. In this case, let it be diluted! These extra "restrictions" don't affect normal people at all,
Yes, they do, and the only reason for using the term “open source” for things whose licensing terms flagrantly defy the Open Source definition is to falsely sell the idea that using the code carries the benefits that are tied to the combination of features that are in the definition and which are lost with only a subset of those features. The freedom to use the software in commercial services is particularly important to end-users that are not interested in running their own services as a guarantee against lock-in and of whatever longevity they are able to pay to have provided even if the original creator later has interests that conflict with offering the software as a commercial service.
If this deception wasn't important, there would be no incentive not to use the more honest “source available for limited uses” description.
> I couldn't care less that the term is "diluted" and that makes it harder
It also makes life harder for individuals and small companies, because this is not Open Source. It's incompatible with Open Source, it can't be reused in other Open Source projects.
Terms have meanings. This is not Open Source, and it will never be Open Source.
That's fine, but I don't think you should call it open source or call it MIT or even 'modified MIT.' Call it Mistral license or something along those lines
That's probably better, but Modified MIT is pretty descriptive, I read it as "mostly MIT, but with caveats for extreme cases" which is about right, if you already know what the MIT license entails
Whatever name they come up with for a new license will be less useful, because I'll have to figure out that this is what that is
imo this is a hill people need to stop dying on. Open source means "I can see the source" to most of the world. Wishing it meant "very permissively licensed" to everyone is a lost cause.
And honestly it wasn't a good hill to begin with: if what you are talking about is the license, call it "open license". The source code is out in the open, so it is "open source". This is why the purists have lost ground to practical usage.
> imo this is a hill people need to stop dying on.
As someone who was born and raised on FOSS, and still mostly employed to work on FOSS, I disagree.
Open source is what it is today because it's built by people with a spine who stand tall for their ideals even if it means less money, less industry recognition, lots of unglorious work and lots of other negatives.
It's not purist to believe that what built open source so far should remain open source, and not wanting to dilute that ecosystem with things that aren't open source, yet call themselves open source.
> Open source is what it is today because it's built by people with a spine who stand tall for their ideals even if it means less money, less industry recognition, lots of unglorious work and lots of other negatives.
With all due respect, don't you see the irony in saying "people with a spine who stand tall for their ideals", and then arguing that attaching "restrictions" which only affect the richest megacorporations in the world somehow makes the license not permissive anymore?
What ideals are those exactly? So that megacorporations have the right to use the software without restrictions? And why should we care about that?
> What ideals are those exactly?
Anyone can use the code for whatever purpose they want, in any way they want. I've never been a "rich megacorporation", but I have gone from having zero money to having enough money, and I still think the very same thing about the code I myself release as I did from the beginning, it should be free to be used by anyone, for any purpose.
You should stand up for your ideals, but dying on the hill of what you call your ideals is actually getting in the way of that.
Because instead of making the point "this license isn't as permissive as it could/should be" (easy to understand), instead the point being made is "this isn't real open source", which comes across to most people as just some weird gate-keeping / No True Scotsman kinda thing.
"No True Scotsman" is about specifically about changing the rules to exclude a new example you don't want to permit. The rules haven't changed, and the attempts to violate the requirements aren't new. Proprietary licenses continue to be proprietary. Open Source continues to not allow restrictions on commercial use.
no, “No True Scotsman” is just about people not categories like open source
Good job missing the point.
Though given the stance you are taking in this conversation, I'm not surprised you want to quibble over that.
¯\_(ツ)_/¯
ultimately you have to imbue words with meaning, otherwise it is impossible to have a discussion. what i said about no true scotsman was false, i was just trying to prove a point.
What point were you proving?
And back in the day, people incorrectly called it "public domain". That was wrong too.
> if what you are talking about is the license, call it "open license".
If you want to build something proprietary, call it something else. "Open Source" is taken.
> Open source means "I can see the source" to most of the world
well we don't really want to open that can of worms though, do we?
I don't agree with ceding technical terms to the rest of the world. I'm increasingly told we need to stop calling cancer detection AI "AI" or "ML" because it is not the 'bad AI' and confuses people.
I guess I'm okay with being intransigent.
If you are happy that time is being spent quibbling over definitions instead of actually focusing on the ideal, I'm not sure you care about the ideals as much as you say you do.
Who gives a shit what we call "cancer AI", what matters is the result.
I don't think you get access to source in this case. The release is a binary blob.
You're presently illustrating exactly why Stallman et al were such sticklers about "Free Software."
"Open Source" is nebulous. It reasonably works here, for better or worse.
>"Open Source" is nebulous
No it isn't it is well defined. The only people who find it "nebulous" are people who want the benefits without upholding the obligations.
https://opensource.org/definition-annotated
Free software to me means GPL and associates, so if that is what Stallman was trying to be a stickler for - it worked.
Open source has a well understood meaning, including licenses like MIT and Apache - but not including MIT but only if you make less than $500million dollars, MIT unless you were born on a wednesday, etc.
MIT and Apache are free software licenses in Stallman's sense, and the FSF has always been clear about it.
Earnestly, what's the concern here? People complain about open source being mostly beneficial to megacorps, if that's the main change (idk I haven't looked too closely) then that's pretty good, no?
They are claiming something is open-source when it isn’t. Regardless of whether you think the deviation from open-source is a good thing or not, you should still be in favour of honesty.
*according to your definition of open-source
No, according to the commonly accepted definition of open-source.
Whenever anybody tries to claim that a non-commercial licenses is open-source, it always gets complaints that it is not open-source. This particular word hasn’t been watered down by misuse like so many others.
There is no commonly-accepted definition of open-source that allows commercial restrictions. You do not get to make up your own meaning for words that differs from how other people use it. Open-source does not have commercial restrictions by definition.
Where are you getting this compendium of commonly-accepted definitions?
Looking up open-source in the dictionary does include definitions that would allow for commercial restrictions, depending on how you define "free" (a matter that is most certainly up for debate).
"Open-source" isn't a term that emerged organically from conversations between people. It is a term that was very deliberately coined for a specific purpose, defined into existence by an authority. It's a term of art, and its exact definition is available here: https://opensource.org/osd
The term "open-source" exists for the purposes of a particular movement. If you are "for" the misuse and abuse of the term, you not only aren't part of that movement, but you are ignorant about it and fail to understand it— which means you frankly have no place speaking about the meanings of its terminology.
yeahhhhhhh, that's not how this works.
Unless this authority has some ownership over the term and can prevent its misuse (e.g. with lawsuits or similar), it is not actually the authority of the term, and people will continue to use it how they see fit.
Indeed, I am not part of a movement (nor would I want to be) which focuses more on what words are used rather than what actions are taken.
> people will continue to use it how they see fit.
And whenever they do so, this pointless argument will happen. Again, and again, and again. Because that’s not what the word means and your desired redefinition has been consistently and continuously rejected over and over again for decades.
What do you gain from misusing this term? The only thing it does is make you look dishonest and start arguments.
> people will continue to use it how they see fit.
People can also say 2+2=5, and they're wrong. And people will continue to call them out on it. And we will keep doing so, because stopping lets people move the Overton window and try to get away with even more.
2+2 is a mathematical concept. Definitions do not need to be agreed upon beyond fundamental axioms.
The same is not true for "open source", which is a purely linguistic construct.
*according to the industry standard definition of Open Source
This kind of thing is how people try to shift the Overton window. No.
"I don't know anything about open source licenses hence I must spread my ignorance everywhere"
Is there some Open Source™ council I am unaware of that bequeaths the open source moniker on certain licenses?
Yes, literally: https://opensource.org/licenses
So if I invent a new license and call it "open source", they will sue me, or...?
Mainly about the dilution of the term. Though TBH i do not think that open source is beneficial mostly to megacorps either.
Mistral have used janky licenses in that a few times in the past. I was hoping the competition from China might have snapped them out of it.
All "Open Source" licenses are to an extent, janky. Obligatory "Stallman was right;" -- If it's not GPL/Free Software, YMMV.
Is such a term even enforceable? How would it be? How could Mistral know how much a company makes if that information isn't public?
They don't have to enforce it, evil megacorps won't risk the legal consequences of using it without talking to Mistral first. In reality they just won't use it.
> Model Size (B tokens)
How is that a measure of model size? It should either be parameter size, activated parameters, or cost per output token.
Looks like a typo because the models line up with reported param sizes.
PSA: 10X savings when you have to prompt it 10 times to get the correct solution is not actually faster.
will definetey try mistral vibe with gpt-oss-20b
I am very disappointed they don't have an equivalent subscription for coding to the 200 EUR ChatGPT or Claude one, and it is only available for Enterprise deployments.
The only thing I found is a pay-as-you-go API, but I wonder if it is any good (and cost-effective) vs Claude et al.
> Devstral 2 is currently offered free via our API. After the free period, the API pricing will be $0.40/$2.00 per million tokens (input/output) for Devstral 2
With pricing so low I don't see any reason why someone would buy sub for 200 EUR. These days those subs are so much limited in Claude Code or Cursor than it used to be (or used to unlimited). Better pay-as-you-go especially when there are days when you probably use AI less or not at all (weekends/holidays etc.) as long as those credits don't expire.
True, I just wish I could pay once for code AND the chat, but the chat subscription does not include Code sadly.
At these rates you can afford to pay by the token.
In a figure: Model size (B tokens)?
did anyone test how up to date is knowledge?
After querying the model about .NET, it seems that its knowledge comes from around June 2024.
I confirm that. It had no idea how to use Deno v2+.
Looks like another Deepseek distil like the new Ministrals. For every other use case that would be an insult, but for coding that's a great approach given how much lead in coding performance Qwen and Deepseek have on Mistral's internal datasets. The Small 24B seems to have a decent edge on 30BA3B, though it'll be comparatively extremely slow to run.
Can Vibe CLI help me vibe code PRs for when I vibe on the https://github.com/buttplugio/buttplug repo?
You can do anything if you believe.
Yet another CLI.
Why does every AI provider need to have its own tool, instead of contributing to existing tools like Roo Code or Opencode?
My 2ct: Because providers want to make their model run optimally and maybe some of them try to build a moat.
> providers want to make their model run optimally
Because they couldn't do it by contributing to existing opensource tools?
Modified MIT?????
Just call it Mistral License & flush it down