Show HN: Caliper – pass@k reliability testing for Claude Code and Codex skills

(github.com)

3 points | by edonadei 3 days ago ago

3 comments

$edonadei 3 days ago

btw FYI, a really good article on evaluation, I vastly based my research and iteration from it https://www.anthropic.com/engineering/demystifying-evals-for...
$ocramz 2 days ago

very nice! Does it also fix model version and the random seed? That would be crucial for interpreting the scores (IIRC in the past openai let you do that)
[-]
- $edonadei a day ago
  
  Hey, glad that you appreciate it! Model version, yes you can either overwrite with the --model and --judge-model you can overwrite at runtime or in the yaml spec.
  For the seed: no I haven't added it, and I don't think the harnesses like claude code or Codex supports it nowadays. (You're right OpenAI API exposes a best-effort seed parameter though)
  But TBH I was not trying too much to control the LLM itself. I remember reading nice work by thinking machines on it. Here I decided to focus way more on pass@k because I'm focusing on a "classic" usage that is not trying too hard to play with seed or temperature, just the model with sane defaults.
  If you feel that could be useful, please file me an issue and I'll consider adding it :)