Nice work. One thinga I'd love to see in the bench mark: a breakdown by question type (aggregations vs. multi-hop joins vs. lookups). My guess is the SQL approach pulls ahead hardest on the join-heavy ones, and showing that explicitly would make the "too good to be true" results feel more grounded. Either way, the token efficiency numbers sounds intruiging.
It's not explicitly stated in the benchmarks README, good catch.
80% of the benchmark questions are aggregations, 16% are multi-hop, 4% are lookups/subqueries.
Multi-hop is where LLMs struggle the most (hallucinations, partial answers), and aggregations is where you get the most token efficiency, since you skip on pagination which you need with APIs/MCPs that don't provide filters.
How does it handle schema drift (eg saas vendor changes a column)? Does the annotation agent mark breaking changes in some way or just describe the current state of the world? With that many connections, you'll hit a bunch of weird edge cases, especially with things like salesforce custom objects.
Right now schema changes create new columns, I’m working on reconciling old columns, which right now are not dropped.
The annotation/semantic layer agent creates a new description of the schema on sync, which represents the current state, but that includes stale columns as of today, data is not dropped.
I’ll implement automated schema migrations in the next week or so!
It's an experimental benchmark, I couldn't find any off-the-shelf benchmarks to use this with. There's Spider 2.0 but it's for text-to-SQL. I'm planning to run this [1] next but it's quite expensive.
There's 75 questions, divided in 5 use case groups: revenue ops, e-commerce, knowledge bases, devops, support.
I then generated a synthetic dataset with data mimicking APIs ranging from Stripe to Hubspot to Shopify to Zendesk etc..
I expose all the data through Dinobase vs. having one MCP per source e.g. one MCP for Stripe data, one MCP for Hubspot data etc.
I tested this with 11 models, ranging from Kimi 2.5 to Claude Opus 4.6.
Finally there's an LLM-as-a-judge that decides if the answer is correct, and I log latency and tokens.
Nice work. One thinga I'd love to see in the bench mark: a breakdown by question type (aggregations vs. multi-hop joins vs. lookups). My guess is the SQL approach pulls ahead hardest on the join-heavy ones, and showing that explicitly would make the "too good to be true" results feel more grounded. Either way, the token efficiency numbers sounds intruiging.
It's not explicitly stated in the benchmarks README, good catch.
80% of the benchmark questions are aggregations, 16% are multi-hop, 4% are lookups/subqueries.
Multi-hop is where LLMs struggle the most (hallucinations, partial answers), and aggregations is where you get the most token efficiency, since you skip on pagination which you need with APIs/MCPs that don't provide filters.
How does it handle schema drift (eg saas vendor changes a column)? Does the annotation agent mark breaking changes in some way or just describe the current state of the world? With that many connections, you'll hit a bunch of weird edge cases, especially with things like salesforce custom objects.
Right now schema changes create new columns, I’m working on reconciling old columns, which right now are not dropped.
The annotation/semantic layer agent creates a new description of the schema on sync, which represents the current state, but that includes stale columns as of today, data is not dropped.
I’ll implement automated schema migrations in the next week or so!
Interesting approach, makes a lot of sense. Looks promising!
Thank you, if you happen to try it out or have ideas, feel free to contact me at: em at dinobase dot ai
Could you give more context on the benchmarks included in the repo?
It's an experimental benchmark, I couldn't find any off-the-shelf benchmarks to use this with. There's Spider 2.0 but it's for text-to-SQL. I'm planning to run this [1] next but it's quite expensive.
There's 75 questions, divided in 5 use case groups: revenue ops, e-commerce, knowledge bases, devops, support.
I then generated a synthetic dataset with data mimicking APIs ranging from Stripe to Hubspot to Shopify to Zendesk etc..
I expose all the data through Dinobase vs. having one MCP per source e.g. one MCP for Stripe data, one MCP for Hubspot data etc.
I tested this with 11 models, ranging from Kimi 2.5 to Claude Opus 4.6.
Finally there's an LLM-as-a-judge that decides if the answer is correct, and I log latency and tokens.
[1] https://arxiv.org/abs/2510.02938
Which llm is best at driving DuckDB currently?
DuckDB exposes Postgres SQL, and most coding LLMs have been trained on that.
Of the small models I tested, Qwen 3.5 is the clear winner. Going to larger LLMs, Sonnet and Opus lead the charts.
[dead]