Agentic eval results — local vs cloud models on tool use, planning, memory, and more.
Can models other than the big cloud APIs actually drive an autonomous agent? Tool calling, multi-step planning, memory retrieval, skill orchestration, error recovery — the basics of an agentic loop. We’re running a growing set of models through the same eval harness to find out.
A note on methodology: This is a qualitative smoke test, not a rigorous scientific evaluation. Each model is run once through the same 24 prompts — there are no repeated trials, no statistical significance tests, and no controlled ablations. The results are directional: useful for spotting capability gaps and failure modes, not for making definitive claims about model rankings.
The benchmark is Curunir, a Python agentic framework with basic tools (grep, read, write, bash, web fetch), loadable skills, persistent memory, and multiple channels. The model receives tool schemas via JSON function calling, decides which tools to use, executes them, reads results, and loops until it has an answer.
24 prompts across 8 categories:
| Category | What it tests |
|---|---|
| Tool Use Accuracy | Pick the right tool, use it correctly |
| Multi-Step Planning | Decompose complex requests into tool call sequences |
| Memory Retrieval | Find and synthesize info from persistent memory files |
| Instruction Following | Respect constraints (“don’t use tools”, “only answer from this file”) |
| Skill Orchestration | Load and execute multi-step skills (web search, research) |
| Error Recovery | Handle missing files, failed commands, empty results |
| Output Quality | Explain architecture, compare tools, identify design decisions |
| Efficiency | Solve simple questions without unnecessary tool calls |
Same harness, same prompts, same tools, same system prompt. The only variable is the model. Claude Sonnet 4.6 is the baseline — every other model is compared against it.
Each article compares one model against Sonnet 4.6, prompt by prompt. Where they’re equal, where one is better, where they fail.
More comparisons coming as we run additional models through the harness.
Follow-ups to the main articles — performance characterization, reruns, and other supporting pieces that don’t fit the model-vs-Sonnet format.
The eval prompts and harness script are public.