Agentic Eval Series

Can models other than the big cloud APIs actually drive an autonomous agent? Tool calling, multi-step planning, memory retrieval, skill orchestration, error recovery — the basics of an agentic loop. We’re running a growing set of models through the same eval harness to find out.

A note on methodology: This is a qualitative smoke test, not a rigorous scientific evaluation. Each model is run once through the same 24 prompts — there are no repeated trials, no statistical significance tests, and no controlled ablations. The results are directional: useful for spotting capability gaps and failure modes, not for making definitive claims about model rankings.

What We’re Testing

The benchmark is Curunir, a Python agentic framework with basic tools (grep, read, write, bash, web fetch), loadable skills, persistent memory, and multiple channels. The model receives tool schemas via JSON function calling, decides which tools to use, executes them, reads results, and loops until it has an answer.

24 prompts across 8 categories:

Category	What it tests
Tool Use Accuracy	Pick the right tool, use it correctly
Multi-Step Planning	Decompose complex requests into tool call sequences
Memory Retrieval	Find and synthesize info from persistent memory files
Instruction Following	Respect constraints (“don’t use tools”, “only answer from this file”)
Skill Orchestration	Load and execute multi-step skills (web search, research)
Error Recovery	Handle missing files, failed commands, empty results
Output Quality	Explain architecture, compare tools, identify design decisions
Efficiency	Solve simple questions without unnecessary tool calls

Same harness, same prompts, same tools, same system prompt. The only variable is the model. Claude Sonnet 4.6 is the baseline — every other model is compared against it.

The Articles

Each article compares one model against Sonnet 4.6, prompt by prompt. Where they’re equal, where one is better, where they fail.

Can a Local 26B Model Drive an Agentic Framework? — Gemma 4 26B vs Sonnet 4.6 (2026-04-07)
How Far Down Can You Go? E4B vs Sonnet 4.6 — Gemma 4 E4B vs Sonnet 4.6 (2026-04-07)
GLM-5 Turbo vs Sonnet 4.6: A Statistical Tie — Zhipu AI GLM-5 Turbo vs Sonnet 4.6 (2026-04-08)
Kimi K2.5: When Path Hallucination Kills Agentic Tool Use — Moonshot AI Kimi K2.5 vs Sonnet 4.6 (2026-04-08)
MiniMax M2.7: Another Cloud Model Goes Toe-to-Toe with Sonnet — MiniMax M2.7 vs Sonnet 4.6 (2026-04-08)
GLM-5.1: A Step Backward from GLM-5 Turbo — Zhipu AI GLM-5.1 vs Sonnet 4.6 (2026-04-08)
Same Model, Half the Bits, No GPU — Gemma 4 26B Q4_K_M (CPU) vs Sonnet 4.6 (2026-04-11)
Qwen3.6 35B-A3B on CPU: A Sparser MoE, Different Failure Modes — Qwen3.6 35B-A3B Q4_K_M (CPU) vs Sonnet 4.6 (2026-04-20)

More comparisons coming as we run additional models through the harness.

Companion Pieces

Follow-ups to the main articles — performance characterization, reruns, and other supporting pieces that don’t fit the model-vs-Sonnet format.

How Fast Is Gemma 4 26B on a MacBook Pro? — performance characterization of the 26B local run (2026-04-09)
Gemma 4 26B, Rerun: What Changes on a Newer llama.cpp? — same model, new llama.cpp build, reproducibility check (2026-04-09)
When the Agent Writes Its Own Skills — GLM-5 Turbo autonomously creates a reusable skill and schedules it (2026-04-11)
Gemma 26B Q4, Third Run: What Orchestrator Mode Costs a Small Model — same model and hardware as the Q4 article, under a delegate-based orchestrator harness (2026-04-16)
Gemma 26B Q4 on the New Eval: What Held Up, What Didn’t — solo review of Gemma 26B Q4 through the new 29-prompt eval suite (2026-04-17)
4 Active Experts vs 3: Gemma 26B-A4B vs Qwen3.6 35B-A3B on the Same Box — two local MoE Q4_K_M models head-to-head, same Ryzen box, no Sonnet baseline (2026-04-20)
Gemma 26B Q4, Fourth Run: What a Small-Model Harness Branch Changes — same model weights, main vs small-model branch of Curunir on an M5 Pro (2026-04-20)

Source

The eval prompts and harness script are public.