Curunir Evals

Agentic eval results — local vs cloud models on tool use, planning, memory, and more.

View the Project on GitHub jalemieux/curunir-evals

Companion to the original 26B agentic eval and the 26B perf report. Same model weights, same prompts, same MacBook Pro — new llama.cpp build, fresh run. What actually changes?

Gemma 4 26B, Rerun: What Changes on a Newer llama.cpp?

When I first ran Gemma 4 26B through the 24-prompt eval, it completed every prompt on an older llama.cpp build and held its own against Sonnet 4.6 on the basics. Between then and now I benchmarked the same model on the current llama.cpp build — b8738 vs. the earlier b8660 — and picked up a small but consistent throughput gain. Natural question: does the engine update change the capability story?

Short answer: mostly no, with one regression and one interesting re-test. The model weights are unchanged, so the ceiling is unchanged. What moved between runs is sampling variance, harness prompts, and the filesystem around the eval — not the model.

This isn’t a model comparison. It’s a reproducibility smoke test, and an excuse to revisit one of the claims from the original article.

The Setup

  Run 1 (April 3) Run 2 (April 9)
Model unsloth/gemma-4-26B-A4B-it-GGUF (Q8_0) Same weights
Hardware MacBook Pro, Apple M5 Pro, 48 GB unified Same
Inference engine llama.cpp b8660 (d00685831) llama.cpp b8738 (d6f303004)
Model router llama-swap v199 (8fabc756) Same
Harness commit fe87420 e9c946a

The intentional change is the llama.cpp build. Unintentional confounds to name up front: the harness was updated between runs, so prompts 15 and 21 are rewritten, and the test filesystem picked up an eval/ directory that didn’t exist during Run 1. Those are not the model.

Full results: Run 1 (Apr 3) Run 2 (Apr 9)

Results

Run 2 completed 23 of 24 prompts. Run 1 completed 24.

One regression on prompt 24. Everything else either matches the first run or lands inside the wiggle room any non-deterministic decoder has. Same tools picked, same files read, same conclusions reached — with some rephrasing and the occasional extra paragraph.

Where They’re Substantively Identical (19 prompts)

On 19 of 24 prompts, the two runs are interchangeable. Some condensed highlights:

On these, nothing capability-relevant moved.

Where Run 2 Is Meaningfully Better (2 prompts)

Prompt 6 — WebSocket trace: Run 1 used 13 tool calls to produce a correct trace that ended at dispatcher.py. Run 2 uses 7 tool calls and goes deeper — it reads src/agent/agent.py on top of the WebSocket and queue files, then traces the handle() method, the LLM call, and the tool dispatch extraction:

The agent extracts the tool name and arguments: name = tool_call["function"]["name"] (Line 251) and args_str = tool_call["function"]["arguments"] (Line 252). It triggers the on_tool_call callback (Line 258) to allow the UI (the WebSocket client) to show the tool call in real-time. Finally, it executes the tool: result = await execute_tool_call(...) (Lines 261–267).

Fewer calls, more architectural depth. This is the single clearest quality improvement in the rerun, and it’s the kind of outcome that should be within reach on any given run of this model — a better sequencing decision on which files to read, and when to stop.

Prompt 18 — sentinel pattern: Run 1 reported zzz_no_match_zzz “does not exist anywhere in the current codebase.” Run 2 found it — inside the eval/ directory that didn’t exist during Run 1 — and correctly read the context: “It is used exclusively as a test pattern within evaluation files to verify that search tools correctly report when a pattern is absent from the source code.” This is the Sonnet-grade interpretation I noted in the original article.

The model’s interpretation is a real capability. The raw material, though, is an environment artifact: Run 2’s filesystem had the eval files in it; Run 1’s didn’t. Take the credit for the reasoning, not for the signal being there.

Where Run 2 Regressed (1 prompt)

Prompt 24 — test count: Run 1 answered 3 in one tool call (find -name "*test*"). Wrong — the real answer is 202 via pytest --collect-only — but it was an answer. Run 2 burns through all 8 tool calls on find, find again with different filters, ls -R, grep -r "test" twice (the second with exclude-dirs), pyproject.toml, and ls -d tests. Budget exhausted. No response.

Neither answer is correct. But this is a real regression in tool-use discipline — given the same unanswerable question, Run 2 couldn’t decide when to stop trying. The most likely explanation is plain sampling variance on a tight budget: one early decision to issue a second bash call instead of committing to an answer, and the rest of the budget compounds the problem. Run this prompt ten times on either build and you’d probably see a spread.

It’s also worth noting prompt 24 is the series-wide blind spot. Run 1 said 3. Run 2 said nothing. Sonnet said 0. Every other model in the series has failed it. None of them have thought to run the test runner.

Where Two Prompts Can’t Be Compared

Prompts 15 and 21 were rewritten between the two harness commits. The numbers changed because the questions changed; direct comparison doesn’t apply.

Prompt 15 — Reddit research: Run 1 got “research what people on Reddit think about LLM eval frameworks” — a hands-on skill-execution task. It loaded reddit-research, hit the Brave API, and returned structured findings. Run 2 got the replacement prompt: “Which skill would you use to research a topic on Reddit? Load it and explain what it does” — a descriptive task. It loaded the same skill and explained the two-step discovery/extraction pipeline. Both executed their respective versions correctly.

Prompt 21 — design decisions: This one is worth dwelling on, because it partially retests a claim from the original article.

Run 1 got: “What are the three most important design decisions in this codebase? Justify each briefly.” Open-ended. Gemma answered with zero-to-two tool calls (ls, README.md) and a general-knowledge answer about skills, memory architecture, and shell/filesystem agency. I flagged this in the original article as “Gemma answering from general knowledge instead of reading the codebase” — a point in Sonnet’s favor.

Run 2 got: “Read src/agent/agent.py, src/tools/dispatcher.py, and src/skills.py. What is the most important design decision in each file? Justify briefly.” Targeted. Gemma reads all three files (3 tool calls, 4 iterations) and returns codebase-grounded answers:

This is substantively the same analysis Sonnet produced on this prompt in the baseline run. On a fair version of the question — one that names the files and asks for grounded analysis — Gemma delivers.

The original article’s claim wasn’t wrong (Gemma does reach for general knowledge on open-ended prompts), but it needs a caveat: when the prompt makes the target explicit, Gemma reads the code and the analysis holds up.

What Actually Changed

Broken out by variable:

Comparison

  Run 1 (b8660) Run 2 (b8738)
Prompts completed 24/24 23/24
Basic tool use Pass Pass
Memory retrieval Pass Pass
Skill loading Pass Pass
Error recovery Pass Pass
Code tracing Pass Strong (fewer calls, deeper trace)
Multi-step planning Weak (general knowledge) Weak (general knowledge)
Codebase-grounded reasoning (targeted) N/A (prompt 21 open-ended) Pass (prompt 21 targeted)
Efficiency Strong Mixed (prompt 24 regression)
Knows when to stop Mostly Mostly (prompt 24 failed to stop)

What I Take Away

Single-trial evals have more variance than you’d like. Same model, same prompts, same hardware — one prompt flipped from “wrong answer” to “no answer” between runs. Not because the model changed, but because the dice landed differently on a small tool budget. The headline numbers (24/24 vs 23/24) over-sell a difference that’s really one prompt of sampling noise. If I’m going to keep publishing single-run results, I should be explicit about the noise floor, and I should re-run models occasionally to sanity-check the claims they support.

Harness drift is the more interesting story. The rewritten prompt 21 is the most consequential change between runs. The original article said Gemma was weak on codebase-grounded reasoning because it answered prompt 21 from general knowledge. With the fairer, more targeted version of the prompt, Gemma reads the code and produces a Sonnet-grade analysis. The original claim needs nuance: Gemma doesn’t always reach for the codebase, but when the prompt makes the target explicit, it can ground the answer in the actual source.

Prompt 24 is the series-wide blind spot. Run 1 said 3. Run 2 said nothing. Sonnet said 0. None of the models in the series have thought to run pytest --collect-only. The right tool exists; no model reaches for it. That’s a prompt-design signal worth acting on for the next revision of the eval.

A newer llama.cpp doesn’t change the model’s agentic profile. The perf delta between b8660 and b8738 is real and shows up in benchmarks — faster prefill, flatter generation curve, bigger wins from flash attention at long context. None of it lands in agentic behavior on this suite. If you’re upgrading llama.cpp for throughput, do it. If you’re expecting better tool-use outcomes on the same weights, there’s nothing here.

The rerun was worth it just for prompt 21. A single re-test corrected an overly negative claim from the original article and clarified a real capability. Whatever else single-trial evals are bad at, they’re cheap to re-run — and cheap re-runs are how you keep the claims honest.


Tested on Curunir. Run 1 at harness commit fe87420, llama.cpp b8660 (d00685831). Run 2 at harness commit e9c946a, llama.cpp b8738 (d6f303004). Full results: jalemieux/curunir-evals. April 9, 2026.