A More Detailed Look at ‘Even a Screw Works as a Nail If You Hit It with a Big Enough Hammer’
A Large Independent Mastermind-Style PIN Puzzle Benchmark December 2025 – 620+ runs, 73 local models, 10 frontier APIs
TL;DR – The Ultimate Irony in One Paragraph
On 1 December 2025 the best AI frontier models collectively spent $2.40, 42 minutes of inference time, and ~187,000 reasoning tokens trying to solve a single 4-digit PIN puzzle… and every single one of them failed.
Meanwhile, a 35-line Python script generated in ten seconds by any decent coding assistant solves the identical puzzle in 31 milliseconds, uses less than a joule of energy, and is correct with mathematical certainty.
The same models that can perfectly write the solver cannot be trusted to reason through the problem themselves. That is the state of “pure reasoning” in late 2025.
Recommendation: Never waste tokens on pure chain-of-thought for any constraint-satisfaction problem with an enumerable search space (Mastermind, Sudoku, small scheduling, verification tasks, etc.). Give your agent tool use — specifically code execution — on day one. It instantly turns unreliable probabilistic reasoning into deterministic perfection and shrinks cost and latency by four to six orders of magnitude. In 2025 and beyond, the single highest-ROI feature you can add to any LLM system is the ability to say “stop thinking, start computing.”
The Puzzles
3-digit puzzle – answer 042
682: One digit correct and well placed
614: One digit correct but wrongly placed
206: Two digits correct but wrongly placed
738: Nothing correct
780: One digit correct but wrongly placed
4-digit puzzle – answer 5930
3593: Three digits correct but wrongly placed
2266: Nothing correct
8348: One digit correct but wrongly placed
8085: Two digits correct but wrongly placed
1489: One digit correct but wrongly placed
These are classic Bulls-and-Cows (Mastermind) constraint-satisfaction problems – exactly the kind of systematic deduction that LLMs are supposed to excel at in 2025.
Scale of the Benchmark
- 73 local models (4B → 120B parameters) via Ollama on an RTX 6000 Blackwell → 584 individual runs
- 10 frontier models × 2 generations via OpenRouter → ~60 additional runs
- Total: more than 620 attempts at pure logical reasoning
Every run used identical prompts and automated answer extraction.
Local Model Results (default temperature)
| Condition | Accuracy |
|---|---|
| 3-digit baseline | 31.5% |
| 3-digit + LogicMaster | 39.7% |
| 4-digit baseline | 23.3% |
| 4-digit + LogicMaster | 32.9% |
| Overall | 31.8% |
The LogicMaster system prompt (explicit rules + “act as a flawless deduction engine”) helped, but only by ~8–10 percentage points – and actually made several models worse.
The Eight Local Champions (perfect 4/4)
| Model | Size | Notes |
|---|---|---|
| phi4-reasoning:14b | 14B | Microsoft reasoning fine-tune |
| Phi-4-reasoning-plus-GGUF | 14B | Unsloth community fine-tune |
| qwen3:30b-a3b-instruct-2507 (fp16 & q4) | 30B | Alibaba |
| qwen3:30b-a3b-thinking-2507-q4_K_M | 30B | “thinking” variant |
| qwen3:4b-thinking-2507-fp16 | 4B | A 4-billion-parameter model beat 70B+ giants |
| AM-Thinking-v1 | ~32B | Community reasoning fine-tune |
| gpt-oss:20b | 20B | Open-source GPT-style |
A 4B model achieving perfection is one of the clearest demonstrations yet that targeted reasoning training matters more than raw parameter count.
Frontier Model Results – 3-digit puzzle (easier)
| Model | Score | Cost (4 runs) | Notes |
|---|---|---|---|
| Gemini 3 Pro | 4/4 | $0.03 | Cheapest frontier model, fastest, perfect even at temp=0 |
| Grok-4 | 4/4 | $0.04 | Perfect, very slow (~2 min/response) |
| Claude Opus 4.5 | 3/4 | $0.31 | Failed one temp=0 run |
| Claude Sonnet 4.5 | 3/4 | $0.05 | |
| Claude Haiku 4.5 | 1/4 | $0.08 | Extremely verbose, worst performer |
Frontier Model Results – 4-digit puzzle (the real test)
| Model | Correct | Typical wrong answer | Tokens used | Cost |
|---|---|---|---|---|
| Claude Opus 4.5 | 0/2 | 0962 / 6942 | 2–24k | ~$0.19 |
| Claude Sonnet 4.5 | 0/2 | 0912 | 24k | $0.36 |
| Gemini 3 Pro | 0/2 | 6942 | 21k | $0.21 |
| Grok-4 | 0/2 | 6942 | 54k | $0.82 |
Total across 9 frontier models: 187,624 tokens, $2.40, 42+ minutes → 0% accuracy
The Temperature=0 Disaster
Deterministic mode (temp=0) is widely recommended for reasoning.
Reality:
- Local accuracy collapsed from 31.8% → 15.8%
- Every single Claude 4.5 model failed at least one temp=0 condition
- Only Gemini 3 Pro and Grok-4 stayed perfect
Many leading models need randomness as a crutch to escape reasoning dead-ends.
The Computational Solution (the one that actually works)
for pin in range(10000):
if satisfies_all_clues(pin):
print(f"{pin:04d}") # → 5930
- Runtime on one core (Ryzen 9 9950X): 31 ms
- Energy consumption: 0.59 joules (≈ $0.000026 of electricity)
- Accuracy: 100% (mathematical proof)
- Cost after code generation: ~3¢ (tokens to write the script) + 0.59 J to execute
Even a 4B local model with a code interpreter beats every pure-reasoning frontier giant.
Final 2025 Leaderboard – Real-World Performance
| Rank | Solution | Correct | Monetary Cost per Solve | Energy per Solve | Notes |
|---|---|---|---|---|---|
| 1 | Any LLM + code execution | 100% | ~3¢ (code gen) + $0.000026 | 0.59 joules | Mathematically guaranteed |
| 2 | Gemini 3 Pro (API, no tools) | 100%* | $0.0075 | ~15–25 kJ (cloud) | *only on 3-digit; fails 4-digit |
| 3 | Grok-4 (API, no tools) | 100%* | $0.010 | ~40–60 kJ (cloud) | *only on 3-digit |
| 4 | Local Phi-4-reasoning-14B or Qwen3-4B-thinking | 100% | $0 (your GPU) | <3 J with code exec | Offline + deterministic |
| … | Claude Opus 4.5 (pure reasoning) | 0–75% | $0.10–$0.31 | Hundreds of kJ | Expensive and unreliable on hard puzzles |
Key Takeaways – December 2025
- Pure chain-of-thought reasoning remains brittle and unreliable on hard constraint problems.
- Tool use (code execution) is the single largest immediate performance multiplier in applied AI.
- Reasoning-specialized small models running locally + a code interpreter routinely crush $100/M-token cloud giants.
- Temperature=0 hurts more than it helps on multi-step deduction; keep 0.2–0.5 or use voting.
- Cost per correct answer varies by >10,000× depending on whether you let the model “think” or let it compute.
The smartest capability in late 2025 is not a bigger context window or a better reasoning chain.
It is the meta-capability to recognize when reasoning is the wrong tool — and to delegate to deterministic computation instead.
That 0.59 joules on your laptop isn’t just cheaper and faster than $2.40 of cloud inference.
It is the difference between probabilistic hallucination and provable correctness.
And that gap will only widen in 2026.
This post expands on the findings from our earlier experiment, Even a Screw Works as a Nail If You Hit It with a Big Enough Hammer, scaling from 35 runs to over 620.