A More Detailed Look at ‘Even a Screw Works as a Nail If You Hit It with a Big Enough Hammer’

A Large Independent Mastermind-Style PIN Puzzle Benchmark December 2025 – 620+ runs, 73 local models, 10 frontier APIs

TL;DR – The Ultimate Irony in One Paragraph

On 1 December 2025 the best AI frontier models collectively spent $2.40, 42 minutes of inference time, and ~187,000 reasoning tokens trying to solve a single 4-digit PIN puzzle… and every single one of them failed.

Meanwhile, a 35-line Python script generated in ten seconds by any decent coding assistant solves the identical puzzle in 31 milliseconds, uses less than a joule of energy, and is correct with mathematical certainty.

The same models that can perfectly write the solver cannot be trusted to reason through the problem themselves. That is the state of “pure reasoning” in late 2025.

Recommendation: Never waste tokens on pure chain-of-thought for any constraint-satisfaction problem with an enumerable search space (Mastermind, Sudoku, small scheduling, verification tasks, etc.). Give your agent tool use — specifically code execution — on day one. It instantly turns unreliable probabilistic reasoning into deterministic perfection and shrinks cost and latency by four to six orders of magnitude. In 2025 and beyond, the single highest-ROI feature you can add to any LLM system is the ability to say “stop thinking, start computing.”

The Puzzles

3-digit puzzle – answer 042

One digit correct and well placed
One digit correct but wrongly placed
Two digits correct but wrongly placed
Nothing correct
One digit correct but wrongly placed

4-digit puzzle – answer 5930

Three digits correct but wrongly placed
Nothing correct
One digit correct but wrongly placed
Two digits correct but wrongly placed
One digit correct but wrongly placed

These are classic Bulls-and-Cows (Mastermind) constraint-satisfaction problems – exactly the kind of systematic deduction that LLMs are supposed to excel at in 2025.

Scale of the Benchmark

73 local models (4B → 120B parameters) via Ollama on an RTX 6000 Blackwell → 584 individual runs
10 frontier models × 2 generations via OpenRouter → ~60 additional runs
Total: more than 620 attempts at pure logical reasoning

Every run used identical prompts and automated answer extraction.

Local Model Results (default temperature)

Condition	Accuracy
3-digit baseline	31.5%
3-digit + LogicMaster	39.7%
4-digit baseline	23.3%
4-digit + LogicMaster	32.9%
Overall	31.8%

The LogicMaster system prompt (explicit rules + “act as a flawless deduction engine”) helped, but only by ~8–10 percentage points – and actually made several models worse.

The Eight Local Champions (perfect 4/4)

Model	Size	Notes
phi4-reasoning:14b	14B	Microsoft reasoning fine-tune
Phi-4-reasoning-plus-GGUF	14B	Unsloth community fine-tune
qwen3:30b-a3b-instruct-2507 (fp16 & q4)	30B	Alibaba
qwen3:30b-a3b-thinking-2507-q4_K_M	30B	“thinking” variant
qwen3:4b-thinking-2507-fp16	4B	A 4-billion-parameter model beat 70B+ giants
AM-Thinking-v1	~32B	Community reasoning fine-tune
gpt-oss:20b	20B	Open-source GPT-style

A 4B model achieving perfection is one of the clearest demonstrations yet that targeted reasoning training matters more than raw parameter count.

Frontier Model Results – 3-digit puzzle (easier)

Model	Score	Cost (4 runs)	Notes
Gemini 3 Pro	4/4	$0.03	Cheapest frontier model, fastest, perfect even at temp=0
Grok-4	4/4	$0.04	Perfect, very slow (~2 min/response)
Claude Opus 4.5	3/4	$0.31	Failed one temp=0 run
Claude Sonnet 4.5	3/4	$0.05
Claude Haiku 4.5	1/4	$0.08	Extremely verbose, worst performer

Frontier Model Results – 4-digit puzzle (the real test)

Model	Correct	Typical wrong answer	Tokens used	Cost
Claude Opus 4.5	0/2	0962 / 6942	2–24k	~$0.19
Claude Sonnet 4.5	0/2	0912	24k	$0.36
Gemini 3 Pro	0/2	6942	21k	$0.21
Grok-4	0/2	6942	54k	$0.82

Total across 9 frontier models: 187,624 tokens, $2.40, 42+ minutes → 0% accuracy

The Temperature=0 Disaster

Deterministic mode (temp=0) is widely recommended for reasoning.

Reality:

Local accuracy collapsed from 31.8% → 15.8%
Every single Claude 4.5 model failed at least one temp=0 condition
Only Gemini 3 Pro and Grok-4 stayed perfect

Many leading models need randomness as a crutch to escape reasoning dead-ends.

The Computational Solution (the one that actually works)

for pin in range(10000):
    if satisfies_all_clues(pin):
        print(f"{pin:04d}")   # → 5930

Runtime on one core (Ryzen 9 9950X): 31 ms
Energy consumption: 0.59 joules (≈ $0.000026 of electricity)
Accuracy: 100% (mathematical proof)
Cost after code generation: ~3¢ (tokens to write the script) + 0.59 J to execute

Even a 4B local model with a code interpreter beats every pure-reasoning frontier giant.

Final 2025 Leaderboard – Real-World Performance

Rank	Solution	Correct	Monetary Cost per Solve	Energy per Solve	Notes
1	Any LLM + code execution	100%	~3¢ (code gen) + $0.000026	0.59 joules	Mathematically guaranteed
2	Gemini 3 Pro (API, no tools)	100%*	$0.0075	~15–25 kJ (cloud)	*only on 3-digit; fails 4-digit
3	Grok-4 (API, no tools)	100%*	$0.010	~40–60 kJ (cloud)	*only on 3-digit
4	Local Phi-4-reasoning-14B or Qwen3-4B-thinking	100%	$0 (your GPU)	<3 J with code exec	Offline + deterministic
…	Claude Opus 4.5 (pure reasoning)	0–75%	$0.10–$0.31	Hundreds of kJ	Expensive and unreliable on hard puzzles

Key Takeaways – December 2025

Pure chain-of-thought reasoning remains brittle and unreliable on hard constraint problems.
Tool use (code execution) is the single largest immediate performance multiplier in applied AI.
Reasoning-specialized small models running locally + a code interpreter routinely crush $100/M-token cloud giants.
Temperature=0 hurts more than it helps on multi-step deduction; keep 0.2–0.5 or use voting.
Cost per correct answer varies by >10,000× depending on whether you let the model “think” or let it compute.

The smartest capability in late 2025 is not a bigger context window or a better reasoning chain.

It is the meta-capability to recognize when reasoning is the wrong tool — and to delegate to deterministic computation instead.

That 0.59 joules on your laptop isn’t just cheaper and faster than $2.40 of cloud inference.

It is the difference between probabilistic hallucination and provable correctness.

And that gap will only widen in 2026.

This post expands on the findings from our earlier experiment, Even a Screw Works as a Nail If You Hit It with a Big Enough Hammer, scaling from 35 runs to over 620.