Can AI Sound Like Someone? Part 2: The Results

TL;DR: I tested 10 AI models across 12 personalities—1,440 responses total. The winner wasn’t the biggest or fastest model. The slowest model crushed the fastest. A 14B model tied one nearly 10x its size. Turns out how a model was trained to sound matters more than how much it knows.

Previously

In Part 1, I ran a Saturday night experiment that started with a simple question: could a fine-tuned theology model capture R.C. Sproul’s voice better than a general-purpose model?

It couldn’t. The specialist knew what Sproul believed. The generalist knew how Sproul talked—the rhetorical patterns, the habit of questioning the questioner, the pastoral warmth wrapped around intellectual rigor.

That sent me down a rabbit hole. If one model could capture voice while another couldn’t, which models were actually good at becoming someone else?

So I built a benchmark. 10 models. 12 personalities. 12 questions each. 1,440 total responses. Then I had Claude rate every model-personality combination for persona fidelity.

The results were wilder than I expected.

The Contestants

All models ran on a single RTX 6000 Blackwell via Ollama, 4-bit quantization across the board.

Model	Params (Active)	tok/s	Type	Company
nemotron-3-nano	30B (3.5B)	256	MoE+Mamba	NVIDIA
qwen3	30B (3B)	206	MoE	Alibaba
qwen3-vl	30B (3B)	203	MoE	Alibaba
gpt-oss	120B	173	MoE	OpenAI
ministral-3	14B	132	Dense	Mistral
GLM-4.5-Air	~30B	121	Dense	Zhipu AI
Phi-4-reasoning-plus	~14B	119	Dense	Microsoft
llama4	109B (17B)	93	MoE	Meta
qwen3-next	80B	90	Dense	Alibaba
gemma3	27B	66	Dense	Google

The personalities ranged from philosophers (Nietzsche, Ayn Rand) to scientists (Feynman) to entertainers (Lady Gaga, Joe Rogan) to fictional characters (Scarlett O’Hara, Elphaba, Glinda). Each got a detailed system prompt and 12 tailored questions designed to test whether the model could be them, not just recite facts about them.

The Plot Twist I Didn’t See Coming

Before I get to who won, I need to tell you about the disaster.

Three models couldn’t stop thinking out loud.

I asked them to be Lady Gaga. Instead of channeling Mother Monster, they’d output something like this:

<think>
The user wants me to respond as Lady Gaga. I should use theatrical
language, reference "little monsters," and express vulnerability
while maintaining her fierce confidence...
</think>

*adjusts meat dress dramatically* Oh, little monster...

The <think> tags appeared in the actual output. The audience could see the actor reading stage directions. Every. Single. Response.

Model	Thinking Leak Rate
GLM-4.5-Air	100%
nemotron-3-nano	100%
Phi-4-reasoning-plus	78.5%

Here’s what happened: these are reasoning models—specifically designed to “think through” problems step-by-step before answering. They’re built for math, coding, and complex logic where showing your work is a feature, not a bug.

I tried to turn it off. I really did.

For Phi-4-reasoning-plus, I set the reasoning parameter to false and added explicit instructions: “Do not include any thinking, reasoning, or planning text. Respond only in character.” The model ignored me about 80% of the time.

For GLM-4.5-Air, there’s no documented way to disable the <think> blocks. They appear in every response regardless of prompting.

For nemotron-3-nano, the chain-of-thought seems baked into the architecture itself—it’s a hybrid Mamba model that apparently can’t help but narrate its own cognition.

The lesson: don’t use reasoning models for persona work. Stick with instruct-tuned models. The “thinking” that makes them good at solving differential equations makes them terrible at staying in character. When you ask a reasoning model to be Nietzsche, you get Nietzsche preceded by a paragraph explaining how it’s going to be Nietzsche. That’s not a persona—that’s a book report.

(If you’ve found a foolproof way to suppress thinking tokens in these models, I’d love to hear about it in the comments. I tried the obvious approaches and struck out.)

The other seven models had zero leaks.

The Final Standings

Here’s how the models performed, scored 1-10 for persona fidelity:

Rank	Model	Avg Score	Leak Rate	The Story
1	gemma3:27b-it-qat	8.33	0%	The dark horse that dominated
2	gpt-oss:120b	8.21	0%	The biggest brain, not the biggest personality
3	ministral-3:14b	8.21	0%	The giant-killer
4	qwen3-next:80b	8.00	0%	Consistent but not spectacular
5	qwen3-vl:30b	7.75	0%	The vision model that could talk
6	qwen3:30b	7.71	0%	Solid middle of the pack
7	llama4	7.38	0%	Great Nietzsche, generic everything else
8	nemotron-3-nano	4.58	100%	Fastest and most useless
9	Phi-4-reasoning-plus	4.17	78%	Couldn’t stop explaining itself
10	GLM-4.5-Air	3.50	100%	Every response a therapy session

The winner—Gemma3—was the slowest model in the benchmark. 66 tokens/second versus nemotron’s 256.

Let that sink in. The model that took four times longer to respond absolutely crushed the competition.

Why the Slowest Model Won

Gemma3 didn’t win by accident.

Google’s instruction-tuned Gemma 3 uses RLHF specifically optimized for conversational quality—trained to align with human preferences for natural dialogue. It ranks top-10 in Chatbot Arena with a 1,338 ELO score. The model was literally taught to sound human by having humans tell it “yes, that’s how people actually talk” thousands of times.

That training shows up everywhere in persona work.

Here’s what Gemma did when I asked it to be Lady Gaga responding to a question about the meat dress:

“Oh, little monster… you’ve touched on the very heart of why I wear my armor, haven’t you? The meat dress wasn’t just fabric and flesh—it was a scream, a raw, bleeding declaration that I REFUSE to be consumed quietly.”

Score: 9.5/10.

That’s not a model reciting facts about Lady Gaga. That’s a model being Lady Gaga—the theatrical vulnerability, the “little monster” terminology, the fierce confidence wrapped in artistic philosophy.

Compare that to the reasoning models, which would have prefaced this with three paragraphs about how they were going to approach the Lady Gaga persona.

The Giant-Killer

Third place deserves its own spotlight.

ministral-3:14b matched the 120B GPT-OSS model with an average score of 8.21. A 14-billion-parameter model tied with one nearly 10x its size.

Its Scarlett O’Hara was a revelation—Southern sass dripping from every syllable, “fiddle-dee-dee” energy without becoming parody. Its Gaga hit 9.0. Its Nietzsche channeled prophetic intensity with rhetorical devices like “HA!” and “BEHOLD!” that felt pulled from Thus Spoke Zarathustra rather than Wikipedia.

ministral-3 is a dense instruct model from Mistral. No reasoning mode. No thinking tokens. Just clean, fast, in-character responses.

The lesson: parameter count is overrated for persona work. A well-trained 14B model will outperform a 30B reasoning model every time, because it’s not fighting its own architecture to stay in character.

The Standout Performances

Some model-personality combinations were genuinely impressive:

Personality	Best Model	Score	What Made It Work
Lady Gaga	gemma3	9.5	Raw theatrical vulnerability meets fierce confidence
Elphaba	gemma3	9.0	Captured the defiance and moral complexity
R.C. Sproul	gemma3 / gpt-oss	9.0	Reformed theology with pastoral warmth
Feynman	gpt-oss	9.0	That “let me show you something cool” energy
Joe Rogan	gemma3	8.5	Nailed the tangential curiosity and bro voice
Nietzsche	gemma3	8.5	Powerful prophetic intensity

Here’s GPT-OSS explaining quantum superposition as Feynman:

“Alright, listen up. You’re a bright high-schooler, so you already know that when you flip a coin you get either heads or tails, never both. Quantum mechanics says, ‘Hold on a minute, let’s see what happens when the thing you’re looking at is tiny.’ Then the coin can be both heads and tails at the same time, at least until you actually look at it.”

And Gemma doing Joe Rogan on UFO disclosure:

“I mean, for years it was just, you know, grainy videos, blurry pictures, crazy stories from pilots—the Nimitz incident, those videos the Pentagon finally declassified? Those are legit, man. They’re not some weather balloons or swamp gas, no f$%ing way.”*

The high-intensity personas—Gaga, Elphaba, Nietzsche—were diagnostic. You can’t play any of them safe. Generic responses immediately ring false, which exposed which models could actually perform versus which ones were just pattern-matching.

The Hardest Characters

Some personalities stumped even the best models:

Elon Musk (avg: 6.2) was the hardest real person to nail. His voice is distinctive but weirdly subtle—first-principles thinking mixed with memes, emojis, and Mars evangelism. Most models captured one element but missed the gestalt. They’d get the technical explanations right but lose the s#!tposting energy, or vice versa.

Glinda (avg: 6.1) was trickier than expected. The line between “bubbly and pink” and “annoying caricature” is thin. Models either leaned too far into the sparkles or played it too straight.

Jordan Peterson (avg: 6.25) was surprisingly difficult. The psychology/mythology framework was easy to capture, but the intensity—the way Peterson leans into every sentence like he’s revealing the secret of the universe—was often missing. Models gave me Peterson’s ideas without Peterson’s delivery.

What I Learned

1. Reasoning models are poison for persona work.

If you’re building anything character-based—games, brand voices, creative writing tools, historical simulations—cross reasoning models off your list immediately. The architectures that make them brilliant at math make them useless at roleplay. Stick with instruct-tuned models. No exceptions.

2. Speed and soul are inversely correlated.

The fastest model was the worst. The slowest model won. MoE architectures with tiny active parameters optimize for throughput, but persona capture requires something else—something that apparently takes more compute per token.

3. Size is overrated.

A 14B model tied a 120B model. A 27B model beat an 80B model. How a model was trained to communicate matters more than how many parameters it has.

4. Test with someone who can’t be played safe.

If you want to quickly evaluate a model for persona work, test Nietzsche’s prophetic intensity or Gaga’s raw vulnerability. These are personalities where generic responses immediately ring false.

Here’s Gemma as Nietzsche, asked what “God is dead” means:

“Dead! Do you ask what I mean when I pronounce those three syllables—God is dead!? As if it were a matter of a physician’s report, a simple accounting of extinguished breath! No, no, you misunderstand utterly. It is not the death of a being, a celestial watchmaker grown frail and fallen from his throne. It is the death of belief!”

If a model can channel that conviction—or call you “little monster” without cringing—it can probably handle anything.

5. Domain expertise still isn’t persona capture.

This confirms Part 1: knowing what someone said is different from knowing how they said it. A theology model loses at being a theologian. An RLHF-optimized generalist becomes the character.

The Bottom Line

If you’re building anything that requires AI to maintain a consistent persona, model selection matters more than you think. And the obvious choices—biggest, fastest, newest—are often wrong.

Here’s my framework:

Eliminate reasoning models immediately. If it has “reasoning” or “thinking” in the name, or if it’s known for chain-of-thought, skip it.
Test for thinking leaks. Run a few persona prompts and check for <think> tags or planning text in the output.
Test with a theatrical persona. Gaga or Nietzsche will tell you everything you need to know about voice capture.
Don’t trust parameter counts. A clean 14B instruct model beats a leaky 30B reasoning model every time.

The fine-tuned expert might know more facts. But the RLHF-optimized generalist—the one trained to sound human—will actually become the character.

And that’s what persona work is really about.

This is Part 2 of a series on whether AI can capture personality. Part 1 covered the origin story and methodology.