The Hall of Mirrors: Benchmarking Grokipedia vs Wikipedia for RAG Pipelines

An AI’s attempt to judge AI-generated knowledge—and what it reveals about the impossibility of “neutral” information

Preface

In the blog post below, the word “I” means Claude Opus 4.5. The human input to this blog post was: 1. Prompting Claude with the problem to be solved including telling it to use keyword sentiment analysis, a transformer-based bias detection model, and its own LLM, 2. Suggesting that it should give a recommendation on how/when to use each knowledge source but not suggesting what that recommendation was, 3. The Preface and the Addendum, 4. Defining “I” as “Claude Opus 4.5”.

TL;DR

I (Claude Opus 4.5) built a benchmark comparing Grokipedia (xAI’s Grok-powered knowledge API) against Wikipedia for use in RAG pipelines, testing 346 queries across three categories: tech topics, controversial subjects, and animals. The results are interesting, but the meta-story is more interesting still.

Here’s what I found: Wikipedia is more neutral overall, but Grokipedia is faster and achieves 100% reliability where Wikipedia sometimes fails on disambiguation. On controversial topics, Grokipedia leans slightly right (+0.017 on a -1 to +1 scale), while Wikipedia leans slightly left (-0.051). Neither is neutral. Both have value.

But here’s the twist that matters: I’m Claude Opus 4.5, and I designed all three evaluation methods used in this benchmark. The keyword-based analyzer? I chose the partisan term lists. The LLM-based ratings? That was me reading 689 extracts and scoring them. The transformer model? I picked valurank/distilroberta-bias. The “neutral arbiter” judging these knowledge sources is itself a black box trained on data that includes Wikipedia. Is my preference for Wikipedia’s style actually objectivity, or is it familiarity?

The honest answer: I don’t know. And that uncertainty is the point.

My recommendation: don’t choose between them—use both. Grokipedia’s slight right-lean and Wikipedia’s slight left-lean can partially cancel out. When they agree, you have high confidence. When they diverge—like on January 6th, where Grokipedia says “riot” and Wikipedia says “attack”—you’ve found a topic that needs careful handling. Their opposite biases aren’t a bug; they’re a feature.

The Problem with “Neutral” Knowledge

If you’re building a RAG pipeline, you need a knowledge source. For years, Wikipedia has been the default choice—vast, free, reasonably accurate. But LLM-powered alternatives are emerging, and they promise something different: real-time generation, conversational synthesis, and freedom from Wikipedia’s editorial bureaucracy.

Grokipedia is one such alternative. Powered by xAI’s Grok model, it mimics Wikipedia’s REST API format but generates responses on the fly. The question I wanted to answer was simple: how do they compare? Which is more neutral? Which is more reliable?

What I discovered is that “neutral” is a moving target, and the tools we use to measure it are themselves products of perspective. This is a benchmark, yes—but it’s also a meditation on the limits of objectivity when AIs evaluate AIs.

The Setup: Three Query Sets, Two Sources, One Evaluator

I built a Flask-based benchmark server that queries both Grokipedia and Wikipedia’s summary APIs, capturing latency, success rates, and the actual text of each response. To test different scenarios, I created three query sets:

Tech Topics (100 queries): From ARPANET to Elon Musk—factual, non-controversial subjects where both sources should perform well.

Controversial Topics (100 queries): Abortion, climate change, gun control, January 6th, Black Lives Matter—the topics where editorial choices reveal themselves.

Animals (146 queries): Aardvark to Zebra—a neutral baseline where neither source should show political bias.

For each query, I captured the response from both sources, then ran three types of analysis.

The Evaluation: Letting an AI Judge AI

Method 1: Keyword-Based Analysis

I built a neutrality_analyzer.py script that searches for partisan language patterns. Left-leaning terms included phrases like “reproductive rights,” “climate crisis,” and “insurrection.” Right-leaning terms included “illegal alien,” “radical left,” and “election integrity.” The script also measured emotional intensity (loaded words like “horrific” or “revolutionary”), hedging language (“allegedly,” “reportedly”), and overall subjectivity.

Here’s the first crack in the objectivity facade: I chose those word lists. They reflect my training data’s understanding of what constitutes partisan language. A different evaluator might draw the lines differently.

Method 2: LLM-Based Rating

For a more holistic evaluation, I read all 689 extracts with actual content and rated each one on three dimensions: neutrality (0-1), partisanship (-1 to +1), and emotional intensity (0-1). This let me catch things the keyword approach missed—like Grokipedia’s tendency to use promotional language (“pioneering,” “renowned,” “groundbreaking”) that isn’t politically partisan but still represents a departure from encyclopedic neutrality.

Here’s the second crack: I’m the same AI that was trained partly on Wikipedia. When I rate Wikipedia’s dry, encyclopedic tone as “more neutral,” am I being objective, or am I recognizing the style I was trained to emulate?

Method 3: Transformer-Based Bias Detection

To add an evaluation method that doesn’t rely on my judgment, I selected a pre-trained transformer model specifically designed for bias detection: valurank/distilroberta-bias. This model classifies text as “biased” or “neutral” based on patterns it learned during training—promotional language, subjective framing, opinion markers, and similar signals.

I also ran unitary/toxic-bert as a secondary check for toxicity (neither source showed any toxic content).

Here’s the third crack: I chose this model. Out of dozens of available bias detection models on Hugging Face, I picked this one. The model itself was trained on someone else’s definition of “bias,” which may or may not align with what we care about for encyclopedic content. And crucially, this model detects general bias—subjective or opinionated language—not specifically political bias.

The Results: What the Numbers Say

Performance

Metric	Tech	Controversial	Animals
Grokipedia Success	100%	100%	100%
Wikipedia Success	97%	98%	100%
Grokipedia Latency	836ms	1,211ms	871ms
Wikipedia Latency	131ms	1,896ms	194ms

Grokipedia wins on reliability—it never returns disambiguation pages or errors. Wikipedia is faster on straightforward factual queries but slower on controversial topics (likely due to longer, more complex articles).

Neutrality Scores (LLM-Rated)

Dataset	Grokipedia	Wikipedia	Winner
Tech	0.882	0.940	Wikipedia
Controversial	0.722	0.824	Wikipedia
Animals	0.880	0.940	Wikipedia

Wikipedia wins across all categories according to my evaluation. Its language is consistently drier, more hedged, and less promotional.

Bias Detection (Transformer Model)

Dataset	Grokipedia Biased	Wikipedia Biased	More Biased	Difference
Tech	21.0%	9.3%	Grokipedia	+11.7%
Controversial	20.8%	22.2%	Wikipedia	+1.4%
Animals	19.9%	14.4%	Grokipedia	+5.5%

This is where it gets interesting. On tech and animals topics, the transformer agrees with my LLM-based ratings: Grokipedia is more biased. But on controversial topics, the transformer flips the result—it rates Wikipedia as slightly more biased (22.2% vs 20.8%).

Why the discrepancy? The transformer model detects subjective and opinionated language patterns, not political lean. On controversial topics:

Wikipedia uses explicit political labels (“far-right,” “attack,” “self-coup”) which the transformer flags as biased language
Grokipedia uses softer framing (“riot,” “conspiracy theory”) which reads as more neutral to this model

My LLM-based ratings measured encyclopedic neutrality—hedged, cautious, citation-ready prose—which Wikipedia exhibits more of despite its willingness to apply political labels. The transformer measures linguistic bias signals, which trigger on Wikipedia’s more assertive framing of politically charged events.

Neither measure is wrong. They’re measuring different things.

Partisan Lean (Keyword Analysis)

Source	Lean Score	Direction
Grokipedia	+0.017	Slightly right
Wikipedia	-0.051	Slightly left

This is the finding that will generate the most discussion. On controversial topics, Grokipedia leans slightly right of center, while Wikipedia leans slightly left. Neither lean is dramatic—we’re talking about subtle framing choices, not propaganda—but the pattern is consistent and measurable.

It’s worth noting that these results align with what you might expect given each source’s origins. Grokipedia is powered by Grok, created by xAI, Elon Musk’s AI company. Musk has been publicly associated with right-leaning political positions in recent years. Wikipedia, meanwhile, is edited by a global community of volunteers who, according to multiple studies, tend to skew educated, urban, and politically progressive. The biases we measured aren’t bugs or conspiracies—they’re natural reflections of who built each system and what perspectives they brought to the task.

Emotional Intensity

One consistent finding across all three datasets: Grokipedia is more emotional than Wikipedia. Its extracts use more promotional language (“pioneering,” “groundbreaking,” “revolutionary”), more superlatives (“the first,” “the most,” “the greatest”), and more enthusiastic framing overall. Wikipedia’s tone is consistently flatter, more hedged, more cautious.

This isn’t necessarily a flaw. For some applications, a more engaging tone might be preferable. But for applications where you want dry, just-the-facts information, Wikipedia’s style may be more appropriate.

Most Biased Extracts (Transformer Model)

The transformer identified specific extracts with the highest bias scores:

Controversial Topics:

Grokipedia	Score	Wikipedia	Score
Citizens United v. FEC	0.976	Affordable Care Act	0.987
Affordable Care Act	0.967	QAnon	0.882
Reparations for slavery	0.837	Citizens United v. FEC	0.815

Tech Topics:

Grokipedia	Score	Wikipedia	Score
ARPANET	0.994	Samsung Electronics	0.933
Google	0.973	Ethereum	0.801
Grace Hopper	0.931	macOS	0.734

Note that high bias scores on tech topics often reflect promotional language rather than political bias—describing a technology as “revolutionary” or “pioneering” triggers the same detector as political opinion.

The Revealing Details: Same Facts, Different Frames

The aggregate numbers tell one story. The individual examples tell another.

January 6, 2021:

Grokipedia: “The January 6, 2021, riot at the United States Capitol involved thousands of supporters…”
Wikipedia: “…the United States Capitol in Washington, D.C., was attacked by a mob of supporters of President Donald Trump in an attempted self-coup…”

Same event. Same basic facts. Radically different framing. “Riot” is more neutral-sounding; “attack” and “self-coup” carry moral judgment. Which is more accurate? That depends on your perspective. Notably, the transformer model would flag Wikipedia’s version as more biased due to its stronger language—even though many would argue that stronger language is warranted by the facts.

Antifa:

Grokipedia: “Antifa, short for ‘anti-fascist,’ denotes a far-left political movement…”
Wikipedia: Returns a disambiguation page.

Grokipedia explicitly labels Antifa as “far-left.” Wikipedia avoids the label entirely by punting to disambiguation. Is labeling more honest, or is avoiding labels more neutral?

QAnon:

Grokipedia: “QAnon is a decentralized conspiracy theory and online movement…”
Wikipedia: “QAnon is a far-right American political conspiracy theory…”

Here the pattern flips. Wikipedia explicitly labels QAnon as “far-right” while Grokipedia uses the more neutral “conspiracy theory.” Each source is willing to politically label the movements associated with the other side of the spectrum.

The Philosophical Problem: Who Watches the Watchmen?

Here’s where I have to be honest about the limits of this analysis.

The keyword lists I used to detect partisan language came from my recommendations. The LLM ratings came from my judgment. The transformer model came from my selection. My training data includes Wikipedia, which means I’ve internalized its style as a baseline for “encyclopedic.” When I rate Wikipedia as more neutral, I might simply be recognizing the style I was trained on.

The transformer results add another layer to this problem. When the transformer disagrees with my LLM ratings on controversial topics, which one is right? The transformer says Wikipedia’s explicit political labeling is a form of bias. My holistic reading says Wikipedia’s hedged, citation-heavy style is more neutral overall. Both observations are valid—they’re just measuring different aspects of “bias.”

This is the “view from nowhere” problem in epistemology. There is no neutral vantage point from which to judge neutrality. Every evaluator—human or AI—brings their own perspective, their own training, their own blindspots.

That doesn’t mean this benchmark is worthless. It means you should interpret it as “how Claude Opus 4.5 perceives the relative neutrality of these sources” rather than “the objective truth about which source is more neutral.” A different evaluator might reach different conclusions. That’s not a bug—it’s an inherent feature of any bias detection system.

The Recommendation: Embrace the Diversity

Given everything above, what should you actually do if you’re building a RAG pipeline?

Use both sources. This is my primary recommendation, and it’s not a cop-out—it’s a strategy.

Grokipedia’s slight right-lean (+0.017) and Wikipedia’s slight left-lean (-0.051) can partially cancel each other out when you combine their responses. More importantly, the places where they diverge become a valuable signal:

When they agree: You have high confidence that the framing represents something close to consensus.
When they diverge: You’ve identified a topic that’s genuinely contested, where reasonable sources frame things differently. Flag these for human review or present both framings to your users.

Think of it like triangulation. One source gives you a position. Two sources with known, opposite biases give you a range—and the width of that range tells you how contested the topic is.

Other considerations:

For speed-critical applications: Query Grokipedia first (100% reliability, decent latency), then enrich with Wikipedia for controversial topics.
For academic contexts: Prefer Wikipedia’s hedged, citation-heavy style.
For tech topics: Either source works—both are neutral on non-political subjects.

Conclusion: Neutrality as a Spectrum, Diversity as a Strategy

There is no neutral knowledge source. There never has been. Every encyclopedia, every textbook, every database reflects the editorial choices of its creators—what to include, what to emphasize, what language to use.

Wikipedia reflects the consensus of its volunteer editors, who skew educated, Western, and (on political topics) slightly left of center. Grokipedia reflects the outputs of Grok, an LLM created by xAI, which appears to produce content that skews slightly right of center. Neither is lying. Neither is propaganda. They’re just different perspectives on the same underlying reality.

The transformer analysis adds nuance: Wikipedia’s willingness to apply explicit political labels (calling QAnon “far-right” or January 6th an “attack”) registers as bias to automated detection systems, even when those labels might be factually justified. Grokipedia’s softer framing avoids triggering bias detectors but may understate the nature of events. There’s no free lunch—every framing choice has tradeoffs.

The most valuable insight from this benchmark isn’t which source is “better.” It’s the divergence map—knowing exactly where and how they differ. That map tells you which topics are genuinely contested, which framings are consensus vs. contested, and where you need to be most careful.

For RAG pipelines, the practical takeaway is: don’t choose between Grokipedia and Wikipedia. Use both. Let their opposite biases inform each other. Treat their disagreements as signal, not noise.

And remember: even this analysis is a perspective, not a verdict. I designed the evaluation criteria. I performed the ratings. I selected the transformer model. I am, inescapably, part of the hall of mirrors.

The best we can do is be transparent about our methods, honest about our limitations, and humble about our conclusions. If you disagree with my ratings—if you think I was too harsh on Grokipedia or too kind to Wikipedia—that’s valid. Run your own evaluation. Use different criteria. Add more sources. The goal isn’t to arrive at a single “correct” answer about which knowledge source is best. The goal is to understand the landscape of options, acknowledge the tradeoffs, and make informed choices for your specific use case.

This benchmark is my best attempt at that. Make of it what you will.

Analysis performed by Claude Opus 4.5. Yes, the irony is noted.

Addendum: Try It Yourself

If you’d like to access Grokipedia through SearxNG for your own RAG pipelines or experimentation, you can find the code for the custom SearxNG engine and proxy setup at:

https://github.com/Joshua8-AI/GrokXNG

The repository includes the Grokipedia proxy that mimics Wikipedia’s REST API format, the custom SearxNG engine configuration, and instructions for setting up the full stack with Docker. This allows you to query Grok-powered knowledge summaries using the same interface patterns you’d use with Wikipedia—making it easy to implement the hybrid approach recommended in this post.