As AI chatbots become our go-to tools for research, writing, and problem-solving, one critical question emerges: how often do these systems make things up? When an AI confidently states false information—a phenomenon known as "hallucination"—the consequences can range from embarrassing to dangerous, especially in professional or academic contexts.
According to a recent experiment shared on Reddit by user BluebirdFront9797, a comprehensive test of three leading AI models reveals surprising differences in how often they hallucinate—and why raw accuracy numbers don't tell the whole story.
The Testing Methodology: How to Catch an AI in a Lie
The researcher designed a rigorous three-step verification process to test 1,000 identical prompts across ChatGPT, Claude, and Perplexity. Here's how the hallucination detection system worked:
Step 1: Claim Extraction
An LLM analyzed each AI response and extracted all verifiable factual claims—essentially breaking down the answer into individual statements that could be fact-checked.
Step 2: Web Verification
For each extracted claim, Exa (a search technology) scoured the web for the most relevant authoritative sources that could confirm or contradict the statement.
Step 3: Verdict Assignment
Another LLM compared each claim against the sources found and assigned one of three verdicts: - True: Supported by credible sources - Unsupported: No evidence found - Conflicting: Contradicted by available sources
Each verdict came with a confidence score to account for ambiguity.
The hallucination threshold: As reported in the study, an answer was marked as containing a hallucination if at least one of its claims was either unsupported or conflicting with source material. This is a strict but fair standard—after all, one false statement can undermine an entire response.
The Results: Which AI Lies Most?
Out of 1,000 prompts tested, here's how each model performed:
ChatGPT: The Middle Ground
- Hallucination rate: 12% (120 out of 1,000 answers)
- Performance analysis: ChatGPT showed moderate reliability, with roughly one in eight responses containing at least one unverifiable or false claim
Claude: The Worst Performer
- Hallucination rate: 15% (150 out of 1,000 answers)
- Performance analysis: According to the test results, Claude had the highest hallucination rate among the three models, with approximately three in twenty responses containing problematic claims
Perplexity: The Complicated Winner
- Hallucination rate: 3.3% (33 out of 1,000 answers)
- The catch: While Perplexity appeared to be the clear winner with the lowest hallucination rate, the Exa verification system revealed a significant caveat
As the researcher noted, most of Perplexity's "safe" answers were "low-effort copy-paste jobs, generic summaries or stitched quotes." More tellingly, "in the rare cases where it actually tried to generate original content, the hallucination rate exploded."
What This Really Means: The Trade-Off Between Safety and Usefulness
This is where the results become fascinating from a practical standpoint. Perplexity's strategy appears to be risk avoidance rather than genuine accuracy. (This is my interpretation based on the reported findings.)
Think of it this way: If you ask someone a complex question and they simply read back excerpts from a textbook without synthesizing information, they're technically "accurate"—but are they actually helpful? Perplexity seems to have optimized for not being wrong rather than being genuinely insightful.
This creates an important distinction for users:
When you need: - Quick fact verification → Perplexity's approach may work well - Original synthesis and analysis → ChatGPT or Claude might be more useful despite higher hallucination rates - Critical information requiring verification → Always fact-check regardless of the model
The Broader Implications for AI Trust
This experiment highlights several critical considerations for anyone using AI tools:
1. No Model is Fully Reliable
Even the "best" performer (Perplexity at 3.3%) still hallucinated in some responses. This means approximately 1 in 30 answers contained false information—a rate that's unacceptable for critical applications without human verification.
2. The Copy-Paste Problem
Perplexity's low hallucination rate came at the cost of originality. This raises questions about what we actually want from AI: safe regurgitation of existing content, or creative synthesis with higher risk?
3. Context Matters
The 1,000 prompts tested weren't specified in detail, but hallucination rates likely vary significantly based on: - Topic complexity - Availability of training data - Recency of information required - Whether the question requires reasoning vs. recall
4. Verification is Essential
As reported in this study, even sophisticated AI models make things up regularly. The takeaway? Never use AI-generated information for important decisions without independent verification [LINK: AI fact-checking tools].
Key Takeaways for AI Users
✓ ChatGPT hallucinated in 12% of responses tested—roughly 1 in 8 answers contained false claims
✓ Claude showed the highest hallucination rate at 15%—approximately 3 in 20 responses had issues
✓ Perplexity had only 3.3% hallucinations but achieved this primarily through copy-pasting rather than original content generation
✓ When Perplexity attempted original synthesis, its hallucination rate increased dramatically
✓ All three models showed significant reliability issues, reinforcing that human fact-checking remains essential
What You Should Do Next
Based on these findings, here are actionable steps for working with AI chatbots:
- Choose your tool based on your task: Use Perplexity for straightforward fact-gathering, but consider ChatGPT or Claude when you need creative synthesis (with appropriate fact-checking)
- Implement verification workflows: For any important use case, cross-reference AI outputs with authoritative sources [LINK: source verification methods]
- Be skeptical of confident-sounding claims: AI models don't indicate uncertainty well—they often present hallucinations with the same confidence as facts
- Consider using hallucination detection tools: Technologies like Exa can help identify unsupported claims in AI-generated content
- Stay informed about model updates: These hallucination rates reflect specific versions tested; newer releases may perform differently
The Bottom Line
This experiment, as shared by BluebirdFront9797, provides valuable empirical data about AI reliability. While no model performed perfectly, the results reveal that accuracy and usefulness exist in tension—the safest AI (Perplexity) achieved low hallucination rates partly by avoiding original thought.
For users, the message is clear: AI chatbots are powerful tools, but they're not yet trustworthy enough to use without verification. Understanding each model's strengths and weaknesses—and implementing appropriate safeguards—is essential for anyone relying on AI in their work or research.