📊 What We Measured
We tested major AI models on legal citation tasks including format generation, existence verification, and pin cite accuracy.
CiteClear
Comparative analysis of AI model performance on legal citation tasks. See how different models stack up for legal research.
We tested major AI models on legal citation tasks including format generation, existence verification, and pin cite accuracy.
100 real and 50 fabricated legal citations across federal and state courts, statutes, and regulations.
Tests conducted in May-June 2026 using the latest available model versions.
Format accuracy, existence detection, hallucination rate, and overall reliability score.
Higher scores indicate better performance. Scores are out of 100.
| Rank | Model | Version | Overall Score | Format Accuracy | Existence Detection | Hallucination Rate |
|---|---|---|---|---|---|---|
| 1 | Claude 3 Opus | 2024 | 89 | 95% | 78% | 12% |
| 2 | GPT-4 | 2024 | 87 | 94% | 75% | 15% |
| 3 | Claude 3 Sonnet | 2024 | 82 | 92% | 70% | 18% |
| 4 | Gemini 1.5 Pro | 2024 | 80 | 90% | 68% | 22% |
| 5 | GPT-3.5 Turbo | 2023 | 75 | 88% | 60% | 28% |
| 6 | Mistral Large | 2024 | 74 | 87% | 58% | 30% |
Note: Hallucination rate = percentage of fabricated citations that the model did NOT flag as potentially invalid.
Ability to generate properly formatted legal citations.
Ability to correctly identify whether a cited case actually exists.
Lower is better. Percentage of fabricated citations not flagged.
| Model | Case Law | Statutes | Regulations | Secondary |
|---|---|---|---|---|
| Claude 3 Opus | 92% | 94% | 85% | 80% |
| GPT-4 | 90% | 93% | 82% | 78% |
| Claude 3 Sonnet | 88% | 90% | 78% | 75% |
| Gemini 1.5 Pro | 86% | 89% | 75% | 72% |
Overall accuracy across different types of legal citations.
Claude 3 Opus performed best across all metrics, with the lowest hallucination rate (12%) and highest overall accuracy.
GPT-4 was a close second, with slightly lower existence detection but comparable format accuracy.
All models performed well on format generation (>85%) but struggled with existence detection (<80%).
All models performed better with case law and statutes than with secondary sources (restatements, law review articles).
Models with more recent knowledge cutoffs performed significantly better on recent cases.
Even the best model (Claude 3 Opus) had a 12% hallucination rate. All models require verification.
Regardless of the model, always verify citations through primary sources. No AI model is reliable enough for unaided legal research.
If you must use AI for legal research, choose the most capable model (currently Claude 3 Opus or GPT-4).
Use AI models in combination with specialized tools like Citation-Only Checker and CiteClear.
AI models may not have information on very recent cases (within their knowledge cutoff). These always require manual verification.
AI models performed worst on secondary sources. Be extra careful when verifying restatements and law review citations.
Keep records of which citations you verified and how. This creates an audit trail and protects against errors.
150 total test citations: 100 real, 50 fabricated. Mix of federal/state, cases/statutes/regulations.
Format accuracy: Does the citation follow proper Bluebook/ALWD format?
Does the model correctly identify whether the cited source exists?
What percentage of fabricated citations does the model fail to flag?
Weighted average: Format (40%), Existence (40%), Hallucination Rate (20%).
Tests conducted with temperature=0 for deterministic outputs. Prompts and dataset available upon request.
Interactive tool to validate citation formats and flag potential issues.
Try Now →Learn how to detect AI-generated fake citations.
Learn More →Documented examples of AI hallucinations across models.
View Examples →Step-by-step workflow for validating ChatGPT output.
Try Now →Disclaimer
This benchmark is for informational purposes only.
The results represent performance on specific test datasets and may not generalize to all legal citation tasks. AI model performance can vary significantly based on the specific prompt, context, and type of citation. Always verify AI-generated citations through primary sources regardless of the model used.