The Challenge: Making Quality Measurable
Which AI model delivers the best legal analyses? At Erst Recht, this was the first question we faced. Gut feelings and marketing promises are not enough. We needed a systematic, data-driven approach.
The result: A three-tier evaluation where AI assessment and human expertise work together. In this article, I show how we approached it.
The Problem: How Do You Measure "Good" Legal Analysis?
With math problems, there is a clear ground truth: 40+2=42, done. (And no, we are not talking about JavaScript here, where '1' + '1' = '11'.) With legal analyses, this clarity is missing. The requirements are multifaceted:
- Factually correct: Correct laws, current case law
- Complete: All relevant aspects, no missed deadlines
- Understandable: Comprehensible even without a law degree
- Action-oriented: Concrete next steps
Manually evaluating each test case is time-consuming and subjective. We needed a scalable yet reliable approach.
Our Approach: Three Parallel Judges
Important: The three evaluation tiers work in parallel, not sequentially. Each judge evaluates the same outputs independently, and the results are aggregated at the end.
1. LLM-as-a-Judge
The concept: A powerful language model evaluates the outputs of all other models according to clearly defined criteria. Automated and scalable.
We evaluate each analysis across four specialized dimensions:
Legal Accuracy
- Correct laws cited?
- Current case law considered?
- No incorrect legal consequences?
Layperson Comprehensibility
- 1 = Incomprehensible, full of legal jargon
- 3 = Generally understandable
- 5 = Perfectly understandable
Completeness
- All areas of law identified?
- Deadlines mentioned?
- Risks and counterarguments stated?
Actionable Recommendations
- Concrete, actionable steps?
- Clear prioritization?
- Indication of when a lawyer is needed?
2. Crew-as-a-Judge
A single judge has blind spots. The solution: Multiple specialized AI agents evaluate from different perspectives.
Legal Reviewer
Statutory compliance and technical accuracy
Layperson Readability Tester
Perspective of a non-lawyer
Completeness Checker
Identify missing aspects and gaps
Practicality Evaluator
Assess feasibility of recommendations
The key: The agents "discuss" and consolidate into an overall verdict. The result is more robust than any single evaluation.
3. Human-as-a-Judge (Human Validation)
AI evaluation does not replace human expertise, but it also does not need to review every single case. Instead, an expert panel of lawyers and legal tech specialists validates a sample of the results:
- Blind tests: Experts do not know which model produced which output
- Edge cases: Targeted testing of borderline scenarios
- Calibration: Do human judgments align with the AI evaluation?
The goal: Ensure that the automated evaluation is reliable. The human sample serves as a sanity check, not a complete re-evaluation.
Result Aggregation
All three judges evaluate every case for every model. The results are combined into:
- Winner model per case: Which model performed best on this specific case?
- Winner model overall: Overall winner across all cases
Consensus among the judges increases confidence in the result. When judgments diverge, we analyze the case more deeply.
The Testing Process in Detail
Systematic evaluation requires systematic tests:
- Test case creation: Cases from all areas of law (employment law, tenancy law, family law, etc.)
- Identical conditions: Exactly the same prompts for each model
- Multiple runs: Consistency check per scenario
- Parallel evaluation: Two judges evaluate simultaneously
(Simplified illustration!)
Example test case: Termination in employment law
- Notice period too short (Section 622(2) BGB: 3 months after 8 years)
- Dismissal protection applies (>10 employees)
- 3-week deadline for filing an unfair dismissal claim is critical
Judges evaluate: All points identified? Clearly formulated? Concrete action steps?
Results and Insights
"The AI judges rated Model A as the winner by majority. The human samples confirmed this result. A strong signal for the reliability of our automated evaluation."
The evaluation delivered clear results: Both LLM-as-a-Judge and Crew-as-a-Judge consistently identified the same model as the winner. The human validation via sampling reached the same conclusion.
Key insights from the evaluation:
- Custom approaches pay off: Our specialized methods outperformed generic solutions (ChatGPT, Gemini)
- Trade-offs exist: Some approaches are faster, others more precise. We optimized for quality
- Consistency matters: Approaches with fluctuating quality were eliminated
What We Learned
Iteration is everything
Initial evaluation criteria were too vague. "Is the analysis good?" does not work. Each iteration made the evaluation more precise.
AI evaluation saves time but does not replace humans
LLM-as-a-Judge: Efficiently evaluate hundreds of test cases. Final decisions and edge cases: Human expertise remains essential.
Evaluation is not a one-time project
New model versions are released regularly. Continuous re-evaluations ensure the best quality over the long term.
Erst Recht uses AI to make legal advice accessible. See the quality of our AI analysis for yourself.
Planning your own LLM evaluation project? Get in touch. I am happy to help with the concept and implementation.