Best AI Model for Legal Analysis: Our Method

A three-tier evaluation approach: LLM-as-a-Judge, Crew-as-a-Judge, and human validation. Parallel, not sequential.

The Challenge: Making Quality Measurable

Which AI model delivers the best legal analyses? At Erst Recht, this was the first question we faced. Gut feelings and marketing promises are not enough. We needed a systematic, data-driven approach.

The result: A three-tier evaluation where AI assessment and human expertise work together. In this article, I show how we approached it.

The Problem: How Do You Measure "Good" Legal Analysis?

With math problems, there is a clear ground truth: 40+2=42, done. (And no, we are not talking about JavaScript here, where '1' + '1' = '11'.) With legal analyses, this clarity is missing. The requirements are multifaceted:

Factually correct: Correct laws, current case law
Complete: All relevant aspects, no missed deadlines
Understandable: Comprehensible even without a law degree
Action-oriented: Concrete next steps

Manually evaluating each test case is time-consuming and subjective. We needed a scalable yet reliable approach.

Our Approach: Three Parallel Judges

Important: The three evaluation tiers work in parallel, not sequentially. Each judge evaluates the same outputs independently, and the results are aggregated at the end.

1. LLM-as-a-Judge

The concept: A powerful language model evaluates the outputs of all other models according to clearly defined criteria. Automated and scalable.

We evaluate each analysis across four specialized dimensions:

Legal Accuracy

Correct laws cited?
Current case law considered?
No incorrect legal consequences?

Layperson Comprehensibility

1 = Incomprehensible, full of legal jargon
3 = Generally understandable
5 = Perfectly understandable

Completeness

All areas of law identified?
Deadlines mentioned?
Risks and counterarguments stated?

Actionable Recommendations

Concrete, actionable steps?
Clear prioritization?
Indication of when a lawyer is needed?

2. Crew-as-a-Judge

A single judge has blind spots. The solution: Multiple specialized AI agents evaluate from different perspectives.

Legal Reviewer

Statutory compliance and technical accuracy

Layperson Readability Tester

Perspective of a non-lawyer

Completeness Checker

Identify missing aspects and gaps

Practicality Evaluator

Assess feasibility of recommendations

The key: The agents "discuss" and consolidate into an overall verdict. The result is more robust than any single evaluation.

3. Human-as-a-Judge (Human Validation)

AI evaluation does not replace human expertise, but it also does not need to review every single case. Instead, an expert panel of lawyers and legal tech specialists validates a sample of the results:

Blind tests: Experts do not know which model produced which output
Edge cases: Targeted testing of borderline scenarios
Calibration: Do human judgments align with the AI evaluation?

The goal: Ensure that the automated evaluation is reliable. The human sample serves as a sanity check, not a complete re-evaluation.

Result Aggregation

All three judges evaluate every case for every model. The results are combined into:

Winner model per case: Which model performed best on this specific case?
Winner model overall: Overall winner across all cases

Consensus among the judges increases confidence in the result. When judgments diverge, we analyze the case more deeply.

The Testing Process in Detail

Systematic evaluation requires systematic tests:

Test case creation: Cases from all areas of law (employment law, tenancy law, family law, etc.)
Identical conditions: Exactly the same prompts for each model
Multiple runs: Consistency check per scenario
Parallel evaluation: Two judges evaluate simultaneously

(Simplified illustration!)

Example test case: Termination in employment law

"I have been working at a company with 50 employees for 8 years. Yesterday I received a termination notice with a 4-week notice period effective at the end of the month. Is this lawful? What can I do?"

A correct analysis must identify:

Notice period too short (Section 622(2) BGB: 3 months after 8 years)
Dismissal protection applies (>10 employees)
3-week deadline for filing an unfair dismissal claim is critical

Judges evaluate: All points identified? Clearly formulated? Concrete action steps?

Results and Insights

"The AI judges rated Model A as the winner by majority. The human samples confirmed this result. A strong signal for the reliability of our automated evaluation."

The evaluation delivered clear results: Both LLM-as-a-Judge and Crew-as-a-Judge consistently identified the same model as the winner. The human validation via sampling reached the same conclusion.

Key insights from the evaluation:

Custom approaches pay off: Our specialized methods outperformed generic solutions (ChatGPT, Gemini)
Trade-offs exist: Some approaches are faster, others more precise. We optimized for quality
Consistency matters: Approaches with fluctuating quality were eliminated

What We Learned

Iteration is everything

Initial evaluation criteria were too vague. "Is the analysis good?" does not work. Each iteration made the evaluation more precise.

AI evaluation saves time but does not replace humans

LLM-as-a-Judge: Efficiently evaluate hundreds of test cases. Final decisions and edge cases: Human expertise remains essential.

Evaluation is not a one-time project

New model versions are released regularly. Continuous re-evaluations ensure the best quality over the long term.

Erst Recht uses AI to make legal advice accessible. See the quality of our AI analysis for yourself.

Get Your Initial Assessment Now (From 9.99 EUR)

Planning your own LLM evaluation project? Get in touch. I am happy to help with the concept and implementation.

How We Found the Best AI Model for Legal Analysis