Back to Blog
Docling RAG Legal AI December 30, 2025

From PDF Chaos to Precise Legal AI: How We Structured 1000+ Documents with Docling

How we transformed legal study materials with Docling into a searchable knowledge base for our AI-powered legal advisory platform.

At Erst Recht, we are building an AI-powered legal advisory platform for German law. One of our biggest challenges? Turning thousands of legal PDF documents into a searchable knowledge base that our AI agents can actually use.

The solution: Docling, an open-source document processing toolkit from IBM Research.

The Challenge

We had access to comprehensive legal study materials covering everything from labor law to inheritance law. These PDFs contained:

  • Complex legal content with various document types
  • Different structures depending on the area of law
  • 22 different areas of law, each with its own requirements

The problem: Anyone who has ever tried to extract text from a PDF knows the result. Headings end up in the middle of body text, tables turn into garbled characters, multi-column layouts are read line by line instead of column by column. You cannot simply dump this raw text into a vector database and expect good results. Structure is critical, and legal documents rely on connections that are lost with naive text extraction.

Why Not Just Use OCR?

Our first instinct was to use traditional OCR (Optical Character Recognition). But we quickly realized that OCR solves the wrong problem:

Traditional OCR Docling
Converts images → text Converts documents → structured data
Loses document structure Preserves headings, lists, hierarchy
Treats every page as flat text Understands reading order
Tables become garbled text Tables remain as tables

Docling is not "just better OCR" but a document understanding pipeline. It outputs clean Markdown with the logical structure of the document.

Our Pipeline

Here is how we process a legal PDF:

PDF Legal Documents Docling IBM Research Conversion Markdown Structured Output Processing Domain Logic Vector DB Semantic Search Step 1 Step 2 Step 3 Step 4 Step 5

Step 1: Classification

Before processing, we classify each document by type and area of law.

Step 2: Docling Conversion

Docling converts PDFs into structured Markdown. For our digital PDFs (which already have text layers), we disable OCR for speed:

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")
markdown = result.document.export_to_markdown()

The output preserves headings, lists, and structure:

## Anspruchsgrundlagen

### Vertragliche Ansprüche
- Erfüllungsanspruch
- Schadensersatz statt der Leistung

### Gesetzliche Ansprüche
- Deliktsrecht
- Bereicherungsrecht

Step 3: Intelligent Processing

We developed domain-specific processing logic that understands legal document structures. The exact methodology is our secret sauce, but the result is clear: the AI does not just receive isolated text fragments but understands connections.

The key point: Context makes the difference between "I found something" and "I understood the answer."

Step 4: Embeddings & Vector Storage

We generate embeddings with an LLM embedding model and store everything in a vector database, which enables filtered search:

  • Search only within specific areas of law
  • Filter by document type and relevance
  • Intelligently retrieve related content

The Results

We processed 22 areas of law and over 1,000 PDF documents. But the numbers are not the decisive factor. The difference in response quality is.

Before: The AI delivered superficial results. It found relevant text passages, but without the necessary context. For complex legal questions, it drew incorrect conclusions because isolated information is not enough.

After: Through intelligent document processing, the AI now understands connections. The results are precise and reliable. This moved the product from "interesting prototype" to "production-ready."

Specifically, the system can now:

  1. Find relevant precedents even for complex questions
  2. Deliver complete answers with the necessary context for informed decisions
  3. Search by area of law for maximum relevance

Key Learnings

"Docling makes it possible to preserve the full structure of documents. This fundamentally improved our RAG quality."

1. Structure > Raw Text

Preserving document structure made our RAG system significantly more accurate. Headings become natural chunk boundaries. Lists remain as lists.

2. Context Beats Keyword Matching

A match alone is useless. Only context makes it valuable. Our processing ensures that the AI does not just find, but also understands.

3. Skip OCR When Possible

Our PDFs had text layers. Disabling OCR made processing 10x faster with no loss in quality.

4. Domain-Specific Post-Processing

Docling produces clean Markdown. What comes next depends on the domain. For legal documents, we developed post-processing tailored to the specifics of legal argumentation. In other domains such as technical documentation, compliance, or contract management, different approaches would be relevant.

What Comes Next

We are continuously working on improving our pipeline. Our focus: advanced chunking strategies and opening up additional document types such as court decisions and statutory texts.


Erst Recht uses AI to make legal advice accessible. Our multi-agent system analyzes your legal situation and provides actionable recommendations.


Tools Used:

Docling

Working on a similar problem? Preparing large document collections for RAG? Get in touch, I am happy to help with architecture or implementation.