Legal AI Knowledge Base with Docling

How we transformed legal study materials with Docling into a searchable knowledge base for our AI-powered legal advisory platform.

At Erst Recht, we are building an AI-powered legal advisory platform for German law. One of our biggest challenges? Turning thousands of legal PDF documents into a searchable knowledge base that our AI agents can actually use.

The solution: Docling, an open-source document processing toolkit from IBM Research.

The Challenge

We had access to comprehensive legal study materials covering everything from labor law to inheritance law. These PDFs contained:

Complex legal content with various document types
Different structures depending on the area of law
22 different areas of law, each with its own requirements

The problem: Anyone who has ever tried to extract text from a PDF knows the result. Headings end up in the middle of body text, tables turn into garbled characters, multi-column layouts are read line by line instead of column by column. You cannot simply dump this raw text into a vector database and expect good results. Structure is critical, and legal documents rely on connections that are lost with naive text extraction.

Why Not Just Use OCR?

Our first instinct was to use traditional OCR (Optical Character Recognition). But we quickly realized that OCR solves the wrong problem:

Traditional OCR	Docling
Converts images → text	Converts documents → structured data
Loses document structure	Preserves headings, lists, hierarchy
Treats every page as flat text	Understands reading order
Tables become garbled text	Tables remain as tables

Docling is not "just better OCR" but a document understanding pipeline. It outputs clean Markdown with the logical structure of the document.

Our Pipeline

Here is how we process a legal PDF:

Step 1: Classification

Before processing, we classify each document by type and area of law.

Step 2: Docling Conversion

Docling converts PDFs into structured Markdown. For our digital PDFs (which already have text layers), we disable OCR for speed:

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")
markdown = result.document.export_to_markdown()

The output preserves headings, lists, and structure:

## Anspruchsgrundlagen

### Vertragliche Ansprüche
- Erfüllungsanspruch
- Schadensersatz statt der Leistung

### Gesetzliche Ansprüche
- Deliktsrecht
- Bereicherungsrecht

Step 3: Intelligent Processing

We developed domain-specific processing logic that understands legal document structures. The exact methodology is our secret sauce, but the result is clear: the AI does not just receive isolated text fragments but understands connections.

The key point: Context makes the difference between "I found something" and "I understood the answer."

Step 4: Embeddings & Vector Storage

We generate embeddings with an LLM embedding model and store everything in a vector database, which enables filtered search:

Search only within specific areas of law
Filter by document type and relevance
Intelligently retrieve related content

The Results

We processed 22 areas of law and over 1,000 PDF documents. But the numbers are not the decisive factor. The difference in response quality is.

Before: The AI delivered superficial results. It found relevant text passages, but without the necessary context. For complex legal questions, it drew incorrect conclusions because isolated information is not enough.

After: Through intelligent document processing, the AI now understands connections. The results are precise and reliable. This moved the product from "interesting prototype" to "production-ready."

Specifically, the system can now:

Find relevant precedents even for complex questions
Deliver complete answers with the necessary context for informed decisions
Search by area of law for maximum relevance

Key Learnings

"Docling makes it possible to preserve the full structure of documents. This fundamentally improved our RAG quality."

1. Structure > Raw Text

Preserving document structure made our RAG system significantly more accurate. Headings become natural chunk boundaries. Lists remain as lists.

2. Context Beats Keyword Matching

A match alone is useless. Only context makes it valuable. Our processing ensures that the AI does not just find, but also understands.

3. Skip OCR When Possible

Our PDFs had text layers. Disabling OCR made processing 10x faster with no loss in quality.

4. Domain-Specific Post-Processing

Docling produces clean Markdown. What comes next depends on the domain. For legal documents, we developed post-processing tailored to the specifics of legal argumentation. In other domains such as technical documentation, compliance, or contract management, different approaches would be relevant.

What Comes Next

We are continuously working on improving our pipeline. Our focus: advanced chunking strategies and opening up additional document types such as court decisions and statutory texts.

Erst Recht uses AI to make legal advice accessible. Our multi-agent system analyzes your legal situation and provides actionable recommendations.

Tools Used:

Docling

Working on a similar problem? Preparing large document collections for RAG? Get in touch, I am happy to help with architecture or implementation.

From PDF Chaos to Precise Legal AI: How We Structured 1000+ Documents with Docling