At Erst Recht, we are building an AI-powered legal advisory platform for German law. One of our biggest challenges? Turning thousands of legal PDF documents into a searchable knowledge base that our AI agents can actually use.
The solution: Docling, an open-source document processing toolkit from IBM Research.
The Challenge
We had access to comprehensive legal study materials covering everything from labor law to inheritance law. These PDFs contained:
- Complex legal content with various document types
- Different structures depending on the area of law
- 22 different areas of law, each with its own requirements
The problem: Anyone who has ever tried to extract text from a PDF knows the result. Headings end up in the middle of body text, tables turn into garbled characters, multi-column layouts are read line by line instead of column by column. You cannot simply dump this raw text into a vector database and expect good results. Structure is critical, and legal documents rely on connections that are lost with naive text extraction.
Why Not Just Use OCR?
Our first instinct was to use traditional OCR (Optical Character Recognition). But we quickly realized that OCR solves the wrong problem:
| Traditional OCR | Docling |
|---|---|
| Converts images → text | Converts documents → structured data |
| Loses document structure | Preserves headings, lists, hierarchy |
| Treats every page as flat text | Understands reading order |
| Tables become garbled text | Tables remain as tables |
Docling is not "just better OCR" but a document understanding pipeline. It outputs clean Markdown with the logical structure of the document.
Our Pipeline
Here is how we process a legal PDF:
Step 1: Classification
Before processing, we classify each document by type and area of law.
Step 2: Docling Conversion
Docling converts PDFs into structured Markdown. For our digital PDFs (which already have text layers), we disable OCR for speed:
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("document.pdf")
markdown = result.document.export_to_markdown()
The output preserves headings, lists, and structure:
## Anspruchsgrundlagen
### Vertragliche Ansprüche
- Erfüllungsanspruch
- Schadensersatz statt der Leistung
### Gesetzliche Ansprüche
- Deliktsrecht
- Bereicherungsrecht
Step 3: Intelligent Processing
We developed domain-specific processing logic that understands legal document structures. The exact methodology is our secret sauce, but the result is clear: the AI does not just receive isolated text fragments but understands connections.
The key point: Context makes the difference between "I found something" and "I understood the answer."
Step 4: Embeddings & Vector Storage
We generate embeddings with an LLM embedding model and store everything in a vector database, which enables filtered search:
- Search only within specific areas of law
- Filter by document type and relevance
- Intelligently retrieve related content
The Results
We processed 22 areas of law and over 1,000 PDF documents. But the numbers are not the decisive factor. The difference in response quality is.
Before: The AI delivered superficial results. It found relevant text passages, but without the necessary context. For complex legal questions, it drew incorrect conclusions because isolated information is not enough.
After: Through intelligent document processing, the AI now understands connections. The results are precise and reliable. This moved the product from "interesting prototype" to "production-ready."
Specifically, the system can now:
- Find relevant precedents even for complex questions
- Deliver complete answers with the necessary context for informed decisions
- Search by area of law for maximum relevance
Key Learnings
"Docling makes it possible to preserve the full structure of documents. This fundamentally improved our RAG quality."
1. Structure > Raw Text
Preserving document structure made our RAG system significantly more accurate. Headings become natural chunk boundaries. Lists remain as lists.
2. Context Beats Keyword Matching
A match alone is useless. Only context makes it valuable. Our processing ensures that the AI does not just find, but also understands.
3. Skip OCR When Possible
Our PDFs had text layers. Disabling OCR made processing 10x faster with no loss in quality.
4. Domain-Specific Post-Processing
Docling produces clean Markdown. What comes next depends on the domain. For legal documents, we developed post-processing tailored to the specifics of legal argumentation. In other domains such as technical documentation, compliance, or contract management, different approaches would be relevant.
What Comes Next
We are continuously working on improving our pipeline. Our focus: advanced chunking strategies and opening up additional document types such as court decisions and statutory texts.
Erst Recht uses AI to make legal advice accessible. Our multi-agent system analyzes your legal situation and provides actionable recommendations.
Tools Used:
Working on a similar problem? Preparing large document collections for RAG? Get in touch, I am happy to help with architecture or implementation.