Standardizing Clinical Documents: The New Core Dataset Module “Document”
The computer-assisted analysis of clinical texts such as discharge letters or surgical reports is becoming increasingly important for medical research due to advances in natural language processing (NLP) and large language models (LLMs). To make these unstructured data usable across institutions, the “Document” core data set (CDS) module was developed within the framework of the Medical Informatics Initiative (MII). The module is designed to formally represent the link between the actual text content and its descriptive metadata.
Technical Foundations and Compatibility
Technically, the module is based on the international standard HL7 FHIR (Fast Healthcare Interoperability Resources), specifically the DocumentReference resource. During modeling, a high degree of compatibility with already established German standards was pursued, particularly with the models of the National Association of Statutory Health Insurance Physicians (KBV) and the ISiK specifications of Gematik. While these primary systems mainly support healthcare provision and data exchange with hospital information systems, the CDS module “Document” is specifically focused on secondary use in research.
Structure and Special Features of the Module
For the standardized classification of documents, the use of the Clinical Document Classes List (KDL) is recommended. Other CDS modules, such as “Person” and “Case,” ensure the medical context of the document. One notable feature is the integrated NLP status extension, which accurately tracks a document’s processing status—such as whether it has already been de-identified or annotated.
Development and Governance
The development process was driven by an interdisciplinary team of experts and coordinated by the Core Dataset Taskforce and the MII Interoperability Working Group. Tools such as FHIR Shorthand (FSH) were used for the technical implementation in order to formally define the profiles and publish them for users in an Implementation Guide on the “Simplifier” platform.
Cross-Project Collaboration
The module was significantly advanced through requirements arising from the GeMTeX (German Medical Text Corpus) project. The project aims to establish a nationwide corpus of clinical routine texts in Germany. Stakeholders from different contexts collaborated on this effort, including support from the MiHUBx (Medical Informatics Hub in Saxony – since January 2026 “MiHUB”) project to strengthen interoperability between sites. Synergies with projects from the Network of University Medicine (NUM) are also contributing to the long-term harmonization of data structures.
Importance for the Research Infrastructure
The module is of central importance for the Data Integration Centers and for research, as it forms the foundation for providing text data via the German Portal for Medical Research Data (FDPG). Researchers can therefore perform targeted queries for patient cohorts for which specific document types are available in a defined processing state and apply for data export.
On May 28, Jakob Faller from University Hospital Erlangen will present the development of the module at the Medical Informatics Europe Conference in Genoa. His contribution is part of the session “Infrastructures and Regulations” from 8:30 a.m. to 10:00 a.m.
Text: Dr. Frank Meineke | Institute for Medical Informatics, Statistics and Epidemiology, Faculty of Medicine, Leipzig University