Advancing the Use of Clinical Texts for Research and Artificial Intelligence
GeMTeX LLM Workshop and Plenary Meeting, June 22–23, 2026
Unstructured clinical documents such as physicians’ notes and discharge summaries contain a wealth of valuable data for medical research. Making these data securely and in compliance with data protection regulations available for research and artificial intelligence (AI) applications is one of the main objectives of the GeMTeX project within the German Medical Informatics Initiative (MII).
On June 22 and 23, 2026, GeMTeX project members met at TUM University Hospital Rechts der Isar for an internal Large Language Model (LLM) Workshop and the project’s plenary meeting.
Preparing for the Final Phase of the GeMTeX Project
The plenary meeting focused on the project’s current progress and the preparation of its final phase. Justin Hofenbitzer (TUM University Hospital Rechts der Isar) presented a major milestone: more than 1,000 clinical documents from six university hospitals have now been semantically annotated.
As part of the annotation process, medical students identify relevant information in clinical documents and enrich it with metadata, making the texts machine-readable. These high-quality datasets provide the foundation for research and the development of new methods in clinical natural language processing.
Publication of the GeMTeX Text Corpus Moves Closer
The project has also made significant technical progress. Jakob Faller (University Hospital Erlangen) presented the new Core Dataset (CDS) Module “Document”, which was developed in collaboration with the Digital Hub MiHUB.
The CDS module enables clinical text documents processed within GeMTeX to be integrated from the Data Integration Centers into the German Portal for Medical Research Data (FDPG). Both de-identified and semantically annotated documents can therefore be provided in a standardized format for research and clinical use. Initial results of these developments have already been presented at the international Language Resources and Evaluation Conference (LREC) and Medical Informatics Europe (MIE) conference.
In addition, the GeMTeX team is preparing a first prototype use case in collaboration with the German National Library of Medicine (ZB MED). Ethical approval has already been obtained, and all six participating university hospitals have granted approval through their respective Use and Access Committees. As a result, the text corpus created within the project will soon be available for scientific research projects upon request and under defined conditions.
LLM Workshop Highlighted the Potential of AI Language Models for Medicine
On the preceding day, project members discussed current research and practical applications of Large Language Models (LLMs) for processing clinical documents during an internal GeMTeX LLM Workshop. LLMs are AI-based language models that learn from large volumes of text and can, for example, generate coherent text independently.
The workshop covered a range of topics, including:
- Methods for the de-identification of sensitive information
- Automated recognition of clinical entities
- Comparisons between synthetically generated and authentic medical history dialogues
- Structured processing of clinical practice guidelines
- The use of LLM-supported software in research
The workshop demonstrated that language models have considerable potential to support many tasks in clinical text processing. At the same time, it highlighted the importance of rigorous scientific evaluation and robust data protection frameworks—both of which are key objectives of the GeMTeX project.
The final GeMTeX project meeting will take place on October 20, 2026, at the Albertina Library of Leipzig University.
Recently published:
Jakob Faller, Marcel Susky, Noemi Deppenwiese, Justin Hofenbitzer, Christina Lohr, Thomas Ganslandt, Martin Boeker, Frank Meineke. Standardized Information Model for Clinical Texts: The MII Core Data Set Module Document. Stud Health Technol Inform. 2026 May 21;336:1202-1206. DOI: 10.3233/SHTI260389.
Christina Lohr, Marvin Seiferling, Philipp Wiesenbach, Jakob Faller, Christoph Dieterich. The SURROGATOR Framework for Context-Aware Surrogation of Privacy Sensitive Information in Medical Text. Stud Health Technol Inform. 2026 May 21;336:1405-1409. DOI: 10.3233/SHTI260440. [Slides] [Code SURROGATOR] [Code Evaluation]
Justin Hofenbitzer, Christina Lohr, Andrea Riedel, Rebekka Kiser, Aliaksandra Shutsko, Abanoub Abdelmalak, Peter Klügl, Jutta Romberg, Sarah Riepenhausen, Miriam Schechner, Jakob Faller, Frank Meineke, Luise Modersohn, Markus Löffler, Juliane Fluck, Udo Hahn, Stefan Schulz, Martin Boeker. Developing the German Medical Text Corpus (GeMTeX): Legal Compliance and Semantic Enrichment. In Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026) (pp. 1571–1584). European Language Resources Association (ELRA). DOI: 10.63317/4eqiegnqbu96.