About GeMTeX

GeMTeX – German Medical Text Corpus

Automatic indexing of medical texts for research

In everyday clinical practice, large amounts of text are produced, such as doctors’ letters or medical reports, which contain valuable information on the background, course and treatment of diseases for healthcare and research. Natural language processing (NLP) programmes could support the work of doctors and researchers on the basis of these texts. However, due to the lack of standardisation of free medical text, the potential of this treasure trove of data cannot be fully exploited. The structure and language of clinical notes are highly dependent on the individuals who write them. In addition, medical language is very different from everyday and scientific language. Clinical texts are characterised by jargon, brevity and economy of language, are written under time pressure, have incomplete coding and little structure.

This is where the GeMTeX methodological use case comes in: Six Data Integration Centres from the four medical informatics consortia DIFUTURE, HiGHmed, MIRACUM and SMITH are contributing data and methods to make medical texts from patient care available for research projects. The aim is to create the largest medical text corpus in the German language. The office of the SMITH Consortium coordinates the project. The work in GeMTeX is based on the methodological Use Case PheP/NLP, which was implemented by the SMITH Consortium from 01.01.2018 – 31.05.2023.

Ärztin arbeitet an einem Computer mit einer elektronischen Patientenakte in einer modernen Klinik.

GeMTeX Use Case

Creation of a large database for medical research projects as well as for AI models with the aim of clinical application.

Extensive annotation of this corpus - in addition to basic annotation (e.g. diagnoses, medications), also deep domain-specific annotation (e.g. pathology, oncology, neurology, cardiology).

Establishment of technical and organisational standards for the mapping of text and annotations with the expansion of the MII core dataset.

Cross-consortium project of the Medical Informatics Initiative with 17 partners from science, IT and healthcare.

Creation of a large collection of German-language medical texts used in everyday patient care

Computer-based natural language processing can be used to build models through machine learning that automatically make information visible in clinical texts.

The use of natural language processing (NLP) thus provides the necessary basis for making text documents usable for medical research. Progress in clinical NLP will depend crucially on specially trained language models that require realistic clinical documents. To realise the full potential of NLP, it is therefore necessary to have access to large amounts of annotated texts from everyday patient care.

Annotated texts are documents that contain additional information through systematic annotations, such as information on diagnoses or medications. The annotations are manually reviewed by physician trainees and serve as a reference for further improvement of the automatic annotation. Information structured in this way can be used with existing data for analysis and statistical modelling.

The IT infrastructure that will be built during the development and networking phase of the Medical Informatics Initiative (MII) between 2018 and 2022 offers the possibility of making clinical documents accessible on a large scale and enriching them with systematic annotations. The MII method platform GeMTeX aims to solve the two major bottlenecks of current language models: data accessibility and data annotation.

With the consent of the patients, the GeMTeX project collects documents from the electronic patient files (ePA) of the six university medical centres in Munich, Leipzig, Essen, Berlin, Dresden and Erlangen. Using natural language processing, the documents are edited and made available in anonymised form for shared use. This creates a valuable text repertoire for research and development.

Central structures enable broad enrichment and use of clinical text documents

In its implementation, GeMTeX will create a central technical and organisational structure to collect anonymised texts and process them according to guidelines. GeMTeX thus covers a wide range of annotation tasks. These will be tested, verified and applied on a large scale to create a unique database. It can be used to train AI models and then test their usefulness in clinical practice. The enriched text documents and models will be made publicly available via the German National Library of Medicine (ZBMED) and the DFG-funded NFDI4Health project.

The GeMTeX Use Case started on 1 June 2023 and is funded by the German Federal Ministry of Education and Research (BMBF) with around seven million euros until 31 August 2026.

GeMTeX Fact Sheet

Status: 02/2024

Participating consortia of the Medical Informatics Initiative