The largest dataset of German-language medical texts is being created | GeMTeX Kick Off Meeting in Leipzig

On June 20 and 21, about 40 members of the GeMTeX project met to start their work on the methods platform of the Medical Informatics Initiative.

The main goal of GeMTeX is to standardize texts from clinical care, such as discharge letters, so that they can be read by automatic language processing programs and therefore used as a knowledge base for artificial intelligence applications. The GeMTeX project is expected to produce the largest data set for automated language processing of German-language clinical texts.

The core of the text corpus consists of the document collections of the university hospitals of the Technical University of Munich, Essen, Charité Berlin, Erlangen, Dresden and Leipzig. In order for the text collection to be composed, the patients concerned must have given their consent for their clinical documents to be used for research purposes as part of the Medical Informatics Initiative’s broad consent.

“We are an open project and would like to add other German language processing groups to the current project constellation,” said Professor Martin Boeker, joint coordinator of the GeMTeX project, at the opening of the event. Professor Markus Löffler, Director of the Institute of Medical Informatics, Statistics and Epidemiology at the Leipzig University has taken on the role of deputy coordinator.

The focus of the kick-off meeting was on getting to know each other and exchanging ideas. In addition, relevant preliminary work on clinical language processing was presented. For example, Florian Borchert from the Hasso-Plattner-Institute presented the annotated text corpus “GGPOnc“, which was developed in cooperation with the German Cancer Society as part of the guideline programme for cancer diseases. 

Annotation – one of the biggest tasks in GeMTeX

In order to make medical texts from clinical documentation usable for automatic language processing programs, they must first be annotated, i.e. their content must be marked up. Annotation work is therefore at the heart of the GeMTeX project. To this end, medical students at the six participating university hospitals will read medical documents according to certain specifications and mark up passages with a software.

“Annotation is one of the biggest, if not the biggest, task in GeMTeX,” emphasised Luise Modersohn, research associate in the GeMTeX project. She moderated a session in which representatives from various medical fields, such as cardiovascular diseases or drug safety, explained their annotation requirements.

In addition, Professor Stefan Schulz from the Medical University of Graz presented annotation guidelines being developed in the EU project AIDAVA. Annotation guidelines describe how and which text passages from clinical documents should be uniformly coded so that they can later be read by automatic language processing programmes.

Cooperation for productive collaboration

Another important component of the GeMTeX project is the software that supports the annotation work. For this purpose, Dr Richard Eckart de Castilho, research assistant at the Technical University of Darmstadt, presented the annotation tool “INCEpTION“. INCEpTION will be used to carry out the annotation work at all sites.

In addition to academia and clinics, industry is also involved in GeMTeX. The industry partners Averbis GmbH and ID Information und Dokumentation im Gesundheitswesen GmbH & Co. KGaA demonstrated how their software solutions can support scientific studies and make annotation more effective.

At the end of the meeting, the participants discussed how the GeMTeX text corpora can be transferred to the German National Library of Medicine (ZB MED – Information Centre for Life Sciences) so that they are available for research and care after the end of the project.

The first GeMTeX Closed Meeting will take place on 20-21 November 2023 in Munich.