
How Medical Texts Are Made Usable for AI Models | 5 Questions for… Marie-Sophie Polifka and Sarina Shams, Annotators in the GeMTeX Project
The use of language models and artificial intelligence (AI) is already commonplace in many specialized fields. Doctors could benefit from this as well, but they require adapted language models for medicine that are based on extensive German data sets. The GeMTeX project aims to create one of the largest corpora of German medical texts available. To this end, unstructured texts from clinical documentation will be made accessible for research and AI applications in compliance with data protection regulations. Annotation by student assistants plays a key role: they mark relevant text passages in medical documents and provide metadata on the content, making the texts machine-readable. In this interview, medical students Marie-Sophie Polifka and Sarina Shams discuss their work as annotators in Leipzig and share how the project will benefit their future careers.
What does your day-to-day work as an annotator for GeMTeX look like?
Marie-Sophie Polifka: We read a large number of different medical texts, such as doctors’ letters and medical reports. First, we make any personal data in them unrecognizable. This means we remove any information that could allow conclusions to be drawn about identities. Then, the documents can be used for further processing. We refer to this process as de-identification. Next, we will mark specific medical content, such as symptoms, diagnoses, and examinations, in the texts.
Sarina Shams: We use software to annotate the text and classify the information into categories. First, we make the data from the documents machine-readable. This medical information can then be used for research and developing AI models.
What aspects of this work do you find challenging?
Marie-Sophie Polifka: During processing, we repeatedly come across ambiguities. For instance, certain content cannot be clearly assigned to the intended categories. This is often because special cases like these have not yet been taken into account in the guidelines. In such cases, we consult with the team or other working groups before finalizing the document.
I also personally find it difficult to use medical abbreviations with which I am not yet familiar. Then, I have to use Google.
Sarina Shams: Sometimes, it’s not easy to determine whether certain information could be identifying, even if it only appears indirectly in the text. Some diseases are linked to specific professions. In some documents, there are sections that describe in detail what symptoms occur in different work-related situations. We can’t simply mark the “profession,” as the profession and symptoms are intertwined in the text. We discuss these cases within the team. Whether the information needs to be de-identified depends on the specific case.
An additional challenge is complex texts, as German is not my native language. Some texts I need to read multiple times. Fortunately, my team is there to support me when I have language-related questions.
What new or unexpected insights have you encountered in your work?
Sarina Shams: I realized that medical documentation varies greatly from doctor to doctor and from clinic to clinic. This presents a challenge when we want to use medical documents for automation and AI. GeMTeX taught me how important clear, consistent language is in documentation, and it’s something I’d like to implement in my day-to-day work.
Marie-Sophie Polifka: In almost every group meeting, we encounter new special cases that we must include in our guidelines. I probably wouldn’t have noticed many of them on my own. This has shown me how important the multiple-eye principle is in this type of work.
How will this experience help you in your future career?
Marie-Sophie Polifka: A significant part of a doctor’s work on the ward involves writing the medical history of their patients in a letter so that future practitioners are also informed. During our studies, we hardly learn how to write a proper doctor’s letter; we only learn this during clinical internships outside of university. However, when we start our careers, we are expected to excel at this task. In this respect, gaining insight into a large number of doctor’s letters is very helpful.
Sarina Shams: In the GeMTeX project, I learned to read precisely, understand, and critically question. This is, of course, not only helpful when writing medical reports. Later, in communication with colleagues, I will understand that while we may have different communication styles, we mean the same thing. Additionally, I have gained an understanding of how crucial high-quality data is for research and digital applications in medicine.
The GeMeTeX project can advance medical research because…
Sarina Shams: …it helps make large volumes of medical text data usable in a structured, data-protection compliant way. This is an important basis for developing AI applications that can assist doctors in their daily work.
Marie-Sophie Polifka: …certain automation processes are needed in medicine so doctors can spend more time with their patients.