“We have to find a common language”: GeMTeX_MII makes it possible| 5 Questions for… Christina Lohr and Luise Modersohn about the start of the project

Medical texts from routine care contain a large amount of complex data, such as disease progression, diagnoses and treatments, which can be very useful for research and patient care. However, clinical documentation texts often vary widely in structure and content between institutions. This makes them difficult to use for natural language processing, which is the basis for all automated processes and analyses. The German Medical Text Corpus (GeMTeX) project, launched on 1 June, aims to remedy this situation. The method platform of the Medical Informatics Initiative (MII) is preparing a large number of medical texts from various disciplines in such a way that they can be used to build automated language models. For this purpose, the texts are marked according to defined specifications and encoded in a machine-readable way. Personal information is rendered unrecognisable in a multi-stage process. The aim is to create the largest standardised dataset of pseudonymised medical texts in the German language, which can be used for research and patient care across locations.

Christina Lohr and Luise Modersohn are research assistants in the GeMTeX project and are currently completing their PhDs in computational linguistics. In this interview, they talk about the challenges they face in creating the GeMTeX text corpus and how research and clinics can use the standardised text collection in the future.

You have both been involved in computational linguistics and natural language processing for a long time. What fascinates you about the subject?

Christina Lohr:On the one hand, I am fascinated by the fact that language can be very diverse and that there are an incredible number of ways to represent facts. For example, how diagnoses can be described in a medical context.
As language is constantly changing and new diseases emerge all the time, we have to learn how to deal with them and update existing systems. Such complex challenges with clinical content have fascinated me all my life.

Luise Modersohn: It was similar for me. I am a bit of a puzzle freak. In puzzles we are always looking for the optimal solution to a problem. Language is used every day. Everyone should know it! But then it turns out that some things cannot be expressed in the language we all use. We realise: it works for certain cases, but then there are many exceptions. What fascinates me about language is that it is alive, constantly changing and at the same time full of ambiguities. I’m a big fan of puns. I think it’s incredibly cool that a term can have multiple meanings and that puns are possible. Coming from a computer science background, you ask yourself: how can you generalise something like that? It doesn’t really work, but somehow it does. That’s why I’m fascinated by natural language processing.

To what extent can an annotated corpus of medical texts support clinical research and patient care?

Luise Modersohn: There is always something that can be answered: It is important that things are comparable and that we standardise them. On the one hand, this is of course due to the fact that we are currently automating more and more. For example, Google, Facebook and other companies are developing more and more language processing and AI tools. But they are all developing programs for general use. However, clinical language differs from general language in many ways, like a dialect. With GeMTeX or an annotated corpus, we can independently check certain performance values and not have to rely on the manufacturers’ specifications. This has implications for patient care, where I think automation will increase. So we have to make sure that standardisation ensures the security of the data and therefore the well-being of the patient. But the gap between GeMTeX and direct patient care is relatively wide. The benefit is more for research.

Christina Lohr: There are many vendors of automated diagnostic coding on the market, some of which are not publicly evaluated. But these vendors determine what is billed and use tools with speech data processing running in a black box in the background. Scientists and hospital staff have no way of understanding how the black box works, or how good the service really is for particular datasets. With GeMTeX, we have the opportunity to create a gold standard for comparing the performance of automatic language processing.

As for the research benefits, follow-up projects could build on the annotated GeMTeX corpus to investigate, for example, the extent to which speech module-based documentation software can really make life easier for doctors and nurses. During treatment, documentation could be done directly via voice control, while the GeMTeX language models run in the background. That would be the ideal situation.

Luise Modersohn: ChatGPT has become the talk of the town. Such a model or a similar one COULD later be used locally on the GeMTeX data, for example to support the generation of doctor’s letters. Or a medical chatbot similar to ChatGPT could be developed for patients in hospitals who have questions or problems. COULD, because unfortunately the development of these language models requires a lot of data. So we will never get close to something like Google. But that is not our goal. We are talking about medical knowledge, which is a relatively closed domain. Once we have addressed and solved the security issues, voice-controlled AI assistance would be a potential application for which we could use the GeMTeX corpus.

Christina Lohr: You have to remember that a lot of data has to be available for the operation of such systems. For the clinical context it is, but it may not be used or it may not be in a state where it can be used.

This brings us to the topic of clinical text annotation. Individual passages in the text are marked according to certain specifications. These markers help to build language models for artificial intelligence applications.
What are the challenges of annotating medical texts?

Christina Lohr: The biggest challenge is to organise the process. We have annotated a text dataset at the Chair for Computational Linguistics at the Friedrich Schiller University Jena in cooperation with the Jena University Hospital. A group of medical students read texts for us according to certain specifications and marked passages. As we started almost from scratch five or six years ago and there is very little previous work for the German language, we had to set up and develop some of the annotation guidelines ourselves. In the end, we had annotation guidelines with examples and counter-examples, or we took category systems out of the annotation cycle because we noticed that the more requirements there are, the more cognitive effort there is in editing texts. It can take six months to a year before we even know what the requirements look like in detail.

Luise Modersohn: In the end, it all comes down to communication. As computer scientists, we thought about what terms we were interested in and how we defined them. For those who came from computer science, it was also perfectly logical how we formulated it. Then we gave our definitions to the medical students. They looked at us and said, “No, not like that!” We had to find a common language. It was pure communication work. We had to ask ourselves: What are we talking about? What do you mean by this? What do we mean by it? What do we want to express? Through the dialogue between us computer scientists and the medical students, we were able to agree on a common language. This has less to do with medicine or computer science than with communication. It is language in a philosophical sense.

Christina Lohr: To give an example of the pitfalls we encountered: For a long time we didn’t know how to deal with double negations or negations from a linguistic or pathological point of view.

Luise Modersohn: A positive result is not necessarily positive. So: Congratulating someone for testing positive for HIV is pretty mean. It’s something negative, but expressed in a positive way. Abbreviations are also tricky.

Christina Lohr: Or when a lab result is “without result”.

Luise Modersohn: Or five lines of text and the final result is: “We found nothing”.

Christina Lohr: We discussed a lot about how to deal with such situations.

It’s amazing the subtleties you come across in a project like this! Back to GeMTeX: What bottlenecks of previous language models can the GeMTeX method platform solve?

Christina Lohr: GeMTeX provides research data for processing German medical language. Although data sets are available, they are often only usable for scientific purposes and under certain conditions. In some cases, scientific textual datasets in medicine can only be used in the context of a very specific project. GeMTeX can fill this gap. Very importantly, GeMTeX builds on the broad patient consent of the Medical Informatics Initiative. This means that we are allowed to use the pseudonymised texts, if we have obtained consent, for exactly these purposes.

Luise Modersohn: Ultimately, it comes back to standardisation. Interdisciplinary research is very popular at the moment, and rightly so. In the past, there were only a few individual computational linguists who were interested in automation. So everyone had a small collection of texts in their hospital that were not allowed to be published – they were used for their own little evaluations. That’s all well and good. But the problem is that we now have differences between hospitals and even between individuals, even for the same type of letter, such as discharge letters. When you only have a very small amount of data, analyses on your own data work very well, but you have never had the opportunity to try it in a large context.

Please complete the following sentence: The GeMTeX project is an enrichment for the German research landscape, because…

Christina Lohr: …with our language models we can develop software for clinical documentation that can make doctors’ work easier.

Luise Modersohn: …now individual researchers are no longer sitting at their own workstations alone, but the concentrated knowledge from the clinical sites in Germany can be involved in a project.