SMITH project helps to make medical texts usable for automated analysis
Routine care generates large volumes of medical text that contain valuable information about patients. However, the wording, content and structure of medical documentation can vary greatly between different institutions, making it unusable for digital programmes or analysis across different locations. In the SMITH consortium, the NLP project has addressed this problem by publishing guidelines for preparing medical texts so that they can be used for medical research and care, using methods of natural language processing (NLP).
In all German hospitals, doctors’ letters are written at the time of transfer and discharge to provide information about the patient to the doctors who continue to treat him or her. These letters are an essential part of the medical record and contain information about the reason for the treatment, details of the patient’s medical history, allergies, previous illnesses, family diagnoses, therapies already carried out, medications prescribed to date and also information about further treatment. This information can be of great value not only to the doctors treating the patient, but also to medical research and to the patients themselves.
In order for these medical documentation texts to be used across sites, they must be readable by digital programmes. However, the wording of medical texts is highly dependent on the institution, the medical speciality and also the person who writes them. In addition, they are not standardised either structurally or in terms of content. Automated capture of details in texts such as doctors’ letters and discharge summaries, such as descriptions of medicines, their frequency of use (daily, three times a day) or form of administration (as tablets, drops) is therefore hardly possible without prior preparation.
Automated text analysis methods can make the content of such texts usable for technical information systems as well as for treating doctors and patients. However, this requires that such NLP systems have access to sufficient text material to enable automatic analysis.
With the help of special computer programmes, so-called annotation tools such as Brat or INCEpTION, trained personnel mark text passages in manual steps according to certain content specifications. These markings, also known as annotations, contain clues about the structure and content of the texts and form the basis for statistical models on which modern NLP systems base their analyses. The creation of such annotations is governed by annotation guidelines. During the first funding phase of the Medical Informatics Initiative (MII), several such guidelines were developed for the annotation of German discharge summaries. They focus on the following tasks:
1) Structures of text passages that indicate whether a text passage describes, for example, a salutation, an anamnesis or the administration of medication, or the course of a hospital stay.[1]
2) Person-identifying characteristics or all descriptions that allow conclusions to be drawn about an individual patient and must therefore be subsequently anonymised for data protection reasons (e.g. personal names, address data or dates).[2]
3) Descriptions of key medical categories such as diagnoses, symptoms and findings;[3]
4) Descriptions of medications, including dosage (e.g., 50 mg, 1/2 tablet), frequency (e.g., three times a day), route (e.g., oral or by mouth), duration, and reason;[4]
5) Additional medical categories in terms of content (e.g. descriptions of anatomical locations, medical tests and procedures, treatment methods) and their relations, which relate these categories to each other;
6) Temporal references between categories and their relationships – all references to points in time and the sequence of clinical events – with the aim of being able to automatically map the information contained in the doctor’s letter onto a timeline;
7) Descriptions of the certainty or uncertainty and the exclusion (negation) of statements, for example whether a diagnosis is formulated on a tentative basis or excluded altogether.
The first four of these seven annotation guidelines have now been published at the end of the first phase of the MII. These can be used as a starting point for the cross-consortium project German Medical Textcorpus (GeMTeX). GeMTeX starts in June 2023 and will build a German clinical reference corpus at six university hospitals (Leipzig, TU Munich, Essen, Berlin, Dresden and Erlangen).
_______________ |
[1] Annotation guideline: https://doi.org/10.5281/zenodo.7707756 / Publication: Christina Lohr, Stephanie Luther, Franz Matthies, Luise Modersohn, Danny Ammon, Kutaiba Saleh, Andreas G. Henkel, Michael Kiehntopf, and Udo Hahn: CDA-Compliant Section Annotation of German-Language Discharge Summaries: Guideline Development, Annotation Campaign, Section Classification. In: AMIA Annual Symposium Proceedings 2018, San Francisco, USA, Nov 3-7. |
[2] Annotation guideline: https://doi.org/10.5281/zenodo.7707882 / Publication: Tobias Kolditz, Christina Lohr, Johannes Hellrich, Luise Modersohn, Boris Betz, Michael Kiehntopf, Udo Hahn: Annotating German Clinical Documents for De-Identification (MedInfo 2019 Aug 25-30 Lyon France) [Slides] |
[3] Annotation guideline: https://doi.org/10.5281/zenodo.7707917 / Publication: Christina Lohr, Luise Modersohn, Johannes Hellrich, Tobias Kolditz, Udo Hahn: An Evolutionary Approach to the of Discharge Summaries. In: Studies in Health Technology and Informatics, Vol. 270: Digital Personalized Health and Medicine – Proceedings of MIE 2020 |
[4] Annotation guideline: https://doi.org/10.5281/zenodo.7707947 / Publication: Udo Hahn, Franz Matthies, Christina Lohr, Markus Löffler. 3000PA-Towards a National Reference Corpus of German Clinical Language. In: Studies in Health Technology and Informatics, Vol. 247: Building Continents of Knowledge in Oceans of Data: The Future of Co-Created eHealth – Proceedings of MIE 2018, Gothenburg, Sweden, April 24-26 2018. [Slides] |