ethernet cable on network switches background

SMITH-Use Case PheP: Innovative tools and methods for clinical data evaluation – A summary of five years of project work

Through the work of the Medical Informatics Initiative (MII) and the associated establishment of Data Integration Centres, a unique and rich pool of clinical care data from various areas of healthcare is being made available for research. Unique because the form and content of this data is precisely defined for all participating university hospitals in Germany. To make the data as usable as possible for research and innovative healthcare, the methodological Use Case Phenotyping Pipeline (PheP) was launched in January 2018 as a project of the SMITH Consortium. PheP supports the construction and qualitative enrichment of data, and demonstrates how these data can be used in clinical projects. The PheP project will be completed at the end of this month. In its five years of operation, the methodological use case has laid the groundwork for a wide range of other MII projects. PheP has provided tools and methods to use medical data in a privacy-compliant way and to address novel research questions.

Using algorithms for early disease detection

One of the methods used in PheP was phenotyping. Phenotyping is the process of extracting new information from existing data. Certain characteristics or phenotypes of patients from medical records are automatically recognised and grouped. In this way, certain laboratory values and medications can provide clues to other diseases, or suitable patients can be found for participation in a study. The Terminology- and Ontology-based Phenotyping (TOP) junior research group emerged from this part of the project. Based at the University of Leipzig, the junior research group, led by Dr Alexandr Uciteli, has been working since 2021 on a software platform for modelling and executing phenotyping algorithms.

Privacy-compliant on-site analysis of patient data

The PheP project has also established a method for distributed analysis of patient data that complies with data protection regulations: The data is processed anonymously on site, and only the results that are no longer patient-related leave the premises. All algorithms are developed in advance, distributed to participating sites and run there. This approach has been used successfully in various national MII tests, the projectathons, and also in the cross-consortium clinical use case POLAR_MI, which focused on drug safety. Distributed analysis will be further facilitated in the future by, among others, the Personal Health Train group initiated as part of PheP.

Medical texts made available for research

In addition, preliminary work has been done in the PheP Use Case to make medical documents available for research through natural language processing (NLP). Not all data in electronic patient records is structured and unambiguously coded as facts and figures. Free text is often used in the documentation, e.g. for findings or discharge letters. The PheP-NLP project developed new methods for extracting diagnoses, findings, medications and side effects from these texts. For this to work, the largest possible collection of these texts, a text corpus, is required. In a pilot project with the three university hospitals in Aachen, Jena and Leipzig, medical documents from more than 3,000 patients were analysed. The experience gained in this project has led to a new project that is unique in Germany: the German Medical Texcorpus (GeMTeX). It will start in June 2023 and, with 16 partners – including six university hospitals – will build up by far the largest collection of German-language medical texts for NLP research. GeMTeX will thus form the basis for the further development of natural language processing in Germany.