WP6 IT-Conception and Infrastructure

Work package 6: IT Conception and Infrastructure

One of DigiMed Bayern's key strengths is the innovative design and implementation of a centralized digital platform, which not only guarantees secure, data protection-compliant data access, but also analytical tools, algorithm-based analysis including machine learning, and knowledge management systems. The use of AI in these areas has been established for a relatively long time. It extends in the free text area (text mining) from the recognition and assignment of words and word groups in ontologies to syntactic and semantic analysis and subsequent structuring of information in databases. In the field of image recognition, for example, histological microscopic patterns can be recognized and quantified. As a special application within DigiMed Bayern, the establishment and use of a high-throughput process is envisaged in WP5.2, for example, in which lasers for microdissociation of the plaques originating from WP2 are then controlled on the basis of evaluated image data. The prepared plaque areas are then to be examined by proteomics methods, and thus contribute to the causal connection with the clinical progression of stroke patients through to new therapeutic approaches.

The collection, analysis and evaluation of medical big data are closely supported by the statistical expertise of the partners. The goal is to establish a largely comprehensive and cooperative digital infrastructure, which is contemporary and feasible at the same time.

The collection, processing, merging and integrative analysis of large sets of different data, as aimed for in DigiMed Bayern, requires the use of various IT technologies. The aim is to integrate these technologies and create an infrastructure that connects the IT resources available at the partner sites involved, and thereby enables efficient sharing of the data and tools required for the project. The requirements for this infrastructure are also partially described in WPs 1 to 5. In addition to the infrastructure, it is also required to build up the DigiMed partners' competence to use them correctly and purposefully, and thus profitably for the medical benefit.

There will be close horizontal cooperation between the institutions involved in DigiMed Bayern with regard to data types, volumes, protection, integration and evaluation. At the same time, the interdisciplinary setting of the team, such as computer scientists, bioinformaticians, statisticians, omics experts and clinical scientists, has to implement the digital aspects vertically from different perspectives. To meet these requirements, a digital control structure, the so-called governance board, has been set up under the direction of WP6. For example, each vertical area is assigned at least one qualified representative from all work packages with a defined proportion of working hours, who serves both to work on the IT governance structure and to support all cooperation partners, and as an adapter into his work package or his institution. The detailed structure of the governance board and associated processes are defined in the first six weeks after the start of the project.

In the digital area in particular, an "agile" process has largely been established to cope with the high dynamics with constantly adapted requirements ("Scrum" model). A continuous learning and optimization process with tight control points in project management generally achieves more precise planning with better results, i.e. an increase in efficiency, predictability and goal achievement. This model will be appropriately considered in WP6.

The primary task in WPs 1 to 5 is to record the existing hardware and software infrastructure, of the data and use cases, in order to then jointly develop a solid concept that also includes the LRZ infrastructure in the implementation. This also creates a connection to the computer capacity that is required to implement the data analysis in WPs 1 to 5.

WP6.1 Analysis of the existing infrastructure / conception of the integrative IT infrastructure

First, the status quo of the existing IT infrastructures of the partners involved is evaluated. In order to arrive at an early assessment of the integrability into a higher-level infrastructure, all important questions regarding the current status of IT equipment, data management requirements and the requirements of the scientists for an integrative infrastructure are identified and categorized in the first six months. A questionnaire is filled in during interviews with the IT specialists from all partner institutions. These surveys are carried out in close cooperation with WP7 under the requirements of ethics and data protection. A comprehensive workshop at the beginning of the project lays the foundation. The results serve for the subsequent conception of the infrastructure. This takes into account both the local inventory of resources, data and tools, as well as infrastructures that are available internationally at other locations or with other cooperation partners.
For the design of the required infrastructure, particular attention is paid to the requirements for analysis and knowledge management software from WP6.3. Conversely, the analysis results from WP6.1 flow into WP6.3, especially with regard to security, data protection, scalability, usability, compatibility of hardware architectures, interfaces, data formats, etc. The process of defining requirements and selection is coordinated and controlled by the LRZ.

The documented results of the conception phase are used for the detailed planning of the construction of the infrastructure in the second project phase. A continuous training program developed by WP6.2 familiarizes the project partners with the infrastructure, and jointly identifies suggestions for improvement and further developments for further integration. After construction and operation, a goal is also the transfer of the infrastructure to clinical or clinic-related operation in year 5, and the public-available documentation as an exemplary, scalable and transferable infrastructure for P4 medicine with omics data.

WP6.2 Development of the pilot infrastructure, planning and coordination of the data exchange, provision of computing capacity

In order to efficiently utilize the conception phase, and to allow realistic experiences to flow into the conception continuously despite the complex analysis of requirements and status quo, the first parts of the integrative infrastructure with a focus on "low hanging fruits" are being developed in parallel. IT resources are set up in such a way that pilot applications on the first infrastructure components can be tested and used as quickly as possible. Such a component will be, for example, the database, which centrally stores the biochemical and molecular genetic data and the treatment of FH patients, or public-available gene and protein databases with phenotypical associations and ontologies. Until all questions relating to data protection regulations are clarified, only non-critical data will be used for the testing phase. In addition to data storage, data transfer between the institutions and the central data management facilities is also made possible. The need for scaling of data transfer infrastructures to defined, high volumes of data is taken into account early in the operating phase.

In addition to the transfer and management of the data, their processing is an essential part of the IT infrastructure. For this purpose, analysis software must be partially installed on central hardware components, which can provide the computing capacity required for the corresponding calculations. For HPC systems, such as those used for large simulations or analysis, a separate application is required that demonstrates the scientific and technical expertise necessary to use the systems. The LRZ supports the responsible scientists in preparing these applications. In addition, tests of the software are carried out on the infrastructure, the results of which flow directly into its design at WP6.1.
Following a process of continuous integration, this pilot infrastructure will be gradually expanded and improved. In this way, the growing possibilities can be tested for suitability by pilot users, and any deficits can be immediately incorporated into the further concept. At the end of the first phase, the knowledge gained from setting up and testing the pilot infrastructure is gathered and integrated into the documentation for WP6.1. This serves to decide on the continuation of pilot components in real operation, or the redesign of sub-areas that may not be sufficiently viable.

In the second phase, the concept of WP6.1 is implemented iteratively and operated for use by the scientists from WPs 1 to 5. At the same time, the transfer to clinical or clinic-related operations is being prepared independently of DigiMed Bayern, which is planned to take place in year 5.

WP6.3 Analysis, conception and implementation of software solutions for the integrated omics platform and the expert system for digital medicine

The data protection-compliant integration of existing as well as prospective and retrospective data is the basis for further analysis. For this purpose, a comprehensive IT core infrastructure is to be created, which supports the acquisition, processing, integration and analysis of big data. Public data used in the project should be in compatible formats that can be evaluated digitally and integrative, and be made accessible and integrated. At the command line or graphical user interface level, it should also be possible to analyse data also with complex inquiries into analysis tools including AI, or to analyze it manually. The analyzed data, including the selected parameters and results, should be stored in a structured manner. This not only provides documentation, but also quick repeatability of the analysis with adapted underlying data records and/or parameters. It should also be possible to annotate analysis, results, data and relations manually with free text and with existing and/or new, flexibly adaptable ontologies. An exemplary use case is a protein entry, for which it is then visible within the consortium, i) which analysis have been carried out, ii) by whom, iii) with which data, iv) which software, and v) which parameters, vi)under which hypotheses, vii) with which results and insights, as well as viii) which questions or follow-up activities surrender. This is the only way to achieve a collaborative information exchange between the institutions and people involved, although with large amounts of data and combinatorial diversity. This continuous digital axis is essential for high efficiency with regard to the overall project goals, and has the following sub-aspects:

structured, flexible data retrieval including access management;
promotion of efficient and transparent collaboration despite interdisciplinarity;
transparent (live) documentation ranging from activities to innovations, publications and project success;
scientific project management;
creation of an exemplary, scalable and transferable digital infrastructure for P4 medicine;
publicly accessible interactive expert database for biomarkers in atherosclerosis as a sustainable structure.

The largest data quantity occurs in the omics technologies in WP5. At the same time, there is already great expertise in integrative analysis with regard to overarching, disease-related issues. In the working groups of Prof. Matthias Mann, similar infrastructures are already being set up and used in parallel projects. The focus here is on the integration of omics data and public databases, as well as on the integration of clinical data, text mining and graph-based databases. This expertise is to be used by DigiMed Bayern. At the same time, especially in the clinical field, the need for data formatting and harmonization is not fulfilled. The same is the case with regard to the need for user interface-based access, the efficient granular management of user rights and data access, the integration of further data, interfaces to analytical tools, and annotation. The knowledge gained will result in documented requirements for the IT infrastructure and the operation or expansion of the pilot infrastructure. The need for commercial service providers will be documented in a structured manner, and can therefore flow quickly into targeted award procedures. As far as possible, the pilot infrastructure should be used as an already existing building block for the productive systems. In addition to recording the current status and use cases and the resulting requirements, the IT security and data protection concept is first drawn up. Research data is collected in accordance with the TMF "Guide to data protection in medical research projects" including two-stage pseudonymization. Before the preparation or integration, both the research data and the clinical data are pseudonymized. The further steps in the process are therefore carried out exclusively on pseudonymized data. For external access, secure concepts for the data protection-compliant exchange of data and samples must be developed. The data protection and security concept will be discussed with the responsible data protection officers and ethics committees, and submitted for assessment.
The requirements and the identified source systems are prioritized, and following an agile process, a first version of the architecture is designed. The software components are subsequently developed or expanded and put into pilot operation.

Prof. Dr. med. Heribert Schunkert

Wissenschaftlicher Leiter DigiMed Bayern, Direktor der Klinik für Herz- und Kreislauferkrankungen am Deutschen Herzzentrum München

+49 (0) 89 / 1218-4073

wildgruber@dhm.mhn.de

Prof. Dr. Annette Peters

Director of the Institute of Epidemiology, Helmholtz Zentrum München, Germany

+49 (0) 89 / 3187-4566

peters@helmholtz-muenchen.de

Dr. Holger Prokisch

AG Leiter, Institut für für Neurogenomik, Helmholtz Zentrum München; AG Leiter, Institut für Humangenetik, Klinikum rechts der Isar, Technische Universität München

+49 (0) 89 / 3187-2890

prokisch@helmholtz-muenchen.de