Efficient and scalable methods for the integration of large amounts of data as well as knowledge representation and discovery are central challenges of the research program of the Scientific Data Management Research Group. The developed applications are used in various domains (especially biomedicine and digital libraries) to turn heterogeneous data into usable knowledge.
The research plan includes the development of state-of-the-art infrastructures for managing heterogeneous scientific data, extracting knowledge from these data and developing new relationships and patterns. These infrastructures facilitate the integration and analysis of large and complex data sets into scientific knowledge graphs and facilitate the cooperation of all actors in value-added chains around scientific data. The challenges that the research group is working on include:
- Knowledge graphs that not only encode the meaning and connections of scientific data, but also contain knowledge about provenance, privacy, quality and uncertainty.
- Domain-specific ontologies and link discovery techniques capable of promoting the interoperability of heterogeneous and large scientific data sets in a scalable manner.
- Integration methods for heterogeneous and extensive scientific data sources, e. g. legacy, structured and unstructured data, static data and continuous data streams.
- Storage and distribution of extensive scientific data and knowledge graphs.
- Access control methods to enforce privacy regulations for sensitive data.
- Federated query engines for scientific knowledge graphs.
- Data analysis and methods of knowledge discovery through scientific knowledge graphs.
The developed infrastructure components are evaluated on the basis of various data sets. Scientific data from publications archived in the TIB's databases (e. g. via RADAR or DataCite) are particularly suitable for this purpose. Scientists will be able to use the developed scientific data management infrastructures to sustainably increase the effectiveness and productivity of their research work.
The research group will work on third-party funded projects that have been transferred and newly acquired by the University of Bonn. This includes in particular:
- iASiS: Integration and analysis of heterogeneous big data for precision medicine and suggested treatments for different types of patients (2017 bis 2020).
- BigMedilytics: Big Data for Medical Analytics (2017 bis 2020)
- QualiChain: Decentralised Qualifications’ Verification and Management for Learner Empowerment, Education Reengineering and Public Section Transformation. EU H2020 Research and Innovation Action (RIA) funded project. 2019-2022
- CLARIFY: Cancer Long Survivors Artificial Intelligence Follow Up. EU H2020 Research and Innovation Action (RIA) funded project. 2020-2023
- ImProVIT: Transforming big data into knowledge: for deep immunoprofiling in vaccination, infectious diseases, and transplantation. Project supported by the Minister for Science and Culture in Lower Saxony. 2019-2022
- PLATOON: Digital PLAtform and analytic TOOls for eNergy. EU H2020 Innovation Action (IA) funded project. 2020-2023
- P4-LUCAT: Personalized medicine for lung cancer treatment: using Big Data-driven approaches for decision support. ERAPerMed JTC2019. 2020-2023.
- NoBIAS: European Training Network (ETN) for the study of methods of detecting, describing, and managing bias during in knowledge-driven approaches. 2020-2023
- Ontario: Ontario is an ontology-based data integration and semantic enrichment on-demand framework over Semantic Data Lakes. Ontario adds a semantic layer on top of the source datasets which are stored as a raw format in a Data Lake. Ontario supports different data models (structured and semi-structured) such as Relational, CSV, TSV, JSON, XML, Document, and Graph. In addition, the following data management systems are supported: MySQL, Postgres, MongoDB, Neo4j, and distributed file systems Hadoop HDFS and S3. SPARQL is the global query language and currently, RML mappings are supported.
- Falcon: FALCON is an entity and relation linking framework over DBpedia able to identify relations and entities in short texts or questions.
- RDFizer: SDM-RFizer an interpreter of mapping rules that allow the transformation of (un)structured data into RDF knowledge graphs. The current version of the SDM-RDFizer assumes mapping rules are defined in the RDF Mapping Language (RML). The SDM-RDFizer implements optimized data structures and relational algebra operators that enable efficient executions of RML triple maps even in the presence of Big data. SDM-RDFizer is able to process data from Heterogeneous data sources (CSV, JSON, RDB, XML). The latest version of SDM-RDFizer, version4.0, empowered by new optimization features to create very large KGs efficiently, is released in October 2021.
Some of the research on these topics takes place within the framework of the <link en research-development joint-lab _self internal-link>Joint Lab Data Science & Open Knowledge.
The Joint Lab will be established together with Leibniz Universität Hannover (LUH), the Faculty of Electrical Engineering and Computer Science and the L3S Research Center of LUH.