Research

Research focus and projects in the field of Scientific Data Management

Efficient and scalable methods for the integration of large amounts of data as well as knowledge representation and discovery are central challenges of the research program of the Scientific Data Management Research Group. The developed applications are used in various domains (especially biomedicine and digital libraries) to turn heterogeneous data into usable knowledge.

The research plan includes the development of state-of-the-art infrastructures for managing heterogeneous scientific data, extracting knowledge from these data and developing new relationships and patterns. These infrastructures facilitate the integration and analysis of large and complex data sets into scientific knowledge graphs and facilitate the cooperation of all actors in value-added chains around scientific data. The challenges that the research group is working on include:

  • Knowledge graphs that not only encode the meaning and connections of scientific data, but also contain knowledge about provenance, privacy, quality and uncertainty.
  • Domain-specific ontologies and link discovery techniques capable of promoting the interoperability of heterogeneous and large scientific data sets in a scalable manner.
  • Integration methods for heterogeneous and extensive scientific data sources, e. g. legacy, structured and unstructured data, static data and continuous data streams.
  • Storage and distribution of extensive scientific data and knowledge graphs.
  • Access control methods to enforce privacy regulations for sensitive data.
  • Federated query engines for scientific knowledge graphs.
  • Data analysis and methods of knowledge discovery through scientific knowledge graphs.

The developed infrastructure components are evaluated on the basis of various data sets. Scientific data from publications archived in the TIB's databases (e. g. via RADAR or DataCite) are particularly suitable for this purpose. Scientists will be able to use the developed scientific data management infrastructures to sustainably increase the effectiveness and productivity of their research work.

Projects

The research group will work on third-party funded projects that have been transferred and newly acquired by the University of Bonn. This includes in particular:

  • iASiS: Integration and analysis of heterogeneous big data for precision medicine and suggested treatments for different types of patients (2017 bis 2020).
  • BigMedilytics: Big Data for Medical Analytics (2017 bis 2020)
  • QualiChain: Decentralised Qualifications’ Verification and Management for Learner Empowerment, Education Reengineering and Public Section Transformation. EU H2020 Research and Innovation Action (RIA) funded project. 2019-2022
  • CLARIFY: Cancer Long Survivors Artificial Intelligence Follow Up. EU H2020 Research and Innovation Action (RIA) funded project. 2020-2023
  • ImProVIT: Transforming big data into knowledge: for deep immunoprofiling in vaccination, infectious diseases, and transplantation. Project supported by the Minister for Science and Culture in Lower Saxony. 2019-2022
  • PLATOON: Digital PLAtform and analytic TOOls for eNergy. EU H2020 Innovation Action (IA) funded project. 2020-2023
  • P4-LUCAT: Personalized medicine for lung cancer treatment: using Big Data-driven approaches for decision support. ERAPerMed JTC2019. 2020-2023.
  • NoBIAS: European Training Network (ETN) for the study of methods of detecting, describing, and managing bias during in knowledge-driven approaches. 2020-2023
  • Knowledge4Hubris: Knowledge graph methodology allows various types of information deriving from heterogeneous sources to create an integrated representation of all data relevant to the tenure of different forms of power. 
  • TrustKG: A Framework for Knowledge Graphs based on Semantic Integration, Representation, and Curation of Scientific Data to enable Trustable and Interpretable Knowledge Exploration and Discovery
  • Leibniz Data Manager: A Research Data Management System. LDM is funded by Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) in the LIS Funding Programme e-Research Technologies (grant no. 438302423).

Prototypes

  • Ontario: Ontario is an ontology-based data integration and semantic enrichment on-demand framework over Semantic Data Lakes. Ontario adds a semantic layer on top of the source datasets which are stored as a raw format in a Data Lake. Ontario supports different data models (structured and semi-structured) such as Relational, CSV, TSV, JSON, XML, Document, and Graph. In addition, the following data management systems are supported: MySQL, Postgres, MongoDB, Neo4j, and distributed file systems Hadoop HDFS and S3. SPARQL is the global query language and currently, RML mappings are supported. 
  • Falcon: FALCON is an entity and relation linking framework over DBpedia able to identify relations and entities in short texts or questions. 
  • RDFizer: SDM-RFizer an interpreter of mapping rules that allow the transformation of (un)structured data into RDF knowledge graphs. The current version of the SDM-RDFizer assumes mapping rules are defined in the RDF Mapping Language (RML). The SDM-RDFizer implements optimized data structures and relational algebra operators that enable efficient executions of RML triple maps even in the presence of Big data. SDM-RDFizer is able to process data from Heterogeneous data sources (CSV, JSON, RDB, XML). The latest version of SDM-RDFizer, version4.0, empowered by new optimization features to create very large KGs efficiently, is released in October 2021.
  • Dragoman: Dragoman is an optimized interpreter of mapping rules (defined in RML) and integrate data pre/post-processing functions defined according to FnO (Function Ontology) as part of the transformation of data into RDF knowldge graph. Dragoman enables users to provide their own required function library easily.
  • easyRML: easyRML facilitates the creation of RML mapping rules. easyRML provides a user-friendy interface enabling users to create their mapping rules without being concerned about the syntaxes of the mapping language. easyRML allows users to upload their ontology and data fields list so to have a better overview of the components of the data integration system during the process of mapping rules declaration.
  • Leibniz Data Manager: The TIB Data Manager prototype was developed to support the aspect of better re-usability of research data.
  • DeTrusty: is a federated query engine. At this stage, only SPARQL endpoints are supported. DeTrusty differs from other query engines through its focus on the explainability and trustworthiness of the query result.
  • Trav-SHACL: a SHACL engine capable of planning the traversal and execution of a shape schema in a way that invalid entities are detected early and needless validations are minimized. Trav-SHACL reorders the shapes in a shape schema for efficient validation and rewrites target and constraint queries for fast detection of invalid entities. The shape schema is validated against an RDF graph accessible via a SPARQL endpoint.

Joint Lab Data Science & Open Knowledge

Some of the research on these topics takes place within the framework of the <link en research-development joint-lab _self internal-link>Joint Lab Data Science & Open Knowledge.

The Joint Lab will be established together with Leibniz Universität Hannover (LUH), the Faculty of Electrical Engineering and Computer Science and the L3S Research Center of LUH.

 

Feedback