Teaching

Prof. (Univ. Simón Bolívar) Dr. Maria-Esther Vidal lectures Master courses at the Faculty of Electrical Engineering and Computer Science at Leibniz Universität Hannover and in BIOMEDAS.
In addition, the research group supervises laboratory and final theses.

Courses at Leibniz Universität Hannover

Summer 2020

  • Introduction to Database Systems (Lecture, Exercises)

Winter 2019/20

  • Data Science & Digital Libraries (Seminar)

Summer 2019

  • Knowledge Engineering and Semantic Web (Lecture, Exercises)

Courses in BIOMEDAS

Summer 2021

Bachelor, Master, and PhD Theses

We offer Bachelor and Master theses in the following topics:
a) Knowledge graph creation and data integration
b) Constraint validation over knowledge graphs
c) Federated query processing over knowledge graphs
d) Explainable machine learning over knowledge graphs
e) Research data management and FAIR principles
f)  Knowledge extraction and text mining

Open topics:

  • Master Thesis: SHACL Engine
  • Master Thesis: Transparent Polystores

Under progress:

  • Fernandes, Luis
  • He, Chenyu

Completed:

  • Iglesias, Enrique: Data Structures for Knowledge Graph Creation at Scale. Master Thesis. 2021-05-12
    Supervisor(s): M.Sc. Samaneh Jozashoori, M.Sc. David Chaves-Fraga, Prof. Dr. Maria-Esther Vidal
    Evaluators: Prof. Dr. Jens Lehmann, Prof. Dr. Sören Auer
    Abstract

    Data has grown exponentially in the last years and knowledge graphs have gained momentum as data structures to integrate heterogeneous data. This explosion of data has created many opportunities to develop innovative technologies. Still, it brings attention to the lack of standardization for making data available, raising questions about interoperability and data quality. Data complexities like large volume, heterogeneity, and high duplicate rates affect knowledge graph creation. This thesis addresses these issues to scale up knowledge graph creation guided by the RDF Mapping Language (RML). This thesis is built on the assumption that the amount of memory required to create a knowledge graph and the strategies utilized to execute the RML mapping rules impact on the efficiency of a knowledge graph creation process. We propose a two-fold solution to address these two sources of complexity. First, RML mapping rules are reordered in a way that the most selective mapping rules are evaluated first while non-selective rules are considered at the end. As a result, the number of triples that are kept is main memory is reduced. In a second step, an RDF compression strategy and novel operators are made available. They avoid the generation of duplicated RDF triples and the reduction of the number of comparisons during the execution of RML operators between mapping rules. We empirically evaluate the performance of our proposed solution against various testbeds with diverse configurations of data volume, duplicate rates, and heterogeneity. Observed results suggest that our approach optimizes execution times and memory usage when compared with the state of the art. Moreover, these outcomes provide evidence of the crucial role of data structures and execution strategies on the scalability of processes of knowledge graph creation using declarative mapping languages.

  • Figuera, Mónica: Efficiently Validating Integrity Constraints in SHACL. Master Thesis. 2021-05-05
    Supervisor(s): Prof. Dr. Maria-Esther Vidal, M.Sc. Philipp D. Rohde, Dr. Diego Collarana
    Evaluators: Prof. Dr. Jens Lehmann, Prof. Dr. Maria-Esther Vidal
    Abstract

    We study fundamental aspects related to the validation of schemas defined under the Shapes Constraint Language (SHACL) specification, the W3C recommendation for declaratively representing integrity constraints over RDF knowledge graphs and the reference language in industrial consortia, such as the International Data Spaces initiative (IDS). Existing SHACL engines enable the identification of entities that violate integrity constraints. Nevertheless, these approaches do not scale up in presence of large volumes of data to effectively identify invalidations. We address the problem of efficiently validating integrity constraints in SHACL. To this end, we propose Trav-SHACL, an engine that performs data quality assessment in minimal time by identifying heuristic-based strategies to traverse a shapes schema and performing optimization techniques to the schema. Our key contributions include (i) Trav-SHACL, a SHACL engine capable of planning the traversal and execution of a shapes schema in a way that invalid entities are detected early and needless validations are minimized. Trav-SHACL reorders the shapes in a shapes schema for efficient validation and rewrites target and constraint queries for fast detection of invalid entities; (ii) the empirical evaluation of Trav-SHACL on 27 test beds over the well-known Lehigh University Benchmark (LUBM) executed against knowledge graphs of up to 34M triples. Our experimental results suggest that Trav-SHACL exhibits high performance gradually and reduces validation time by a factor of up to 33.65 compared to the state of the art.

  • Hanasoge, Supreetha: Efficiently Identifying Top k Similar Entities. Master Thesis. 2021-02-10
    Supervisor(s): Prof. Dr. Maria-Esther Vidal
    Evaluators: Prof. Dr. Sören Auer, Prof. Dr. Maria-Esther Vidal
    Abstract

    With the rapid growth in genomic studies, more and more successful researchers are being produced that integrate tools and technologies from interdisciplinary sciences. Computational biology or bioinformatics is one such field that successfully applies computational tools to capture and transcribe biological data. Specifically, in genomic studies, detection and analysis of co-occurring mutations is a leading area of study. Concurrently, in recent years, computer science and information technology have seen an increased interest in the areas of association analysis and co-occurrence computation. The traditional method of finding top similar entities involves examining every possible pair of entities, which leads to prohibitive quadratic time complexity. Most of the existing approaches also require a similarity measure and threshold beforehand to retrieve the top similar entities. These parameters are not always easy to tune. Heuristically, an adaptive method can have broader applications for identifying the top most similar pair of mutations (or entities in general). This thesis presents an algorithm to efficiently identify top k similar genetic variants using co-occurrence as the similarity measure. Our approach used an upper bound condition to prune the search space iteratively and tackled the quadratic complexity. The empirical evaluations illustrate the behavior of the proposed methods in terms of execution time and accuracy of our approach, particularly in large-sized datasets. The experimental studies also explore the impact of various parameters like input size, k on the execution time in top k approaches. The study outcomes suggest that systematic pruning of the search space using an adaptive threshold condition optimizes the process of identifying similar top pair of entities.

  • Torabinejad, Mohammad: Normalization Techniques for Improving the Performance of Knowledge Graph Creation Pipelines. Master Thesis. 2020-09-25
    Supervisor(s): Prof. Dr. Maria-Esther Vidal, Samaneh Jozashoori
    Evaluators: Prof. Dr. Sören Auer, Prof. Dr. Maria-Esther Vidal
    Abstract

    With the rapid growth of data within the web, demands on discovering information within data and consecutively exploiting knowledge graphs rise much more than we think it does. Data integration systems can be of great help to meet this precious demand in that they offer transformation of data from various sources and with different volumes. To this end, a data integration system takes advantage of utilizing mapping rules– specified in a language like RML – to integrate data collected from various data sources into a knowledge graph. However, large data sources may suffer from various data quality issues, being redundant one of them. Regarding this, the Semantic Web community contributes to Knowledge Engineering with techniques to create a knowledge graph efficiently. The thesis reported in this document tackles creating knowledge graphs in the presence of data sources with redundant data, and a novel normalization theory is proposed to solve this problem. This theory covers not only the characteristics of the data sources but also mapping rules used to integrate the data sources into a knowledge graph. Based on this, three normal forms are proposed and an algorithm for transforming mapping rules and data sources into these normal forms. The proposed approach’s performance is evaluated in different testbeds composed of real-world data and synthetic data. The observed results suggest that the proposed techniques can dramatically reduce the execution time of knowledge graph creation. Therefore, this thesis’s normalization theory contributes to the repertoire of tools that facilitate the creation of knowledge graphs at scale.

  • Karim, Farah: Compact Semantic Representations of Observational Data. PhD Thesis. 2020-03-18
    Advisor(s): Prof. Dr. Sören Auer, Prof. Dr. Maria-Esther Vidal
    Examiners: Prof. Dr. Sören Auer, Prof. Vojtêch Svátek
    Abstract

    The Internet of Things (IoT) concept has been widely adopted in several domains to enable devices to interact with each other and perform certain tasks. IoT devices encompass different concepts, e.g., sensors, programs, computers, and actuators. IoT devices observe their surroundings to collect information and communicate with each other in order to perform mutual tasks. These devices continuously generate observational data streams, which become historical data when these observations are stored. Due to an increase in the number of IoT devices, a large amount of streaming and historical observational data is being produced. Moreover, several ontologies, like the Semantic Sensor Network (SSN) Ontology, have been proposed for semantic annotation of observational data-either streams or historical. Resource Description Framework (RDF) is widely adopted data model to semantically describe the datasets. Semantic annotation provides a shared understanding for processing and analysis of observational data. However, adding semantics, further increases the data size especially when the observation values are redundantly sensed by several devices. For example, several sensors can generate observations indicating the same value for relative humidity in a given timestamp and city. This situation can be represented in an RDF graph using four RDF triples where observations are represented as triples that describe the observed phenomenon, the unit of measurement, the timestamp, and the coordinates. The RDF triples of an observation are associated with the same subject. Such observations share the same objects in a certain group of properties, i.e., they match star patterns composed of these properties and objects. In case the number of these subject entities or properties in these star patterns is large, the size of the RDF graph and query processing are negatively impacted; we refer these star patterns as frequent star patterns. This thesis addresses the problem of identifying frequent star patterns in RDF graphs and develop computational methods to identify frequent star patterns and generate a factorized RDF graph where the number of frequent star patterns is minimized. Furthermore, we apply these factorized RDF representations over historical semantic sensor data described using the SSN ontology and present tabular-based representations of factorized semantic sensor data in order to exploit Big Data frameworks. In addition, this thesis devises a knowledge-driven approach named DESERT that is able to on-Demand factorizE and Semantically Enrich stReam daTa. We evaluate the performance of our proposed techniques on several RDF graph benchmarks. The outcomes show that our techniques are able to effectively and efficiently detect frequent star patterns and RDF graph size can be reduced by up to 66.56% while data represented in the original RDF graph is preserved. Moreover, the compact representations are able to reduce the number of RDF triples by at least 53.25% in historical observational data and upto 94.34% in observational data streams. Additionally, query evaluation results over historical data reduce query execution time by up to three orders of magnitude. In observational data streams the size of the data required to answer the query is reduced by 92.53% reducing the memory space requirements to answer the queries. These results provide evidence that IoT data can be efficiently represented using the proposed compact representations, reducing thus, the negative impact that semantic annotations may have on IoT data management.

  • Endris, Kemele M.: Federated Query Processing over Heterogeneous Data Sources in a Semantic Data Lake. PhD Thesis. 2020-03-03
    Advisor(s): Prof. Dr. Sören Auer, Prof. Dr. Maria-Esther Vidal
    Examiners: Prof. Dr. Sören Auer, Prof. Dr. Jens Lehmann
    Abstract

    Data provides the basis for emerging scientific and interdisciplinary data-centric applications with the potential of improving the quality of life for citizens. Big Data plays an important role in promoting both manufacturing and scientific development through industrial digitization and emerging interdisciplinary research. Open data initiatives have encouraged the publication of Big Data by exploiting the decentralized nature of the Web, allowing for the availability of heterogeneous data generated and maintained by autonomous data providers. Consequently, the growing volume of data consumed by different applications raise the need for effective data integration approaches able to process a large volume of data that is represented in different format, schema and model, which may also include sensitive data, e.g., financial transactions, medical procedures, or personal data. Data Lakes are composed of heterogeneous data sources in their original format, that reduce the overhead of materialized data integration. Query processing over Data Lakes require the semantic description of data collected from heterogeneous data sources. A Data Lake with such semantic annotations is referred to as a Semantic Data Lake. Transforming Big Data into actionable knowledge demands novel and scalable techniques for enabling not only Big Data ingestion and curation to the Semantic Data Lake, but also for efficient large-scale semantic data integration, exploration, and discovery. Federated query processing techniques utilize source descriptions to find relevant data sources and find efficient execution plan that minimize the total execution time and maximize the completeness of answers. Existing federated query processing engines employ a coarse-grained description model where the semantics encoded in data sources are ignored. Such descriptions may lead to the erroneous selection of data sources for a query and unnecessary retrieval of data, affecting thus the performance of query processing engine. In this thesis, we address the problem of federated query processing against heterogeneous data sources in a Semantic Data Lake. First, we tackle the challenge of knowledge representation and propose a novel source description model, RDF Molecule Templates, that describe knowledge available in a Semantic Data Lake. RDF Molecule Templates (RDF-MTs) describes data sources in terms of an abstract description of entities belonging to the same semantic concept. Then, we propose a technique for data source selection and query decomposition, the MULDER approach, and query planning and optimization techniques, Ontario, that exploit the characteristics of heterogeneous data sources described using RDF-MTs and provide a uniform access to heterogeneous data sources. We then address the challenge of enforcing privacy and access control requirements imposed by data providers. We introduce a privacy-aware federated query technique, BOUNCER, able to enforce privacy and access control regulations during query processing over data sources in a Semantic Data Lake. In particular, BOUNCER exploits RDF-MTs based source descriptions in order to express privacy and access control policies as well as their automatic enforcement during source selection, query decomposition, and planning. Furthermore, BOUNCER implements query decomposition and optimization techniques able to identify query plans over data sources that not only contain the relevant entities to answer a query, but also are regulated by policies that allow for accessing these relevant entities. Finally, we tackle the problem of interest based update propagation and co-evolution of data sources. We present a novel approach for interest-based RDF update propagation that consistently maintains a full or partial replication of large datasets and deal with co-evolution.

  • Rohde, Philipp D.: Query Optimization Techniques For Scaling Up To Data Variety. Master Thesis. 2019-07-08
    Supervisor(s): M.Sc. Kemele M. Endris
    Evaluators: Prof. Dr. Sören Auer, Prof. Dr. Maria-Esther Vidal
    Abstract

    Even though Data Lakes are efficient in terms of data storage, they increase the complexity of query processing; this can lead to expensive query execution. Hence, novel techniques for generating query execution plans are demanded. Those techniques have to be able to exploit the main characteristics of Data Lakes. Ontario is a federated query engine capable of processing queries over heterogeneous data sources. Ontario uses source descriptions based on RDF Molecule Templates, i.e., an abstract description of the properties belonging to the entities in the unified schema of the data in the Data Lake. This thesis proposes new heuristics tailored to the problem of query processing over heterogeneous data sources including heuristics specifically designed for certain data models. The proposed heuristics are integrated into the Ontario query optimizer. Ontario is compared to state-of-the-art RDF query engines in order to study the overhead introduced by considering heterogeneity during query processing. The results of the empirical evaluation suggest that there is no significant overhead when considering heterogeneity. Furthermore, the baseline version of Ontario is compared to two different sets of additional heuristics, i.e., heuristics specifically designed for certain data models and heuristics that do not consider the data model. The analysis of the obtained experimental results shows that source-specific heuristics are able to improve query performance. Ontario optimization techniques are able to generate effective and efficient query plans that can be executed over heterogeneous data sources in a Data Lake.

 

Feedback