Teaching
Prof. Dr. Maria-Esther Vidal lectures Bachelor and Master courses at the Faculty of Electrical Engineering and Computer Science at Leibniz Universität Hannover. Additionally, Prof, Vidal and the members of her research team supervises laboratory and final theses.
Courses at Leibniz Universität Hannover
Winter 2022/23
- Scientific Data Management and Knowledge Graph (Seminar)
Summer 2022
- Scientific Data Management (Seminar)
Summer 2020
- Introduction to Database Systems (Lecture, Exercises)
Winter 2019/20
- Data Science & Digital Libraries (Seminar)
Summer 2019
- Knowledge Engineering and Semantic Web (Lecture, Exercises)
Courses in BIOMEDAS
Summer 2022
- Scientific Database Programming (Lecture, Exercises)
Summer 2021
- Responsible Data Management (Lecture, Exercises)
Winter 2020/21
- Introduction to Scientific Databases (Lecture, Exercises)
Theses and Dissertations
Open topics:
A list with open topics can be found at the subpage Open Theses.
Under progress:
- Gercke, Julian
- He, Chenyu
Completed:
- Gercke, Julian: Supporting Explainable AI on Semantic Constraint Validation. Master Thesis. 2022-07-12
Supervisor(s): M.Sc. Philipp D. Rohde, Prof. Dr. Maria-Esther Vidal
Evaluators: Prof. Dr. Maria-Esther Vidal, Prof. Dr. Sören Auer
AbstractThere is a rising number of knowledge graphs available published through various sources. The enormous amount of linked data strives to give entities a semantic context. Using SHACL, the entities can be validated with respect to their context. On the other hand, an increasing usage of AI models in productive systems comes with a great responsibility in various areas. Predictive models like linear, logistic regression, and tree-based models, are still frequently used as they come with a simple structure, which allows for interpretability. However, explaining models includes verifying whether the model makes predictions based on human constraints or scientific facts. This work proposes to use the semantic context of the entities in knowledge graphs to validate predictive models with respect to user-defined constraints; therefore, providing a theoretical framework for a model-agnostic validation engine based on SHACL. In a second step, the model validation results are summarized in the case of a decision tree and visualized model-coherently. Finally, the performance of the framework is evaluated based on a Python implementation.
- Alom, Hany: A Library for Visualizing SHACL over Knowledge Graphs. Master Thesis. 2022-04-06
Supervisor(s): M.Sc. Philipp D. Rohde
Evaluators: Prof. Dr. Maria-Esther Vidal, Prof. Dr. Sören Auer
AbstractIn a data-driven world, the amount of data currently collected and processed is perhaps the most spectacular result of the digital revolution. And the range of possibilities available has grown and will continue to grow. The Web is full of documents for humans to read, and with Semantic Web, data can also be understood by machines. W3C standardized RDF to represent the Web of data as modeled entities and their relations. Then SHACL came along to present constraints in RDF knowledge graphs, as a network of shapes. SHACL networks are usually presented in textual formats. This thesis focuses on visualizing SHACL networks in a 3D space, while providing many features for the user to manipulate the graph and get the desired information. Thus, SHACLViewer is presented as a framework for SHACL visualization. In addition, an evaluation for the impact of various parameters like network size, topology, and density are studied. During the study, execution times for different functions are computed; they include loading time, expanding the graph, and highlighting a shape. The observed results reveal the characteristics of the SHACL networks that affect the performance and scalability of SHACLViewer.
- Fernandes, Luis: Effectively Unveiling Skills in Linked Data . Master Thesis. 2021-09-22
Supervisor(s): Prof. Dr. Maria-Esther Vidal, M.Sc. Ahmad Sakor
Evaluators: Prof. Dr. Jens Lehmann, Prof. Dr. Maria-Esther Vidal
AbstractIn recent times, the ingenuity of humanity and its population have grown considerably, and the technology and the data we produce. With each new technology, new skills are necessary to develop it, which also increases the demand from institutions and companies towards people who have these new skills. Knowledge Graphs is one of those technologies that have opened an endless number of applications. We aim at identifying new skills within the Linked Data. To carry out this task, we propose an approach called DiSkill that allows extracting good representations of the nodes of these knowledge graphs and classifying their elements appropriately. This work is built on others that have extracted sets of representations for knowledge Graphs. However, in all of them, such models have always been generated considering the entire graph. Our work, on the other hand, applies a strategy, which makes use of specific domain knowledge as a first step. DiSkill resorts to an Entity Linking Engine to recognize entities related to the background knowledge in existing knowledge graphs (e.g., DBpedia and Wikidata). Next, DiSkill creates the latent representations of the graph generated by the subgraphs reachable from the mentioned linked entities. Lastly, DiSkill relies on existing predictive models and the latent representations to predict which of the entities in the knowledge graph correspond to skills. As part of this thesis, we also evaluate specific configurations of the RDF2Vec strategy on our approach and report results through a set of metrics and after the execution in different classification models to judge their quality. We also compare the development of DiSkill with that of the original RDF2Vec work and demonstrate the considerable improvements that our strategy provides.
- Iglesias, Enrique: Data Structures for Knowledge Graph Creation at Scale. Master Thesis. 2021-05-12
Supervisor(s): M.Sc. Samaneh Jozashoori, M.Sc. David Chaves-Fraga, Prof. Dr. Maria-Esther Vidal
Evaluators: Prof. Dr. Jens Lehmann, Prof. Dr. Sören Auer
AbstractData has grown exponentially in the last years and knowledge graphs have gained momentum as data structures to integrate heterogeneous data. This explosion of data has created many opportunities to develop innovative technologies. Still, it brings attention to the lack of standardization for making data available, raising questions about interoperability and data quality. Data complexities like large volume, heterogeneity, and high duplicate rates affect knowledge graph creation. This thesis addresses these issues to scale up knowledge graph creation guided by the RDF Mapping Language (RML). This thesis is built on the assumption that the amount of memory required to create a knowledge graph and the strategies utilized to execute the RML mapping rules impact on the efficiency of a knowledge graph creation process. We propose a two-fold solution to address these two sources of complexity. First, RML mapping rules are reordered in a way that the most selective mapping rules are evaluated first while non-selective rules are considered at the end. As a result, the number of triples that are kept is main memory is reduced. In a second step, an RDF compression strategy and novel operators are made available. They avoid the generation of duplicated RDF triples and the reduction of the number of comparisons during the execution of RML operators between mapping rules. We empirically evaluate the performance of our proposed solution against various testbeds with diverse configurations of data volume, duplicate rates, and heterogeneity. Observed results suggest that our approach optimizes execution times and memory usage when compared with the state of the art. Moreover, these outcomes provide evidence of the crucial role of data structures and execution strategies on the scalability of processes of knowledge graph creation using declarative mapping languages.
- Figuera, Mónica: Efficiently Validating Integrity Constraints in SHACL. Master Thesis. 2021-05-05
Supervisor(s): Prof. Dr. Maria-Esther Vidal, M.Sc. Philipp D. Rohde, Dr. Diego Collarana
Evaluators: Prof. Dr. Jens Lehmann, Prof. Dr. Maria-Esther Vidal
AbstractWe study fundamental aspects related to the validation of schemas defined under the Shapes Constraint Language (SHACL) specification, the W3C recommendation for declaratively representing integrity constraints over RDF knowledge graphs and the reference language in industrial consortia, such as the International Data Spaces initiative (IDS). Existing SHACL engines enable the identification of entities that violate integrity constraints. Nevertheless, these approaches do not scale up in presence of large volumes of data to effectively identify invalidations. We address the problem of efficiently validating integrity constraints in SHACL. To this end, we propose Trav-SHACL, an engine that performs data quality assessment in minimal time by identifying heuristic-based strategies to traverse a shapes schema and performing optimization techniques to the schema. Our key contributions include (i) Trav-SHACL, a SHACL engine capable of planning the traversal and execution of a shapes schema in a way that invalid entities are detected early and needless validations are minimized. Trav-SHACL reorders the shapes in a shapes schema for efficient validation and rewrites target and constraint queries for fast detection of invalid entities; (ii) the empirical evaluation of Trav-SHACL on 27 test beds over the well-known Lehigh University Benchmark (LUBM) executed against knowledge graphs of up to 34M triples. Our experimental results suggest that Trav-SHACL exhibits high performance gradually and reduces validation time by a factor of up to 33.65 compared to the state of the art.
- Hanasoge, Supreetha: Efficiently Identifying Top k Similar Entities. Master Thesis. 2021-02-10
Supervisor(s): Prof. Dr. Maria-Esther Vidal
Evaluators: Prof. Dr. Sören Auer, Prof. Dr. Maria-Esther Vidal
AbstractWith the rapid growth in genomic studies, more and more successful researchers are being produced that integrate tools and technologies from interdisciplinary sciences. Computational biology or bioinformatics is one such field that successfully applies computational tools to capture and transcribe biological data. Specifically, in genomic studies, detection and analysis of co-occurring mutations is a leading area of study. Concurrently, in recent years, computer science and information technology have seen an increased interest in the areas of association analysis and co-occurrence computation. The traditional method of finding top similar entities involves examining every possible pair of entities, which leads to prohibitive quadratic time complexity. Most of the existing approaches also require a similarity measure and threshold beforehand to retrieve the top similar entities. These parameters are not always easy to tune. Heuristically, an adaptive method can have broader applications for identifying the top most similar pair of mutations (or entities in general). This thesis presents an algorithm to efficiently identify top k similar genetic variants using co-occurrence as the similarity measure. Our approach used an upper bound condition to prune the search space iteratively and tackled the quadratic complexity. The empirical evaluations illustrate the behavior of the proposed methods in terms of execution time and accuracy of our approach, particularly in large-sized datasets. The experimental studies also explore the impact of various parameters like input size, k on the execution time in top k approaches. The study outcomes suggest that systematic pruning of the search space using an adaptive threshold condition optimizes the process of identifying similar top pair of entities.
- Torabinejad, Mohammad: Normalization Techniques for Improving the Performance of Knowledge Graph Creation Pipelines. Master Thesis. 2020-09-25
Supervisor(s): Prof. Dr. Maria-Esther Vidal, M.Sc. Samaneh Jozashoori
Evaluators: Prof. Dr. Sören Auer, Prof. Dr. Maria-Esther Vidal
AbstractWith the rapid growth of data within the web, demands on discovering information within data and consecutively exploiting knowledge graphs rise much more than we think it does. Data integration systems can be of great help to meet this precious demand in that they offer transformation of data from various sources and with different volumes. To this end, a data integration system takes advantage of utilizing mapping rules– specified in a language like RML – to integrate data collected from various data sources into a knowledge graph. However, large data sources may suffer from various data quality issues, being redundant one of them. Regarding this, the Semantic Web community contributes to Knowledge Engineering with techniques to create a knowledge graph efficiently. The thesis reported in this document tackles creating knowledge graphs in the presence of data sources with redundant data, and a novel normalization theory is proposed to solve this problem. This theory covers not only the characteristics of the data sources but also mapping rules used to integrate the data sources into a knowledge graph. Based on this, three normal forms are proposed and an algorithm for transforming mapping rules and data sources into these normal forms. The proposed approach’s performance is evaluated in different testbeds composed of real-world data and synthetic data. The observed results suggest that the proposed techniques can dramatically reduce the execution time of knowledge graph creation. Therefore, this thesis’s normalization theory contributes to the repertoire of tools that facilitate the creation of knowledge graphs at scale.
- Karim, Farah: Compact Semantic Representations of Observational Data. PhD Thesis. 2020-03-18
Advisor(s): Prof. Dr. Sören Auer, Prof. Dr. Maria-Esther Vidal
Examiners: Prof. Dr. Sören Auer, Prof. Vojtêch Svátek
AbstractThe Internet of Things (IoT) concept has been widely adopted in several domains to enable devices to interact with each other and perform certain tasks. IoT devices encompass different concepts, e.g., sensors, programs, computers, and actuators. IoT devices observe their surroundings to collect information and communicate with each other in order to perform mutual tasks. These devices continuously generate observational data streams, which become historical data when these observations are stored. Due to an increase in the number of IoT devices, a large amount of streaming and historical observational data is being produced. Moreover, several ontologies, like the Semantic Sensor Network (SSN) Ontology, have been proposed for semantic annotation of observational data-either streams or historical. Resource Description Framework (RDF) is widely adopted data model to semantically describe the datasets. Semantic annotation provides a shared understanding for processing and analysis of observational data. However, adding semantics, further increases the data size especially when the observation values are redundantly sensed by several devices. For example, several sensors can generate observations indicating the same value for relative humidity in a given timestamp and city. This situation can be represented in an RDF graph using four RDF triples where observations are represented as triples that describe the observed phenomenon, the unit of measurement, the timestamp, and the coordinates. The RDF triples of an observation are associated with the same subject. Such observations share the same objects in a certain group of properties, i.e., they match star patterns composed of these properties and objects. In case the number of these subject entities or properties in these star patterns is large, the size of the RDF graph and query processing are negatively impacted; we refer these star patterns as frequent star patterns. This thesis addresses the problem of identifying frequent star patterns in RDF graphs and develop computational methods to identify frequent star patterns and generate a factorized RDF graph where the number of frequent star patterns is minimized. Furthermore, we apply these factorized RDF representations over historical semantic sensor data described using the SSN ontology and present tabular-based representations of factorized semantic sensor data in order to exploit Big Data frameworks. In addition, this thesis devises a knowledge-driven approach named DESERT that is able to on-Demand factorizE and Semantically Enrich stReam daTa. We evaluate the performance of our proposed techniques on several RDF graph benchmarks. The outcomes show that our techniques are able to effectively and efficiently detect frequent star patterns and RDF graph size can be reduced by up to 66.56% while data represented in the original RDF graph is preserved. Moreover, the compact representations are able to reduce the number of RDF triples by at least 53.25% in historical observational data and upto 94.34% in observational data streams. Additionally, query evaluation results over historical data reduce query execution time by up to three orders of magnitude. In observational data streams the size of the data required to answer the query is reduced by 92.53% reducing the memory space requirements to answer the queries. These results provide evidence that IoT data can be efficiently represented using the proposed compact representations, reducing thus, the negative impact that semantic annotations may have on IoT data management.
- Endris, Kemele M.: Federated Query Processing over Heterogeneous Data Sources in a Semantic Data Lake. PhD Thesis. 2020-03-03
Advisor(s): Prof. Dr. Sören Auer, Prof. Dr. Maria-Esther Vidal
Examiners: Prof. Dr. Sören Auer, Prof. Dr. Jens Lehmann
AbstractData provides the basis for emerging scientific and interdisciplinary data-centric applications with the potential of improving the quality of life for citizens. Big Data plays an important role in promoting both manufacturing and scientific development through industrial digitization and emerging interdisciplinary research. Open data initiatives have encouraged the publication of Big Data by exploiting the decentralized nature of the Web, allowing for the availability of heterogeneous data generated and maintained by autonomous data providers. Consequently, the growing volume of data consumed by different applications raise the need for effective data integration approaches able to process a large volume of data that is represented in different format, schema and model, which may also include sensitive data, e.g., financial transactions, medical procedures, or personal data. Data Lakes are composed of heterogeneous data sources in their original format, that reduce the overhead of materialized data integration. Query processing over Data Lakes require the semantic description of data collected from heterogeneous data sources. A Data Lake with such semantic annotations is referred to as a Semantic Data Lake. Transforming Big Data into actionable knowledge demands novel and scalable techniques for enabling not only Big Data ingestion and curation to the Semantic Data Lake, but also for efficient large-scale semantic data integration, exploration, and discovery. Federated query processing techniques utilize source descriptions to find relevant data sources and find efficient execution plan that minimize the total execution time and maximize the completeness of answers. Existing federated query processing engines employ a coarse-grained description model where the semantics encoded in data sources are ignored. Such descriptions may lead to the erroneous selection of data sources for a query and unnecessary retrieval of data, affecting thus the performance of query processing engine. In this thesis, we address the problem of federated query processing against heterogeneous data sources in a Semantic Data Lake. First, we tackle the challenge of knowledge representation and propose a novel source description model, RDF Molecule Templates, that describe knowledge available in a Semantic Data Lake. RDF Molecule Templates (RDF-MTs) describes data sources in terms of an abstract description of entities belonging to the same semantic concept. Then, we propose a technique for data source selection and query decomposition, the MULDER approach, and query planning and optimization techniques, Ontario, that exploit the characteristics of heterogeneous data sources described using RDF-MTs and provide a uniform access to heterogeneous data sources. We then address the challenge of enforcing privacy and access control requirements imposed by data providers. We introduce a privacy-aware federated query technique, BOUNCER, able to enforce privacy and access control regulations during query processing over data sources in a Semantic Data Lake. In particular, BOUNCER exploits RDF-MTs based source descriptions in order to express privacy and access control policies as well as their automatic enforcement during source selection, query decomposition, and planning. Furthermore, BOUNCER implements query decomposition and optimization techniques able to identify query plans over data sources that not only contain the relevant entities to answer a query, but also are regulated by policies that allow for accessing these relevant entities. Finally, we tackle the problem of interest based update propagation and co-evolution of data sources. We present a novel approach for interest-based RDF update propagation that consistently maintains a full or partial replication of large datasets and deal with co-evolution.
- Rohde, Philipp D.: Query Optimization Techniques For Scaling Up To Data Variety. Master Thesis. 2019-07-08
Supervisor(s): M.Sc. Kemele M. Endris
Evaluators: Prof. Dr. Sören Auer, Prof. Dr. Maria-Esther Vidal
AbstractEven though Data Lakes are efficient in terms of data storage, they increase the complexity of query processing; this can lead to expensive query execution. Hence, novel techniques for generating query execution plans are demanded. Those techniques have to be able to exploit the main characteristics of Data Lakes. Ontario is a federated query engine capable of processing queries over heterogeneous data sources. Ontario uses source descriptions based on RDF Molecule Templates, i.e., an abstract description of the properties belonging to the entities in the unified schema of the data in the Data Lake. This thesis proposes new heuristics tailored to the problem of query processing over heterogeneous data sources including heuristics specifically designed for certain data models. The proposed heuristics are integrated into the Ontario query optimizer. Ontario is compared to state-of-the-art RDF query engines in order to study the overhead introduced by considering heterogeneity during query processing. The results of the empirical evaluation suggest that there is no significant overhead when considering heterogeneity. Furthermore, the baseline version of Ontario is compared to two different sets of additional heuristics, i.e., heuristics specifically designed for certain data models and heuristics that do not consider the data model. The analysis of the obtained experimental results shows that source-specific heuristics are able to improve query performance. Ontario optimization techniques are able to generate effective and efficient query plans that can be executed over heterogeneous data sources in a Data Lake.