Teaching

Prof. Dr. Maria-Esther Vidal lectures Bachelor and Master courses at the Faculty of Electrical Engineering and Computer Science at Leibniz Universität Hannover. Additionally, Prof, Vidal and the members of her research team supervises laboratory and final theses.

Courses at Leibniz Universität Hannover

Summer 2023

  • Scientific Data Management (Seminar)

Winter 2022/23

  • Scientific Data Management and Knowledge Graphs (Lecture, Exercises)

Summer 2022

  • Scientific Data Management (Seminar)

Summer 2020

  • Introduction to Database Systems (Lecture, Exercises)

Winter 2019/20

  • Data Science & Digital Libraries (Seminar)

Summer 2019

  • Knowledge Engineering and Semantic Web (Lecture, Exercises)

Courses in BIOMEDAS

Summer 2022

  • Scientific Database Programming (Lecture, Exercises)

Summer 2021

Theses and Dissertations

Open topics:

A list with open topics can be found at the subpage Open Theses.

Completed:

  • Deshar, Sohan: Efficient Symbolic Learning over Knowledge Graphs. Bachelor Thesis. 2023-12-19
    Supervisor(s): M.Sc. Disha Purohit
    Evaluators: Prof. Dr. Maria-Esther Vidal, Prof. Dr. Sören Auer
    Abstract

    Knowledge Graphs (KG) are repositories of structured information. Inductive Logic Programming (ILP) can be used over these KGs to mine logical rules which can then be used to deduce new information and learn new facts from these KGs. Over the years, many algorithms have been developed for this purpose, almost all requiring the complete KG to be present in the main memory at some point of their execution. With increasing sizes of the KGs, owing to the improvement in the knowledge extraction mechanisms, the application of these algorithms is being rendered less and less feasible locally. Due to the sheer size of these KGs, many of them don't even fit in the memory of normal computing devices. These KGs can, however, also be represented in RDF making them structured and queriable using the SPARQL endpoints. And thanks to software like Openlink's Virtuoso, these queriable KGs can be hosted on a server as SPARQL endpoints. In light of this fact, an effort was undertaken to develop an algorithm that overcomes the memory bottleneck of the current logical rule mining procedures by using SPARQL endpoints. To that end, one of the state-of-the-art algorithms AMIE~\cite{galarraga2013amie} was taken as a reference to create a new algorithm that mines logical rules over these KGs by querying the SPARQL endpoints on which they are hosted, effectively overcoming the aforementioned memory bottleneck, allowing us to mine rules (and eventually deduce new information) locally.

  • Azizi, Sepideh: Documenting Data Integration using Knowledge Graphs. Master Thesis. 2023-05-24
    Supervisor(s): Prof. Dr. Maria-Esther Vidal
    Evaluators: Prof. Dr. Maria-Esther Vidal, Prof. Dr. Sören Auer
    Abstract

    With the increasing volume of data on the Web and the proliferation of published knowledge graphs, there is a growing need for improved data management and information extraction. However, heterogeneity issues across the data sources, i.e., various formats and systems, negatively impact efficient access, manage, reuse, and analyze the data. A data integration system (DIS) provides uniform access to heterogeneous data sources and their relationships; it offers a unified and comprehensive view of the data. DISs resort to mapping rules, expressed in declarative languages like RML, to align data from various sources to classes and properties defined in an ontology. This work defines a knowledge graph where data integration systems are represented as factual statements. The aim of this work is to provide the basis for integrated analysis of data collected from heterogeneous data silos. The proposed knowledge graph is also specified as a data integration system, that integrates all data integration systems. The proposed solution includes a unified schema, which defines and explains the relationships between all elements in the data integration system DIS=⟨G, S, M, F⟩. The results suggest that factual statements from the proposed knowledge graph, improve the understanding of the features that characterize knowledge graphs declaratively defined like data integration systems.

  • Rivas, Ariam: Computational and Human-based methods for knowledge discovery over Knowledge Graphs. PhD Thesis. 2023-05-09
    Advisor(s): Prof. Dr. Maria-Esther Vidal
    Examiners: Prof. Dr. Maria-Esther Vidal, Prof. Dr. Sandra Geisler
    Abstract

    The modern world has evolved, accompanied by the huge exploitation of data and information. Daily, increasing volumes of data from various sources and formats are stored, resulting in a challenging strategy to manage and integrate them to discover new knowledge. The appropriate use of data in various sectors of society, such as education, healthcare, e-commerce, and industry, provides advantages for decision support in these areas. However, knowledge discovery becomes challenging since data may come from heterogeneous sources with important information hidden. Thus, new approaches that adapt to the new challenges of knowledge discovery in such heterogeneous data environments are required. The semantic web and knowledge graphs (KGs) are becoming increasingly relevant on the road to knowledge discovery. This thesis tackles the problem of knowledge discovery over KGs built from heterogeneous data sources. We provide a neuro-symbolic artificial intelligence system that integrates symbolic and sub-symbolic frameworks to exploit the semantics encoded in a KG and its structure. The symbolic system relies on existing approaches of deductive databases to make explicit, implicit knowledge encoded in a KG. The proposed deductive database DS can derive new statements to ego networks given an abstract target prediction. Thus, DS minimizes data sparsity in KGs. In addition, a sub-symbolic system relies on knowledge graph embedding (KGE) models. KGE models are commonly applied in the KG completion task to represent entities in a KG in a low-dimensional vector space. However, KGE models are known to suffer from data sparsity, and a symbolic system assists in overcoming this fact. The proposed approach discovers knowledge given a target prediction in a KG and extracts unknown implicit information related to the target prediction. As a proof of concept, we have implemented the neuro-symbolic system on top of a KG for lung cancer to predict polypharmacy treatment effectiveness. The symbolic system implements a deductive system to deduce pharmacokinetic drug-drug interactions encoded in a set of rules through the Datalog program. Additionally, the sub-symbolic system predicts treatment effectiveness using a KGE model, which preserves the KG structure. An ablation study on the components of our approach is conducted, considering state-of-the-art KGE methods. The observed results provide evidence for the benefits of the neuro-symbolic integration of our approach, where the neuro-symbolic system for an abstract target prediction exhibits improved results. The enhancement of the results occurs because the symbolic system increases the prediction capacity of the sub-symbolic system. Moreover, the proposed neuro-symbolic artificial intelligence system in Industry 4.0 (I4.0) is evaluated, demonstrating its effectiveness in determining relatedness among standards and analyzing their properties to detect unknown relations in the I4.0KG. The results achieved allow us to conclude that the proposed neuro-symbolic approach for an abstract target prediction improves the prediction capability of KGE models by minimizing data sparsity in KGs.

  • Sakor, Ahmad: Knowledge Extraction from Unstructured Data. PhD Thesis. 2023-05-05
    Advisor(s): Prof. Dr. Maria-Esther Vidal
    Examiners: Prof. Dr. Maria-Esther Vidal, Prof. Dr. Ricardo Usbek
    Abstract

    Data availability is becoming more essential, considering the current growth of web-based data. The data available on the web are represented as unstructured, semi-structured, or structured data. In order to make the web-based data available for several Natural Language Processing or Data Mining tasks, the data needs to be presented as machine-readable data in a structured format. Thus, techniques for addressing the problem of capturing knowledge from unstructured data sources are needed. Knowledge extraction methods are used by the research communities to address this problem; methods that are able to capture knowledge in a natural language text and map the extracted knowledge to existing knowledge presented in knowledge graphs (KGs). These knowledge extraction methods include Named-entity recognition, Named-entity Disambiguation, Relation Recognition, and Relation Linking. This thesis addresses the problem of extracting knowledge over unstructured data and discovering patterns in the extracted knowledge. We devise a rule-based approach for entity and relation recognition and linking. The defined approach effectively maps entities and relations within a text to their resources in a target KG. Additionally, it overcomes the challenges of recognizing and linking entities and relations to a specific KG by employing devised catalogs of linguistic and domain-specific rules that state the criteria to recognize entities in a sentence of a particular language, and a deductive database that encodes knowledge in community-maintained KGs. Moreover, we define a Neuro-symbolic approach for the tasks of knowledge extraction in encyclopedic and domain-specific domains; it combines symbolic and sub-symbolic components to overcome the challenges of entity recognition and linking and the limitation of the availability of training data while maintaining the accuracy of recognizing and linking entities. Additionally, we present a context-aware framework for unveiling semantically related posts in a corpus; it is a knowledge-driven framework that retrieves associated posts effectively. We cast the problem of unveiling semantically related posts in a corpus into the Vertex Coloring Problem. We evaluate the performance of our techniques on several benchmarks related to various domains for knowledge extraction tasks. Furthermore, we apply these methods in real-world scenarios from national and international projects. The outcomes show that our techniques are able to effec- tively extract knowledge encoded in unstructured data and discover patterns over the extracted knowledge presented as machine-readable data. More importantly, the evaluation results provide evidence to the effectiveness of combining the reasoning capacity of the symbolic frameworks with the power of pattern recognition and classification of sub-symbolic models.

  • Sawischa, Sammy: Bias Assessments of Benchmarks for Link Predictions over Knowledge Graphs. Bachelor Thesis. 2023-05-03
    Supervisor(s): M.Sc. Mayra Russo
    Evaluators: Prof. Dr. Maria-Esther Vidal, Jun.-Prof. Dr.-Ing. Alexander Dockhorn
    Abstract

    Link prediction (LP) aims to tackle the challenge of predicting new facts by reasoning over a knowledge graph (KG). Di↵erent machine learning architectures have been proposed to solve the task of LP, several of them competing for better performance on a few de-facto benchmarks. The problem of this thesis is the characterization of LP datasets regarding their structural bias properties and their effects on attained performance results. We provide a domain-agnostic framework that assesses the network topology, test leakage bias and sample selection bias in LP datasets. The framework includes SPARQL queries that can be reused in the explorative data analysis of KGs for uncovering unusual patterns. We finally apply our framework to characterize 7 common benchmarks used for assessing the task of LP. In conducted experiments, we use a trained TransE model to show how the two bias types a↵ect prediction results. Our analysis shows problematic patterns in most of the benchmark datasets. Especially critical are the findings regarding the state-of-the-art benchmarks FB15k-237, WN18RR and YAGO3-10.

  • Jozashoori, Samaneh: Semantic Data Integration and Knowledge Graph Creation at Scale. PhD Thesis. 2023-04-13
    Advisor(s): Prof. Dr. Maria-Esther Vidal
    Examiners: Prof. Dr. Maria-Esther Vidal, Prof. Dr. Ernestina Menasalvas
    Abstract

    Contrary to data, knowledge is often abstract. Concrete knowledge can be achieved through the inclusion of semantics in the data models, highlighting the role of data integration. The massive growing number of data, in recent years, has promoted the demand for scaling up data management techniques; materializing data integration, a.k.a., knowledge graph creation falls in that category. In this thesis, we investigate efficient methods and techniques for materializing data integration. We formalize the process of materializing data integration. We formally define the characteristics of a materialized data integration system that merge the data operators and sources. Owing to this formalism, both layers of data integration, including data and schema-level integration, are formalized in the context of mapping assertions. We explore optimization opportunities for improving the materialization of data integration systems. We recognize three angles including intra/inter-mapping assertions from which the materialization can be improved. Accordingly, we propose source-based, mapping-based, and inter-mapping assertion groups of optimization techniques. We utilize our proposed techniques in three real-world projects. We illustrate how applying these optimization techniques contribute to meeting the objectives of the mentioned projects. Furthermore, we study the parameters impacting the performance of materialization of data integration. Relying on reported parameters and the presumably impacting parameters, we build four groups of testbeds. We empirically study the performances of these different testbeds in the presence and absence of our proposed techniques, in terms of execution time. We observe that the savings can be up to 75%. Lastly, we contribute to facilitating the process of declarative data integration system definition. We propose two data operation function signatures in Function Ontology (FnO). The first set of functions is designed to perform the task of entity alignment by resorting to an entity and relation linking tool. The second library consists of domain-specific functions to align genomic entities by harmonizing their representations. Finally, we introduce a tool equipped with a user interface to facilitate the process of defining declarative mapping rules by allowing users to explore the data sources and unified schema while defining their correspondences.

  • Nassimi, Sahar: Entity Linking for the Biomedical Domain. Master Thesis. 2023-03-27
    Supervisor(s): M.Sc. Ahmad Sakor
    Evaluators: Prof. Dr. Maria-Esther Vidal, Prof. Dr. Sören Auer
    Abstract

    Entity linking is the process of detecting mentions of different concepts in text documents and linking them to canonical entities in a target lexicon. However, one of the biggest issues in entity linking is the ambiguity in entity names. The ambiguity is an issue that many text mining tools have yet to address since different names can represent the same thing and every mention could indicate a different thing. For instance, search engines that rely on heuristic string matches frequently return irrelevant results, because they are unable to satisfactorily resolve ambiguity. Thus, resolving named entity ambiguity is a crucial step in entity linking. To solve the problem of ambiguity, this work proposes a heuristic method for entity recognition and entity linking over the biomedical knowledge graph concerning the semantic similarity of entities in the knowledge graph. Named entity recognition (NER), relation extraction (RE), and relationship linking make up a conventional entity linking (EL) system pipeline (RL). We have used the accuracy metric in this thesis. Therefore, for each identified relation or entity, the solution comprises identifying the correct one and matching it to its corresponding unique CUI in the knowledge base. Because KBs contain a substantial number of relations and entities, each with only one natural language label, the second phase is directly dependent on the accuracy of the first. The framework developed in this thesis enables the extraction of relations and entities from the text and their mapping to the associated CUI in the UMLS knowledge base. This approach derives a new representation of the knowledge base that lends it to the easy comparison. Our idea to select the best candidates is to build a graph of relations and determine the shortest path distance using a ranking approach. We test our suggested approach on two well-known benchmarks in the biomedical field and show that our method exceeds the search engine’s top result and provides us with around 4% more accuracy. In general, when it comes to fine-tuning, we notice that entity linking contains subjective characteristics and modifications may be required depending on the task at hand. The performance of the framework is evaluated based on a Python implementation.

  • Gercke, Julian: Supporting Explainable AI on Semantic Constraint Validation. Master Thesis. 2022-07-12
    Supervisor(s): M.Sc. Philipp D. Rohde, Prof. Dr. Maria-Esther Vidal
    Evaluators: Prof. Dr. Maria-Esther Vidal, Prof. Dr. Sören Auer
    Abstract

    There is a rising number of knowledge graphs available published through various sources. The enormous amount of linked data strives to give entities a semantic context. Using SHACL, the entities can be validated with respect to their context. On the other hand, an increasing usage of AI models in productive systems comes with a great responsibility in various areas. Predictive models like linear, logistic regression, and tree-based models, are still frequently used as they come with a simple structure, which allows for interpretability. However, explaining models includes verifying whether the model makes predictions based on human constraints or scientific facts. This work proposes to use the semantic context of the entities in knowledge graphs to validate predictive models with respect to user-defined constraints; therefore, providing a theoretical framework for a model-agnostic validation engine based on SHACL. In a second step, the model validation results are summarized in the case of a decision tree and visualized model-coherently. Finally, the performance of the framework is evaluated based on a Python implementation.

  • Alom, Hany: A Library for Visualizing SHACL over Knowledge Graphs. Master Thesis. 2022-04-06
    Supervisor(s): M.Sc. Philipp D. Rohde
    Evaluators: Prof. Dr. Maria-Esther Vidal, Prof. Dr. Sören Auer
    Abstract

    In a data-driven world, the amount of data currently collected and processed is perhaps the most spectacular result of the digital revolution. And the range of possibilities available has grown and will continue to grow. The Web is full of documents for humans to read, and with Semantic Web, data can also be understood by machines. W3C standardized RDF to represent the Web of data as modeled entities and their relations. Then SHACL came along to present constraints in RDF knowledge graphs, as a network of shapes. SHACL networks are usually presented in textual formats. This thesis focuses on visualizing SHACL networks in a 3D space, while providing many features for the user to manipulate the graph and get the desired information. Thus, SHACLViewer is presented as a framework for SHACL visualization. In addition, an evaluation for the impact of various parameters like network size, topology, and density are studied. During the study, execution times for different functions are computed; they include loading time, expanding the graph, and highlighting a shape. The observed results reveal the characteristics of the SHACL networks that affect the performance and scalability of SHACLViewer.

  • Fernandes, Luis: Effectively Unveiling Skills in Linked Data . Master Thesis. 2021-09-22
    Supervisor(s): Prof. Dr. Maria-Esther Vidal, M.Sc. Ahmad Sakor
    Evaluators: Prof. Dr. Jens Lehmann, Prof. Dr. Maria-Esther Vidal
    Abstract

    In recent times, the ingenuity of humanity and its population have grown considerably, and the technology and the data we produce. With each new technology, new skills are necessary to develop it, which also increases the demand from institutions and companies towards people who have these new skills. Knowledge Graphs is one of those technologies that have opened an endless number of applications. We aim at identifying new skills within the Linked Data. To carry out this task, we propose an approach called DiSkill that allows extracting good representations of the nodes of these knowledge graphs and classifying their elements appropriately. This work is built on others that have extracted sets of representations for knowledge Graphs. However, in all of them, such models have always been generated considering the entire graph. Our work, on the other hand, applies a strategy, which makes use of specific domain knowledge as a first step. DiSkill resorts to an Entity Linking Engine to recognize entities related to the background knowledge in existing knowledge graphs (e.g., DBpedia and Wikidata). Next, DiSkill creates the latent representations of the graph generated by the subgraphs reachable from the mentioned linked entities. Lastly, DiSkill relies on existing predictive models and the latent representations to predict which of the entities in the knowledge graph correspond to skills. As part of this thesis, we also evaluate specific configurations of the RDF2Vec strategy on our approach and report results through a set of metrics and after the execution in different classification models to judge their quality. We also compare the development of DiSkill with that of the original RDF2Vec work and demonstrate the considerable improvements that our strategy provides.

  • Iglesias, Enrique: Data Structures for Knowledge Graph Creation at Scale. Master Thesis. 2021-05-12
    Supervisor(s): M.Sc. Samaneh Jozashoori, M.Sc. David Chaves-Fraga, Prof. Dr. Maria-Esther Vidal
    Evaluators: Prof. Dr. Jens Lehmann, Prof. Dr. Sören Auer
    Abstract

    Data has grown exponentially in the last years and knowledge graphs have gained momentum as data structures to integrate heterogeneous data. This explosion of data has created many opportunities to develop innovative technologies. Still, it brings attention to the lack of standardization for making data available, raising questions about interoperability and data quality. Data complexities like large volume, heterogeneity, and high duplicate rates affect knowledge graph creation. This thesis addresses these issues to scale up knowledge graph creation guided by the RDF Mapping Language (RML). This thesis is built on the assumption that the amount of memory required to create a knowledge graph and the strategies utilized to execute the RML mapping rules impact on the efficiency of a knowledge graph creation process. We propose a two-fold solution to address these two sources of complexity. First, RML mapping rules are reordered in a way that the most selective mapping rules are evaluated first while non-selective rules are considered at the end. As a result, the number of triples that are kept is main memory is reduced. In a second step, an RDF compression strategy and novel operators are made available. They avoid the generation of duplicated RDF triples and the reduction of the number of comparisons during the execution of RML operators between mapping rules. We empirically evaluate the performance of our proposed solution against various testbeds with diverse configurations of data volume, duplicate rates, and heterogeneity. Observed results suggest that our approach optimizes execution times and memory usage when compared with the state of the art. Moreover, these outcomes provide evidence of the crucial role of data structures and execution strategies on the scalability of processes of knowledge graph creation using declarative mapping languages.

  • Figuera, Mónica: Efficiently Validating Integrity Constraints in SHACL. Master Thesis. 2021-05-05
    Supervisor(s): Prof. Dr. Maria-Esther Vidal, M.Sc. Philipp D. Rohde, Dr. Diego Collarana
    Evaluators: Prof. Dr. Jens Lehmann, Prof. Dr. Maria-Esther Vidal
    Abstract

    We study fundamental aspects related to the validation of schemas defined under the Shapes Constraint Language (SHACL) specification, the W3C recommendation for declaratively representing integrity constraints over RDF knowledge graphs and the reference language in industrial consortia, such as the International Data Spaces initiative (IDS). Existing SHACL engines enable the identification of entities that violate integrity constraints. Nevertheless, these approaches do not scale up in presence of large volumes of data to effectively identify invalidations. We address the problem of efficiently validating integrity constraints in SHACL. To this end, we propose Trav-SHACL, an engine that performs data quality assessment in minimal time by identifying heuristic-based strategies to traverse a shapes schema and performing optimization techniques to the schema. Our key contributions include (i) Trav-SHACL, a SHACL engine capable of planning the traversal and execution of a shapes schema in a way that invalid entities are detected early and needless validations are minimized. Trav-SHACL reorders the shapes in a shapes schema for efficient validation and rewrites target and constraint queries for fast detection of invalid entities; (ii) the empirical evaluation of Trav-SHACL on 27 test beds over the well-known Lehigh University Benchmark (LUBM) executed against knowledge graphs of up to 34M triples. Our experimental results suggest that Trav-SHACL exhibits high performance gradually and reduces validation time by a factor of up to 33.65 compared to the state of the art.

  • Hanasoge, Supreetha: Efficiently Identifying Top k Similar Entities. Master Thesis. 2021-02-10
    Supervisor(s): Prof. Dr. Maria-Esther Vidal
    Evaluators: Prof. Dr. Sören Auer, Prof. Dr. Maria-Esther Vidal
    Abstract

    With the rapid growth in genomic studies, more and more successful researchers are being produced that integrate tools and technologies from interdisciplinary sciences. Computational biology or bioinformatics is one such field that successfully applies computational tools to capture and transcribe biological data. Specifically, in genomic studies, detection and analysis of co-occurring mutations is a leading area of study. Concurrently, in recent years, computer science and information technology have seen an increased interest in the areas of association analysis and co-occurrence computation. The traditional method of finding top similar entities involves examining every possible pair of entities, which leads to prohibitive quadratic time complexity. Most of the existing approaches also require a similarity measure and threshold beforehand to retrieve the top similar entities. These parameters are not always easy to tune. Heuristically, an adaptive method can have broader applications for identifying the top most similar pair of mutations (or entities in general). This thesis presents an algorithm to efficiently identify top k similar genetic variants using co-occurrence as the similarity measure. Our approach used an upper bound condition to prune the search space iteratively and tackled the quadratic complexity. The empirical evaluations illustrate the behavior of the proposed methods in terms of execution time and accuracy of our approach, particularly in large-sized datasets. The experimental studies also explore the impact of various parameters like input size, k on the execution time in top k approaches. The study outcomes suggest that systematic pruning of the search space using an adaptive threshold condition optimizes the process of identifying similar top pair of entities.

  • Torabinejad, Mohammad: Normalization Techniques for Improving the Performance of Knowledge Graph Creation Pipelines. Master Thesis. 2020-09-25
    Supervisor(s): Prof. Dr. Maria-Esther Vidal, M.Sc. Samaneh Jozashoori
    Evaluators: Prof. Dr. Sören Auer, Prof. Dr. Maria-Esther Vidal
    Abstract

    With the rapid growth of data within the web, demands on discovering information within data and consecutively exploiting knowledge graphs rise much more than we think it does. Data integration systems can be of great help to meet this precious demand in that they offer transformation of data from various sources and with different volumes. To this end, a data integration system takes advantage of utilizing mapping rules– specified in a language like RML – to integrate data collected from various data sources into a knowledge graph. However, large data sources may suffer from various data quality issues, being redundant one of them. Regarding this, the Semantic Web community contributes to Knowledge Engineering with techniques to create a knowledge graph efficiently. The thesis reported in this document tackles creating knowledge graphs in the presence of data sources with redundant data, and a novel normalization theory is proposed to solve this problem. This theory covers not only the characteristics of the data sources but also mapping rules used to integrate the data sources into a knowledge graph. Based on this, three normal forms are proposed and an algorithm for transforming mapping rules and data sources into these normal forms. The proposed approach’s performance is evaluated in different testbeds composed of real-world data and synthetic data. The observed results suggest that the proposed techniques can dramatically reduce the execution time of knowledge graph creation. Therefore, this thesis’s normalization theory contributes to the repertoire of tools that facilitate the creation of knowledge graphs at scale.

  • Karim, Farah: Compact Semantic Representations of Observational Data. PhD Thesis. 2020-03-18
    Advisor(s): Prof. Dr. Sören Auer, Prof. Dr. Maria-Esther Vidal
    Examiners: Prof. Dr. Sören Auer, Prof. Vojtêch Svátek
    Abstract

    The Internet of Things (IoT) concept has been widely adopted in several domains to enable devices to interact with each other and perform certain tasks. IoT devices encompass different concepts, e.g., sensors, programs, computers, and actuators. IoT devices observe their surroundings to collect information and communicate with each other in order to perform mutual tasks. These devices continuously generate observational data streams, which become historical data when these observations are stored. Due to an increase in the number of IoT devices, a large amount of streaming and historical observational data is being produced. Moreover, several ontologies, like the Semantic Sensor Network (SSN) Ontology, have been proposed for semantic annotation of observational data-either streams or historical. Resource Description Framework (RDF) is widely adopted data model to semantically describe the datasets. Semantic annotation provides a shared understanding for processing and analysis of observational data. However, adding semantics, further increases the data size especially when the observation values are redundantly sensed by several devices. For example, several sensors can generate observations indicating the same value for relative humidity in a given timestamp and city. This situation can be represented in an RDF graph using four RDF triples where observations are represented as triples that describe the observed phenomenon, the unit of measurement, the timestamp, and the coordinates. The RDF triples of an observation are associated with the same subject. Such observations share the same objects in a certain group of properties, i.e., they match star patterns composed of these properties and objects. In case the number of these subject entities or properties in these star patterns is large, the size of the RDF graph and query processing are negatively impacted; we refer these star patterns as frequent star patterns. This thesis addresses the problem of identifying frequent star patterns in RDF graphs and develop computational methods to identify frequent star patterns and generate a factorized RDF graph where the number of frequent star patterns is minimized. Furthermore, we apply these factorized RDF representations over historical semantic sensor data described using the SSN ontology and present tabular-based representations of factorized semantic sensor data in order to exploit Big Data frameworks. In addition, this thesis devises a knowledge-driven approach named DESERT that is able to on-Demand factorizE and Semantically Enrich stReam daTa. We evaluate the performance of our proposed techniques on several RDF graph benchmarks. The outcomes show that our techniques are able to effectively and efficiently detect frequent star patterns and RDF graph size can be reduced by up to 66.56% while data represented in the original RDF graph is preserved. Moreover, the compact representations are able to reduce the number of RDF triples by at least 53.25% in historical observational data and upto 94.34% in observational data streams. Additionally, query evaluation results over historical data reduce query execution time by up to three orders of magnitude. In observational data streams the size of the data required to answer the query is reduced by 92.53% reducing the memory space requirements to answer the queries. These results provide evidence that IoT data can be efficiently represented using the proposed compact representations, reducing thus, the negative impact that semantic annotations may have on IoT data management.

  • Endris, Kemele M.: Federated Query Processing over Heterogeneous Data Sources in a Semantic Data Lake. PhD Thesis. 2020-03-03
    Advisor(s): Prof. Dr. Sören Auer, Prof. Dr. Maria-Esther Vidal
    Examiners: Prof. Dr. Sören Auer, Prof. Dr. Jens Lehmann
    Abstract

    Data provides the basis for emerging scientific and interdisciplinary data-centric applications with the potential of improving the quality of life for citizens. Big Data plays an important role in promoting both manufacturing and scientific development through industrial digitization and emerging interdisciplinary research. Open data initiatives have encouraged the publication of Big Data by exploiting the decentralized nature of the Web, allowing for the availability of heterogeneous data generated and maintained by autonomous data providers. Consequently, the growing volume of data consumed by different applications raise the need for effective data integration approaches able to process a large volume of data that is represented in different format, schema and model, which may also include sensitive data, e.g., financial transactions, medical procedures, or personal data. Data Lakes are composed of heterogeneous data sources in their original format, that reduce the overhead of materialized data integration. Query processing over Data Lakes require the semantic description of data collected from heterogeneous data sources. A Data Lake with such semantic annotations is referred to as a Semantic Data Lake. Transforming Big Data into actionable knowledge demands novel and scalable techniques for enabling not only Big Data ingestion and curation to the Semantic Data Lake, but also for efficient large-scale semantic data integration, exploration, and discovery. Federated query processing techniques utilize source descriptions to find relevant data sources and find efficient execution plan that minimize the total execution time and maximize the completeness of answers. Existing federated query processing engines employ a coarse-grained description model where the semantics encoded in data sources are ignored. Such descriptions may lead to the erroneous selection of data sources for a query and unnecessary retrieval of data, affecting thus the performance of query processing engine. In this thesis, we address the problem of federated query processing against heterogeneous data sources in a Semantic Data Lake. First, we tackle the challenge of knowledge representation and propose a novel source description model, RDF Molecule Templates, that describe knowledge available in a Semantic Data Lake. RDF Molecule Templates (RDF-MTs) describes data sources in terms of an abstract description of entities belonging to the same semantic concept. Then, we propose a technique for data source selection and query decomposition, the MULDER approach, and query planning and optimization techniques, Ontario, that exploit the characteristics of heterogeneous data sources described using RDF-MTs and provide a uniform access to heterogeneous data sources. We then address the challenge of enforcing privacy and access control requirements imposed by data providers. We introduce a privacy-aware federated query technique, BOUNCER, able to enforce privacy and access control regulations during query processing over data sources in a Semantic Data Lake. In particular, BOUNCER exploits RDF-MTs based source descriptions in order to express privacy and access control policies as well as their automatic enforcement during source selection, query decomposition, and planning. Furthermore, BOUNCER implements query decomposition and optimization techniques able to identify query plans over data sources that not only contain the relevant entities to answer a query, but also are regulated by policies that allow for accessing these relevant entities. Finally, we tackle the problem of interest based update propagation and co-evolution of data sources. We present a novel approach for interest-based RDF update propagation that consistently maintains a full or partial replication of large datasets and deal with co-evolution.

  • Rohde, Philipp D.: Query Optimization Techniques For Scaling Up To Data Variety. Master Thesis. 2019-07-08
    Supervisor(s): M.Sc. Kemele M. Endris
    Evaluators: Prof. Dr. Sören Auer, Prof. Dr. Maria-Esther Vidal
    Abstract

    Even though Data Lakes are efficient in terms of data storage, they increase the complexity of query processing; this can lead to expensive query execution. Hence, novel techniques for generating query execution plans are demanded. Those techniques have to be able to exploit the main characteristics of Data Lakes. Ontario is a federated query engine capable of processing queries over heterogeneous data sources. Ontario uses source descriptions based on RDF Molecule Templates, i.e., an abstract description of the properties belonging to the entities in the unified schema of the data in the Data Lake. This thesis proposes new heuristics tailored to the problem of query processing over heterogeneous data sources including heuristics specifically designed for certain data models. The proposed heuristics are integrated into the Ontario query optimizer. Ontario is compared to state-of-the-art RDF query engines in order to study the overhead introduced by considering heterogeneity during query processing. The results of the empirical evaluation suggest that there is no significant overhead when considering heterogeneity. Furthermore, the baseline version of Ontario is compared to two different sets of additional heuristics, i.e., heuristics specifically designed for certain data models and heuristics that do not consider the data model. The analysis of the obtained experimental results shows that source-specific heuristics are able to improve query performance. Ontario optimization techniques are able to generate effective and efficient query plans that can be executed over heterogeneous data sources in a Data Lake.

Feedback