Lehre

Prof. (Universidad Simón Bolívar) Dr. Maria-Esther Vidal lehrt an der Fakultät für Elektrotechnik und Informatik der Leibniz Universität Hannover.
Des Weiteren betreut die Forschungsgruppe Labor- sowie Abschlussarbeiten.

Lehrveranstaltungen an der Leibniz Universität Hannover

Sommer 2020

Winter 2019/20

  • Data Science & Digital Libraries (Seminar)

Sommer 2019

Abschlussarbeiten und Dissertationen

Wir bieten Bachelor- und Masterarbeiten zu den folgenden Themen an:
a) Erstellung von Wissensgraphen und Datenintegration
b) Constraint-Validierung über Wissensgraphen
c) Föderierte Anfrageverarbeitung über Wissensgraphen
d) Erklärbares maschinelles Lernen über Wissensgraphen
e) Forschungsdatenmanagement und FAIR-Prinzipien
f)  Wissensextraktion und Text Mining

Offen:

  • Master Thesis: SHACL Engine
  • Master Thesis: Transparent Polystores

Laufend:

  • Fernandes, Luis
  • Figuera, Mónica
  • He, Chenyu
  • Iglesias, Enrique

Abgeschlossen:

  • Hanasoge, Supreetha: Efficiently Identifying Top k Similar Entities. Masterarbeit. 2021-02-10
    Supervisor(s): Prof. Dr. Maria-Esther Vidal
    Evaluators: Prof. Dr. Sören Auer, Prof. Dr. Maria-Esther Vidal
    Zusammenfassung

    With the rapid growth in genomic studies, more and more successful researchers are being produced that integrate tools and technologies from interdisciplinary sciences. Computational biology or bioinformatics is one such field that successfully applies computational tools to capture and transcribe biological data. Specifically, in genomic studies, detection and analysis of co-occurring mutations is a leading area of study. Concurrently, in recent years, computer science and information technology have seen an increased interest in the areas of association analysis and co-occurrence computation. The traditional method of finding top similar entities involves examining every possible pair of entities, which leads to prohibitive quadratic time complexity. Most of the existing approaches also require a similarity measure and threshold beforehand to retrieve the top similar entities. These parameters are not always easy to tune. Heuristically, an adaptive method can have broader applications for identifying the top most similar pair of mutations (or entities in general). This thesis presents an algorithm to efficiently identify top k similar genetic variants using co-occurrence as the similarity measure. Our approach used an upper bound condition to prune the search space iteratively and tackled the quadratic complexity. The empirical evaluations illustrate the behavior of the proposed methods in terms of execution time and accuracy of our approach, particularly in large-sized datasets. The experimental studies also explore the impact of various parameters like input size, k on the execution time in top k approaches. The study outcomes suggest that systematic pruning of the search space using an adaptive threshold condition optimizes the process of identifying similar top pair of entities.

  • Torabinejad, Mohammad: Normalization Techniques for Improving the Performance of Knowledge Graph Creation Pipelines. Masterarbeit. 2020-09-25
    Supervisor(s): Prof. Dr. Maria-Esther Vidal, Samaneh Jozashoori
    Evaluators: Prof. Dr. Sören Auer, Prof. Dr. Maria-Esther Vidal
    Zusammenfassung

    With the rapid growth of data within the web, demands on discovering information within data and consecutively exploiting knowledge graphs rise much more than we think it does. Data integration systems can be of great help to meet this precious demand in that they offer transformation of data from various sources and with different volumes. To this end, a data integration system takes advantage of utilizing mapping rules– specified in a language like RML – to integrate data collected from various data sources into a knowledge graph. However, large data sources may suffer from various data quality issues, being redundant one of them. Regarding this, the Semantic Web community contributes to Knowledge Engineering with techniques to create a knowledge graph efficiently. The thesis reported in this document tackles creating knowledge graphs in the presence of data sources with redundant data, and a novel normalization theory is proposed to solve this problem. This theory covers not only the characteristics of the data sources but also mapping rules used to integrate the data sources into a knowledge graph. Based on this, three normal forms are proposed and an algorithm for transforming mapping rules and data sources into these normal forms. The proposed approach’s performance is evaluated in different testbeds composed of real-world data and synthetic data. The observed results suggest that the proposed techniques can dramatically reduce the execution time of knowledge graph creation. Therefore, this thesis’s normalization theory contributes to the repertoire of tools that facilitate the creation of knowledge graphs at scale.

  • Karim, Farah: Compact Semantic Representations of Observational Data. Dissertation. 2020-03-18
    Advisor(s): Prof. Dr. Sören Auer, Prof. Dr. Maria-Esther Vidal
    Examiners: Prof. Dr. Sören Auer, Prof. Vojtêch Svátek
    Zusammenfassung

    The Internet of Things (IoT) concept has been widely adopted in several domains to enable devices to interact with each other and perform certain tasks. IoT devices encompass different concepts, e.g., sensors, programs, computers, and actuators. IoT devices observe their surroundings to collect information and communicate with each other in order to perform mutual tasks. These devices continuously generate observational data streams, which become historical data when these observations are stored. Due to an increase in the number of IoT devices, a large amount of streaming and historical observational data is being produced. Moreover, several ontologies, like the Semantic Sensor Network (SSN) Ontology, have been proposed for semantic annotation of observational data-either streams or historical. Resource Description Framework (RDF) is widely adopted data model to semantically describe the datasets. Semantic annotation provides a shared understanding for processing and analysis of observational data. However, adding semantics, further increases the data size especially when the observation values are redundantly sensed by several devices. For example, several sensors can generate observations indicating the same value for relative humidity in a given timestamp and city. This situation can be represented in an RDF graph using four RDF triples where observations are represented as triples that describe the observed phenomenon, the unit of measurement, the timestamp, and the coordinates. The RDF triples of an observation are associated with the same subject. Such observations share the same objects in a certain group of properties, i.e., they match star patterns composed of these properties and objects. In case the number of these subject entities or properties in these star patterns is large, the size of the RDF graph and query processing are negatively impacted; we refer these star patterns as frequent star patterns. This thesis addresses the problem of identifying frequent star patterns in RDF graphs and develop computational methods to identify frequent star patterns and generate a factorized RDF graph where the number of frequent star patterns is minimized. Furthermore, we apply these factorized RDF representations over historical semantic sensor data described using the SSN ontology and present tabular-based representations of factorized semantic sensor data in order to exploit Big Data frameworks. In addition, this thesis devises a knowledge-driven approach named DESERT that is able to on-Demand factorizE and Semantically Enrich stReam daTa. We evaluate the performance of our proposed techniques on several RDF graph benchmarks. The outcomes show that our techniques are able to effectively and efficiently detect frequent star patterns and RDF graph size can be reduced by up to 66.56% while data represented in the original RDF graph is preserved. Moreover, the compact representations are able to reduce the number of RDF triples by at least 53.25% in historical observational data and upto 94.34% in observational data streams. Additionally, query evaluation results over historical data reduce query execution time by up to three orders of magnitude. In observational data streams the size of the data required to answer the query is reduced by 92.53% reducing the memory space requirements to answer the queries. These results provide evidence that IoT data can be efficiently represented using the proposed compact representations, reducing thus, the negative impact that semantic annotations may have on IoT data management.

  • Endris, Kemele M.: Federated Query Processing over Heterogeneous Data Sources in a Semantic Data Lake. Dissertation. 2020-03-03
    Advisor(s): Prof. Dr. Sören Auer, Prof. Dr. Maria-Esther Vidal
    Examiners: Prof. Dr. Sören Auer, Prof. Dr. Jens Lehmann
    Zusammenfassung

    Data provides the basis for emerging scientific and interdisciplinary data-centric applications with the potential of improving the quality of life for citizens. Big Data plays an important role in promoting both manufacturing and scientific development through industrial digitization and emerging interdisciplinary research. Open data initiatives have encouraged the publication of Big Data by exploiting the decentralized nature of the Web, allowing for the availability of heterogeneous data generated and maintained by autonomous data providers. Consequently, the growing volume of data consumed by different applications raise the need for effective data integration approaches able to process a large volume of data that is represented in different format, schema and model, which may also include sensitive data, e.g., financial transactions, medical procedures, or personal data. Data Lakes are composed of heterogeneous data sources in their original format, that reduce the overhead of materialized data integration. Query processing over Data Lakes require the semantic description of data collected from heterogeneous data sources. A Data Lake with such semantic annotations is referred to as a Semantic Data Lake. Transforming Big Data into actionable knowledge demands novel and scalable techniques for enabling not only Big Data ingestion and curation to the Semantic Data Lake, but also for efficient large-scale semantic data integration, exploration, and discovery. Federated query processing techniques utilize source descriptions to find relevant data sources and find efficient execution plan that minimize the total execution time and maximize the completeness of answers. Existing federated query processing engines employ a coarse-grained description model where the semantics encoded in data sources are ignored. Such descriptions may lead to the erroneous selection of data sources for a query and unnecessary retrieval of data, affecting thus the performance of query processing engine. In this thesis, we address the problem of federated query processing against heterogeneous data sources in a Semantic Data Lake. First, we tackle the challenge of knowledge representation and propose a novel source description model, RDF Molecule Templates, that describe knowledge available in a Semantic Data Lake. RDF Molecule Templates (RDF-MTs) describes data sources in terms of an abstract description of entities belonging to the same semantic concept. Then, we propose a technique for data source selection and query decomposition, the MULDER approach, and query planning and optimization techniques, Ontario, that exploit the characteristics of heterogeneous data sources described using RDF-MTs and provide a uniform access to heterogeneous data sources. We then address the challenge of enforcing privacy and access control requirements imposed by data providers. We introduce a privacy-aware federated query technique, BOUNCER, able to enforce privacy and access control regulations during query processing over data sources in a Semantic Data Lake. In particular, BOUNCER exploits RDF-MTs based source descriptions in order to express privacy and access control policies as well as their automatic enforcement during source selection, query decomposition, and planning. Furthermore, BOUNCER implements query decomposition and optimization techniques able to identify query plans over data sources that not only contain the relevant entities to answer a query, but also are regulated by policies that allow for accessing these relevant entities. Finally, we tackle the problem of interest based update propagation and co-evolution of data sources. We present a novel approach for interest-based RDF update propagation that consistently maintains a full or partial replication of large datasets and deal with co-evolution.

  • Rohde, Philipp D.: Query Optimization Techniques For Scaling Up To Data Variety. Masterarbeit. 2019-07-08
    Supervisor(s): M.Sc. Kemele M. Endris
    Evaluators: Prof. Dr. Sören Auer, Prof. Dr. Maria-Esther Vidal
    Zusammenfassung

    Even though Data Lakes are efficient in terms of data storage, they increase the complexity of query processing; this can lead to expensive query execution. Hence, novel techniques for generating query execution plans are demanded. Those techniques have to be able to exploit the main characteristics of Data Lakes. Ontario is a federated query engine capable of processing queries over heterogeneous data sources. Ontario uses source descriptions based on RDF Molecule Templates, i.e., an abstract description of the properties belonging to the entities in the unified schema of the data in the Data Lake. This thesis proposes new heuristics tailored to the problem of query processing over heterogeneous data sources including heuristics specifically designed for certain data models. The proposed heuristics are integrated into the Ontario query optimizer. Ontario is compared to state-of-the-art RDF query engines in order to study the overhead introduced by considering heterogeneity during query processing. The results of the empirical evaluation suggest that there is no significant overhead when considering heterogeneity. Furthermore, the baseline version of Ontario is compared to two different sets of additional heuristics, i.e., heuristics specifically designed for certain data models and heuristics that do not consider the data model. The analysis of the obtained experimental results shows that source-specific heuristics are able to improve query performance. Ontario optimization techniques are able to generate effective and efficient query plans that can be executed over heterogeneous data sources in a Data Lake.

Feedback