Research Group Thomas Seidl

Database Systems and Data Mining AI Lab

Walid Durani

Database Systems and Data Mining AI Lab

Collin Leiber

Dr.

* Former Member

Anna Beer

Dr.

* Former Member

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[123]

U. Schlegel, G. M. Tavares and T. Seidl.
Towards Explainable Deep Clustering for Time Series Data.
TempXAI @ECML-PKDD 2025 - Workshop Explainable AI for Time Series and Data Streams at European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2025). Porto, Portugal, Sep 15-19, 2025. To be published.

Abstract

Deepclustering uncovers hidden patterns and groups incomplex time series data, yet its opaque decision-making limits use in safety-critical settings. This survey offers a structured overview of explainable deep clustering for time series, collecting current methods and their realworld applications. We thoroughly discuss and compare peer-reviewed and preprint papers through application domains across healthcare, finance, IoT, and climate science. Our analysis reveals that most work relies on autoencoder and attention architectures, with limited support for streaming, irregularly sampled, or privacy-preserved series, and interpretability is still primarily treated as an add-on. To push the field forward, we outline six research opportunities: (1) combining complex networks with built-in interpretability; (2) setting up clear, faithfulness-focused evaluation metrics for unsupervised explanations; (3) building explainers that adapt to live data streams; (4) crafting explanations tailored to specific domains; (5) adding human-in-the-loop methods that refine clusters and explanations together; and (6) improving our understanding of how time series clustering models work internally. By making interpretability a primary design goal rather than an afterthought, we propose the groundwork for the next generation of trustworthy deep clustering time series analytics.

MCML Authors

Udo Schlegel

Database Systems and Data Mining AI Lab

Gabriel Marques Tavares

Dr.

Database Systems and Data Mining AI Lab

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[122]

S. Rauch, C. M. M. Frey, A. Maldonado and T. Seidl.
BEST: Bilaterally Expanding Subtrace Tree for Event Sequence Prediction.
BPM 2025 - 23rd International Conference on Business Process Management. Seville, Spain, Aug 31-Sep 05, 2025. To be published.

Abstract

In Predictive Process Monitoring, handling uncertainty regarding future case execution is the core building block for reliable predictive or prescriptive methods.In the last decade, deep learning methods are increasingly the preferred approach when it comes to Next Activity Prediction and/or Remaining Trace Prediction. However, it remains an open question whether deep learning models finally surpass traditional data mining techniques for these tasks. In our paper, we contribute to answering this question by proposing a sequence prediction framework based on bilaterally expanding hierarchical subtraces that serves as an alternative approach for currently established deep learning techniques. We mine sequential patterns from activity traces and arrange them into a hierarchical subtrace tree by their structural relationship and inter-pattern distances. The tree structure can directly be leveraged for forecasting the most probable future activities given the trace history. We achieve competitive forecasting results for Remaining Trace Prediction, even surpassing state-of-the-art deep learning approaches on the majority of the analyzed real-world benchmark process event logs while only relying on the available control-flow information.

MCML Authors

Simon Rauch

A3 | Computational Models
→ Group Matthias Schubert

Database Systems and Data Mining AI Lab

Christian Frey

Dr.

* Former Member

Andrea Maldonado

Database Systems and Data Mining AI Lab

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[121]

W. Durani, T. Nitzl, C. Plant and C. Böhm.
Weakly Supervised Anomaly Detection via Dual-Tailed Kernel.
ICML 2025 - 42nd International Conference on Machine Learning. Vancouver, Canada, Jul 13-19, 2025. To be published. URL

Abstract

Detecting anomalies with limited supervision is challenging due to the scarcity of labeled anomalies, which often fail to capture the diversity of abnormal behaviors. We propose Weakly Supervised Anomaly Detection via Dual-Tailed Kernel (WSAD-DT), a novel framework that learns robust latent representations to distinctly separate anomalies from normal samples under weak supervision. WSAD-DT introduces two centroids—one for normal samples and one for anomalies—and leverages a dual-tailed kernel scheme: a light-tailed kernel to compactly model in-class points and a heavy-tailed kernel to main- tain a wider margin against out-of-class instances. To preserve intra-class diversity, WSAD-DT in- corporates kernel-based regularization, encouraging richer representations within each class. Furthermore, we devise an ensemble strategy that partition unlabeled data into diverse subsets, while sharing the limited labeled anomalies among these partitions to maximize their impact. Empirically, WSAD-DT achieves state-of-the-art performance on several challenging anomaly detection benchmarks, outperforming leading ensemble-based methods such as XGBOD.

MCML Authors

Walid Durani

Database Systems and Data Mining AI Lab

[120]

L. Xu, M. Sarkar, A. I. Lonappan, Í. Zubeldia, P. Villanueva-Domingo, S. Casas, C. Fidler, C. Amancharla, U. Tiwari, A. Bayer, C. A. Ekioui, M. Cranmer, A. Dimitrov, J. Fergusson, K. Gandhi, S. Krippendorf, A. Laverick, J. Lesgourgues, A. Lewis, T. Meier, B. Sherwin, K. Surrao, F. Villaescusa-Navarro, C. Wang, X. Xu and B. Bolliet.
Open Source Planning & Control System with Language Agents for Autonomous Scientific Discovery.
ML4Astro @ICML 2025 - Machine Learning for Astrophysics at the 42nd International Conference on Machine Learning (ICML 2025). Vancouver, Canada, Jul 13-19, 2025. To be published. Preprint available. arXiv

Abstract

We present a multi-agent system for automation of scientific research tasks, cmbagent (this https URL). The system is formed by about 30 Large Language Model (LLM) agents and implements a Planning & Control strategy to orchestrate the agentic workflow, with no human-in-the-loop at any point. Each agent specializes in a different task (performing retrieval on scientific papers and codebases, writing code, interpreting results, critiquing the output of other agents) and the system is able to execute code locally. We successfully apply cmbagent to carry out a PhD level cosmology task (the measurement of cosmological parameters using supernova data) and evaluate its performance on two benchmark sets, finding superior performance over state-of-the-art LLMs. The source code is available on GitHub, demonstration videos are also available, and the system is deployed on HuggingFace and will be available on the cloud.

MCML Authors

Thomas Meier

Dr.

[119]

P. Knab, S. Marton, U. Schlegel and C. Bartelt.
Which LIME should I trust? Concepts, Challenges, and Solutions.
xAI 2025 - 3rd World Conference on Explainable Artificial Intelligence. Istanbul, Turkey, Jul 09-11, 2025. To be published. Preprint available. arXiv GitHub

Abstract

As neural networks become dominant in essential systems, Explainable Artificial Intelligence (XAI) plays a crucial role in fostering trust and detecting potential misbehavior of opaque models. LIME (Local Interpretable Model-agnostic Explanations) is among the most prominent model-agnostic approaches, generating explanations by approximating the behavior of black-box models around specific instances. Despite its popularity, LIME faces challenges related to fidelity, stability, and applicability to domain-specific problems. Numerous adaptations and enhancements have been proposed to address these issues, but the growing number of developments can be overwhelming, complicating efforts to navigate LIME-related research. To the best of our knowledge, this is the first survey to comprehensively explore and collect LIME’s foundational concepts and known limitations. We categorize and compare its various enhancements, offering a structured taxonomy based on intermediate steps and key issues. Our analysis provides a holistic overview of advancements in LIME, guiding future research and helping practitioners identify suitable approaches. Additionally, we provide a continuously updated interactive website (this https URL), offering a concise and accessible overview of the survey.

MCML Authors

Udo Schlegel

Database Systems and Data Mining AI Lab

[118]

T. Meier and K. Khutsishvili.
Who Owns the Future? Ways to Understand Power, Technology, and the Moral Commons.
Preprint (Jul. 2025). URL

Abstract

The ascent of tech billionaires—and, depending on the market, soon trillionaires—signals more than a shift in global economic structures; it marks a transformation in the moral and cultural conditions under which democratic life is sustained. This contribution offers a communitarian critique of Big Tech’s influence, grounded in the philosophical frameworks of Charles Taylor, Michael Sandel, and virtue ethicist Shannon Vallor, and further supported by public goods theory and economic insights from Paul Samuelson and Joseph Stiglitz, with Elinor Ostrom’s work emphasizing the civic importance of collective stewardship. It contends that the challenge to democracy posed by concentrated digital power is not merely institutional, economic, or ethical, but a disruption of the very conditions for democratic citizenship.

MCML Authors

Thomas Meier

Dr.

[117]

T. Hannan, M. M. Islam, J. Gu, T. Seidl and G. Bertasius.
ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos.
CVPR 2025 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA, Jun 11-15, 2025. To be published. URL GitHub

Abstract

Large language models (LLMs) excel at retrieving information from lengthy text, but their vision-language counterparts (VLMs) face difficulties with hour-long videos, especially for temporal grounding. Specifically, these VLMs are constrained by frame limitations, often losing essential temporal details needed for accurate event localization in extended video content. We propose ReVisionLLM, a recursive vision-language model designed to locate events in hour-long videos. Inspired by human search strategies, our model initially targets broad segments of interest, progressively revising its focus to pinpoint exact temporal boundaries. Our model can seamlessly handle videos of vastly different lengths, from minutes to hours. We also introduce a hierarchical training strategy that starts with short clips to capture distinct events and progressively extends to longer videos. To our knowledge, ReVisionLLM is the first VLM capable of temporal grounding in hour-long videos, outperforming previous state-of-the-art methods across multiple datasets by a significant margin (+2.6% R1@0.1 on MAD).

MCML Authors

Tanveer Hannan

Database Systems and Data Mining AI Lab

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[116]

M. Aljoud, G. M. Tavares, C. Leiber and T. Seidl.
DCMatch - Identify Matching Architectures in Deep Clustering through Meta-Learning.
PAKDD 2025 - 29th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Sydney, Australia, Jun 10-13, 2025. To be published.

Abstract

The effectiveness of deepclustering algorithms like DeepEmbedded Clustering (DEC) is heavily influenced by the architecture of the neural network employed. However, selecting an optimal architecture is challenging due to the absence of labels in clustering tasks, which makes traditional Neural Architecture Search (NAS) methods unsuitable. To address this, we propose a novel dataset characterization method specifically tailored for image datasets, combining deep-learning-based and sta tistical feature extraction techniques. By utilizing features extracted from a small subset of images, our method effectively captures both high-level semantic and low-level statistical properties of the data. These dataset characteristics are then employed in a meta-learning framework to recommend autoencoder architectures likely to outperform default configurations. Extensive experiments on 20 image datasets validate the robustness of our approach, achieving improved clustering performance on 16 datasets compared to the baseline configuration.

MCML Authors

Gabriel Marques Tavares

Dr.

Database Systems and Data Mining AI Lab

Collin Leiber

Dr.

* Former Member

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[115]

G. Zhang, T. Hannan, H. Kleiner, B. Aydemir, X. Xie, J. Lan, T. Seidl, V. Tresp and J. Gu.
AViLA: Asynchronous Vision-Language Agent for Streaming Multimodal Data Interaction.
Preprint (Jun. 2025). arXiv

Abstract

An ideal vision-language agent serves as a bridge between the human users and their surrounding physical world in real-world applications like autonomous driving and embodied agents, and proactively provides accurate and timely responses given user intents. An intriguing challenge arises when agents interact with the world as a dynamic data stream and ad-hoc queries from users: supporting knowledge for queries, namely evidence, usually appears asynchronously with the arrival time of queries, and agents need to ground their responses in historical data, present observations, and even future streams. We frame this challenge as Query-Evidence Asynchrony, where user queries and their supporting evidence typically arrive asynchronously in the streaming setting. This setting requires not only strong reasoning capabilities but also the ability to retain past observations and respond to queries with temporal awareness. In this paper, we introduce a diagnostic benchmark that evaluates Multimodal Large Language Models (MLLMs) on their ability to handle interaction with streaming data. Further, we present AViLA, Asynchronous Video-Language Agent for streaming data interaction that can handle ad-hoc queries and give time-aware responses. For this purpose, AViLA consists of three key modules: comprehensive memory retention, evidence identification, and evidence-grounded trigger, that are designed to maintain a general-purpose memory and respond readily and timely to queries. Our experiments show that existing models often fail to respond at appropriate times, while AViLA significantly improves both accuracy and temporal awareness. Our code and dataset will be publicly available.

MCML Authors

Gengyuan Zhang

Database Systems and Data Mining AI Lab

Tanveer Hannan

Database Systems and Data Mining AI Lab

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

Volker Tresp

Prof. Dr.

Database Systems and Data Mining AI Lab

[114]

J. Lan, Y. Fu, U. Schlegel, G. Zhang, T. Hannan, H. Chen and T. Seidl.
My Answer Is NOT 'Fair': Mitigating Social Bias in Vision-Language Models via Fair and Biased Residuals.
Preprint (May. 2025). arXiv

Abstract

Social bias is a critical issue in large vision-language models (VLMs), where fairness- and ethics-related problems harm certain groups of people in society. It is unknown to what extent VLMs yield social bias in generative responses. In this study, we focus on evaluating and mitigating social bias on both the model’s response and probability distribution. To do so, we first evaluate four state-of-the-art VLMs on PAIRS and SocialCounterfactuals datasets with the multiple-choice selection task. Surprisingly, we find that models suffer from generating gender-biased or race-biased responses. We also observe that models are prone to stating their responses are fair, but indeed having mis-calibrated confidence levels towards particular social groups. While investigating why VLMs are unfair in this study, we observe that VLMs’ hidden layers exhibit substantial fluctuations in fairness levels. Meanwhile, residuals in each layer show mixed effects on fairness, with some contributing positively while some lead to increased bias. Based on these findings, we propose a post-hoc method for the inference stage to mitigate social bias, which is training-free and model-agnostic. We achieve this by ablating bias-associated residuals while amplifying fairness-associated residuals on model hidden layers during inference. We demonstrate that our post-hoc method outperforms the competing training strategies, helping VLMs have fairer responses and more reliable confidence levels.

MCML Authors

Udo Schlegel

Database Systems and Data Mining AI Lab

Gengyuan Zhang

Database Systems and Data Mining AI Lab

Tanveer Hannan

Database Systems and Data Mining AI Lab

Haokun Chen

Database Systems and Data Mining AI Lab

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

2024

[113]

D. Winkel, N. Strauß, M. Bernhard, Z. Li, T. Seidl and M. Schubert.
Autoregressive Policy Optimization for Constrained Allocation Tasks.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. URL GitHub

Abstract

Allocation tasks represent a class of problems where a limited amount of resources must be allocated to a set of entities at each time step. Prominent examples of this task include portfolio optimization or distributing computational workloads across servers. Allocation tasks are typically bound by linear constraints describing practical requirements that have to be strictly fulfilled at all times. In portfolio optimization, for example, investors may be obligated to allocate less than 30% of the funds into a certain industrial sector in any investment period. Such constraints restrict the action space of allowed allocations in intricate ways, which makes learning a policy that avoids constraint violations difficult. In this paper, we propose a new method for constrained allocation tasks based on an autoregressive process to sequentially sample allocations for each entity. In addition, we introduce a novel de-biasing mechanism to counter the initial bias caused by sequential sampling. We demonstrate the superior performance of our approach compared to a variety of Constrained Reinforcement Learning (CRL) methods on three distinct constrained allocation tasks: portfolio optimization, computational workload distribution, and a synthetic allocation benchmark.

MCML Authors

David Winkel

A3 | Computational Models
→ Group Matthias Schubert

Database Systems and Data Mining AI Lab

Niklas Strauß

Dr.

Spatial Artificial Intelligence

Maximilian Bernhard

Dr.

A3 | Computational Models
→ Group Matthias Schubert

* Former Member

Zongyue Li

A3 | Computational Models
→ Group Matthias Schubert

Spatial Artificial Intelligence

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

Matthias Schubert

Prof. Dr.

Spatial Artificial Intelligence

[112]

C. Leiber, N. Strauß, M. Schubert and T. Seidl.
Dying Clusters Is All You Need -- Deep Clustering With an Unknown Number of Clusters.
DLC @ICDM 2024 - 6th Workshop on Deep Learning and Clustering at the 24th IEEE International Conference on Data Mining (ICDM 2024). Abu Dhabi, United Arab Emirates, Dec 09-12, 2024. DOI GitHub

Abstract

Finding meaningful groups, i.e., clusters, in high-dimensional data such as images or texts without labeled data at hand is an important challenge in data mining. In recent years, deep clustering methods have achieved remarkable results in these tasks. However, most of these methods require the user to specify the number of clusters in advance. This is a major limitation since the number of clusters is typically unknown if labeled data is unavailable. Thus, an area of research has emerged that addresses this problem. Most of these approaches estimate the number of clusters separated from the clustering process. This results in a strong dependency of the clustering result on the quality of the initial embedding. Other approaches are tailored to specific clustering processes, making them hard to adapt to other scenarios. In this paper, we propose UNSEEN, a general framework that, starting from a given upper bound, is able to estimate the number of clusters. To the best of our knowledge, it is the first method that can be easily combined with various deep clustering algorithms. We demonstrate the applicability of our approach by combining UNSEEN with the popular deep clustering algorithms DCN, DEC, and DKM and verify its effectiveness through an extensive experimental evaluation on several image and tabular datasets. Moreover, we perform numerous ablations to analyze our approach and show the importance of its components.

MCML Authors

Collin Leiber

Dr.

A3 | Computational Models
→ Group Matthias Schubert

* Former Member

Niklas Strauß

Dr.

Spatial Artificial Intelligence

Matthias Schubert

Prof. Dr.

Spatial Artificial Intelligence

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[111]

A. Beer, P. Weber, L. Miklautz, C. Leiber, W. Durani, C. Böhm and C. Plant.
SHADE: Deep Density-based Clustering.
ICDM 2024 - 24th IEEE International Conference on Data Mining. Abu Dhabi, United Arab Emirates, Dec 09-12, 2024. DOI

Abstract

Detecting arbitrarily shaped clusters in high-dimensional noisy data is challenging for current clustering methods. We introduce SHADE (Structure-preserving High-dimensional Analysis with Density-based Exploration), the first deep clustering algorithm that incorporates density-connectivity into its loss function. Similar to existing deep clustering algorithms, SHADE supports high-dimensional and large data sets with the expressive power of a deep autoencoder. In contrast to most existing deep clustering methods that rely on a centroid-based clustering objective, SHADE incorporates a novel loss function that captures density-connectivity. SHADE thereby learns a representation that enhances the separation of density-connected clusters. SHADE detects a stable clustering and noise points fully automatically without any user input. It outperforms existing methods in clustering quality, especially on data that contain non-Gaussian clusters, such as video data. Moreover, the embedded space of SHADE is suitable for visualization and interpretation of the clustering results as the individual shapes of the clusters are preserved.

MCML Authors

Anna Beer

Dr.

* Former Member

Collin Leiber

Dr.

* Former Member

Walid Durani

Database Systems and Data Mining AI Lab

Christian Böhm

Prof. Dr.

* Former Principal Investigator

[110]

T. Hannan, R. Koner, M. Bernhard, S. Shit, B. Menze, V. Tresp, M. Schubert and T. Seidl.
GRAtt-VIS: Gated Residual Attention for Video Instance Segmentation.
ICPR 2024 - 27th International Conference on Pattern Recognition. Kolkata, India, Dec 01-05, 2024. DOI GitHub

Abstract

Recent trends in Video Instance Segmentation (VIS) have seen a growing reliance on online methods to model complex and lengthy video sequences. However, the degradation of representation and noise accumulation of the online methods, especially during occlusion and abrupt changes, pose substantial challenges. Transformer-based query propagation provides promising directions at the cost of quadratic memory attention. However, they are susceptible to the degradation of instance features due to the above-mentioned challenges and suffer from cascading effects. The detection and rectification of such errors remain largely underexplored. To this end, we introduce textbf{GRAtt-VIS}, textbf{G}ated textbf{R}esidual textbf{Att}ention for textbf{V}ideo textbf{I}nstance textbf{S}egmentation. Firstly, we leverage a Gumbel-Softmax-based gate to detect possible errors in the current frame. Next, based on the gate activation, we rectify degraded features from its past representation. Such a residual configuration alleviates the need for dedicated memory and provides a continuous stream of relevant instance features. Secondly, we propose a novel inter-instance interaction using gate activation as a mask for self-attention. This masking strategy dynamically restricts the unrepresentative instance queries in the self-attention and preserves vital information for long-term tracking. We refer to this novel combination of Gated Residual Connection and Masked Self-Attention as textbf{GRAtt} block, which can easily be integrated into the existing propagation-based framework. Further, GRAtt blocks significantly reduce the attention overhead and simplify dynamic temporal modeling. GRAtt-VIS achieves state-of-the-art performance on YouTube-VIS and the highly challenging OVIS dataset, significantly improving over previous methods.

MCML Authors

Tanveer Hannan

Database Systems and Data Mining AI Lab

Rajat Koner

A3 | Computational Models
→ Group Matthias Schubert

Database Systems and Data Mining AI Lab

Maximilian Bernhard

Dr.

* Former Member

Volker Tresp

Prof. Dr.

Database Systems and Data Mining AI Lab

Matthias Schubert

Prof. Dr.

Spatial Artificial Intelligence

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[109]

I. M. Grigore, G. M. Tavares and S. Barbon Junior.
Beyond Flattening: Detecting Concurrency Anomalies Using K-NN Graph-Based Modeling in Object-Centric Event Logs.
DATAMOD @SEFM 2024 - 12th International Symposium From Data to Models and Back at the 22nd International Conference of Software Engineering and Formal Methods (SEFM 2024). Aveiro, Portugal, Nov 04-05, 2024. DOI

Abstract

Detecting anomalous executions is essential in today’s dynamic and diverse business environments. It plays a pivotal role in identifying inefficiencies, ensuring compliance, and mitigating risks associated with deviations from standard procedures. Traditional process mining techniques generally assume a linear sequence of events. However, real-world processes often present concurrency, characterized by the parallel execution of multiple activities or cases and complex interactions among events. These behaviors are not mapped by conventional linear models, this way, not accurately capturing the dynamic nature of process flows. To tackle this challenge, this study proposes a new approach for detecting concurrency anomalies using a K-NN graph-based model, overcoming the traditional flattening method. In our experiments, we explored object-centric event logs with different types of concurrency anomalies and compared them to the traditional flattening procedure. Our proposal was able to provide comprehensive and precise communities (clusters) of anomalous variants compared to the baseline.

MCML Authors

Gabriel Marques Tavares

Dr.

A3 | Computational Models
→ Group Matthias Schubert

Database Systems and Data Mining AI Lab

[108]

M. Bernhard, T. Hannan, N. Strauß and M. Schubert.
Context Matters: Leveraging Spatiotemporal Metadata for Semi-Supervised Learning on Remote Sensing Images.
ECAI 2024 - 27th European Conference on Artificial Intelligence. Santiago de Compostela, Spain, Oct 19-24, 2024. DOI GitHub

Abstract

Remote sensing projects typically generate large amounts of imagery that can be used to train powerful deep neural networks. However, the amount of labeled images is often small, as remote sensing applications generally require expert labelers. Thus, semi-supervised learning (SSL), i.e., learning with a small pool of labeled and a larger pool of unlabeled data, is particularly useful in this domain. Current SSL approaches generate pseudo-labels from model predictions for unlabeled samples. As the quality of these pseudo-labels is crucial for performance, utilizing additional information to improve pseudo-label quality yields a promising direction. For remote sensing images, geolocation and recording time are generally available and provide a valuable source of information as semantic concepts, such as land cover, are highly dependent on spatiotemporal context, e.g., due to seasonal effects and vegetation zones. In this paper, we propose to exploit spatiotemporal metainformation in SSL to improve the quality of pseudo-labels and, therefore, the final model performance. We show that directly adding the available metadata to the input of the predictor at test time degenerates the prediction quality for metadata outside the spatiotemporal distribution of the training set. Thus, we propose a teacher-student SSL framework where only the teacher network uses metainformation to improve the quality of pseudo-labels on the training set. Correspondingly, our student network benefits from the improved pseudo-labels but does not receive metadata as input, making it invariant to spatiotemporal shifts at test time. Furthermore, we propose methods for encoding and injecting spatiotemporal information into the model and introduce a novel distillation mechanism to enhance the knowledge transfer between teacher and student. Our framework dubbed Spatiotemporal SSL can be easily combined with several state-of-the-art SSL methods, resulting in significant and consistent improvements on the BigEarthNet and EuroSAT benchmarks.

MCML Authors

Maximilian Bernhard

Dr.

* Former Member

Tanveer Hannan

A3 | Computational Models
→ Group Matthias Schubert

Database Systems and Data Mining AI Lab

Niklas Strauß

Dr.

Spatial Artificial Intelligence

Matthias Schubert

Prof. Dr.

Spatial Artificial Intelligence

[107]

S. Rauch, C. M. M. Frey, L. Zellner and T. Seidl.
Process-Aware Bayesian Networks for Sequential Event Log Queries.
ICPM 2024 - 6th International Conference on Process Mining. Lyngby, Denmark, Oct 14-18, 2024. DOI

Abstract

Business processes from many domains like manufacturing, healthcare, or business administration suffer from different amounts of uncertainty concerning the execution of individual activities and their order of occurrence. As long as a process is not entirely serial, i.e., there are no forks or decisions to be made along the process execution, we are - in the absence of exhaustive domain knowledge - confronted with the question whether and in what order activities should be executed or left out for a given case and a desired outcome. As the occurrence or non-occurrence of events has substantial implications regarding process key performance indicators like throughput times or scrap rate, there is ample need for assessing and modeling that process-inherent uncertainty. We propose a novel way of handling the uncertainty by leveraging the probabilistic mechanisms of Bayesian Networks to model processes from the structural and temporal information given in event log data and offer a comprehensive evaluation of uncertainty by modelling cases in their entirety. In a thorough analysis of well-established benchmark datasets, we show that our Process-aware Bayesian Network is capable of answering process queries concerned with any unknown process sequence regarding activities and/or attributes enhancing the explainability of processes. Our method can infer execution probabilities of activities at different stages and can query probabilities of certain process outcomes. The key benefit of the Process-aware Query System over existing approaches is the ability to deliver probabilistic, case-diagnostic information about the execution of activities via Bayesian inference.

MCML Authors

Simon Rauch

A3 | Computational Models
→ Group Matthias Schubert

Database Systems and Data Mining AI Lab

Christian Frey

Dr.

* Former Member

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[106]

A. Maldonado, S. A. Aryasomayajula, C. M. M. Frey and T. Seidl.
iGEDI: interactive Generating Event Data with Intentional Features.
ICPM 2024 - Demo Tracks at the 6th International Conference on Process Mining. Lyngby, Denmark, Oct 14-18, 2024. URL

Abstract

Process mining solutions aim to improve performance, save resources, and address bottlenecks in organizations. However, success depends on data quality and availability, and existing analyses often lack diverse data for rigorous testing. To overcome this, we propose an interactive web application tool, extending the GEDI Python framework, which creates event datasets that meet specific (meta-)features. It provides diverse benchmark event data by exploring new regions within the feature space, enhancing the range and quality of process mining analyses. This tool improves evaluation quality and helps uncover correlations between meta-features and metrics, ultimately enhancing solution effectiveness.

MCML Authors

Andrea Maldonado

A3 | Computational Models
→ Group Matthias Schubert

Database Systems and Data Mining AI Lab

Christian Frey

Dr.

* Former Member

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[105]

A. Maldonado.
Data-Driven Approaches Towards Transparent Benchmarking of Process Mining Tasks.
ICPM 2024 - Doctoral Consortium at the 6th International Conference on Process Mining. Lyngby, Denmark, Oct 14-18, 2024. URL

Abstract

The abundance of new approaches in process mining and the diversity of processes in the real-world, raises the question of this thesis: How can we create benchmarks, which reliably measure the impact of event data features on process mining evaluation? Developing benchmarks, that employ comprehensive intentional ED and also consider connections between ED characteristic features, methods, and metrics, will support process miners to evaluate methods more efficiently and reliably.

MCML Authors

Andrea Maldonado

Database Systems and Data Mining AI Lab

[104]

Z. Xian, L. Zellner, G. M. Tavares and T. Seidl.
CC-HIT: Creating Counterfactuals from High-Impact Transitions.
ML4PM @ICPM 2024 - 4th International Workshop on Leveraging Machine Learning in Process Mining at the 6th International Conference on Process Mining (ICPM 2024). Lyngby, Denmark, Oct 14-18, 2024. DOI

Abstract

Reliable process information, especially regarding trace durations, is crucial for smooth execution. Without it, maintaining a process becomes costly. While many predictive systems aim to identify inefficiencies, they often focus on individual process instances, missing the global perspective. It is essential not only to detect where delays occur but also to pinpoint specific activity transitions causing them. To address this, we propose CC-HIT (Creating Counterfactuals from High-Impact Transitions), which identifies temporal dependencies across the entire process. By focusing on activity transitions, we provide deeper insights into relational impacts, enabling faster resolution of inefficiencies. CC-HIT highlights the most influential transitions on process performance, offering actionable insights for optimization. We validate this method using the BPIC 2020 dataset, demonstrating its effectiveness compared to existing approaches.

MCML Authors

Zhicong Xian

Database Systems and Data Mining AI Lab

Gabriel Marques Tavares

Dr.

Database Systems and Data Mining AI Lab

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[103]

T. Hannan, M. M. Islam, T. Seidl and G. Bertasius.
RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. DOI GitHub

Abstract

Locating specific moments within long videos (20–120 min) presents a significant challenge, akin to finding a needle in a haystack. Adapting existing short video (5–30 s) grounding methods to this problem yields poor performance. Since most real-life videos, such as those on YouTube and AR/VR, are lengthy, addressing this issue is crucial. Existing methods typically operate in two stages: clip retrieval and grounding. However, this disjoint process limits the retrieval module’s fine-grained event understanding, crucial for specific moment detection. We propose RGNet which deeply integrates clip retrieval and grounding into a single network capable of processing long videos into multiple granular levels, e.g., clips and frames. Its core component is a novel transformer encoder, RG-Encoder, that unifies the two stages through shared features and mutual optimization. The encoder incorporates a sparse attention mechanism and an attention loss to model both granularity jointly. Moreover, we introduce a contrastive clip sampling technique to mimic the long video paradigm closely during training. RGNet surpasses prior methods, showcasing state-of-the-art performance on long video temporal grounding (LVTG) datasets MAD and Ego4D.

MCML Authors

Tanveer Hannan

Database Systems and Data Mining AI Lab

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[102]

M. C. da Silva, B. Licari, G. M. Tavares and S. Barbon Junior.
Benchmarking AutoML Clustering Frameworks.
AutoML 2024 - ABCD Track - International Conference on Automated Machine Learning. Paris, France, Sep 09-12, 2024. URL

Abstract

The surge of frameworks for automated unsupervised clustering problems exposed the notable gap in performance assessment, unified datasets and methodologies for this field. The lack of standardization and proper clustering goal setting obscures the applicability and suitability of such solutions. Therefore, we propose a benchmark to bridge this gap by offering a comparative analysis of AutoML frameworks for clustering, using several criteria and a comprehensive set of benchmarking problems. Four prominent AutoML unsupervised frameworks (AutoML4Clust, Autocluster, cSmartML, and ML2DAC) were compared following our methodology. By extending the evaluation beyond quantitative metrics, this research contributes to a more nuanced understanding of the applicability and performance of AutoML for a diverse set of clustering problems. Our analysis shows the evident demand for effort in the direction of pipeline synthesis (i.e., search and optimization of complete pipelines), clustering goal definition and suitable analysis dimensions.

MCML Authors

Gabriel Marques Tavares

Dr.

Database Systems and Data Mining AI Lab

[101]

S. Gilhuber, A. Beer, Y. Ma and T. Seidl.
FALCUN: A Simple and Efficient Deep Active Learning Strategy.
ECML-PKDD 2024 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Vilnius, Lithuania, Sep 09-13, 2024. DOI

Abstract

We propose FALCUN, a novel deep batch active learning method that is label- and time-efficient. Our proposed acquisition uses a natural, self-adjusting balance of uncertainty and diversity: It slowly transitions from emphasizing uncertain instances at the decision boundary to emphasizing batch diversity. In contrast, established deep active learning methods often have a fixed weighting of uncertainty and diversity, limiting their effectiveness over diverse data sets exhibiting different characteristics. Moreover, to increase diversity, most methods demand intensive search through a deep neural network’s high-dimensional latent embedding space. This leads to high acquisition times when experts are idle while waiting for the next batch for annotation. We overcome this structural problem by exclusively operating on the low-dimensional probability space, yielding much faster acquisition times without sacrificing label efficiency. In extensive experiments, we show FALCUN’s suitability for diverse use cases, including medical images and tabular data. Compared to state-of-the-art methods like BADGE, CLUE, and AlfaMix, FALCUN consistently excels in quality and speed: while FALCUN is among the fastest methods, it has the highest average label efficiency.

MCML Authors

Sandra Gilhuber (née Obermeier)

Database Systems and Data Mining AI Lab

Yunpu Ma

Dr.

Database Systems and Data Mining AI Lab

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[100]

P. Jahn, C. M. M. Frey, A. Beer, C. Leiber and T. Seidl.
Data with Density-Based Clusters: A Generator for Systematic Evaluation of Clustering Algorithms.
ECML-PKDD 2024 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Vilnius, Lithuania, Sep 09-13, 2024. DOI GitHub

Abstract

Mining data containing density-based clusters is well-established and widespread but faces problems when it comes to systematic and reproducible comparison and evaluation. Although the success of clustering methods hinges on data quality and availability, reproducibly generating suitable data for this setting is not easy, leading to mostly low-dimensional toy datasets being used. To resolve this issue, we propose DENSIRED (DENSIty-based Reproducible Experimental Data), a novel data generator for data containing density-based clusters. It is highly flexible w.r.t. a large variety of properties of the data and produces reproducible datasets in a two-step approach. First, skeletons of the clusters are constructed following a random walk. In the second step, these skeletons are enriched with data samples. DENSIRED enables the systematic generation of data for a robust and reliable analysis of methods aimed toward examining data containing density-connected clusters. In extensive experiments, we analyze the impact of user-defined properties on the generated datasets and the intrinsic dimensionalities of synthesized clusters.

MCML Authors

Philipp Jahn

Database Systems and Data Mining AI Lab

Collin Leiber

Dr.

* Former Member

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[99]

B. Chocholaty, C. Leiber and S. Marburg.
Effects of similarity measures and assignment methods on mode pairing for the application of timber plates.
ISMA 2024 - 31st International Conference on Noise and Vibration Engineering. KU Leuven, Belgium, Sep 09-11, 2024. URL

Abstract

Correctly pairing experimentally and numerically determined mode shapes is crucial for successful model updating. It ensures that the updated model accurately reflects the physical behavior of the structure. This study investigates the two main steps applied for successful mode pairing. First, the correlation between the model and experiments is analyzed using different measures of similarity. Second, based on the computed correlation, a variety of strategies for a correct assignment of the mode pairs is studied. Here, an approach to iteratively combine the mode pairs showing the maximum similarity value in the similarity matrix, an extension additionally using the auto-similarity matrix, the Hungarian method, and a clustering-based approach are investigated. To study the efficacy of the various approaches, the study incorporates an application involving a timber plate. Thus, the effects of employing different similarity measures and pair assignment methods are demonstrated, providing insights for future studies related to mode pairing and model updating.

MCML Authors

Collin Leiber

Dr.

* Former Member

[98]

A. Maldonado, C. M. M. Frey, G. M. Tavares, N. Rehwald and T. Seidl.
GEDI: Generating Event Data with Intentional Features for Benchmarking Process Mining.
BPM 2024 - 22nd International Conference on Business Process Management. Krakow, Poland, Sep 01-06, 2024. DOI

Abstract

Process mining solutions include enhancing performance, conserving resources, and alleviating bottlenecks in organizational contexts. However, as in other data mining fields, success hinges on data quality and availability. Existing analyses for process mining solutions lack diverse and ample data for rigorous testing, hindering insights’ generalization. To address this, we propose Generating Event Data with Intentional features, a framework producing event data sets satisfying specific meta-features. Considering the meta-feature space that defines feasible event logs, we observe that existing real-world datasets describe only local areas within the overall space. Hence, our framework aims at providing the capability to generate an event data benchmark, which covers unexplored regions. Therefore, our approach leverages a discretization of the meta-feature space to steer generated data towards regions, where a combination of meta-features is not met yet by existing benchmark datasets. Providing a comprehensive data pool enriches process mining analyses, enables methods to capture a wider range of real-world scenarios, and improves evaluation quality. Moreover, it empowers analysts to uncover correlations between meta-features and evaluation metrics, enhancing explainability and solution effectiveness. Experiments demonstrate GEDI’s ability to produce a benchmark of intentional event data sets and robust analyses for process mining tasks.

MCML Authors

Andrea Maldonado

A3 | Computational Models
→ Group Matthias Schubert

Database Systems and Data Mining AI Lab

Christian Frey

Dr.

* Former Member

Gabriel Marques Tavares

Dr.

Database Systems and Data Mining AI Lab

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[97]

R. S. Oyamada, G. M. Tavares, S. B. Junior and P. Ceravolo.
CoSMo: A Framework to Instantiate Conditioned Process Simulation Models.
BPM 2024 - 22nd International Conference on Business Process Management. Krakow, Poland, Sep 01-06, 2024. DOI

Abstract

Process simulation is gaining attention for its ability to assess potential performance improvements and risks associated with business process changes. The existing literature presents various techniques, generally grounded in process models discovered from event log data or built upon deep learning algorithms. These techniques have specific strengths and limitations. Traditional data-driven approaches offer increased interpretability, while deep learning-based excel at generalizing changes across large event logs. However, the practical application of deep learning faces challenges related to managing stochasticity and integrating information for what-if analysis. This paper introduces a novel recurrent neural architecture tailored to discover COnditioned process Simulation MOdels (CoSMo) based on user-based constraints or any other nature of a-priori knowledge. This architecture facilitates the simulation of event logs that adhere to specific constraints by incorporating declarative-based rules into the learning phase as an attempt to fill the gap of incorporating information into deep learning models to perform what-if analysis. Experimental validation illustrates CoSMo’s efficacy in simulating event logs while adhering to predefined declarative conditions, emphasizing both control-flow and data-flow perspectives.

MCML Authors

Gabriel Marques Tavares

Dr.

Database Systems and Data Mining AI Lab

[96]

M. C. da Silva, G. M. Tavares, E. Medvet and S. Barbon Junior.
Problem-oriented AutoML in Clustering.
Preprint (Sep. 2024). arXiv

Abstract

The Problem-oriented AutoML in Clustering (PoAC) framework introduces a novel, flexible approach to automating clustering tasks by addressing the shortcomings of traditional AutoML solutions. Conventional methods often rely on predefined internal Clustering Validity Indexes (CVIs) and static meta-features, limiting their adaptability and effectiveness across diverse clustering tasks. In contrast, PoAC establishes a dynamic connection between the clustering problem, CVIs, and meta-features, allowing users to customize these components based on the specific context and goals of their task. At its core, PoAC employs a surrogate model trained on a large meta-knowledge base of previous clustering datasets and solutions, enabling it to infer the quality of new clustering pipelines and synthesize optimal solutions for unseen datasets. Unlike many AutoML frameworks that are constrained by fixed evaluation metrics and algorithm sets, PoAC is algorithm-agnostic, adapting seamlessly to different clustering problems without requiring additional data or retraining. Experimental results demonstrate that PoAC not only outperforms state-of-the-art frameworks on a variety of datasets but also excels in specific tasks such as data visualization, and highlight its ability to dynamically adjust pipeline configurations based on dataset complexity.

MCML Authors

Gabriel Marques Tavares

Dr.

Database Systems and Data Mining AI Lab

[95]

Y. Sun, J. Liu, Z. Wu, Z. Ding, Y. Ma, T. Seidl and V. Tresp.
SA-DQAS: Self-attention Enhanced Differentiable Quantum Architecture Search.
ICML 2024 - Workshop Differentiable Almost Everything: Differentiable Relaxations, Algorithms, Operators, and Simulators at the 41st International Conference on Machine Learning. Vienna, Austria, Jul 21-27, 2024. PDF

Abstract

We introduce SA-DQAS in this paper, a novel framework that enhances the gradient-based Differentiable Quantum Architecture Search (DQAS) with a self-attention mechanism, aimed at optimizing circuit design for Quantum Machine Learning (QML) challenges. Analogous to a sequence of words in a sentence, a quantum circuit can be viewed as a sequence of placeholders containing quantum gates. Unlike DQAS, each placeholder is independent, while the self-attention mechanism in SA-DQAS helps to capture relation and dependency information among each operation candidate placed on placeholders in a circuit. To evaluate and verify, we conduct experiments on job-shop scheduling problems (JSSP), Max-cut problems, and quantum fidelity. Incorporating self-attention improves the stability and performance of the resulting quantum circuits and refines their structural design with higher noise resilience and fidelity. Our research demonstrates the first successful integration of self-attention with DQAS.

MCML Authors

Yize Sun

Database Systems and Data Mining AI Lab

Zifeng Ding

Database Systems and Data Mining AI Lab

Yunpu Ma

Dr.

Database Systems and Data Mining AI Lab

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

Volker Tresp

Prof. Dr.

Database Systems and Data Mining AI Lab

[94]

L. Arrighi, L. Pennella, G. M. Tavares and S. Barbon Junior.
Decision Predicate Graphs: Enhancing Interpretability in Tree Ensembles.
xAI 2024 - 2nd World Conference on Explainable Artificial Intelligence. Valletta, Malta, Jul 17-19, 2024. DOI

Abstract

Understanding the decisions of tree-based ensembles and their relationships is pivotal for machine learning model interpretation. Recent attempts to mitigate the human-in-the-loop interpretation challenge have explored the extraction of the decision structure underlying the model taking advantage of graph simplification and path emphasis. However, while these efforts enhance the visualisation experience, they may either result in a visually complex representation or compromise the interpretability of the original ensemble model. In addressing this challenge, especially in complex scenarios, we introduce the Decision Predicate Graph (DPG) as a model-specific tool to provide a global interpretation of the model. DPG is a graph structure that captures the tree-based ensemble model and learned dataset details, preserving the relations among features, logical decisions, and predictions towards emphasising insightful points. Leveraging well-known graph theory concepts, such as the notions of centrality and community, DPG offers additional quantitative insights into the model, complementing visualisation techniques, expanding the problem space descriptions, and offering diverse possibilities for extensions. Empirical experiments demonstrate the potential of DPG in addressing traditional benchmarks and complex classification scenarios.

MCML Authors

Gabriel Marques Tavares

Dr.

Database Systems and Data Mining AI Lab

[93]

R. S. Oyamada, G. M. Tavares, S. B. Junior and P. Ceravolo.
Enhancing Predictive Process Monitoring with Time-Related Feature Engineering.
CAiSE 2024 - 36th International Conference on Advanced Information Systems Engineering. Limassol, Cyprus, Jun 03-07, 2024. DOI

Abstract

Predictive process monitoring plays a critical role in process mining by predicting the dynamics of ongoing processes. Recent trends employ deep learning techniques that use event sequences to make highly accurate predictions. However, this focus often overshadows the significant advantages of lightweight, transparent algorithms. This study explores the potential of traditional regression algorithms, namely kNN, SVM, and RF, enhanced by event time feature engineering. We integrate existing and novel time-related features to augment these algorithms and compare their performance against the well-known LSTM network. Our results show that these enhanced lightweight models not only compete with LSTM in terms of predictive accuracy but also excel in scenarios requiring online, real-time decision-making and explanation. Furthermore, despite incorporating additional feature extraction processes, these algorithms maintain superior computational efficiency compared to their deep learning counterparts, making them more viable for time-critical and resource-constrained environments.

MCML Authors

Gabriel Marques Tavares

Dr.

A3 | Computational Models
→ Group Eyke Hüllermeier

Database Systems and Data Mining AI Lab

[92]

V. Margraf, M. Wever, S. Gilhuber, G. M. Tavares, T. Seidl and E. Hüllermeier.
ALPBench: A Benchmark for Active Learning Pipelines on Tabular Data.
Preprint (Jun. 2024). arXiv GitHub

Abstract

In settings where only a budgeted amount of labeled data can be afforded, active learning seeks to devise query strategies for selecting the most informative data points to be labeled, aiming to enhance learning algorithms’ efficiency and performance. Numerous such query strategies have been proposed and compared in the active learning literature. However, the community still lacks standardized benchmarks for comparing the performance of different query strategies. This particularly holds for the combination of query strategies with different learning algorithms into active learning pipelines and examining the impact of the learning algorithm choice. To close this gap, we propose ALPBench, which facilitates the specification, execution, and performance monitoring of active learning pipelines. It has built-in measures to ensure evaluations are done reproducibly, saving exact dataset splits and hyperparameter settings of used algorithms. In total, ALPBench consists of 86 real-world tabular classification datasets and 5 active learning settings, yielding 430 active learning problems. To demonstrate its usefulness and broad compatibility with various learning algorithms and query strategies, we conduct an exemplary study evaluating 9 query strategies paired with 8 learning algorithms in 2 different settings.

MCML Authors

Valentin Margraf

Artificial Intelligence and Machine Learning

Marcel Wever

Dr.

A3 | Computational Models
→ Group Eyke Hüllermeier

* Former Member

Sandra Gilhuber (née Obermeier)

Database Systems and Data Mining AI Lab

Gabriel Marques Tavares

Dr.

Database Systems and Data Mining AI Lab

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning

[91]

A. Stephan, L. Miklautz, C. Leiber, P. H. Araujo, D. Répás, C. Plant and B. Roth.
Text-Guided Alternative Image Clustering.
Preprint (Jun. 2024). arXiv

Abstract

Traditional image clustering techniques only find a single grouping within visual data. In particular, they do not provide a possibility to explicitly define multiple types of clustering. This work explores the potential of large vision-language models to facilitate alternative image clustering. We propose Text-Guided Alternative Image Consensus Clustering (TGAICC), a novel approach that leverages user-specified interests via prompts to guide the discovery of diverse clusterings. To achieve this, it generates a clustering for each prompt, groups them using hierarchical clustering, and then aggregates them using consensus clustering. TGAICC outperforms image- and text-based baselines on four alternative image clustering benchmark datasets. Furthermore, using count-based word statistics, we are able to obtain text-based explanations of the alternative clusterings. In conclusion, our research illustrates how contemporary large vision-language models can transform explanatory data analysis, enabling the generation of insightful, customizable, and diverse image clusterings.

MCML Authors

Collin Leiber

Dr.

* Former Member

[90]

A. Beer, O. Palotás, A. Maldonado, A. Draganov and I. Assent.
DROPP: Structure-aware PCA for Ordered Data.
ICDE 2024 - 40th IEEE International Conference on Data Engineering. Utrecht, Netherlands, May 13-17, 2024. DOI

Abstract

Ordered data arises in many areas, e.g., in molecular dynamics and other spatial-temporal trajectories. While data points that are close in this order are related, common dimensionality reduction techniques cannot capture this relation or order. Thus, the information is lost in the low-dimensional representations. We introduce DROPP, which incorporates order into dimensionality reduction by adapting a Gaussian kernel function across the ordered covariances between data points. We find underlying principal components that are characteristic of the process that generated the data. In extensive experiments, we show DROPP’s advantages over other dimensionality reduction techniques on synthetic as well as real-world data sets from molecular dynamics and climate research: The principal components of different data sets that were generated by the same underlying mechanism are very similar to each other. They can, thus, be used for dimensionality reduction with low reconstruction errors along a set of data sets, allowing an explainable visual comparison of different data sets as well as good compression even for unseen data.

MCML Authors

Andrea Maldonado

Database Systems and Data Mining AI Lab

[89]

L. Zellner, S. Rauch, J. Sontheim and T. Seidl.
On Diverse and Precise Recommendations for Small and Medium-Sized Enterprises.
PAKDD 2024 - 28th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Taipeh, Taiwan, May 07-10, 2024. DOI GitHub

Abstract

Recommender Systems are a popular and common means to extract relevant information for users. Small and medium-sized enterprises make up a large share of the overall amount of business but need to be more frequently considered regarding the demand for recommender systems. Different conditions, such as the small amount of data, lower computational capabilities, and users frequently not possessing an account, require a different and potentially a more small-scale recommender system. The requirements regarding quality are similar: High accuracy and high diversity are certainly an advantage. We provide multiple solutions with different variants solely based on information contained in event-based sequences and temporal information. Our code is available at GitHub. We conduct experiments on four different datasets with an increasing set of items to show a possible range for scalability. The promising results show the applicability of these grammar-based recommender system variants and leave the final decision on which recommender to choose to the user and its ultimate goals.

MCML Authors

Simon Rauch

Database Systems and Data Mining AI Lab

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[88]

A. Lohrer, D. Kazempour, M. Hünemörder and P. Kröger.
CoMadOut—a robust outlier detection algorithm based on CoMAD.
Machine Learning 113 (May. 2024). DOI

Abstract

Unsupervised learning methods are well established in the area of anomaly detection and achieve state of the art performances on outlier datasets. Outliers play a significant role, since they bear the potential to distort the predictions of a machine learning algorithm on a given dataset. Especially among PCA-based methods, outliers have an additional destructive potential regarding the result: they may not only distort the orientation and translation of the principal components, they also make it more complicated to detect outliers. To address this problem, we propose the robust outlier detection algorithm CoMadOut, which satisfies two required properties: (1) being robust towards outliers and (2) detecting them. Our CoMadOut outlier detection variants using comedian PCA define, dependent on its variant, an inlier region with a robust noise margin by measures of in-distribution (variant CMO) and optimized scores by measures of out-of-distribution (variants CMO*), e.g. kurtosis-weighting by CMO+k. These measures allow distribution based outlier scoring for each principal component, and thus, an appropriate alignment of the degree of outlierness between normal and abnormal instances. Experiments comparing CoMadOut with traditional, deep and other comparable robust outlier detection methods showed that the performance of the introduced CoMadOut approach is competitive to well established methods related to average precision (AP), area under the precision recall curve (AUPRC) and area under the receiver operating characteristic (AUROC) curve. In summary our approach can be seen as a robust alternative for outlier detection tasks.

MCML Authors

Andreas Lohrer

Dr.

* Former Member

Daniyal Kazempour

Dr.

* Former Member

Peer Kröger

Prof. Dr.

* Former Principal Investigator

[87]

C. Leiber.
Clustering in transformed feature spaces by analyzing distinct modes.
Dissertation 2024. DOI

Abstract

The growing availability of data demands clustering methods that can extract valuable information without requiring costly annotations, especially for large, high-dimensional datasets. This dissertation develops subspace and deep clustering approaches, leveraging methods like the Dip-test of unimodality and Minimum Description Length principle to identify and encode relevant features and clusters automatically, even in complex datasets. By incorporating these techniques into neural networks and refining them through a novel parameter-free approach, the research offers robust clustering tools that perform well without prior knowledge of the number of clusters, all implemented in the open-source package ClustPy. (Shortened).

MCML Authors

Collin Leiber

Dr.

* Former Member

[86]

I. M. Grigore, G. M. Tavares, M. C. Silva, P. Ceravolo and S. Junior.
Automated Trace Clustering Pipeline Synthesis in Process Mining.
Information 15.4 (Apr. 2024). DOI

Abstract

Business processes have undergone a significant transformation with the advent of the process-oriented view in organizations. The increasing complexity of business processes and the abundance of event data have driven the development and widespread adoption of process mining techniques. However, the size and noise of event logs pose challenges that require careful analysis. The inclusion of different sets of behaviors within the same business process further complicates data representation, highlighting the continued need for innovative solutions in the evolving field of process mining. Trace clustering is emerging as a solution to improve the interpretation of underlying business processes. Trace clustering offers benefits such as mitigating the impact of outliers, providing valuable insights, reducing data dimensionality, and serving as a preprocessing step in robust pipelines. However, designing an appropriate clustering pipeline can be challenging for non-experts due to the complexity of the process and the number of steps involved. For experts, it can be time-consuming and costly, requiring careful consideration of trade-offs. To address the challenge of pipeline creation, the paper proposes a genetic programming solution for trace clustering pipeline synthesis that optimizes a multi-objective function matching clustering and process quality metrics. The solution is applied to real event logs, and the results demonstrate improved performance in downstream tasks through the identification of sub-logs.

MCML Authors

Gabriel Marques Tavares

Dr.

Database Systems and Data Mining AI Lab

2023

[85]

C. Leiber, L. Miklautz, C. Plant and C. Böhm.
Benchmarking Deep Clustering Algorithms With ClustPy.
ICDMW 2023 - IEEE International Conference on Data Mining Workshops. Shanghai, China, Dec 01-04, 2023. DOI GitHub

Abstract

Deep clustering algorithms have gained popularity as they are able to cluster complex large-scale data, like images. Yet these powerful algorithms require many decisions w.r.t. architecture, learning rate and other hyperparameters, making it difficult to compare different methods. A comprehensive empirical evaluation of novel clustering methods, however, plays an important role in both scientific and practical applications, as it reveals their individual strengths and weaknesses. Therefore, we introduce ClustPy, a unified framework for benchmarking deep clustering algorithms, and perform a comparison of several fundamental deep clustering methods and some recently introduced ones. We compare these methods on multiple well known image data sets using different evaluation metrics, perform a sensitivity analysis w.r.t. important hyperparameters and perform ablation studies, e.g., for different autoencoder architectures and image augmentation. To our knowledge this is the first in depth benchmarking of deep clustering algorithms in a unified setting.

MCML Authors

Collin Leiber

Dr.

* Former Member

Christian Böhm

Prof. Dr.

* Former Principal Investigator

[84]

A. Maldonado, L. Zellner, S. Strickroth and T. Seidl.
Process Mining Techniques for Collusion Detection in Online Exams.
EduPM @ICPM 2023 - 2nd International Workshop ‘Education meets Process Mining’ organized with the 5th International Conference on Process Mining (ICPM 2023). Rome, Italy, Oct 23-27, 2023. DOI

Abstract

Honesty and fairness are essential. As many skills, practicing those values starts in the classroom. Whether students are examined online or on-site, only testing their knowledge righteously, educators can assess their skills and room for improvement. As online exams increase, we are provided with more suitable data for analysis. Process mining methods as anomaly detection and trace clustering techniques have been used to identify dishonest behavior in other fields, as e.g. fraud detection. In this paper, we investigate collusion detection in online exams as a process mining task. We explore trace ordering for anomaly detection (TOAD) as well as hierarchical agglomerative trace clustering (HATC). Promising preliminary results exemplify, how process mining techniques empower teachers in their decision making, while via flexible configuration of parameters, leaves the last word to them.

MCML Authors

Andrea Maldonado

Database Systems and Data Mining AI Lab

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[83]

A. Maldonado, G. M. Tavares, R. Oyamada, P. Ceravolo and T. Seidl.
FEEED: Feature Extraction from Event Data.
ICPM 2023 - Doctoral Consortium at the 5th International Conference on Process Mining. Rome, Italy, Oct 23-27, 2023. PDF

Abstract

The analysis of event data is largely influenced by the effective characterization of descriptors. These descriptors serve as the building blocks of our understanding, encapsulating the behavior described within the event data. In light of these considerations, we introduce FEEED (Feature Extraction from Event Data), an extendable tool for event data feature extraction. FEEED represents a significant advancement in event data behavior analysis, offering a range of features to empower analysts and data scientists in their pursuit of insightful, actionable, and understandable event data analysis. What sets FEEED apart is its unique capacity to act as a bridge between the worlds of data mining and process mining. In doing so, it promises to enhance the accuracy, comprehensiveness, and utility of characterizing event data for a diverse range of applications.

MCML Authors

Andrea Maldonado

Database Systems and Data Mining AI Lab

Gabriel Marques Tavares

Dr.

Database Systems and Data Mining AI Lab

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[82]

C. Leiber, L. Miklautz, C. Plant and C. Böhm.
Application of Deep Clustering Algorithms.
CIKM 2023 - 32nd ACM International Conference on Information and Knowledge Management. Birmingham, UK, Oct 21-25, 2023. DOI

Abstract

Deep clustering algorithms have gained popularity for clustering complex, large-scale data sets, but getting started is difficult because of numerous decisions regarding architecture, optimizer, and other hyperparameters. Theoretical foundations must be known to obtain meaningful results. At the same time, ease of use is necessary to get used by a broader audience. Therefore, we require a unified framework that allows for easy execution in diverse settings. While this applies to established clustering methods like k-Means and DBSCAN, deep clustering algorithms lack a standard structure, resulting in significant programming overhead. This complicates empirical evaluations, which are essential in both scientific and practical applications. We present a solution to this problem by providing a theoretical background on deep clustering as well as practical implementation techniques and a unified structure with predefined neural networks. For the latter, we use the Python package ClustPy. The aim is to share best practices and facilitate community participation in deep clustering research.

MCML Authors

Collin Leiber

Dr.

* Former Member

Christian Böhm

Prof. Dr.

* Former Principal Investigator

[81]

L. Miklautz, A. Shkabrii, C. Leiber, B. Tobias, B. Seidl, E. Weissensteiner, A. Rausch, C. Böhm and C. Plant.
Non-Redundant Image Clustering of Early Medieval Glass Beads.
DSAA 2023 - 10th IEEE International Conference on Data Science and Advanced Analytics. Thessaloniki, Greece, Oct 09-13, 2023. DOI

Abstract

Glass beads were among the most common grave goods in the Early Middle Ages, with an estimated number in the millions. The color, size, shape and decoration of the beads are diverse leading to many different archaeological classification systems that depend on the subjective decisions of individual experts. The lack of an agreed upon expert categorization leads to a pressing problem in archaeology, as the categorization of archaeological artifacts, like glass beads, is important to learn about cultural trends, manufacturing processes or economic relationships (e.g., trade routes) of historical times. An automated, objective and reproducible classification system is therefore highly desirable. We present a high-quality data set of images of Early Medieval beads and propose a clustering pipeline to learn a classification system in a data-driven way. The pipeline consists of a novel extension of deep embedded non-redundant clustering to identify multiple, meaningful clusterings of glass bead images. During the cluster analysis we address several challenges associated with the data and as a result identify high-quality clusterings that overlap with archaeological domain expertise. To the best of our knowledge this is the first application of non-redundant image clustering for archaeological data.

MCML Authors

Collin Leiber

Dr.

* Former Member

Christian Böhm

Prof. Dr.

* Former Principal Investigator

[80]

D. Winkel, N. Strauß, M. Schubert and T. Seidl.
Simplex Decomposition for Portfolio Allocation Constraints in Reinforcement Learning.
ECAI 2023 - 26th European Conference on Artificial Intelligence. Kraków, Poland, Sep 30-Oct 04, 2023. DOI

Abstract

Portfolio optimization tasks describe sequential decision problems in which the investor’s wealth is distributed across a set of assets. Allocation constraints are used to enforce minimal or maximal investments into particular subsets of assets to control for objectives such as limiting the portfolio’s exposure to a certain sector due to environmental concerns. Although methods for (CRL) can optimize policies while considering allocation constraints, it can be observed that these general methods yield suboptimal results. In this paper, we propose a novel approach to handle allocation constraints based on a decomposition of the constraint action space into a set of unconstrained allocation problems. In particular, we examine this approach for the case of two constraints. For example, an investor may wish to invest at least a certain percentage of the portfolio into green technologies while limiting the investment in the fossil energy sector. We show that the action space of the task is equivalent to the decomposed action space, and introduce a new (RL) approach CAOSD, which is built on top of the decomposition. The experimental evaluation on real-world Nasdaq data demonstrates that our approach consistently outperforms state-of-the-art CRL benchmarks for portfolio optimization.

MCML Authors

David Winkel

A3 | Computational Models
→ Group Matthias Schubert

Database Systems and Data Mining AI Lab

Niklas Strauß

Dr.

Spatial Artificial Intelligence

Matthias Schubert

Prof. Dr.

Spatial Artificial Intelligence

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[79]

S. Gilhuber, J. Busch, D. Rotthues, C. M. M. Frey and T. Seidl.
DiffusAL: Coupling Active Learning with Graph Diffusion for Label-Efficient Node Classification.
ECML-PKDD 2023 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Turin, Italy, Sep 18-22, 2023. DOI

Abstract

Node classification is one of the core tasks on attributed graphs, but successful graph learning solutions require sufficiently labeled data. To keep annotation costs low, active graph learning focuses on selecting the most qualitative subset of nodes that maximizes label efficiency. However, deciding which heuristic is best suited for an unlabeled graph to increase label efficiency is a persistent challenge. Existing solutions either neglect aligning the learned model and the sampling method or focus only on limited selection aspects. They are thus sometimes worse or only equally good as random sampling. In this work, we introduce a novel active graph learning approach called DiffusAL, showing significant robustness in diverse settings. Toward better transferability between different graph structures, we combine three independent scoring functions to identify the most informative node samples for labeling in a parameter-free way: i) Model Uncertainty, ii) Diversity Component, and iii) Node Importance computed via graph diffusion heuristics. Most of our calculations for acquisition and training can be pre-processed, making DiffusAL more efficient compared to approaches combining diverse selection criteria and similarly fast as simpler heuristics. Our experiments on various benchmark datasets show that, unlike previous methods, our approach significantly outperforms random selection in 100% of all datasets and labeling budgets tested.

MCML Authors

Sandra Gilhuber (née Obermeier)

A3 | Computational Models
→ Group Matthias Schubert

Database Systems and Data Mining AI Lab

Christian Frey

Dr.

* Former Member

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[78]

S. Gilhuber, R. Hvingelby, M. L. A. Fok and T. Seidl.
How to Overcome Confirmation Bias in Semi-Supervised Image Classification by Active Learning.
ECML-PKDD 2023 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Turin, Italy, Sep 18-22, 2023. DOI

Abstract

Do we need active learning? The rise of strong deep semi-supervised methods raises doubt about the usability of active learning in limited labeled data settings. This is caused by results showing that combining semi-supervised learning (SSL) methods with a random selection for labeling can outperform existing active learning (AL) techniques. However, these results are obtained from experiments on well-established benchmark datasets that can overestimate the external validity. However, the literature lacks sufficient research on the performance of active semi-supervised learning methods in realistic data scenarios, leaving a notable gap in our understanding. Therefore we present three data challenges common in real-world applications: between-class imbalance, within-class imbalance, and between-class similarity. These challenges can hurt SSL performance due to confirmation bias. We conduct experiments with SSL and AL on simulated data challenges and find that random sampling does not mitigate confirmation bias and, in some cases, leads to worse performance than supervised learning. In contrast, we demonstrate that AL can overcome confirmation bias in SSL in these realistic settings. Our results provide insights into the potential of combining active and semi-supervised learning in the presence of common real-world challenges, which is a promising direction for robust methods when learning with limited labeled data in real-world applications.

MCML Authors

Sandra Gilhuber (née Obermeier)

Database Systems and Data Mining AI Lab

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[77]

M. Klein, C. Leiber and C. Böhm.
k-SubMix: Common Subspace Clustering on Mixed-Type Data.
ECML-PKDD 2023 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Turin, Italy, Sep 18-22, 2023. DOI

Abstract

Clustering heterogeneous data is an ongoing challenge in the data mining community. The most prevalent clustering methods are designed to process datasets with numerical features only, but often datasets consist of mixed numerical and categorical features. This requires new approaches capable of handling both kinds of data types. Further, the most relevant cluster structures are often hidden in only a few features. Thus, another key challenge is to detect those specific features automatically and abandon features not relevant for clustering. This paper proposes the subspace mixed-type clustering algorithm k-SubMix, which tackles both challenges. Its cost function can handle both numerical and categorical features while simultaneously identifying those with the biggest impact for a high-quality clustering result. Unlike other subspace mixed-type clustering methods, k-SubMix preserves inter-cluster comparability, as it is the first mixed-type approach that defines a common subspace for all clusters. Extensive experiments show that k-SubMix outperforms competitive methods and reduces the data’s complexity by a simultaneous dimensionality reduction.

MCML Authors

Mauritius Klein

* Former Member

Collin Leiber

Dr.

* Former Member

Christian Böhm

Prof. Dr.

* Former Principal Investigator

[76]

A. Beer, A. Draganov, E. Hohma, P. Jahn, C. M. M. Frey and I. Assent.
Connecting the Dots — Density-Connectivity Distance unifies DBSCAN, k-Center and Spectral Clustering.
KDD 2023 - 29th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Long Beach, CA, USA, Aug 06-10, 2023. DOI GitHub

Abstract

Despite the popularity of density-based clustering, its procedural definition makes it difficult to analyze compared to clustering methods that minimize a loss function. In this paper, we reformulate DBSCAN through a clean objective function by introducing the density-connectivity distance (dc-dist), which captures the essence of density-based clusters by endowing the minimax distance with the concept of density. This novel ultrametric allows us to show that DBSCAN, k-center, and spectral clustering are equivalent in the space given by the dc-dist, despite these algorithms being perceived as fundamentally different in their respective literatures. We also verify that finding the pairwise dc-dists gives DBSCAN clusterings across all epsilon-values, simplifying the problem of parameterizing density-based clustering. We conclude by thoroughly analyzing density-connectivity and its properties – a task that has been elusive thus far in the literature due to the lack of formal tools.

MCML Authors

Anna Beer

Dr.

* Former Member

Philipp Jahn

A3 | Computational Models
→ Group Matthias Schubert

Database Systems and Data Mining AI Lab

Christian Frey

Dr.

* Former Member

[75]

M. Fromm, M. Berrendorf, E. Faerman and T. Seidl.
Cross-Domain Argument Quality Estimation.
ACL 2023 - Findings of the 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada, Jul 09-14, 2023. DOI GitHub

Abstract

Argumentation is one of society’s foundational pillars, and, sparked by advances in NLP, and the vast availability of text data, automated mining of arguments receives increasing attention. A decisive property of arguments is their strength or quality. While there are works on the automated estimation of argument strength, their scope is narrow:They focus on isolated datasets and neglect the interactions with related argument-mining tasks, such as argument identification and evidence detection. In this work, we close this gap by approaching argument quality estimation from multiple different angles:Grounded on rich results from thorough empirical evaluations, we assess the generalization capabilities of argument quality estimation across diverse domains and the interplay with related argument mining tasks. We find that generalization depends on a sufficient representation of different domains in the training part. In zero-shot transfer and multi-task experiments, we reveal that argument quality is among the more challenging tasks but can improve others.

MCML Authors

Michael Fromm

Dr.

* Former Member

Max Berrendorf

Dr.

A3 | Computational Models
→ Group Matthias Schubert

* Former Member

Evgeny Faerman

Dr.

* Former Member

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[74]

D. Winkel, N. Strauß, M. Schubert, Y. Ma and T. Seidl.
Constrained Portfolio Management using Action Space Decomposition for Reinforcement Learning.
PAKDD 2023 - 27th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Osaka, Japan, May 25-28, 2023. DOI

Abstract

Financial portfolio managers typically face multi-period optimization tasks such as short-selling or investing at least a particular portion of the portfolio in a specific industry sector. A common approach to tackle these problems is to use constrained Markov decision process (CMDP) methods, which may suffer from sample inefficiency, hyperparameter tuning, and lack of guarantees for constraint violations. In this paper, we propose Action Space Decomposition Based Optimization (ADBO) for optimizing a more straightforward surrogate task that allows actions to be mapped back to the original task. We examine our method on two real-world data portfolio construction tasks. The results show that our new approach consistently outperforms state-of-the-art benchmark approaches for general CMDPs.

MCML Authors

David Winkel

A3 | Computational Models
→ Group Matthias Schubert

Database Systems and Data Mining AI Lab

Niklas Strauß

Dr.

Spatial Artificial Intelligence

Matthias Schubert

Prof. Dr.

Spatial Artificial Intelligence

Yunpu Ma

Dr.

Database Systems and Data Mining AI Lab

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[73]

L. G. M. Bauer, C. Leiber, C. Böhm and C. Plant.
Extension of the Dip-test Repertoire - Efficient and Differentiable p-value Calculation for Clustering.
SDM 2023 - SIAM International Conference on Data Mining. Minneapolis, MN, USA, Apr 27-29, 2023. DOI

Abstract

Over the last decade, the Dip-test of unimodality has gained increasing interest in the data mining community as it is a parameter-free statistical test that reliably rates the modality in one-dimensional samples. It returns a so called Dip-value and a corresponding probability for the sample’s unimodality (Dip-p-value). These two values share a sigmoidal relationship. However, the specific transformation is dependent on the sample size. Many Dip-based clustering algorithms use bootstrapped look-up tables translating Dip- to Dip-p-values for a certain limited amount of sample sizes. We propose a specifically designed sigmoid function as a substitute for these state-of-the-art look-up tables. This accelerates computation and provides an approximation of the Dip- to Dip-p-value transformation for every single sample size. Further, it is differentiable and can therefore easily be integrated in learning schemes using gradient descent. We showcase this by exploiting our function in a novel subspace clustering algorithm called Dip’n’Sub. We highlight in extensive experiments the various benefits of our proposal.

MCML Authors

Collin Leiber

Dr.

* Former Member

Christian Böhm

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Anne-Laure Boulesteix

* Former Principal Investigator

[72]

T. Ullmann, A. Beer, M. Hünemörder, T. Seidl and A.-L. Boulesteix.
Over-optimistic evaluation and reporting of novel cluster algorithms: An illustrative study.
Advances in Data Analysis and Classification 17 (Mar. 2023). DOI

Abstract

When researchers publish new cluster algorithms, they usually demonstrate the strengths of their novel approaches by comparing the algorithms’ performance with existing competitors. However, such studies are likely to be optimistically biased towards the new algorithms, as the authors have a vested interest in presenting their method as favorably as possible in order to increase their chances of getting published. Therefore, the superior performance of newly introduced cluster algorithms is over-optimistic and might not be confirmed in independent benchmark studies performed by neutral and unbiased authors. This problem is known among many researchers, but so far, the different mechanisms leading to over-optimism in cluster algorithm evaluation have never been systematically studied and discussed. Researchers are thus often not aware of the full extent of the problem. We present an illustrative study to illuminate the mechanisms by which authors—consciously or unconsciously—paint their cluster algorithm’s performance in an over-optimistic light. Using the recently published cluster algorithm Rock as an example, we demonstrate how optimization of the used datasets or data characteristics, of the algorithm’s parameters and of the choice of the competing cluster algorithms leads to Rock’s performance appearing better than it actually is. Our study is thus a cautionary tale that illustrates how easy it can be for researchers to claim apparent ‘superiority’ of a new cluster algorithm. This illuminates the vital importance of strategies for avoiding the problems of over-optimism (such as, e.g., neutral benchmark studies), which we also discuss in the article.

MCML Authors

Theresa Ullmann

Dr.

* Former Member

Anna Beer

Dr.

* Former Member

Thomas Seidl

Prof. Dr.

A1 | Statistical Foundations & Explainability

Database Systems and Data Mining AI Lab

Anne-Laure Boulesteix

Prof. Dr.

Biometry in Molecular Medicine

[71]

R. Koner, T. Hannan, S. Shit, S. Sharifzadeh, M. Schubert, T. Seidl and V. Tresp.
InstanceFormer: An Online Video Instance Segmentation Framework.
AAAI 2023 - 37th Conference on Artificial Intelligence. Washington, DC, USA, Feb 07-14, 2023. DOI GitHub

Abstract

Recent transformer-based offline video instance segmentation (VIS) approaches achieve encouraging results and significantly outperform online approaches. However, their reliance on the whole video and the immense computational complexity caused by full Spatio-temporal attention limit them in real-life applications such as processing lengthy videos. In this paper, we propose a single-stage transformer-based efficient online VIS framework named InstanceFormer, which is especially suitable for long and challenging videos. We propose three novel components to model short-term and long-term dependency and temporal coherence. First, we propagate the representation, location, and semantic information of prior instances to model short-term changes. Second, we propose a novel memory cross-attention in the decoder, which allows the network to look into earlier instances within a certain temporal window. Finally, we employ a temporal contrastive loss to impose coherence in the representation of an instance across all frames. Memory attention and temporal coherence are particularly beneficial to long-range dependency modeling, including challenging scenarios like occlusion. The proposed InstanceFormer outperforms previous online benchmark methods by a large margin across multiple datasets. Most importantly, InstanceFormer surpasses offline approaches for challenging and long datasets such as YouTube-VIS-2021 and OVIS.

MCML Authors

Rajat Koner

Database Systems and Data Mining AI Lab

Tanveer Hannan

Database Systems and Data Mining AI Lab

Matthias Schubert

Prof. Dr.

Spatial Artificial Intelligence

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

Volker Tresp

Prof. Dr.

Database Systems and Data Mining AI Lab

2022

[70]

W. Durani, D. Mautz, C. Plant and C. Böhm.
DBHD: Density-based clustering for highly varying density.
ICDM 2022 - 22nd IEEE International Conference on Data Mining. Orlando, FL, USA, Nov 30-Dec 02, 2022. DOI

Abstract

A major challenge in cluster analysis is the discovery of clusters with widely varying sizes, densities, and shapes. Most clustering algorithms lack the ability to detect heterogeneous clusters that differ greatly in all three properties simultaneously. In this work, we propose the Density Clustering for Highly varying Density algorithm (DBHD). DBHD uses a novel approach that considers local density information and introduces two new conditions to distinguish between different types of data points. Based on this and the adaptively computed density information, DBHD can detect the clusters described above and is robust to noise. Moreover, DBHD has intuitive and robust parameters. In extensive experiments, we show that our technique is considerably more effective in detecting clusters of different shapes, sizes, and densities than well-known (DBSCAN or OPTICS) and recently proposed algorithms such as DPC, SNN-DPC, or LSDBC.

MCML Authors

Walid Durani

Database Systems and Data Mining AI Lab

Christian Böhm

Prof. Dr.

* Former Principal Investigator

[69]

S. Gilhuber, P. Jahn, Y. Ma and T. Seidl.
VERIPS: Verified Pseudo-label Selection for Deep Active Learning.
ICDM 2022 - 22nd IEEE International Conference on Data Mining. Orlando, FL, USA, Nov 30-Dec 02, 2022. DOI GitHub

Abstract

Active learning has the power to significantly reduce the amount of labeled data needed to build strong classifiers. Existing active pseudo-labeling methods show high potential in integrating pseudo-labels within the active learning loop but heavily depend on the prediction accuracy of the model. In this work, we propose VERIPS, an algorithm that significantly outperforms existing pseudo-labeling techniques for active learning. At its core, VERIPS uses a pseudo-label verification mechanism that consists of a second network only trained on data approved by the oracle and helps to discard questionable pseudo-labels. In particular, the verifier model eliminates all pseudo-labels for which it disagrees with the actual task model. VERIPS overcomes the problems of poorly performing initial models, e.g., due to imbalanced or too small initial pools, where previous methods select too many incorrect pseudo-labels and recovering takes long or is not possible. Moreover, VERIPS is particularly insensitive to parameter choices that existing approaches suffer from.

MCML Authors

Sandra Gilhuber (née Obermeier)

Database Systems and Data Mining AI Lab

Philipp Jahn

Database Systems and Data Mining AI Lab

Yunpu Ma

Dr.

Database Systems and Data Mining AI Lab

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[68]

A. Lohrer, J. J. Binder and P. Kröger.
Group Anomaly Detection for Spatio-Temporal Collective Behaviour Scenarios in Smart Cities.
IWCTS @ACM SIGSPATIAL 2022 - 15th International Workshop on Computational Transportation Science at the 30th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM SIGSPATIAL 2022). Seattle, WA, USA, Nov 01-04, 2022. DOI

Abstract

Group anomaly detection in terms of detecting and predicting abnormal behaviour from entities as a group rather than as an individual, addresses a variety of challenges in spatio-temporal environments like e.g. traffic and transportation systems, smart cities, geoinformation systems, etc. They provide information about a commonly large number of individual entities. Examples for such entities would be airplanes and drones, vehicles, ships but also people, remote sensors and any other information source in interaction with the environment. However, as point anomaly detection is quite common for revealing the abnormal behaviour of individual entities, the collective behaviour of the individuals as a group remains completely uncovered. For example potential for traffic flow optimizations or increased local traffic guideline violations cannot be detected by one single drive but by considering the behavior of a group of vehicle drives in this area. With this work-in-progress we elaborate the potential of group anomaly detection algorithms for spatio-temporal collective behaviour scenarios in smart cities. We describe the group anomaly detection problem in the context of urban planning and demonstrate its effectiveness on a public real-world data set for urban rental bike rides and stations in and around Munich revealing abnormal groups of rides, which allows to optimize the rental bike accessibility to the population and with that to contribute to a sustainable environment.

MCML Authors

Andreas Lohrer

Dr.

* Former Member

Peer Kröger

Prof. Dr.

* Former Principal Investigator

[67]

C. Zelenka, A. Lohrer, M. Bayer and P. Kröger.
AI4EO Hyperview: A SpectralNet3D and RNNPlus Approach for Sustainable Soil Parameter Estimation on Hyperspectral Image Data.
ICIP 2022 - IEEE International Conference on Image Processing. Bordeaux, France, Oct 16-19, 2022. DOI

Abstract

The goal of the Hyperview challenge is to use Hyperspectral Imaging (HSI) to predict the soil parameters potassium (K), phosphorus pentoxide (P 2 O 5 ), magnesium (Mg) and the pH value. These are relevant parameters to determine the need of fertilization in agriculture. With this knowledge, fertilizers can be applied in a targeted way rather than in a prophylactic way which is the current procedure of choice.In this context we introduce two different approaches to solve this regression task based on 3D CNNs with Huber loss regression (SpectralNet3D) and on 1D RNNs. Both methods show distinct advantages with a peak challenge metric score of 0.808 on provided validation data.

MCML Authors

Andreas Lohrer

Dr.

* Former Member

Peer Kröger

Prof. Dr.

A3 | Computational Models
→ Group Matthias Schubert

* Former Principal Investigator

[66]

N. Strauß, D. Winkel, M. Berrendorf and M. Schubert.
Reinforcement Learning for Multi-Agent Stochastic Resource Collection.
ECML-PKDD 2022 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Grenoble, France, Sep 19-23, 2022. DOI

Abstract

Stochastic Resource Collection (SRC) describes tasks where an agent tries to collect a maximal amount of dynamic resources while navigating through a road network. An instance of SRC is the traveling officer problem (TOP), where a parking officer tries to maximize the number of fined parking violations. In contrast to vehicular routing problems, in SRC tasks, resources might appear and disappear by an unknown stochastic process, and thus, the task is inherently more dynamic. In most applications of SRC, such as TOP, covering realistic scenarios requires more than one agent. However, directly applying multi-agent approaches to SRC yields challenges considering temporal abstractions and inter-agent coordination. In this paper, we propose a novel multi-agent reinforcement learning method for the task of Multi-Agent Stochastic Resource Collection (MASRC). To this end, we formalize MASRC as a Semi-Markov Game which allows the use of temporal abstraction and asynchronous actions by various agents. In addition, we propose a novel architecture trained with independent learning, which integrates the information about collaborating agents and allows us to take advantage of temporal abstractions. Our agents are evaluated on the multiple traveling officer problem, an instance of MASRC where multiple officers try to maximize the number of fined parking violations. Our simulation environment is based on real-world sensor data. Results demonstrate that our proposed agent can beat various state-of-the-art approaches.

MCML Authors

Niklas Strauß

Dr.

Spatial Artificial Intelligence

David Winkel

Database Systems and Data Mining AI Lab

Max Berrendorf

Dr.

* Former Member

Matthias Schubert

Prof. Dr.

Spatial Artificial Intelligence

[65]

D. Winkel, N. Strauß, M. Schubert and T. Seidl.
Risk-Aware Reinforcement Learning for Multi-Period Portfolio Selection.
ECML-PKDD 2022 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Grenoble, France, Sep 19-23, 2022. DOI

Abstract

The task of portfolio management is the selection of portfolio allocations for every single time step during an investment period while adjusting the risk-return profile of the portfolio to the investor’s individual level of risk preference. In practice, it can be hard for an investor to quantify his individual risk preference. As an alternative, approximating the risk-return Pareto front allows for the comparison of different optimized portfolio allocations and hence for the selection of the most suitable risk level. Furthermore, an approximation of the Pareto front allows the analysis of the overall risk sensitivity of various investment policies. In this paper, we propose a deep reinforcement learning (RL) based approach, in which a single meta agent generates optimized portfolio allocation policies for any level of risk preference in a given interval. Our method is more efficient than previous approaches, as it only requires training of a single agent for the full approximate risk-return Pareto front. Additionally, it is more stable in training and only requires per time step market risk estimations independent of the policy. Such risk control per time step is a common regulatory requirement for e.g., insurance companies. We benchmark our meta agent against other state-of-the-art risk-aware RL methods using a realistic environment based on real-world Nasdaq-100 data. Our evaluation shows that the proposed meta agent outperforms various benchmark approaches by generating strategies with better risk-return profiles.

MCML Authors

David Winkel

A3 | Computational Models
→ Group Matthias Schubert

Database Systems and Data Mining AI Lab

Niklas Strauß

Dr.

Spatial Artificial Intelligence

Matthias Schubert

Prof. Dr.

Spatial Artificial Intelligence

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[64]

S. Gilhuber, M. Berrendorf, Y. Ma and T. Seidl.
Accelerating Diversity Sampling for Deep Active Learning By Low-Dimensional Representations.
IAL @ECML-PKDD 2022 - 6th International Workshop on Interactive Adaptive Learning at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2022). Grenoble, France, Sep 19-23, 2022. PDF GitHub

Abstract

Selecting diverse instances for annotation is one of the key factors of successful active learning strategies. To this end, existing methods often operate on high-dimensional latent representations. In this work, we propose to use the low-dimensional vector of predicted probabilities instead, which can be seamlessly integrated into existing methods. We empirically demonstrate that this considerably decreases the query time, i.e., time to select an instance for annotation, while at the same time improving results. Low query times are relevant for active learning researchers, which use a (fast) oracle for simulated annotation and thus are often constrained by query time. It is also practically relevant when dealing with complex annotation tasks for which only a small pool of skilled domain experts is available for annotation with a limited time budget.

MCML Authors

Sandra Gilhuber (née Obermeier)

Database Systems and Data Mining AI Lab

Max Berrendorf

Dr.

* Former Member

Yunpu Ma

Dr.

Database Systems and Data Mining AI Lab

Thomas Seidl

Prof. Dr.

A3 | Computational Models
→ Group Matthias Schubert

Database Systems and Data Mining AI Lab

[63]

E. Hohma, C. M. M. Frey, A. Beer and T. Seidl.
SCAR - Spectral Clustering Accelerated and Robustified.
VLDB 2022 - 48th International Conference on Very Large Databases. Sydney, Australia (and hybrid), Sep 05-09, 2022. DOI GitHub

Abstract

Spectral clustering is one of the most advantageous clustering approaches. However, standard Spectral Clustering is sensitive to noisy input data and has a high runtime complexity. Tackling one of these problems often exacerbates the other. As real-world datasets are often large and compromised by noise, we need to improve both robustness and runtime at once. Thus, we propose Spectral Clustering - Accelerated and Robust (SCAR), an accelerated, robustified spectral clustering method. In an iterative approach, we achieve robustness by separating the data into two latent components: cleansed and noisy data. We accelerate the eigendecomposition - the most time-consuming step - based on the Nyström method. We compare SCAR to related recent state-of-the-art algorithms in extensive experiments. SCAR surpasses its competitors in terms of speed and clustering quality on highly noisy data.

MCML Authors

Christian Frey

Dr.

* Former Member

Anna Beer

Dr.

* Former Member

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[62]

C. Leiber, L. G. M. Bauer, M. Neumayr, C. Plant and C. Böhm.
The DipEncoder: Enforcing Multimodality in Autoencoders.
KDD 2022 - 28th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington, DC, USA, Aug 14-18, 2022. DOI

Abstract

Hartigan’s Dip-test of unimodality gained increasing interest in unsupervised learning over the past few years. It is free from complex parameterization and does not require a distribution assumed a priori. A useful property is that the resulting Dip-values can be derived to find a projection axis that identifies multimodal structures in the data set. In this paper, we show how to apply the gradient not only with respect to the projection axis but also with respect to the data to improve the cluster structure. By tightly coupling the Dip-test with an autoencoder, we obtain an embedding that clearly separates all clusters in the data set. This method, called DipEncoder, is the basis of a novel deep clustering algorithm. Extensive experiments show that the DipEncoder is highly competitive to state-of-the-art methods.

MCML Authors

Collin Leiber

Dr.

* Former Member

Christian Böhm

Prof. Dr.

* Former Principal Investigator

[61]

M. Fromm.
Machine learning driven argument mining.
Dissertation 2022. DOI

Abstract

This thesis addresses the challenges of argumentation in the digital age by applying machine learning methods to automatically identify, retrieve, and evaluate arguments from diverse and often contradictory online sources. The first focus is on argument identification, specifically in heterogeneous text sources and peer reviews, where the relationship between the topic and arguments is crucial, and knowledge transfer across domains is limited. The second focus is on argument retrieval, where machine learning is used to select relevant documents, ensuring comprehensive and non-redundant argument coverage. Finally, the thesis explores the strength or quality of arguments, integrating this concept with other argument mining tasks and evaluating its impact across different text domains and contexts. (Shortened.)

MCML Authors

Michael Fromm

Dr.

* Former Member

[60]

C. Leiber, D. Mautz, C. Plant and C. Böhm.
Automatic Parameter Selection for Non-Redundant Clustering.
SDM 2022 - SIAM International Conference on Data Mining. Virtual, Apr 28-30, 2022. DOI

Abstract

High-dimensional datasets often contain multiple meaningful clusterings in different subspaces. For example, objects can be clustered either by color, weight, or size, revealing different interpretations of the given dataset. A variety of approaches are able to identify such non-redundant clusterings. However, most of these methods require the user to specify the expected number of subspaces and clusters for each subspace. Stating these values is a non-trivial problem and usually requires detailed knowledge of the input dataset. In this paper, we propose a framework that utilizes the Minimum Description Length Principle (MDL) to detect the number of subspaces and clusters per subspace automatically. We describe an efficient procedure that greedily searches the parameter space by splitting and merging subspaces and clusters within subspaces. Additionally, an encoding strategy is introduced that allows us to detect outliers in each subspace. Extensive experiments show that our approach is highly competitive to state-of-the-art methods.

MCML Authors

Collin Leiber

Dr.

* Former Member

Christian Böhm

Prof. Dr.

* Former Principal Investigator

[59]

D. Kazempour.
Advances in correlation clustering.
Dissertation 2022. DOI

Abstract

This thesis addresses key challenges in correlation clustering, particularly in high-dimensional datasets, by developing novel methods to evaluate and improve clustering algorithms. The first contribution focuses on defining and deriving internal evaluation criteria for correlation clustering, proposing a new cost function to assess cluster quality based on commonalities among existing algorithms. The second part introduces two innovative strategies for detecting regions of interest (ROIs) in Hough space, improving the robustness of the Hough transform algorithm, and extending it to handle quadratic and periodic correlated clusters. Finally, the thesis explores unifying local and global correlation clustering views and enhancing the resilience of these methods to outliers. (Shortened.)

MCML Authors

Daniyal Kazempour

Dr.

* Former Member

2021

[58]

A. Beer, L. Stephan and T. Seidl.
LUCKe- Connecting Clustering and Correlation Clustering.
ICDMW 2021 - IEEE International Conference on Data Mining Workshops. Auckland, New Zealand, Dec 07-10, 2021. DOI

Abstract

LUCKe allows any purely distance-based ‘classic’ clustering algorithm to reliably find linear correlation clusters. An elaborated distance matrix based on the points’ local PCA extracts all necessary information from high dimensional data to declare points of the same arbitrary dimensional linear correlation cluster as ‘similar’. For that, the points’ eigensystems as well as only the relevant information about their position in space, are put together. LUCKe allows transferring known benefits from the large field of basic clustering to correlation clustering. Its applicability is shown in extensive experiments with simple representatives of diverse basic clustering approaches.

MCML Authors

Anna Beer

Dr.

* Former Member

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[57]

J. Busch, M. Hünemörder, J. Held, P. Kröger and T. Seidl.
Implicit Hough Transform Neural Networks for Subspace Clustering.
ICDMW 2021 - IEEE International Conference on Data Mining Workshops. Auckland, New Zealand, Dec 07-10, 2021. DOI

Abstract

Subspace clustering constitutes a fundamental task in data mining and unsupervised machine learning with myriad applications. We present a novel approach to subspace clustering that detects affine hyperplanes in a given arbitrary-dimensional dataset by explicitly parametrizing them and optimizing their parameters using gradient updates w.r.t. a differentiable loss function. The explicit parametrization allows our model to avoid the exponential search space incurred by models relying on an explicit Hough transform to detect subspaces by searching for high-density points in parameter space. Compared to other existing approaches, our method is highly scalable, can be trained very efficiently on a GPU, is applicable to out-of-sample data, and is amenable to anytime scenarios since training can be stopped at any time and convergence is usually fast. The model can further be viewed as a linear neural network layer and trained end-to-end with an autoencoder to detect arbitrary non-linear correlations. We provide empirical results on a wide array of synthetic datasets with different characteristics following a rigorous evaluation protocol. Our results demonstrate the advantageous properties of our model and additionally reveal that it is particularly robust to jitter and noise present in the data.

MCML Authors

Peer Kröger

Prof. Dr.

* Former Principal Investigator

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[56]

A. Lohrer, J. Deller, M. Hünemörder and P. Kröger.
OAB - An Open Anomaly Benchmark Framework for Unsupervised and Semisupervised Anomaly Detection on Image and Tabular Data Sets.
ICDMW 2021 - IEEE International Conference on Data Mining Workshops. Auckland, New Zealand, Dec 07-10, 2021. DOI

Abstract

We introduce OAB, an Open Anomaly Benchmark Framework for unsupervised and semisupervised anomaly detection on image and tabular data sets, ensuring simple reproducibility for existing benchmark results as well as a reliable comparability and low-effort extensibility when new anomaly detection algorithms or new data sets are added. While making established methods of the most popular benchmarks easily accessible, OAB generalizes the task of un- and semisupervised anomaly benchmarking and offers besides commonly used benchmark data sets also semantically meaningful real-world anomaly data sets as well as a broad range of traditional and state-of-the-art anomaly detection algorithms. The benefit of OAB for the research community has been demonstrated by reproducing and extending existing benchmarks to new algorithms with very low effort allowing researchers to focus on the actual algorithm research.

MCML Authors

Andreas Lohrer

Dr.

* Former Member

Peer Kröger

Prof. Dr.

* Former Principal Investigator

[55]

A. Beer.
On the edges of clustering: creating synergies with related problems.
Dissertation 2021. DOI

Abstract

This thesis explores the connections between clustering and related tasks like subspace clustering, correlation clustering, outlier detection, and data ordering. It introduces novel methods such as the KISS score for subspace clustering, LUCK for correlation clustering, and the ABC algorithm for outlier detection. Additionally, it develops the Circle Index for optimizing data ordering to improve clustering performance. (Shortened.)

MCML Authors

Anna Beer

Dr.

* Former Member

[54]

N. Kees, M. Fromm, E. Faerman and T. Seidl.
Active Learning for Argument Strength Estimation.
Insights @EMNLP 2021 - 2nd Workshop on Insights from Negative Results at the Conference on Empirical Methods in Natural Language Processing (EMNLP 2021). Punta Cana, Dominican Republic, Nov 07-11, 2021. DOI

Abstract

High-quality arguments are an essential part of decision-making. Automatically predicting the quality of an argument is a complex task that recently got much attention in argument mining. However, the annotation effort for this task is exceptionally high. Therefore, we test uncertainty-based active learning (AL) methods on two popular argument-strength data sets to estimate whether sample-efficient learning can be enabled. Our extensive empirical evaluation shows that uncertainty-based acquisition functions can not surpass the accuracy reached with the random acquisition on these data sets.

MCML Authors

Michael Fromm

Dr.

A3 | Computational Models
→ Group Matthias Schubert

* Former Member

Evgeny Faerman

Dr.

* Former Member

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[53]

D. Kazempour, A. Beer, M. Oelker, P. Kröger and T. Seidl.
Compound Segmentation via Clustering on Mol2Vec-based Embeddings.
eScience 2021 - 17th IEEE eScience Conference. Virtual, Sep 20-23, 2021. DOI

Abstract

During different steps in the process of discovering drug candidates for diseases, it can be supportive to identify groups of molecules that share similar properties, i.e. common overall structural similarity. The existing methods for computing (dis)similarities between chemical structures rely on a priori domain knowledge. Here we investigate the clustering of compounds that are applied on embeddings generated from a recently published Mol2Vec technique which enables an entirely unsupervised vector representation of compounds. A research question we address in this work is: do existent well-known clustering algorithms such as k-means or hierarchical clustering methods yield meaningful clusters on the Mol2Vec embeddings? Further, we investigate how far subspace clustering can be utilized to compress the data by reducing the dimensionality of the compounds vector representation. Our first conducted experiments on a set of COVID-19 drug candidates reveal that well-established methods yield meaningful clusters. Preliminary results from subspace clusterings indicate that a compression of the vector representations seems viable.

MCML Authors

Daniyal Kazempour

Dr.

* Former Member

Anna Beer

Dr.

* Former Member

Peer Kröger

Prof. Dr.

* Former Principal Investigator

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[52]

S. Obermeier, A. Beer, F. Wahl and T. Seidl.
Cluster Flow — an Advanced Concept for Ensemble-Enabling, Interactive Clustering.
BTW 2021 - 19th Symposium of Database Systems for Business, Technology and Web. Dresden, Germany, Sep 13-17, 2021. DOI

Abstract

Even though most clustering algorithms serve knowledge discovery in fields other than computer science, most of them still require users to be familiar with programming or data mining to some extent. As that often prevents efficient research, we developed an easy to use, highly explainable clustering method accompanied by an interactive tool for clustering. It is based on intuitively understandable kNN graphs and the subsequent application of adaptable filters, which can be combined ensemble-like and iteratively and prune unnecessary or misleading edges. For a first overview of the data, fully automatic predefined filter cascades deliver robust results. A selection of simple filters and combination methods that can be chosen interactively yield very good results on benchmark datasets compared to various algorithms.

MCML Authors

Sandra Gilhuber (née Obermeier)

Database Systems and Data Mining AI Lab

Anna Beer

Dr.

* Former Member

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[51]

D. Kazempour, J. Winter, P. Kröger and T. Seidl.
On Methods and Measures for the Inspection of Arbitrarily Oriented Subspace Clusters.
Datenbank-Spektrum 21 (Sep. 2021). DOI

Abstract

When using arbitrarily oriented subspace clustering algorithms one obtains a partitioning of a given data set and for each partition its individual subspace. Since clustering is an unsupervised machine learning task, we may not have “ground truth” labels at our disposal or do not wish to rely on them. What is needed in such cases are internal measure which permits a label-less analysis of the obtained subspace clustering. In this work, we propose methods for revising clusters obtained from arbitrarily oriented correlation clustering algorithms. Initial experiments conducted reveal improvements in the clustering results compared to the original clustering outcome. Our proposed approach is simple and can be applied as a post-processing step on arbitrarily oriented correlation clusterings.

MCML Authors

Daniyal Kazempour

Dr.

* Former Member

Peer Kröger

Prof. Dr.

* Former Principal Investigator

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[50]

T. Seidl, M. Fromm and S. Obermeier.
Proceedings of the LWDA 2021 Workshops: FGWM, KDML, FGWI-BIA, and FGIR.
LWDA 2021 - Lernen, Wissen, Daten, Analysen 2021 (Sep. 2021). URL

Abstract

LWDA 2021 is a joint conference of six special interest groups of the German Computer Science Society (GI), addressing research in the areas of knowledge discovery and machine learning, information retrieval, database systems, and knowledge management. The German acronym LWDA stands for ‘Lernen, Wissen, Daten, Analysen’ (Learning, Knowledge, Data, Analytics). Following the tradition of the last years, LWDA 2021 provides a joint forum for experienced and young researchers, to bring insights into recent trends, technologies, and applications and to promote interaction among the special interest groups.

MCML Authors

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

Michael Fromm

Dr.

* Former Member

Sandra Gilhuber (née Obermeier)

Database Systems and Data Mining AI Lab

[49]

A. Lohrer, A. Beer, M. Hünemörder, J. Lauterbach, T. Seidl and P. Kröger.
AnyCORE - An Anytime Algorithm for Cluster Outlier REmoval.
LWDA 2021 - Conference on Lernen. Wissen. Daten. Analysen. München, Germany, Sep 01-03, 2021. PDF

Abstract

We introduce AnyCORE (Anytime Cluster Outlier REmoval), an algorithm that enables users to detect and remove outliers at anytime. The algorithm is based on the idea of MORe++, an approach for outlier detection and removal that iteratively scores and removes 1d-cluster-outliers in n-dimensional data sets. In contrast to MORe++, AnyCORE provides continuous responses for its users and converges independent of cluster centers. This allows AnyCORE to perform outlier detection in combination with an arbitrary clustering method that is most suitable for a given data set. We conducted our AnyCORE experiments on synthetic and real-world data sets by benchmarking its variant with k-Means as the underlying clustering method versus the traditional batch algorithm version of MORe++. In extensive experiments we show that AnyCORE is able to compete with the related batch algorithm version.

MCML Authors

Andreas Lohrer

Dr.

* Former Member

Anna Beer

Dr.

* Former Member

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

Peer Kröger

Prof. Dr.

* Former Principal Investigator

[48]

C. Leiber, L. G. M. Bauer, B. Schelling, C. Böhm and C. Plant.
Dip-based Deep Embedded Clustering with k-Estimation.
KDD 2021 - 27th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Singapore, Aug 14-18, 2021. DOI

Abstract

The combination of clustering with Deep Learning has gained much attention in recent years. Unsupervised neural networks like autoencoders can autonomously learn the essential structures in a data set. This idea can be combined with clustering objectives to learn relevant features automatically. Unfortunately, they are often based on a k-means framework, from which they inherit various assumptions, like spherical-shaped clusters. Another assumption, also found in approaches outside the k-means-family, is knowing the number of clusters a-priori. In this paper, we present the novel clustering algorithm DipDECK, which can estimate the number of clusters simultaneously to improving a Deep Learning-based clustering objective. Additionally, we can cluster complex data sets without assuming only spherically shaped clusters. Our algorithm works by heavily overestimating the number of clusters in the embedded space of an autoencoder and, based on Hartigan’s Dip-test - a statistical test for unimodality - analyses the resulting micro-clusters to determine which to merge. We show in extensive experiments the various benefits of our method: (1) we achieve competitive results while learning the clustering-friendly representation and number of clusters simultaneously; (2) our method is robust regarding parameters, stable in performance, and allows for more flexibility in the cluster shape; (3) we outperform relevant competitors in the estimation of the number of clusters.

MCML Authors

Collin Leiber

Dr.

* Former Member

Christian Böhm

Prof. Dr.

* Former Principal Investigator

[47]

M. Fromm, M. Berrendorf, S. Obermeier, T. Seidl and E. Faerman.
Diversity Aware Relevance Learning for Argument Search.
ECIR 2021 - 43rd European Conference on Information Retrieval. Virtual, Mar 28-Apr 01, 2021. DOI GitHub

Abstract

In this work, we focus on the problem of retrieving relevant arguments for a query claim covering diverse aspects. State-of-the-art methods rely on explicit mappings between claims and premises, and thus are unable to utilize large available collections of premises without laborious and costly manual annotation. Their diversity approach relies on removing duplicates via clustering which does not directly ensure that the selected premises cover all aspects. This work introduces a new multi-step approach for the argument retrieval problem. Rather than relying on ground-truth assignments, our approach employs a machine learning model to capture semantic relationships between arguments. Beyond that, it aims to cover diverse facets of the query, instead of trying to identify duplicates explicitly. Our empirical evaluation demonstrates that our approach leads to a significant improvement in the argument retrieval task even though it requires less data.

MCML Authors

Michael Fromm

Dr.

* Former Member

Max Berrendorf

Dr.

* Former Member

Sandra Gilhuber (née Obermeier)

Database Systems and Data Mining AI Lab

Thomas Seidl

Prof. Dr.

A3 | Computational Models
→ Group Matthias Schubert

Database Systems and Data Mining AI Lab

Evgeny Faerman

Dr.

* Former Member

[46]

A. Beer, E. Allerborn, V. Hartmann and T. Seidl.
KISS - A fast kNN-based Importance Score for Subspaces.
EDBT 2021 - 24th International Conference on Extending Database Technology. Nicosia, Cyprus, Mar 23-26, 2021. PDF

Abstract

In high-dimensional datasets some dimensions or attributes can be more important than others. Whereas most algorithms neglect one or more dimensions for all points of a dataset or at least for all points of a certain cluster together, our method KISS (textbf{k}NN-based textbf{I}mportance textbf{S}core of textbf{S}ubspaces) detects the most important dimensions for each point individually. It is fully unsupervised and does not depend on distorted multidimensional distance measures. Instead, the $k$ nearest neighbors ($k$NN) in one-dimensional projections of the data points are used to calculate the score for every dimension’s importance. Experiments across a variety of settings show that those scores reflect well the structure of the data. KISS can be used for subspace clustering. What sets it apart from other methods for this task is its runtime, which is linear in the number of dimensions and $O(n log(n))$ in the number of points, as opposed to quadratic or even exponential runtimes for previous algorithms.

MCML Authors

Anna Beer

Dr.

* Former Member

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[45]

M. Fromm, E. Faerman, M. Berrendorf, S. Bhargava, R. Qi, Y. Zhang, L. Dennert, S. Selle, Y. Mao and T. Seidl.
Argument Mining Driven Analysis of Peer-Reviews.
AAAI 2021 - 35th Conference on Artificial Intelligence. Virtual, Feb 02-09, 2021. DOI GitHub

Abstract

Peer reviewing is a central process in modern research and essential for ensuring high quality and reliability of published work. At the same time, it is a time-consuming process and increasing interest in emerging fields often results in a high review workload, especially for senior researchers in this area. How to cope with this problem is an open question and it is vividly discussed across all major conferences. In this work, we propose an Argument Mining based approach for the assistance of editors, meta-reviewers, and reviewers. We demonstrate that the decision process in the field of scientific publications is driven by arguments and automatic argument identification is helpful in various use-cases. One of our findings is that arguments used in the peer-review process differ from arguments in other domains making the transfer of pre-trained models difficult. Therefore, we provide the community with a new peer-review dataset from different computer science conferences with annotated arguments. In our extensive empirical evaluation, we show that Argument Mining can be used to efficiently extract the most relevant parts from reviews, which are paramount for the publication decision. The process remains interpretable since the extracted arguments can be highlighted in a review without detaching them from their context.

MCML Authors

Michael Fromm

Dr.

A3 | Computational Models
→ Group Matthias Schubert

* Former Member

Evgeny Faerman

Dr.

* Former Member

Max Berrendorf

Dr.

* Former Member

Yao Zhang

Database Systems and Data Mining AI Lab

Thomas Seidl

Prof. Dr.

A3 | Computational Models
→ Group Matthias Schubert

Database Systems and Data Mining AI Lab

2020

[44]

E. Faerman, F. Borutta, J. Busch and M. Schubert.
Ada-LLD: Adaptive Node Similarity Using Multi-Scale Local Label Distributions.
WI-IAT 2020 - IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology. Virtual, Dec 14-17, 2020. DOI GitHub

Abstract

In many applications, data is represented as a network connecting nodes of various types. While types might be known for some nodes in the network, the type of a newly added node is typically unknown. In this paper, we focus on predicting the types of these new nodes based on their connectivity to the already labeled nodes. To tackle this problem, we propose Adaptive Node Similarity Using Multi-Scale Local Label Distributions (Ada-LLD) which learns the dependency of a node’s class label from the distribution of class labels in this node’s local neighborhood. In contrast to previous approaches, our approach is able to learn how class labels correlate with labels in variously sized neighborhoods. We propose a neural network architecture that combines information from differently sized neighborhoods allowing for the detection of correlations on multiple scales. Our evaluations demonstrate that our method significantly improves prediction quality on real world data sets. In the spirit of reproducible research we make our code available.

MCML Authors

Evgeny Faerman

Dr.

* Former Member

Felix Borutta

Dr.

* Former Member

Matthias Schubert

Prof. Dr.

Spatial Artificial Intelligence

[43]

S. Obermeier, M. Berrendorf and P. Kröger.
Memory-Efficient RkNN Retrieval by Nonlinear k-Distance Approximation.
WI-IAT 2020 - IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology. Virtual, Dec 14-17, 2020. DOI

Abstract

The reverse k-nearest neighbor (RkNN) query is an established query type with various applications reaching from identifying highly influential objects over incrementally updating kNN graphs to optimizing sensor communication and outlier detection. State-of-the-art solutions exploit that the k-distances in real-world datasets often follow the power-law distribution, and bound them with linear lines in log-log space. In this work, we investigate this assumption and uncover that it is violated in regions of changing density, which we show are typical for real-life datasets. Towards a generic solution, we pose the estimation of k-distances as a regression problem. Thereby, we enable harnessing the power of the abundance of available Machine Learning models and profiting from their advancement. We propose a flexible approach which allows steering the performance-memory consumption trade-off, and in particular to find good solutions with a fixed memory budget crucial in the context of edge computing. Moreover, we show how to obtain and improve guaranteed bounds essential to exact query processing. In experiments on real-world datasets, we demonstrate how this framework can significantly reduce the index memory consumption, and strongly reduce the candidate set size. We publish our code at https://github.com/sobermeier/nonlinear-kdist, and a detailed technical report at https://arxiv.org/abs/2011.01773.

MCML Authors

Sandra Gilhuber (née Obermeier)

Database Systems and Data Mining AI Lab

Max Berrendorf

Dr.

* Former Member

Peer Kröger

Prof. Dr.

A3 | Computational Models
→ Group Matthias Schubert

* Former Principal Investigator

[42]

J. Busch, E. Faerman, M. Schubert and T. Seidl.
Learning Self-Expression Metrics for Scalable and Inductive Subspace Clustering.
SSL @NeurIPS 2020 - Workshop on Self-Supervised Learning - Theory and Practice at the 34th Conference on Neural Information Processing Systems (NeurIPS 2020). Virtual, Dec 06-12, 2020. arXiv GitHub

Abstract

Subspace clustering has established itself as a state-of-the-art approach to clustering high-dimensional data. In particular, methods relying on the self-expressiveness property have recently proved especially successful. However, they suffer from two major shortcomings: First, a quadratic-size coefficient matrix is learned directly, preventing these methods from scaling beyond small datasets. Secondly, the trained models are transductive and thus cannot be used to cluster out-of-sample data unseen during training. Instead of learning self-expression coefficients directly, we propose a novel metric learning approach to learn instead a subspace affinity function using a siamese neural network architecture. Consequently, our model benefits from a constant number of parameters and a constant-size memory footprint, allowing it to scale to considerably larger datasets. In addition, we can formally show that out model is still able to exactly recover subspace clusters given an independence assumption. The siamese architecture in combination with a novel geometric classifier further makes our model inductive, allowing it to cluster out-of-sample data. Additionally, non-linear clusters can be detected by simply adding an auto-encoder module to the architecture. The whole model can then be trained end-to-end in a self-supervised manner. This work in progress reports promising preliminary results on the MNIST dataset. In the spirit of reproducible research, me make all code publicly available. In future work we plan to investigate several extensions of our model and to expand experimental evaluation.

MCML Authors

Evgeny Faerman

Dr.

* Former Member

Matthias Schubert

Prof. Dr.

Spatial Artificial Intelligence

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[41]

Y. Zhang, Y. Lu and T. Seidl.
KNNAC: An Efficient k Nearest Neighbor Based Clustering with Active Core Detection.
iiWAS 2020 - 22nd International Conference on Information Integration and Web-based Applications and Services. Chiang Mai, Thailand, Nov 30-Dec 02, 2020. DOI

Abstract

Density-based clustering algorithms are commonly adopted when arbitrarily shaped clusters exist. Usually, they do not need to know the number of clusters in prior, which is a big advantage. Conventional density-based approaches such as DBSCAN, utilize two parameters to define density. Recently, novel density-based clustering algorithms are proposed to reduce the problem complexity to the use of a single parameter k by utilizing the concepts of k Nearest Neighbor (kNN) and Reverse k Nearest Neighbor (RkNN) to define density. However, those kNN-based approaches are either ineffective or inefficient. In this paper, we present a new clustering algorithm KNNAC, which only requires computing the densities for a chosen subset of points due to the use of active core detection. We empirically show that, compared to other nearest neighbor based clustering approaches (e.g., RECORD, IS-DBSCAN, etc.), KNNAC can provide competitive performance while taking a fraction of the runtime.

MCML Authors

Yao Zhang

Database Systems and Data Mining AI Lab

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[40]

D. Kazempour, A. Beer, P. Kröger and T. Seidl.
I fold you so! An internal evaluation measure for arbitrary oriented subspace clustering through piecewise-linear approximations of manifolds.
ICDMW 2020 - IEEE International Conference on Data Mining Workshops. Sorrento, Italy, Nov 17-20, 2020. DOI

Abstract

In this work we propose SRE, the first internal evaluation measure for arbitrary oriented subspace clustering results. For this purpose we present a new perspective on the subspace clustering task: the goal we formalize is to compute a clustering which represents the original dataset by minimizing the reconstruction loss from the obtained subspaces, while at the same time minimizing the dimensionality as well as the number of clusters. A fundamental feature of our approach is that it is model-agnostic, i.e., it is independent of the characteristics of any specific subspace clustering method. It is scale invariant and mathematically founded. The experiments show that the SRE scoring better assesses the quality of an arbitrarily oriented sub-space clustering compared to commonly used external evaluation measures.

MCML Authors

Daniyal Kazempour

Dr.

* Former Member

Anna Beer

Dr.

* Former Member

Peer Kröger

Prof. Dr.

* Former Principal Investigator

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[39]

D. Kazempour, P. Kröger and T. Seidl.
Towards an Internal Evaluation Measure for Arbitrarily Oriented Subspace Clustering.
ICDMW 2020 - IEEE International Conference on Data Mining Workshops. Sorrento, Italy, Nov 17-20, 2020. DOI

Abstract

In the setting of unsupervised machine learning, especially in clustering tasks, the evaluation of either novel algorithms or the assessment of a clustering of novel data is challenging. While mostly in the literature the evaluation of new methods is performed on labelled data, there are cases where no labels are at our disposal. In other cases we may not want to trust the “ground truth” labels. In general there exists a spectrum of so called internal evaluation measures in the literature. Each of the measures is mostly specialized towards a specific clustering model. The model of arbitrarily oriented subspace clusters is a more recent one. To the best of our knowledge there exist at the current time no internal evaluation measures tailored at assessing this particular type of clusterings. In this work we present the first internal quality measures for arbitrarily oriented subspace clusterings namely the normalized projected energy (NPE) and subspace compactness score (SCS). The results from the experiments show that especially NPE is capable of assessing clusterings by considering archetypical properties of arbitrarily oriented subspace clustering.

MCML Authors

Daniyal Kazempour

Dr.

* Former Member

Peer Kröger

Prof. Dr.

* Former Principal Investigator

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[38]

D. Kazempour, L. M. Yan, P. Kröger and T. Seidl.
You see a set of wagons - I see one train: Towards a unified view of local and global arbitrarily oriented subspace clusters.
ICDMW 2020 - IEEE International Conference on Data Mining Workshops. Sorrento, Italy, Nov 17-20, 2020. DOI

Abstract

Having data with a high number of features raises the need to detect clusters which exhibit within subspaces of features a high similarity. These subspaces can be arbitrarily oriented which gave rise to arbitrarily-oriented subspace clustering (AOSC) algorithms. In the diversity of such algorithms some are specialized at detecting clusters which are global, across the entire dataset regardless of any distances, while others are tailored at detecting local clusters. Both of these views (local and global) are obtained separately by each of the algorithms. While from an algebraic point of view, none of both representations can claim to be the true one, it is vital that domain scientists are presented both views, enabling them to inspect and decide which of the representations is closest to the domain specific reality. We propose in this work a framework which is capable to detect locally dense arbitrarily oriented subspace clusters which are embedded within a global one. We also first introduce definitions of locally and globally arbitrarily oriented subspace clusters. Our experiments illustrate that this approach has no significant impact on the cluster quality nor on the runtime performance, and enables scientists to be no longer limited exclusively to either of the local or global views.

MCML Authors

Daniyal Kazempour

Dr.

* Former Member

Peer Kröger

Prof. Dr.

* Former Principal Investigator

Thomas Seidl

Prof. Dr.

C4 | Computational Social Sciences
→ Group Stefan Feuerriegel

Database Systems and Data Mining AI Lab

[37]

V. Melnychuk, E. Faerman, I. Manakov and T. Seidl.
Matching the Clinical Reality: Accurate OCT-Based Diagnosis From Few Labels.
CIKMW @CIKM 2020 - Workshop at the 29th ACM International Conference on Information and Knowledge Management (CIKM 2020). Galway, Ireland, Oct 19-23, 2020. PDF GitHub

Abstract

Unlabeled data is often abundant in the clinic, making machine learning methods based on semi-supervised learning a good match for this setting. Despite this, they are currently receiving relatively little attention in medical image analysis literature. Instead, most practitioners and researchers focus on supervised or transfer learning approaches. The recently proposed Mix-Match and FixMatch algorithms have demonstrated promising results in extracting useful representations while requiring very few labels. Motivated by these recent successes, we apply MixMatch and FixMatch in an ophthalmological diagnostic setting and investigate how they fare against standard transfer learning. We find that both algorithms outperform the transfer learning baseline on all fractions of labelled data. Furthermore, our experiments show that Mean Teacher, which is a component of both algorithms, is not needed for our classification problem, as disabling it leaves the outcome unchanged.

MCML Authors

Valentyn Melnychuk

Artificial Intelligence in Management

Evgeny Faerman

Dr.

A3 | Computational Models
→ Group Matthias Schubert

* Former Member

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[36]

T. Seidl.
Keynote: Data Mining on Process Data.
ICPM 2020 - 2nd International Conference on Process Mining. Virtual, Oct 04-09, 2020. DOI

Abstract

Data Mining and Process Mining – is one just a variant of the other, or do worlds separate the two areas from each other? The notions sound so similar but the contents sometimes look differently, so respective researchers may get confused in their mutual perception, be it authors or reviewers. The talk recalls commonalities like model-based supervised and unsupervised learning approaches, and it also sheds light to peculiarities in process data and process mining tasks as seen from a data mining perspective. When considering trace data from event log files as time series, as sequences, or as activity sets, quite different data mining techniques apply and may be extended and improved. A particular example is rare pattern mining, which fills a gap between frequent patterns and outlier detection. The task aims at identifying patterns that occur with low frequency but above single outliers. Structural deficiences may cause malfunctions or other undesired behavior which get discarded as outliers in event logs, since they are observed infrequently only. Rare pattern mining may identify these situations, and recent approaches include clustering or ordering non-conformant traces. The talk concludes with some remarks on how to sell process mining papers to the data mining community, and vice versa, in order to improve mutual acceptance, and to increase synergies in the fields.

MCML Authors

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[35]

A. Maldonado, J. Sontheim, F. Richter and T. Seidl.
Performance Skyline: Inferring Process Performance Models from Interval Events.
SA4PM @ICPM 2020 - 1st International Workshop on Streaming Analytics for Process Mining in conjunction with the 2nd International Conference on Process Mining (ICPM 2020). Virtual, Oct 04-09, 2020. DOI

Abstract

Performance mining from event logs is a central task in managing and optimizing business processes. Established analysis techniques work with a single timestamp per event only. However, when available, time interval information enables proper analysis of the duration of individual activities as well as the overall execution runtime. Our novel approach, performance skyline, considers extended events, including start and end timestamps in log files, aiming at the discovery of events that are crucial to the overall duration of real process executions. As first contribution, our method gains a geometrical process representation for traces with interval events by using interval-based methods from sequence pattern mining and performance analysis. Secondly, we introduce the performance skyline, which discovers dominating events considering a given heuristic in this case, event duration. As a third contribution, we propose three techniques for statistical analysis of performance skylines and process trace sets, enabling more accurate process discovery, conformance checking, and process enhancement. Experiments on real event logs demonstrate that our contributions are highly suitable for detecting and analyzing the dominant events of a process.

MCML Authors

Andrea Maldonado

Database Systems and Data Mining AI Lab

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[34]

A. Beer, D. Seeholzer, N. S. Schüler and T. Seidl.
Angle-Based Clustering.
SISAP 2020 - 13th International Conference on Similarity Search and Applications. Virtual, Sep 30-Oct 02, 2020. DOI

Abstract

The amount of data increases steadily, and yet most clustering algorithms perform complex computations for every single data point. Furthermore, Euclidean distance which is used for most of the clustering algorithms is often not the best choice for datasets with arbitrarily shaped clusters or such with high dimensionality. Based on ABOD, we introduce ABC, the first angle-based clustering method. The algorithm first identifies a small part of the data as border points of clusters based on the angle between their neighbors. Those few border points can, with some adjustments, be clustered with well-known clustering algorithms like hierarchical clustering with single linkage or DBSCAN. Residual points can quickly and easily be assigned to the cluster of their nearest border point, so the overall runtime is heavily reduced while the results improve or remain similar.

MCML Authors

Anna Beer

Dr.

* Former Member

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[33]

A. Beer, D. Kazempour, J. Busch, A. Tekles and T. Seidl.
Grace - Limiting the Number of Grid Cells for Clustering High-Dimensional Data.
LWDA 2020 - Conference on Lernen. Wissen. Daten. Analysen. Bonn, Germany, Sep 09-11, 2020. PDF

Abstract

Using grid-based clustering algorithms on high-dimensionaldata has the advantage of being able to summarize datapoints into cells, but usually produces an exponential number of grid cells. In this paper we introduce Grace (using textit{Gr}id which is textit{a}daptive for textit{c}lusttextit{e}ring), a clustering algorithm which limits the number of cells produced depending on the number of points in the dataset. A non-equidistant grid is constructed based on the distribution of points in one-dimensional projections of the data. A density threshold is automatically deduced from the data and used to detect dense cells, which are later combined to clusters. The adaptive grid structure makes an efficient but still accurate clustering of multidimensional data possible. Experiments with synthetic as well as real-world data sets of various size and dimensionality confirm these properties.

MCML Authors

Anna Beer

Dr.

* Former Member

Daniyal Kazempour

Dr.

* Former Member

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[32]

A. Beer, V. Hartmann and T. Seidl.
Orderings of Data - more than a Tripping Hazard.
SSDBM 2020 - 32nd International Conference on Scientific and Statistical Database Management. Vienna, Austria, Jul 07-09, 2020. DOI

Abstract

As data processing techniques get more and more sophisticated every day, many of us researchers often get lost in the details and subtleties of the algorithms we are developing and far too easily seem to forget to look also at the very first steps of every algorithm: the input of the data. Since there are plenty of library functions for this task, we indeed do not have to think about this part of the pipeline anymore. But maybe we should. All data is stored and loaded into a program in some order. In this vision paper we study how ignoring this order can (1) lead to performance issues and (2) make research results unreproducible. We furthermore examine desirable properties of a data ordering and why current approaches are often not suited to tackle the two mentioned problems.

MCML Authors

Anna Beer

Dr.

* Former Member

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[31]

S. Friedl, S. Schmoll, F. Borutta and M. Schubert.
SMART-Env.
MDM 2020 - 21st IEEE International Conference on Mobile Data Management. Versailles, France, Jun 30-Jul 03, 2020. DOI

Abstract

In this work, we present SMART-Env (Spatial Multi-Agent Resource search Training Environment), a spatio-temporal multi-agent environment for evaluating and training different kinds of agents on resource search tasks. We explain how to simulate arbitrary spawning distributions on real-world street graphs, compare agents’ behavior and evaluate their performance over time. Finally, we demonstrate SMART-Env in a taxi dispatching scenario with three different kinds of agents.

MCML Authors

Sabrina Friedl

Dr.

* Former Member

Felix Borutta

Dr.

* Former Member

Matthias Schubert

Prof. Dr.

Spatial Artificial Intelligence

[30]

F. Borutta, D. Kazempour, F. Marty, P. Kröger and T. Seidl.
Detecting Arbitrarily Oriented Subspace Clusters in Data Streams Using Hough Transform.
PAKDD 2020 - 24th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Singapore, May 11-14, 2020. DOI

Abstract

When facing high-dimensional data streams, clustering algorithms quickly reach the boundaries of their usefulness as most of these methods are not designed to deal with the curse of dimensionality. Due to inherent sparsity in high-dimensional data, distances between objects tend to become meaningless since the distances between any two objects measured in the full dimensional space tend to become the same for all pairs of objects. In this work, we present a novel oriented subspace clustering algorithm that is able to deal with such issues and detects arbitrarily oriented subspace clusters in high-dimensional data streams. Data streams generally implicate the challenge that the data cannot be stored entirely and hence there is a general demand for suitable data handling strategies for clustering algorithms such that the data can be processed within a single scan. We therefore propose the CASHSTREAM algorithm that unites state-of-the-art stream processing techniques and additionally relies on the Hough transform to detect arbitrarily oriented subspace clusters. Our experiments compare CASHSTREAM to its static counterpart and show that the amount of consumed memory is significantly decreased while there is no loss in terms of runtime.

MCML Authors

Felix Borutta

Dr.

* Former Member

Daniyal Kazempour

Dr.

* Former Member

Peer Kröger

Prof. Dr.

* Former Principal Investigator

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[29]

M. Berrendorf, E. Faerman, V. Melnychuk, V. Tresp and T. Seidl.
Knowledge Graph Entity Alignment with Graph Convolutional Networks: Lessons Learned.
ECIR 2020 - 42nd European Conference on Information Retrieval. Virtual, Apr 14-17, 2020. DOI GitHub

Abstract

In this work, we focus on the problem of entity alignment in Knowledge Graphs (KG) and we report on our experiences when applying a Graph Convolutional Network (GCN) based model for this task. Variants of GCN are used in multiple state-of-the-art approaches and therefore it is important to understand the specifics and limitations of GCN-based models. Despite serious efforts, we were not able to fully reproduce the results from the original paper and after a thorough audit of the code provided by authors, we concluded, that their implementation is different from the architecture described in the paper. In addition, several tricks are required to make the model work and some of them are not very intuitive.We provide an extensive ablation study to quantify the effects these tricks and changes of architecture have on final performance. Furthermore, we examine current evaluation approaches and systematize available benchmark datasets.We believe that people interested in KG matching might profit from our work, as well as novices entering the field.

MCML Authors

Max Berrendorf

Dr.

A3 | Computational Models
→ Group Matthias Schubert

* Former Member

Evgeny Faerman

Dr.

* Former Member

Valentyn Melnychuk

C4 | Computational Social Sciences
→ Group Stefan Feuerriegel

Artificial Intelligence in Management

Volker Tresp

Prof. Dr.

Database Systems and Data Mining AI Lab

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[28]

F. Borutta.
Unsupervised learning on social data.
Dissertation 2020. DOI

Abstract

This thesis addresses several challenges in social data analytics, focusing on methods for clustering, learning from network data, and analyzing dynamic social data. It introduces novel algorithms for correlation clustering on streaming data, hierarchical clustering for social maps, and user identification based on spatio-temporal mobility patterns. Additionally, the thesis presents various node embedding techniques for learning representations from network topology and proposes a graph neural network model for matching nodes across overlapping graphs. (Shortened.)

MCML Authors

Felix Borutta

Dr.

C4 | Computational Social Sciences
→ Group Stefan Feuerriegel

* Former Member

[27]

D. Davletshina, V. Melnychuk, V. Tran, H. Singla, M. Berrendorf, E. Faerman, M. Fromm and M. Schubert.
Unsupervised Anomaly Detection for X-Ray Images.
Preprint (Jan. 2020). arXiv GitHub

Abstract

Obtaining labels for medical (image) data requires scarce and expensive experts. Moreover, due to ambiguous symptoms, single images rarely suffice to correctly diagnose a medical condition. Instead, it often requires to take additional background information such as the patient’s medical history or test results into account. Hence, instead of focusing on uninterpretable black-box systems delivering an uncertain final diagnosis in an end-to-end-fashion, we investigate how unsupervised methods trained on images without anomalies can be used to assist doctors in evaluating X-ray images of hands. Our method increases the efficiency of making a diagnosis and reduces the risk of missing important regions. Therefore, we adopt state-of-the-art approaches for unsupervised learning to detect anomalies and show how the outputs of these methods can be explained. To reduce the effect of noise, which often can be mistaken for an anomaly, we introduce a powerful preprocessing pipeline. We provide an extensive evaluation of different approaches and demonstrate empirically that even without labels it is possible to achieve satisfying results on a real-world dataset of X-ray images of hands. We also evaluate the importance of preprocessing and one of our main findings is that without it, most of our approaches perform not better than random.

MCML Authors

Valentyn Melnychuk

Artificial Intelligence in Management

Viet Tran

C2 | Biology
→ Group Christian Müller

Biomedical Statistics and Data Science

Max Berrendorf

Dr.

A3 | Computational Models
→ Group Matthias Schubert

* Former Member

Evgeny Faerman

Dr.

* Former Member

Michael Fromm

Dr.

* Former Member

Matthias Schubert

Prof. Dr.

A3 | Computational Models
→ Group Matthias Schubert

Spatial Artificial Intelligence

2019

[26]

E. Faerman, O. Voggenreiter, F. Borutta, T. Emrich, M. Berrendorf and M. Schubert.
Graph Alignment Networks with Node Matching Scores.
NeurIPS 2019 - Workshop on Graph Representation Learning at the 33rd Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 08-14, 2019. PDF

Abstract

In this work we address the problem of graph node alignment at the example of Map Fusion (MF). Given two partly overlapping road networks, the goal is to match nodes that represent the same locations in both networks. For this task we propose a new model based on Graph Neural Networks (GNN). Existing GNN approaches, which have recently been successfully applied on various tasks for graph based data, show poor performance for the MF task. We hypothesize that this is mainly caused by graph regions from the non-overlapping areas, as information from those areas negatively affect the learned node representations. Therefore, our model has an additional inductive bias and learns to ignore effects of nodes that do not have a matching in the other graph. Our new model can easily be extended to other graph alignment problems, e.g., for calculating graph similarities, or for the alignment of entities in knowledge graphs, as well.

MCML Authors

Evgeny Faerman

Dr.

* Former Member

Felix Borutta

Dr.

* Former Member

Max Berrendorf

Dr.

* Former Member

Matthias Schubert

Prof. Dr.

A3 | Computational Models
→ Group Matthias Schubert

Spatial Artificial Intelligence

[25]

E. Faerman, M. Rogalla, N. Strauß, A. Krüger, B. Blümel, M. Berrendorf, M. Fromm and M. Schubert.
Spatial Interpolation with Message Passing Framework.
ICDMW 2019 - IEEE International Conference on Data Mining Workshops. Beijing, China, Nov 08-11, 2019. DOI

Abstract

Spatial interpolation is the task to predict a measurement for any location in a given geographical region. To train a prediction model, we assume to have point-wise measurements for various locations in the region. In addition, it is often beneficial to consider historic measurements for these locations when training an interpolation model. Typical use cases are the interpolation of weather, pollution or traffic information. In this paper, we introduce a new type of model with strong relational inductive bias based on Message Passing Networks. In addition, we extend our new model to take geomorphological characteristics into account to improve the prediciton quality. We provide an extensive evaluation based on a large real-world weather dataset and compare our new approach with classical statistical interpolation techniques and Neural Networks without inductive bias.

MCML Authors

Evgeny Faerman

Dr.

* Former Member

Niklas Strauß

Dr.

A3 | Computational Models
→ Group Matthias Schubert

Spatial Artificial Intelligence

Max Berrendorf

Dr.

* Former Member

Michael Fromm

Dr.

* Former Member

Matthias Schubert

Prof. Dr.

Spatial Artificial Intelligence

[24]

M. Fromm, M. Berrendorf, E. Faerman, Y. Chen, B. Schüss and M. Schubert.
XD-STOD: Cross-Domain Superresolution for Tiny Object Detection.
ICDMW 2019 - IEEE International Conference on Data Mining Workshops. Beijing, China, Nov 08-11, 2019. DOI

Abstract

Monitoring the restoration of natural habitats after human intervention is an important task in the field of remote sensing. Currently, this requires extensive field studies entailing considerable costs. Unmanned Aerial vehicles (UAVs, a.k.a. drones) have the potential to reduce these costs, but generate immense amounts of data which have to be evaluated automatically with special techniques. Especially the automated detection of tree seedlings poses a big challenge, as their size and shape vary greatly across images. In addition, there is a tradeoff between different flying altitudes. Given the same camera equipment, a lower flying altitude achieves higher resolution images and thus, achieving high detection rates is easier. However, the imagery will only cover a limited area. On the other hand, flying at larger altitudes, allows for covering larger areas, but makes seedling detection more challenging due to the coarser images. In this paper we investigate the usability of super resolution (SR) networks for the case that we can collect a large amount of coarse imagery on higher flying altitudes, but only a small amount of high resolution images from lower flying altitudes. We use a collection of high-resolution images taken by a drone at 5m altitude. After training the SR models on these data, we evaluate their applicability to low quality images taken at 30m altitude (in-domain). In addition, we investigate and compare whether approaches trained on a highly diverse large data sets can be transferred to these data (cross-domain). We also evaluate the usability of the SR results based on their influence on the detection rate of different object detectors. We found that the features acquired from training on standard SR data sets are transferable to the drone footage. Furthermore, we demonstrate that the detection rate of common object detectors can be improved by SR techniques using both settings, in-domain and cross-domain.

MCML Authors

Michael Fromm

Dr.

* Former Member

Max Berrendorf

Dr.

A3 | Computational Models
→ Group Matthias Schubert

* Former Member

Evgeny Faerman

Dr.

* Former Member

Matthias Schubert

Prof. Dr.

Spatial Artificial Intelligence

[23]

F. Borutta, S. Schmoll and S. Friedl.
Optimizing the Spatio-Temporal Resource Search Problem with Reinforcement Learning.
ACM SIGSPATIAL 2019 - 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. Chicago, ILL, USA, Nov 05-08, 2019. DOI

Abstract

Collecting spatio-temporal resources is an important goal in many real-world use cases such as finding customers for taxicabs. In this paper, we tackle the resource search problem posed by the GIS Cup 2019 where the objective is to minimize the average search time of taxicabs looking for customers. The main challenge is that the taxicabs may not communicate with each other and the only observation they have is the current time and position. Inspired by radial transit route structures in urban environments, our approach relies on round trips that are used as action space for a downstream reinforcement learning procedure. Our source code is publicly available at https://github.com/Fe18/TripBanditAgent.

MCML Authors

Felix Borutta

Dr.

* Former Member

Sabrina Friedl

Dr.

* Former Member

[22]

F. Borutta, J. Busch, E. Faerman, A. Klink and M. Schubert.
Structural Graph Representations based on Multiscale Local Network Topologies.
WI 2019 - IEEE/WIC/ACM International Conference on Web Intelligence. Thessaloniki, Greece, Oct 14-17, 2019. DOI

Abstract

In many applications, it is required to analyze a graph merely based on its topology. In these cases, nodes can only be distinguished based on their structural neighborhoods and it is common that nodes having the same functionality or role yield similar neighborhood structures. In this work, we investigate two problems: (1) how to create structural node embeddings which describe a node’s role and (2) how important the nodes’ roles are for characterizing entire graphs. To describe the role of a node, we explore the structure within the local neighborhood (or multiple local neighborhoods of various extents) of the node in the vertex domain, compute the visiting probability distribution of nodes in the local neighborhoods and summarize each distribution to a single number by computing its entropy. Furthermore, we argue that the roles of nodes are important to characterize the entire graph. Therefore, we propose to aggregate the role representations to describe whole graphs for graph classification tasks. Our experiments show that our new role descriptors outperform state-of-the-art structural node representations that are usually more expensive to compute. Additionally, we achieve promising results compared to advanced state-of-the-art approaches for graph classification on various benchmark datasets, often outperforming these approaches.

MCML Authors

Felix Borutta

Dr.

A3 | Computational Models
→ Group Matthias Schubert

* Former Member

Evgeny Faerman

Dr.

* Former Member

Matthias Schubert

Prof. Dr.

Spatial Artificial Intelligence

[21]

M. Fromm, E. Faerman and T. Seidl.
TACAM: Topic And Context Aware Argument Mining.
WI 2019 - IEEE/WIC/ACM International Conference on Web Intelligence. Thessaloniki, Greece, Oct 14-17, 2019. DOI

Abstract

In this work we address the problem of argument search. The purpose of argument search is the distillation of pro and contra arguments for requested topics from large text corpora. In previous works, the usual approach is to use a standard search engine to extract text parts which are relevant to the given topic and subsequently use an argument recognition algorithm to select arguments from them. The main challenge in the argument recognition task, which is also known as argument mining, is that often sentences containing arguments are structurally similar to purely informative sentences without any stance about the topic. In fact, they only differ semantically. Most approaches use topic or search term information only for the first search step and therefore assume that arguments can be classified independently of a topic. We argue that topic information is crucial for argument mining, since the topic defines the semantic context of an argument. Precisely, we propose different models for the classification of arguments, which take information about a topic of an argument into account. Moreover, to enrich the context of a topic and to let models understand the context of the potential argument better, we integrate information from different external sources such as Knowledge Graphs or pre-trained NLP models. Our evaluation shows that considering topic information, especially in connection with external information, provides a significant performance boost for the argument mining task.

MCML Authors

Michael Fromm

Dr.

* Former Member

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[20]

A. Beer, J. Lauterbach and T. Seidl.
MORe++: k-Means Based Outlier Removal on High-Dimensional Data.
SISAP 2019 - 12th International Conference on Similarity Search and Applications. Newark, New York, USA, Oct 02-04, 2019. DOI

Abstract

MORe++ is a k-Means based Outlier Removal method working on high dimensional data. It is simple, efficient and scalable. The core idea is to find local outliers by examining the points of different k-Means clusters separately. Like that, one-dimensional projections of the data become meaningful and allow to find one-dimensional outliers easily, which else would be hidden by points of other clusters. MORe++ does not need any additional input parameters than the number of clusters k used for k-Means, and delivers an intuitively accessible degree of outlierness. In extensive experiments it performed well compared to k-Means– and ORC.

MCML Authors

Anna Beer

Dr.

* Former Member

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[19]

M. Berrendorf, F. Borutta and P. Kröger.
k-Distance Approximation for Memory-Efficient RkNN Retrieval.
SISAP 2019 - 12th International Conference on Similarity Search and Applications. Newark, New York, USA, Oct 02-04, 2019. DOI

Abstract

For a given query object, Reverse k-Nearest Neighbor queries retrieve those objects that have the query object among their k-nearest neighbors. However, computing the k-nearest neighbor sets for all points in a database is expensive in terms of computational costs. Therefore, specific index structures have been invented to apply pruning heuristics which aim at reducing the search space. At time, the state-of-the-art index structure for enabling fast RkNN query processing in general metric spaces is the MRkNNCoP-Tree which uses linear functions to approximate lower and upper bounds on the k-distances to prune the search space. Storing those linear functions results in additional storage costs in O(n) which might be infeasible in situation where storage space is limited, e.g., on mobile devices. In this work, we present a novel index based on the MRkNNCoP-Tree as well as recent developments in the field of neural indexing. By learning a single neural network model that approximates the k-nearest neighbor distance bounds for all points in a database, the storage complexity of the proposed index structure is reduced to O(1) while the index is still able to guarantee exact query results. As shown in our experimental evaluations on synthetic and real-world data sets, our approach can significantly reduce the required storage space in trade-off to some growth in terms of refinement sets when relying on exact query processing.

MCML Authors

Max Berrendorf

Dr.

* Former Member

Felix Borutta

Dr.

* Former Member

Peer Kröger

Prof. Dr.

* Former Principal Investigator

[18]

F. Borutta, P. Kröger and T. Hubauer.
A Generic Summary Structure for Arbitrarily Oriented Subspace Clustering in Data Streams.
SISAP 2019 - 12th International Conference on Similarity Search and Applications. Newark, New York, USA, Oct 02-04, 2019. DOI

Abstract

Nowadays, as lots of data is gathered in large volumes and with high velocity, the development of algorithms capable of handling complex data streams in (near) real-time is a major challenge. In this work, we present the algorithm CORRSTREAM which tackles the problem of detecting arbitrarily oriented subspace clusters in high-dimensional data streams. The proposed method follows a two phase approach, where the continuous online phase aggregates data points within a proper microcluster structure that stores all necessary information to define a microcluster’s subspace and is generic enough to cope with a variety of offline procedures. Given several such microclusters, the offline phase is able to build a final clustering model which reveals arbitrarily oriented subspaces in which the data tend to cluster. In our experimental evaluation, we show that CORRSTREAM not only has an acceptable throughput but also outperforms static counterpart algorithms by orders of magnitude when considering the runtime. At the same time, the loss of accuracy is quite small.

MCML Authors

Felix Borutta

Dr.

* Former Member

Peer Kröger

Prof. Dr.

* Former Principal Investigator

[17]

M. A. X. Hünemörder, D. Kazempour, P. Kröger and T. Seidl.
SIDEKICK: Linear Correlation Clustering with Supervised Background Knowledge.
SISAP 2019 - 12th International Conference on Similarity Search and Applications. Newark, New York, USA, Oct 02-04, 2019. DOI

Abstract

While explainable AI (XAI) is gaining in popularity, other more traditional machine learning algorithms can also benefit from increased explainability. A semi-supervised approach to correlation clustering opens up a promising design space that might provide such explainability to correlation clustering algorithms. In this work, semi-supervised linear correlation clustering is defined as the task of finding arbitrary oriented subspace clusters using only a small sample of supervised background knowledge provided by a domain experts. This work describes a first foray into this novel approach and provides an implementation of a basic algorithm to perform this task. We have found that even a small amount of supervised background knowledge can significantly improve the quality of correlation clustering in general. With confidence it can be stated, the results of this work have the potential to inspire several more semi-supervised approaches to correlation clustering in the future.

MCML Authors

Daniyal Kazempour

Dr.

* Former Member

Peer Kröger

Prof. Dr.

* Former Principal Investigator

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[16]

D. Kazempour, M. Hünemörder and T. Seidl.
On coMADs and Principal Component Analysis.
SISAP 2019 - 12th International Conference on Similarity Search and Applications. Newark, New York, USA, Oct 02-04, 2019. DOI

Abstract

Principal Component Analysis (PCA) is a popular method for linear dimensionality reduction. It is often used to discover hidden correlations or to facilitate the interpretation and visualization of data. However, it is liable to suffer from outliers. Strong outliers can skew the principal components and as a consequence lead to a higher reconstruction loss. While there exist several sophisticated approaches to make the PCA more robust, we present an approach which is intriguingly simple: we replace the covariance matrix by a so-called coMAD matrix. The first experiments show that PCA based on the coMAD matrix is more robust towards outliers.

MCML Authors

Daniyal Kazempour

Dr.

* Former Member

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[15]

A. Beer, N. S. Schüler and T. Seidl.
A Generator for Subspace Clusters.
LWDA 2019 - Conference on Lernen. Wissen. Daten. Analysen. Berlin, Germany, Sep 30-Oct 02, 2019. PDF

Abstract

We introduce a generator for data containing subspace clusters which is accurately tunable and adjustable to the needs of developers. It is online available and allows to give a plethora of characteristics the data should contain, while it is simultaneously able to generate meaningful data containing subspace clusters with a minimum of input data.

MCML Authors

Anna Beer

Dr.

* Former Member

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[14]

D. Kazempour, A. Beer, O. Schrüfer and T. Seidl.
Clustering Trend Data Time-Series through Segmentation of FFT-decomposed Signal Constituents.
LWDA 2019 - Conference on Lernen. Wissen. Daten. Analysen. Berlin, Germany, Sep 30-Oct 02, 2019. PDF

Abstract

When we are given trend data for different keywords, scientists may want to cluster them in order to detect specific terms which exhibit a similar trending. For this purpose the periodic regression on each of the time-series can be performed. We ask in this work: What if we not simply cluster the regression models of each time-series, but the periodic signal constituents? The impact of such an approach is twofold: first we would see at a regression level how similar or dissimilar two time-series are regarding their periodic models, and secondly we would be able to see similarities based on single signal constituents between different time-series, containing the semantic that although time-series may be different on a regression level, they may be similar on an constituent level, reflecting other periodic influences. The results of this approach reveal commonalities between time series on a constituent level that are not visible in first place, by looking at their plain regression models.

MCML Authors

Daniyal Kazempour

Dr.

* Former Member

Anna Beer

Dr.

* Former Member

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[13]

D. Kazempour, L. M. Yan and T. Seidl.
From Covariance to Comode in context of Principal Component Analysis.
LWDA 2019 - Conference on Lernen. Wissen. Daten. Analysen. Berlin, Germany, Sep 30-Oct 02, 2019. PDF

Abstract

When it comes to the task of dimensionality reduction, the Principal Component Analysis (PCA) is among the most well known methods. Despite its popularity, PCA is prone to outliers which can be traced back to the fact that this method relies on a covariance matrix. Even with the variety of sophisticated methods to enhance the robustness of the PCA, we provide here in this work-in-progress an approach which is intriguingly simple: the covariance matrix is replaced by a so-called comode matrix. Through this minor modification the experiments show that the reconstruction loss is significantly reduced. In this work we introduce the comode and its relation to the MeanShift algorithm, including its bandwidth parameter, compare it in an experiment against the classic covariance matrix and evaluate the impact of the bandwidth hyperparameter on the reconstruction error.

MCML Authors

Daniyal Kazempour

Dr.

* Former Member

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[12]

J. Held, A. Beer and T. Seidl.
Chain-detection Between Clusters.
Datenbank-Spektrum 19 (Sep. 2019). DOI

Abstract

Chains connecting two or more different clusters are a well known problem of clustering algorithms like DBSCAN or Single Linkage Clustering. Since already a small number of points resulting from, e.g., noise can form such a chain and build a bridge between different clusters, it can happen that the results of the clustering algorithm are distorted: several disparate clusters get merged into one. This single-link effect is rather known but to the best of our knowledge there are no satisfying solutions which extract those chains, yet. We present a new algorithm detecting not only straight chains between clusters, but also bent and noisy ones. Users are able to choose between eliminating one dimensional and higher dimensional chains connecting clusters to receive the underlying cluster structure. Also, the desired straightness can be set by the user. As this paper is an extension of ‘Chain-detection for DBSCAN’, we apply our technique not only in combination with DBSCAN but also with single link hierarchical clustering. On a real world dataset containing traffic accidents in Great Britain we were able to detect chains emerging from streets between cities and villages, which led to clusters composed of diverse villages. Additionally, we analyzed the robustness regarding the variance of chains in synthetic experiments.

MCML Authors

Anna Beer

Dr.

* Former Member

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[11]

S. Schmoll, S. Friedl and M. Schubert.
Scaling the Dynamic Resource Routing Problem.
SSTD 2019 - 16th International Symposium on Spatial and Temporal Databases. Vienna, Austria, Aug 19-21, 2019. DOI

Abstract

Routing to a resource (e.g. a parking spot or charging station) is a probabilistic search problem due to the uncertainty as to whether the resource is available at the time of arrival or not. In recent years, more and more real-time information about the current state of resources has become available in order to facilate this task. Therefore, we consider the case of a driver receiving online updates about the current situation. In this setting, the problem can be described as a fully observable Markov Decision Process (MDP) which can be used to compute an optimal policy minimizing the expected search time. However, current approaches do not scale beyond a dozen resources in a query. In this paper, we suggest to adapt common approximate solutions for solving MDPs. We propose a new re-planning and hindsight planning algorithm that redefine the state space and rely on novel cost estimations to find close to optimal results. Unlike exact solutions for computing MDPs, our approximate planers can scale up to hundreds of resources without prohibitive computational costs. We demonstrate the result quality and the scalability of our approaches on two settings describing the search for parking spots and charging stations in an urban environment.

MCML Authors

Sabrina Friedl

Dr.

* Former Member

Matthias Schubert

Prof. Dr.

Spatial Artificial Intelligence

[10]

A. Beer, D. Kazempour, M. Baur and T. Seidl.
Human Learning in Data Science (Poster Extended Abstract).
HCII 2019 - 21st International Conference of Human-Computer Interaction. Orlando, Florida, USA, Jul 26-31, 2019. DOI

Abstract

As machine learning becomes a more and more important area in Data Science, bringing with it a rise of abstractness and complexity, the desire for explainability rises, too. With our work we aim to gain explainability focussing on correlation clustering and try to pursue the original goals of different Data Science tasks,: Extracting knowledge from data. As well-known tools like Fold-It or GeoTime show, gamification is a very mighty approach, but not only to solve tasks which prove more difficult for machines than for humans. We could also gain knowledge from how players proceed trying to solve those difficult tasks. That is why we developed Straighten it up!, a game in which users try to find the best linear correlations in high dimensional datasets. Finding arbitrarily oriented subspaces in high dimensional data is an exponentially complex task due to the number of potential subspaces in regards to the number of dimensions. Nevertheless, linearly correlated points are as a simple pattern easy to track by the human eye. Straighten it up! gives users an overview over two-dimensional projections of a self-chosen dataset. Users decide which subspace they want to examine first, and can draw in arbitrarily many lines fitting the data. An offset inside of which points are assigned to the corresponding line can easily be chosen for every line independently, and users can switch between different projections at any time. We developed a scoring system not only as incentive, but first of all for further examination, based on the density of each cluster, its minimum spanning tree, size of offset, and coverage. By tracking every step of a user we are able to detect common mechanisms and examine differences to state-of-the-art correlation and subspace clustering algorithms, resulting in more comprehensibility.

MCML Authors

Anna Beer

Dr.

* Former Member

Daniyal Kazempour

Dr.

* Former Member

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[9]

D. Kazempour, A. Beer and T. Seidl.
Data on RAILs: On interactive generation of artificial linear correlated data (Poster Extended Abstract).
HCII 2019 - 21st International Conference of Human-Computer Interaction. Orlando, Florida, USA, Jul 26-31, 2019. DOI

Abstract

Artificially generated data sets are present in many data mining and machine learning publications in the experimental section. One of the reasons to use synthetic data is, that scientists can express their understanding of a “ground truth”, having labels and thus an expectation of what an algorithm should be able to detect. This permits also a degree of control to create data sets which either emphasize the strengths of a method or reveal its weaknesses and thus potential targets for improvement. In order to develop methods which detect linear correlated clusters, the necessity of generating such artificial clusters is indispensable. This is mostly done by command-line based scripts which may be tedious since they demand from users to ‘visualize’ in their minds how the correlated clusters have to look like and be positioned within the data space. We present in this work RAIL, a generator for Reproducible Artificial Interactive Linear correlated data. With RAIL, users can add multiple planes into a data space and arbitrarily change orientation and position of those planes in an interactive fashion. This is achieved by manipulating the parameters describing each of the planes, giving users immediate feedback in real-time. With this approach scientists no longer need to imagine their data but can interactively explore and design their own artificial data sets containing linear correlated clusters. Another convenient feature in this context is that the data is only generated when the users decide that their design phase is completed. If researchers want to share data, a small file is exchanged containing the parameters which describe the clusters through information such as e.g. their Hessian-Normal-Form or number of points per cluster, instead of sharing several large csv files.

MCML Authors

Daniyal Kazempour

Dr.

* Former Member

Anna Beer

Dr.

* Former Member

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[8]

A. Beer, D. Kazempour, L. Stephan and T. Seidl.
LUCK - Linear Correlation Clustering Using Cluster Algorithms and a kNN based Distance Function (short paper).
SSDBM 2019 - 31st International Conference on Scientific and Statistical Database Management. Santa Cruz, CA, USA, Jul 23-25, 2019. DOI

Abstract

LUCK allows to use any distance-based clustering algorithm to find linear correlated data. For that a novel distance function is introduced, which takes the distribution of the kNN of points into account and corresponds to the probability of two points being part of the same linear correlation. In this work in progress we tested the distance measure with DBSCAN and k-Means comparing it to the well-known linear correlation clustering algorithms ORCLUS, 4C, COPAC, LMCLUS, and CASH, receiving good results for difficult synthetic data sets containing crossing or non-continuous correlations.

MCML Authors

Anna Beer

Dr.

* Former Member

Daniyal Kazempour

Dr.

* Former Member

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[7]

A. Beer and T. Seidl.
Graph Ordering and Clustering - A Circular Approach.
SSDBM 2019 - 31st International Conference on Scientific and Statistical Database Management. Santa Cruz, CA, USA, Jul 23-25, 2019. DOI

Abstract

As the ordering of data, particularly of graphs, can influence the result of diverse Data Mining tasks performed on it heavily, we introduce the Circle-Index, the first internal quality measurement for orderings of graphs. It is based on a circular arrangement of nodes, but takes in contrast to similar arrangements from the field of, e.g., visual analytics, the edge lengths in this arrangement into account. The minimization of the Circle-Index leads to an arrangement which not only offers a simple way to cluster the data using a constrained texttt{MinCut} in only linear time, but is also visually convincing. We developed the clustering algorithm CirClu which implements this minimization and texttt{MinCut}, and compared it with several established clustering algorithms achieving very good results. Simultaneously we compared the Circle-Index with several internal quality measures for clusterings. We observed a strong coherence between the Circle-Index and the matching of achieved clusterings to the respective ground truths in diverse real world datasets.

MCML Authors

Anna Beer

Dr.

* Former Member

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[6]

D. Kazempour, K. Emmerig, P. Kröger and T. Seidl.
Detecting Global Periodic Correlated Clusters in Event Series based on Parameter Space Transform.
SSDBM 2019 - 31st International Conference on Scientific and Statistical Database Management. Santa Cruz, CA, USA, Jul 23-25, 2019. DOI

Abstract

Periodicities are omnipresent: In nature in the cycles of predator and prey populations, reoccurring patterns regarding our power consumption over the days, or the presence of flu diseases over the year. With regards to the importance of periodicities we ask: Is there a way to detect periodic correlated clusters which are hidden in event series? We propose as a work in progress a method for detecting sinusoidal periodic correlated clusters on event series which relies on parameter space transformation. Our contributions are: Providing the first non-linear correlation clustering algorithm for detecting periodic correlated clusters. Further our method provides an explicit model giving domain experts information on parameters such as amplitude, frequency, phase-shift and vertical-shift of the detected clusters. Beyond that we approach the issue of determining an adequate frequency and phase-shift of the detected correlations given a frequency and phase-shift boundary.

MCML Authors

Daniyal Kazempour

Dr.

* Former Member

Peer Kröger

Prof. Dr.

* Former Principal Investigator

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[5]

D. Kazempour and T. Seidl.
On systematic hyperparameter analysis through the example of subspace clustering.
SSDBM 2019 - 31st International Conference on Scientific and Statistical Database Management. Santa Cruz, CA, USA, Jul 23-25, 2019. DOI

Abstract

In publications where a clustering method is described, the chosen hyperparameters are in many cases to our current observation empirically determined. In this work in progress we discuss and propose one approach on how hyperparameters can be systematically explored and their effects regarding the data set analyzed. We further introduce in the context of hyperparameter analysis a modified definition of the resilience term, which refers here to a subset of data points which persists to be in the same cluster over different hyperparameter settings. In order to analyze relations among different hyperparameters we further introduce the concept of dynamic intersection computing.

MCML Authors

Daniyal Kazempour

Dr.

* Former Member

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[4]

A. Beer, D. Kazempour and T. Seidl.
Rock - Let the points roam to their clusters themselves.
EDBT 2019 - 22nd International Conference on Extending Database Technology. Lisbon, Portugal, Mar 26-29, 2019. PDF

Abstract

In this work we present Rock, a method where the points roam to their clusters using k-NN. Rock is a draft for an algorithm which is capable of detecting non-convex clusters of arbitrary dimension while delivering representatives for each cluster similar to, e.g., Mean Shift or k-Means. Applying Rock, points roam to the mean of their k-NN while k increments in every step. Like that, rather outlying points and noise move to their nearest cluster while the clusters themselves contract first to their skeletons and further to a representative point each. Our empirical results on synthetic and real data demonstrate that Rock is able to detect clusters on datasets where either mode seeking or density-based approaches do not succeed.

MCML Authors

Anna Beer

Dr.

* Former Member

Daniyal Kazempour

Dr.

* Former Member

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[3]

D. Kazempour, L. Krombholz, P. Kröger and T. Seidl.
A Galaxy of Correlations - Detecting Linear Correlated Clusters through k-Tuples Sampling using Parameter Space Transform.
EDBT 2019 - 22nd International Conference on Extending Database Technology. Lisbon, Portugal, Mar 26-29, 2019. PDF

Abstract

In different research domains conducted experiments aim for the detection of (hyper)linear correlations among multiple features within a given data set. For this purpose methods exist where one among them is highly robust against noise and detects linear correlated clusters regardless of any locality assumption. This method is based on parameter space transformation. The currently available parameter transform based algorithms detect the clusters scanning explicitly for intersections of functions in parameter space. This approach comes with drawbacks. It is difficult to analyze aspects going beyond the sole intersection of functions, such as e.g. the area around the intersections and further it is computationally expensive. The work in progress method we provide here overcomes the mentioned drawbacks by sampling d-dimensional tuples in data space, generating a (hyper)plane and representing this plane as a single point in parameter space. By this approach we no longer scan for intersection points of functions in parameter space but for dense regions of such parameter vectors. By this approach in future work well established clustering algorithms can be applied in parameter space to detect e.g. dense regions, modes or hierarchies of linear correlations in parameter space.

MCML Authors

Daniyal Kazempour

Dr.

* Former Member

Peer Kröger

Prof. Dr.

* Former Principal Investigator

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[2]

D. Kazempour and T. Seidl.
Insights into a running clockwork: On interactive process-aware clustering.
EDBT 2019 - 22nd International Conference on Extending Database Technology. Lisbon, Portugal, Mar 26-29, 2019. PDF

Abstract

In recent years the demand for having algorithms which provide not only their results, but also add explainability up to a certain extent increased. In this paper we envision a class of clustering algorithms where the users can interact not only with the input or output but also intercept within the very clustering process itself, which we coin with the term process-aware clustering. Further we aspire to sketch the challenges emerging with such type of algorithms, such as the need of adequate measures which evaluate the progression through the computation process of a clustering method. Beyond the explainability on how the results are generated, we propose methods tailored at systematically analyzing the hyperparameter space of an algorithm, determining in a more ordered fashion suitable hyperparameters rather then applying a trial-and-error schema.

MCML Authors

Daniyal Kazempour

Dr.

* Former Member

Thomas Seidl

Prof. Dr.

Database Systems and Data Mining AI Lab

[1]

D. Kazempour, M. Kazakov, P. Kröger and T. Seidl.
DICE: Density-based Interactive Clustering and Exploration.
BTW 2019 - 18th Symposium of Database Systems for Business, Technology and Web. Rostock, Germany, Mar 04-08, 2019. DOI

Abstract

Clustering algorithms are mostly following the pipeline to provide input data, and hyperparameter values. Then the algorithms are executed and the output files are generated or visualized. We provide in our work an early prototype of an interactive density-based clustering tool named DICE in which the users can change the hyperparameter settings and immediately observe the resulting clusters. Further the users can browse through each of the single detected clusters and get statistics regarding as well as a convex hull profile for each cluster. Further DICE keeps track of the chosen settings, enabling the user to review which hyperparameter values have been previously chosen. DICE can not only be used in scientific context of analyzing data, but also in didactic settings in which students can learn in an exploratory fashion how a density-based clustering algorithm like e.g. DBSCAN behaves.

MCML Authors

Daniyal Kazempour

Dr.

* Former Member

Peer Kröger

Prof. Dr.

* Former Principal Investigator

Thomas Seidl

Prof. Dr.