is Professor for Computer Science and head of the Database Systems and Data Mining Chair at LMU Munich.
His fundamental research on data mining and database technologies with application domains in engineering, business, life science and humanities yielded more than 300 scientific publications so far. He serves on many program committees and scientific boards and is co-chair of the LMU Data Science Lab, the ZD.B Innovation Lab, the Munich School of Data Science @Helmholtz, TUM and LMU (MuDS) and of the Elite Master program in Data Science at LMU.
Allocation tasks represent a class of problems where a limited amount of resources must be allocated to a set of entities at each time step. Prominent examples of this task include portfolio optimization or distributing computational workloads across servers. Allocation tasks are typically bound by linear constraints describing practical requirements that have to be strictly fulfilled at all times. In portfolio optimization, for example, investors may be obligated to allocate less than 30% of the funds into a certain industrial sector in any investment period. Such constraints restrict the action space of allowed allocations in intricate ways, which makes learning a policy that avoids constraint violations difficult. In this paper, we propose a new method for constrained allocation tasks based on an autoregressive process to sequentially sample allocations for each entity. In addition, we introduce a novel de-biasing mechanism to counter the initial bias caused by sequential sampling. We demonstrate the superior performance of our approach compared to a variety of Constrained Reinforcement Learning (CRL) methods on three distinct constrained allocation tasks: portfolio optimization, computational workload distribution, and a synthetic allocation benchmark.
David Winkel
* Former member
Finding meaningful groups, i.e., clusters, in high-dimensional data such as images or texts without labeled data at hand is an important challenge in data mining. In recent years, deep clustering methods have achieved remarkable results in these tasks. However, most of these methods require the user to specify the number of clusters in advance. This is a major limitation since the number of clusters is typically unknown if labeled data is unavailable. Thus, an area of research has emerged that addresses this problem. Most of these approaches estimate the number of clusters separated from the clustering process. This results in a strong dependency of the clustering result on the quality of the initial embedding. Other approaches are tailored to specific clustering processes, making them hard to adapt to other scenarios. In this paper, we propose UNSEEN, a general framework that, starting from a given upper bound, is able to estimate the number of clusters. To the best of our knowledge, it is the first method that can be easily combined with various deep clustering algorithms. We demonstrate the applicability of our approach by combining UNSEEN with the popular deep clustering algorithms DCN, DEC, and DKM and verify its effectiveness through an extensive experimental evaluation on several image and tabular datasets. Moreover, we perform numerous ablations to analyze our approach and show the importance of its components.
Recent trends in Video Instance Segmentation (VIS) have seen a growing reliance on online methods to model complex and lengthy video sequences. However, the degradation of representation and noise accumulation of the online methods, especially during occlusion and abrupt changes, pose substantial challenges. Transformer-based query propagation provides promising directions at the cost of quadratic memory attention. However, they are susceptible to the degradation of instance features due to the above-mentioned challenges and suffer from cascading effects. The detection and rectification of such errors remain largely underexplored. To this end, we introduce textbf{GRAtt-VIS}, textbf{G}ated textbf{R}esidual textbf{Att}ention for textbf{V}ideo textbf{I}nstance textbf{S}egmentation. Firstly, we leverage a Gumbel-Softmax-based gate to detect possible errors in the current frame. Next, based on the gate activation, we rectify degraded features from its past representation. Such a residual configuration alleviates the need for dedicated memory and provides a continuous stream of relevant instance features. Secondly, we propose a novel inter-instance interaction using gate activation as a mask for self-attention. This masking strategy dynamically restricts the unrepresentative instance queries in the self-attention and preserves vital information for long-term tracking. We refer to this novel combination of Gated Residual Connection and Masked Self-Attention as textbf{GRAtt} block, which can easily be integrated into the existing propagation-based framework. Further, GRAtt blocks significantly reduce the attention overhead and simplify dynamic temporal modeling. GRAtt-VIS achieves state-of-the-art performance on YouTube-VIS and the highly challenging OVIS dataset, significantly improving over previous methods.
Large language models (LLMs) excel at retrieving information from lengthy text, but their vision-language counterparts (VLMs) face difficulties with hour-long videos, especially for temporal grounding. Specifically, these VLMs are constrained by frame limitations, often losing essential temporal details needed for accurate event localization in extended video content. We propose ReVisionLLM, a recursive vision-language model designed to locate events in hour-long videos. Inspired by human search strategies, our model initially targets broad segments of interest, progressively revising its focus to pinpoint exact temporal boundaries. Our model can seamlessly handle videos of vastly different lengths, from minutes to hours. We also introduce a hierarchical training strategy that starts with short clips to capture distinct events and progressively extends to longer videos. To our knowledge, ReVisionLLM is the first VLM capable of temporal grounding in hour-long videos, outperforming previous state-of-the-art methods across multiple datasets by a significant margin (+2.6% R1@0.1 on MAD).
Business processes from many domains like manufacturing, healthcare, or business administration suffer from different amounts of uncertainty concerning the execution of individual activities and their order of occurrence. As long as a process is not entirely serial, i.e., there are no forks or decisions to be made along the process execution, we are - in the absence of exhaustive domain knowledge - confronted with the question whether and in what order activities should be executed or left out for a given case and a desired outcome. As the occurrence or non-occurrence of events has substantial implications regarding process key performance indicators like throughput times or scrap rate, there is ample need for assessing and modeling that process-inherent uncertainty. We propose a novel way of handling the uncertainty by leveraging the probabilistic mechanisms of Bayesian Networks to model processes from the structural and temporal information given in event log data and offer a comprehensive evaluation of uncertainty by modelling cases in their entirety. In a thorough analysis of well-established benchmark datasets, we show that our Process-aware Bayesian Network is capable of answering process queries concerned with any unknown process sequence regarding activities and/or attributes enhancing the explainability of processes. Our method can infer execution probabilities of activities at different stages and can query probabilities of certain process outcomes. The key benefit of the Process-aware Query System over existing approaches is the ability to deliver probabilistic, case-diagnostic information about the execution of activities via Bayesian inference.
Process mining solutions aim to improve performance, save resources, and address bottlenecks in organizations. However, success depends on data quality and availability, and existing analyses often lack diverse data for rigorous testing. To overcome this, we propose an interactive web application tool, extending the GEDI Python framework, which creates event datasets that meet specific (meta-)features. It provides diverse benchmark event data by exploring new regions within the feature space, enhancing the range and quality of process mining analyses. This tool improves evaluation quality and helps uncover correlations between meta-features and metrics, ultimately enhancing solution effectiveness.
The abundance of new approaches in process mining and the diversity of processes in the real-world, raises the question of this thesis: How can we create benchmarks, which reliably measure the impact of event data features on process mining evaluation? Developing benchmarks, that employ comprehensive intentional ED and also consider connections between ED characteristic features, methods, and metrics, will support process miners to evaluate methods more efficiently and reliably.
Reliable process information, especially regarding trace durations, is crucial for smooth execution. Without it, maintaining a process becomes costly. While many predictive systems aim to identify inefficiencies, they often focus on individual process instances, missing the global perspective. It is essential not only to detect where delays occur but also to pinpoint specific activity transitions causing them. To address this, we propose CC-HIT (Creating Counterfactuals from High-Impact Transitions), which identifies temporal dependencies across the entire process. By focusing on activity transitions, we provide deeper insights into relational impacts, enabling faster resolution of inefficiencies. CC-HIT highlights the most influential transitions on process performance, offering actionable insights for optimization. We validate this method using the BPIC 2020 dataset, demonstrating its effectiveness compared to existing approaches.
Locating specific moments within long videos (20–120 min) presents a significant challenge, akin to finding a needle in a haystack. Adapting existing short video (5–30 s) grounding methods to this problem yields poor performance. Since most real-life videos, such as those on YouTube and AR/VR, are lengthy, addressing this issue is crucial. Existing methods typically operate in two stages: clip retrieval and grounding. However, this disjoint process limits the retrieval module’s fine-grained event understanding, crucial for specific moment detection. We propose RGNet which deeply integrates clip retrieval and grounding into a single network capable of processing long videos into multiple granular levels, e.g., clips and frames. Its core component is a novel transformer encoder, RG-Encoder, that unifies the two stages through shared features and mutual optimization. The encoder incorporates a sparse attention mechanism and an attention loss to model both granularity jointly. Moreover, we introduce a contrastive clip sampling technique to mimic the long video paradigm closely during training. RGNet surpasses prior methods, showcasing state-of-the-art performance on long video temporal grounding (LVTG) datasets MAD and Ego4D.
We propose FALCUN, a novel deep batch active learning method that is label- and time-efficient. Our proposed acquisition uses a natural, self-adjusting balance of uncertainty and diversity: It slowly transitions from emphasizing uncertain instances at the decision boundary to emphasizing batch diversity. In contrast, established deep active learning methods often have a fixed weighting of uncertainty and diversity, limiting their effectiveness over diverse data sets exhibiting different characteristics. Moreover, to increase diversity, most methods demand intensive search through a deep neural network’s high-dimensional latent embedding space. This leads to high acquisition times when experts are idle while waiting for the next batch for annotation. We overcome this structural problem by exclusively operating on the low-dimensional probability space, yielding much faster acquisition times without sacrificing label efficiency. In extensive experiments, we show FALCUN’s suitability for diverse use cases, including medical images and tabular data. Compared to state-of-the-art methods like BADGE, CLUE, and AlfaMix, FALCUN consistently excels in quality and speed: while FALCUN is among the fastest methods, it has the highest average label efficiency.
Mining data containing density-based clusters is well-established and widespread but faces problems when it comes to systematic and reproducible comparison and evaluation. Although the success of clustering methods hinges on data quality and availability, reproducibly generating suitable data for this setting is not easy, leading to mostly low-dimensional toy datasets being used. To resolve this issue, we propose DENSIRED (DENSIty-based Reproducible Experimental Data), a novel data generator for data containing density-based clusters. It is highly flexible w.r.t. a large variety of properties of the data and produces reproducible datasets in a two-step approach. First, skeletons of the clusters are constructed following a random walk. In the second step, these skeletons are enriched with data samples. DENSIRED enables the systematic generation of data for a robust and reliable analysis of methods aimed toward examining data containing density-connected clusters. In extensive experiments, we analyze the impact of user-defined properties on the generated datasets and the intrinsic dimensionalities of synthesized clusters.
Process mining solutions include enhancing performance, conserving resources, and alleviating bottlenecks in organizational contexts. However, as in other data mining fields, success hinges on data quality and availability. Existing analyses for process mining solutions lack diverse and ample data for rigorous testing, hindering insights’ generalization. To address this, we propose Generating Event Data with Intentional features, a framework producing event data sets satisfying specific meta-features. Considering the meta-feature space that defines feasible event logs, we observe that existing real-world datasets describe only local areas within the overall space. Hence, our framework aims at providing the capability to generate an event data benchmark, which covers unexplored regions. Therefore, our approach leverages a discretization of the meta-feature space to steer generated data towards regions, where a combination of meta-features is not met yet by existing benchmark datasets. Providing a comprehensive data pool enriches process mining analyses, enables methods to capture a wider range of real-world scenarios, and improves evaluation quality. Moreover, it empowers analysts to uncover correlations between meta-features and evaluation metrics, enhancing explainability and solution effectiveness. Experiments demonstrate GEDI’s ability to produce a benchmark of intentional event data sets and robust analyses for process mining tasks.
We introduce SA-DQAS in this paper, a novel framework that enhances the gradient-based Differentiable Quantum Architecture Search (DQAS) with a self-attention mechanism, aimed at optimizing circuit design for Quantum Machine Learning (QML) challenges. Analogous to a sequence of words in a sentence, a quantum circuit can be viewed as a sequence of placeholders containing quantum gates. Unlike DQAS, each placeholder is independent, while the self-attention mechanism in SA-DQAS helps to capture relation and dependency information among each operation candidate placed on placeholders in a circuit. To evaluate and verify, we conduct experiments on job-shop scheduling problems (JSSP), Max-cut problems, and quantum fidelity. Incorporating self-attention improves the stability and performance of the resulting quantum circuits and refines their structural design with higher noise resilience and fidelity. Our research demonstrates the first successful integration of self-attention with DQAS.
In settings where only a budgeted amount of labeled data can be afforded, active learning seeks to devise query strategies for selecting the most informative data points to be labeled, aiming to enhance learning algorithms’ efficiency and performance. Numerous such query strategies have been proposed and compared in the active learning literature. However, the community still lacks standardized benchmarks for comparing the performance of different query strategies. This particularly holds for the combination of query strategies with different learning algorithms into active learning pipelines and examining the impact of the learning algorithm choice. To close this gap, we propose ALPBench, which facilitates the specification, execution, and performance monitoring of active learning pipelines. It has built-in measures to ensure evaluations are done reproducibly, saving exact dataset splits and hyperparameter settings of used algorithms. In total, ALPBench consists of 86 real-world tabular classification datasets and 5 active learning settings, yielding 430 active learning problems. To demonstrate its usefulness and broad compatibility with various learning algorithms and query strategies, we conduct an exemplary study evaluating 9 query strategies paired with 8 learning algorithms in 2 different settings.
Ordered data arises in many areas, e.g., in molecular dynamics and other spatial-temporal trajectories. While data points that are close in this order are related, common dimensionality reduction techniques cannot capture this relation or order. Thus, the information is lost in the low-dimensional representations. We introduce DROPP, which incorporates order into dimensionality reduction by adapting a Gaussian kernel function across the ordered covariances between data points. We find underlying principal components that are characteristic of the process that generated the data. In extensive experiments, we show DROPP’s advantages over other dimensionality reduction techniques on synthetic as well as real-world data sets from molecular dynamics and climate research: The principal components of different data sets that were generated by the same underlying mechanism are very similar to each other. They can, thus, be used for dimensionality reduction with low reconstruction errors along a set of data sets, allowing an explainable visual comparison of different data sets as well as good compression even for unseen data.
Recommender Systems are a popular and common means to extract relevant information for users. Small and medium-sized enterprises make up a large share of the overall amount of business but need to be more frequently considered regarding the demand for recommender systems. Different conditions, such as the small amount of data, lower computational capabilities, and users frequently not possessing an account, require a different and potentially a more small-scale recommender system. The requirements regarding quality are similar: High accuracy and high diversity are certainly an advantage. We provide multiple solutions with different variants solely based on information contained in event-based sequences and temporal information. Our code is available at GitHub. We conduct experiments on four different datasets with an increasing set of items to show a possible range for scalability. The promising results show the applicability of these grammar-based recommender system variants and leave the final decision on which recommender to choose to the user and its ultimate goals.
The growing availability of data demands clustering methods that can extract valuable information without requiring costly annotations, especially for large, high-dimensional datasets. This dissertation develops subspace and deep clustering approaches, leveraging methods like the Dip-test of unimodality and Minimum Description Length principle to identify and encode relevant features and clusters automatically, even in complex datasets. By incorporating these techniques into neural networks and refining them through a novel parameter-free approach, the research offers robust clustering tools that perform well without prior knowledge of the number of clusters, all implemented in the open-source package ClustPy. (Shortened).
Deep clustering algorithms have gained popularity as they are able to cluster complex large-scale data, like images. Yet these powerful algorithms require many decisions w.r.t. architecture, learning rate and other hyperparameters, making it difficult to compare different methods. A comprehensive empirical evaluation of novel clustering methods, however, plays an important role in both scientific and practical applications, as it reveals their individual strengths and weaknesses. Therefore, we introduce ClustPy, a unified framework for benchmarking deep clustering algorithms, and perform a comparison of several fundamental deep clustering methods and some recently introduced ones. We compare these methods on multiple well known image data sets using different evaluation metrics, perform a sensitivity analysis w.r.t. important hyperparameters and perform ablation studies, e.g., for different autoencoder architectures and image augmentation. To our knowledge this is the first in depth benchmarking of deep clustering algorithms in a unified setting.
Christian Böhm
Prof. Dr.
* Former member
Honesty and fairness are essential. As many skills, practicing those values starts in the classroom. Whether students are examined online or on-site, only testing their knowledge righteously, educators can assess their skills and room for improvement. As online exams increase, we are provided with more suitable data for analysis. Process mining methods as anomaly detection and trace clustering techniques have been used to identify dishonest behavior in other fields, as e.g. fraud detection. In this paper, we investigate collusion detection in online exams as a process mining task. We explore trace ordering for anomaly detection (TOAD) as well as hierarchical agglomerative trace clustering (HATC). Promising preliminary results exemplify, how process mining techniques empower teachers in their decision making, while via flexible configuration of parameters, leaves the last word to them.
The analysis of event data is largely influenced by the effective characterization of descriptors. These descriptors serve as the building blocks of our understanding, encapsulating the behavior described within the event data. In light of these considerations, we introduce FEEED (Feature Extraction from Event Data), an extendable tool for event data feature extraction. FEEED represents a significant advancement in event data behavior analysis, offering a range of features to empower analysts and data scientists in their pursuit of insightful, actionable, and understandable event data analysis. What sets FEEED apart is its unique capacity to act as a bridge between the worlds of data mining and process mining. In doing so, it promises to enhance the accuracy, comprehensiveness, and utility of characterizing event data for a diverse range of applications.
Deep clustering algorithms have gained popularity for clustering complex, large-scale data sets, but getting started is difficult because of numerous decisions regarding architecture, optimizer, and other hyperparameters. Theoretical foundations must be known to obtain meaningful results. At the same time, ease of use is necessary to get used by a broader audience. Therefore, we require a unified framework that allows for easy execution in diverse settings. While this applies to established clustering methods like k-Means and DBSCAN, deep clustering algorithms lack a standard structure, resulting in significant programming overhead. This complicates empirical evaluations, which are essential in both scientific and practical applications. We present a solution to this problem by providing a theoretical background on deep clustering as well as practical implementation techniques and a unified structure with predefined neural networks. For the latter, we use the Python package ClustPy. The aim is to share best practices and facilitate community participation in deep clustering research.
Christian Böhm
Prof. Dr.
* Former member
Glass beads were among the most common grave goods in the Early Middle Ages, with an estimated number in the millions. The color, size, shape and decoration of the beads are diverse leading to many different archaeological classification systems that depend on the subjective decisions of individual experts. The lack of an agreed upon expert categorization leads to a pressing problem in archaeology, as the categorization of archaeological artifacts, like glass beads, is important to learn about cultural trends, manufacturing processes or economic relationships (e.g., trade routes) of historical times. An automated, objective and reproducible classification system is therefore highly desirable. We present a high-quality data set of images of Early Medieval beads and propose a clustering pipeline to learn a classification system in a data-driven way. The pipeline consists of a novel extension of deep embedded non-redundant clustering to identify multiple, meaningful clusterings of glass bead images. During the cluster analysis we address several challenges associated with the data and as a result identify high-quality clusterings that overlap with archaeological domain expertise. To the best of our knowledge this is the first application of non-redundant image clustering for archaeological data.
Christian Böhm
Prof. Dr.
* Former member
Portfolio optimization tasks describe sequential decision problems in which the investor’s wealth is distributed across a set of assets. Allocation constraints are used to enforce minimal or maximal investments into particular subsets of assets to control for objectives such as limiting the portfolio’s exposure to a certain sector due to environmental concerns. Although methods for (CRL) can optimize policies while considering allocation constraints, it can be observed that these general methods yield suboptimal results. In this paper, we propose a novel approach to handle allocation constraints based on a decomposition of the constraint action space into a set of unconstrained allocation problems. In particular, we examine this approach for the case of two constraints. For example, an investor may wish to invest at least a certain percentage of the portfolio into green technologies while limiting the investment in the fossil energy sector. We show that the action space of the task is equivalent to the decomposed action space, and introduce a new (RL) approach CAOSD, which is built on top of the decomposition. The experimental evaluation on real-world Nasdaq data demonstrates that our approach consistently outperforms state-of-the-art CRL benchmarks for portfolio optimization.
David Winkel
* Former member
Node classification is one of the core tasks on attributed graphs, but successful graph learning solutions require sufficiently labeled data. To keep annotation costs low, active graph learning focuses on selecting the most qualitative subset of nodes that maximizes label efficiency. However, deciding which heuristic is best suited for an unlabeled graph to increase label efficiency is a persistent challenge. Existing solutions either neglect aligning the learned model and the sampling method or focus only on limited selection aspects. They are thus sometimes worse or only equally good as random sampling. In this work, we introduce a novel active graph learning approach called DiffusAL, showing significant robustness in diverse settings. Toward better transferability between different graph structures, we combine three independent scoring functions to identify the most informative node samples for labeling in a parameter-free way: i) Model Uncertainty, ii) Diversity Component, and iii) Node Importance computed via graph diffusion heuristics. Most of our calculations for acquisition and training can be pre-processed, making DiffusAL more efficient compared to approaches combining diverse selection criteria and similarly fast as simpler heuristics. Our experiments on various benchmark datasets show that, unlike previous methods, our approach significantly outperforms random selection in 100% of all datasets and labeling budgets tested.
Do we need active learning? The rise of strong deep semi-supervised methods raises doubt about the usability of active learning in limited labeled data settings. This is caused by results showing that combining semi-supervised learning (SSL) methods with a random selection for labeling can outperform existing active learning (AL) techniques. However, these results are obtained from experiments on well-established benchmark datasets that can overestimate the external validity. However, the literature lacks sufficient research on the performance of active semi-supervised learning methods in realistic data scenarios, leaving a notable gap in our understanding. Therefore we present three data challenges common in real-world applications: between-class imbalance, within-class imbalance, and between-class similarity. These challenges can hurt SSL performance due to confirmation bias. We conduct experiments with SSL and AL on simulated data challenges and find that random sampling does not mitigate confirmation bias and, in some cases, leads to worse performance than supervised learning. In contrast, we demonstrate that AL can overcome confirmation bias in SSL in these realistic settings. Our results provide insights into the potential of combining active and semi-supervised learning in the presence of common real-world challenges, which is a promising direction for robust methods when learning with limited labeled data in real-world applications.
Clustering heterogeneous data is an ongoing challenge in the data mining community. The most prevalent clustering methods are designed to process datasets with numerical features only, but often datasets consist of mixed numerical and categorical features. This requires new approaches capable of handling both kinds of data types. Further, the most relevant cluster structures are often hidden in only a few features. Thus, another key challenge is to detect those specific features automatically and abandon features not relevant for clustering. This paper proposes the subspace mixed-type clustering algorithm k-SubMix, which tackles both challenges. Its cost function can handle both numerical and categorical features while simultaneously identifying those with the biggest impact for a high-quality clustering result. Unlike other subspace mixed-type clustering methods, k-SubMix preserves inter-cluster comparability, as it is the first mixed-type approach that defines a common subspace for all clusters. Extensive experiments show that k-SubMix outperforms competitive methods and reduces the data’s complexity by a simultaneous dimensionality reduction.
Christian Böhm
Prof. Dr.
* Former member
Despite the popularity of density-based clustering, its procedural definition makes it difficult to analyze compared to clustering methods that minimize a loss function. In this paper, we reformulate DBSCAN through a clean objective function by introducing the density-connectivity distance (dc-dist), which captures the essence of density-based clusters by endowing the minimax distance with the concept of density. This novel ultrametric allows us to show that DBSCAN, k-center, and spectral clustering are equivalent in the space given by the dc-dist, despite these algorithms being perceived as fundamentally different in their respective literatures. We also verify that finding the pairwise dc-dists gives DBSCAN clusterings across all epsilon-values, simplifying the problem of parameterizing density-based clustering. We conclude by thoroughly analyzing density-connectivity and its properties – a task that has been elusive thus far in the literature due to the lack of formal tools.
Argumentation is one of society’s foundational pillars, and, sparked by advances in NLP, and the vast availability of text data, automated mining of arguments receives increasing attention. A decisive property of arguments is their strength or quality. While there are works on the automated estimation of argument strength, their scope is narrow:They focus on isolated datasets and neglect the interactions with related argument-mining tasks, such as argument identification and evidence detection. In this work, we close this gap by approaching argument quality estimation from multiple different angles:Grounded on rich results from thorough empirical evaluations, we assess the generalization capabilities of argument quality estimation across diverse domains and the interplay with related argument mining tasks. We find that generalization depends on a sufficient representation of different domains in the training part. In zero-shot transfer and multi-task experiments, we reveal that argument quality is among the more challenging tasks but can improve others.
Financial portfolio managers typically face multi-period optimization tasks such as short-selling or investing at least a particular portion of the portfolio in a specific industry sector. A common approach to tackle these problems is to use constrained Markov decision process (CMDP) methods, which may suffer from sample inefficiency, hyperparameter tuning, and lack of guarantees for constraint violations. In this paper, we propose Action Space Decomposition Based Optimization (ADBO) for optimizing a more straightforward surrogate task that allows actions to be mapped back to the original task. We examine our method on two real-world data portfolio construction tasks. The results show that our new approach consistently outperforms state-of-the-art benchmark approaches for general CMDPs.
David Winkel
* Former member
Over the last decade, the Dip-test of unimodality has gained increasing interest in the data mining community as it is a parameter-free statistical test that reliably rates the modality in one-dimensional samples. It returns a so called Dip-value and a corresponding probability for the sample’s unimodality (Dip-p-value). These two values share a sigmoidal relationship. However, the specific transformation is dependent on the sample size. Many Dip-based clustering algorithms use bootstrapped look-up tables translating Dip- to Dip-p-values for a certain limited amount of sample sizes. We propose a specifically designed sigmoid function as a substitute for these state-of-the-art look-up tables. This accelerates computation and provides an approximation of the Dip- to Dip-p-value transformation for every single sample size. Further, it is differentiable and can therefore easily be integrated in learning schemes using gradient descent. We showcase this by exploiting our function in a novel subspace clustering algorithm called Dip’n’Sub. We highlight in extensive experiments the various benefits of our proposal.
Christian Böhm
Prof. Dr.
* Former member
When researchers publish new cluster algorithms, they usually demonstrate the strengths of their novel approaches by comparing the algorithms’ performance with existing competitors. However, such studies are likely to be optimistically biased towards the new algorithms, as the authors have a vested interest in presenting their method as favorably as possible in order to increase their chances of getting published. Therefore, the superior performance of newly introduced cluster algorithms is over-optimistic and might not be confirmed in independent benchmark studies performed by neutral and unbiased authors. This problem is known among many researchers, but so far, the different mechanisms leading to over-optimism in cluster algorithm evaluation have never been systematically studied and discussed. Researchers are thus often not aware of the full extent of the problem. We present an illustrative study to illuminate the mechanisms by which authors—consciously or unconsciously—paint their cluster algorithm’s performance in an over-optimistic light. Using the recently published cluster algorithm Rock as an example, we demonstrate how optimization of the used datasets or data characteristics, of the algorithm’s parameters and of the choice of the competing cluster algorithms leads to Rock’s performance appearing better than it actually is. Our study is thus a cautionary tale that illustrates how easy it can be for researchers to claim apparent ‘superiority’ of a new cluster algorithm. This illuminates the vital importance of strategies for avoiding the problems of over-optimism (such as, e.g., neutral benchmark studies), which we also discuss in the article.
Theresa Ullmann
Dr.
Biometry in Molecular Medicine
Recent transformer-based offline video instance segmentation (VIS) approaches achieve encouraging results and significantly outperform online approaches. However, their reliance on the whole video and the immense computational complexity caused by full Spatio-temporal attention limit them in real-life applications such as processing lengthy videos. In this paper, we propose a single-stage transformer-based efficient online VIS framework named InstanceFormer, which is especially suitable for long and challenging videos. We propose three novel components to model short-term and long-term dependency and temporal coherence. First, we propagate the representation, location, and semantic information of prior instances to model short-term changes. Second, we propose a novel memory cross-attention in the decoder, which allows the network to look into earlier instances within a certain temporal window. Finally, we employ a temporal contrastive loss to impose coherence in the representation of an instance across all frames. Memory attention and temporal coherence are particularly beneficial to long-range dependency modeling, including challenging scenarios like occlusion. The proposed InstanceFormer outperforms previous online benchmark methods by a large margin across multiple datasets. Most importantly, InstanceFormer surpasses offline approaches for challenging and long datasets such as YouTube-VIS-2021 and OVIS.
Active learning has the power to significantly reduce the amount of labeled data needed to build strong classifiers. Existing active pseudo-labeling methods show high potential in integrating pseudo-labels within the active learning loop but heavily depend on the prediction accuracy of the model. In this work, we propose VERIPS, an algorithm that significantly outperforms existing pseudo-labeling techniques for active learning. At its core, VERIPS uses a pseudo-label verification mechanism that consists of a second network only trained on data approved by the oracle and helps to discard questionable pseudo-labels. In particular, the verifier model eliminates all pseudo-labels for which it disagrees with the actual task model. VERIPS overcomes the problems of poorly performing initial models, e.g., due to imbalanced or too small initial pools, where previous methods select too many incorrect pseudo-labels and recovering takes long or is not possible. Moreover, VERIPS is particularly insensitive to parameter choices that existing approaches suffer from.
Group anomaly detection in terms of detecting and predicting abnormal behaviour from entities as a group rather than as an individual, addresses a variety of challenges in spatio-temporal environments like e.g. traffic and transportation systems, smart cities, geoinformation systems, etc. They provide information about a commonly large number of individual entities. Examples for such entities would be airplanes and drones, vehicles, ships but also people, remote sensors and any other information source in interaction with the environment. However, as point anomaly detection is quite common for revealing the abnormal behaviour of individual entities, the collective behaviour of the individuals as a group remains completely uncovered. For example potential for traffic flow optimizations or increased local traffic guideline violations cannot be detected by one single drive but by considering the behavior of a group of vehicle drives in this area. With this work-in-progress we elaborate the potential of group anomaly detection algorithms for spatio-temporal collective behaviour scenarios in smart cities. We describe the group anomaly detection problem in the context of urban planning and demonstrate its effectiveness on a public real-world data set for urban rental bike rides and stations in and around Munich revealing abnormal groups of rides, which allows to optimize the rental bike accessibility to the population and with that to contribute to a sustainable environment.
Andreas Lohrer
* Former member
Peer Kröger
Prof. Dr.
* Former member
The goal of the Hyperview challenge is to use Hyperspectral Imaging (HSI) to predict the soil parameters potassium (K), phosphorus pentoxide (P 2 O 5 ), magnesium (Mg) and the pH value. These are relevant parameters to determine the need of fertilization in agriculture. With this knowledge, fertilizers can be applied in a targeted way rather than in a prophylactic way which is the current procedure of choice.In this context we introduce two different approaches to solve this regression task based on 3D CNNs with Huber loss regression (SpectralNet3D) and on 1D RNNs. Both methods show distinct advantages with a peak challenge metric score of 0.808 on provided validation data.
Andreas Lohrer
* Former member
Peer Kröger
Prof. Dr.
* Former member
The task of portfolio management is the selection of portfolio allocations for every single time step during an investment period while adjusting the risk-return profile of the portfolio to the investor’s individual level of risk preference. In practice, it can be hard for an investor to quantify his individual risk preference. As an alternative, approximating the risk-return Pareto front allows for the comparison of different optimized portfolio allocations and hence for the selection of the most suitable risk level. Furthermore, an approximation of the Pareto front allows the analysis of the overall risk sensitivity of various investment policies. In this paper, we propose a deep reinforcement learning (RL) based approach, in which a single meta agent generates optimized portfolio allocation policies for any level of risk preference in a given interval. Our method is more efficient than previous approaches, as it only requires training of a single agent for the full approximate risk-return Pareto front. Additionally, it is more stable in training and only requires per time step market risk estimations independent of the policy. Such risk control per time step is a common regulatory requirement for e.g., insurance companies. We benchmark our meta agent against other state-of-the-art risk-aware RL methods using a realistic environment based on real-world Nasdaq-100 data. Our evaluation shows that the proposed meta agent outperforms various benchmark approaches by generating strategies with better risk-return profiles.
David Winkel
* Former member
Selecting diverse instances for annotation is one of the key factors of successful active learning strategies. To this end, existing methods often operate on high-dimensional latent representations. In this work, we propose to use the low-dimensional vector of predicted probabilities instead, which can be seamlessly integrated into existing methods. We empirically demonstrate that this considerably decreases the query time, i.e., time to select an instance for annotation, while at the same time improving results. Low query times are relevant for active learning researchers, which use a (fast) oracle for simulated annotation and thus are often constrained by query time. It is also practically relevant when dealing with complex annotation tasks for which only a small pool of skilled domain experts is available for annotation with a limited time budget.
Spectral clustering is one of the most advantageous clustering approaches. However, standard Spectral Clustering is sensitive to noisy input data and has a high runtime complexity. Tackling one of these problems often exacerbates the other. As real-world datasets are often large and compromised by noise, we need to improve both robustness and runtime at once. Thus, we propose Spectral Clustering - Accelerated and Robust (SCAR), an accelerated, robustified spectral clustering method. In an iterative approach, we achieve robustness by separating the data into two latent components: cleansed and noisy data. We accelerate the eigendecomposition - the most time-consuming step - based on the Nyström method. We compare SCAR to related recent state-of-the-art algorithms in extensive experiments. SCAR surpasses its competitors in terms of speed and clustering quality on highly noisy data.
Hartigan’s Dip-test of unimodality gained increasing interest in unsupervised learning over the past few years. It is free from complex parameterization and does not require a distribution assumed a priori. A useful property is that the resulting Dip-values can be derived to find a projection axis that identifies multimodal structures in the data set. In this paper, we show how to apply the gradient not only with respect to the projection axis but also with respect to the data to improve the cluster structure. By tightly coupling the Dip-test with an autoencoder, we obtain an embedding that clearly separates all clusters in the data set. This method, called DipEncoder, is the basis of a novel deep clustering algorithm. Extensive experiments show that the DipEncoder is highly competitive to state-of-the-art methods.
Christian Böhm
Prof. Dr.
* Former member
This thesis addresses the challenges of argumentation in the digital age by applying machine learning methods to automatically identify, retrieve, and evaluate arguments from diverse and often contradictory online sources. The first focus is on argument identification, specifically in heterogeneous text sources and peer reviews, where the relationship between the topic and arguments is crucial, and knowledge transfer across domains is limited. The second focus is on argument retrieval, where machine learning is used to select relevant documents, ensuring comprehensive and non-redundant argument coverage. Finally, the thesis explores the strength or quality of arguments, integrating this concept with other argument mining tasks and evaluating its impact across different text domains and contexts. (Shortened.)
This thesis addresses key challenges in correlation clustering, particularly in high-dimensional datasets, by developing novel methods to evaluate and improve clustering algorithms. The first contribution focuses on defining and deriving internal evaluation criteria for correlation clustering, proposing a new cost function to assess cluster quality based on commonalities among existing algorithms. The second part introduces two innovative strategies for detecting regions of interest (ROIs) in Hough space, improving the robustness of the Hough transform algorithm, and extending it to handle quadratic and periodic correlated clusters. Finally, the thesis explores unifying local and global correlation clustering views and enhancing the resilience of these methods to outliers. (Shortened.)
LUCKe allows any purely distance-based ‘classic’ clustering algorithm to reliably find linear correlation clusters. An elaborated distance matrix based on the points’ local PCA extracts all necessary information from high dimensional data to declare points of the same arbitrary dimensional linear correlation cluster as ‘similar’. For that, the points’ eigensystems as well as only the relevant information about their position in space, are put together. LUCKe allows transferring known benefits from the large field of basic clustering to correlation clustering. Its applicability is shown in extensive experiments with simple representatives of diverse basic clustering approaches.
We introduce OAB, an Open Anomaly Benchmark Framework for unsupervised and semisupervised anomaly detection on image and tabular data sets, ensuring simple reproducibility for existing benchmark results as well as a reliable comparability and low-effort extensibility when new anomaly detection algorithms or new data sets are added. While making established methods of the most popular benchmarks easily accessible, OAB generalizes the task of un- and semisupervised anomaly benchmarking and offers besides commonly used benchmark data sets also semantically meaningful real-world anomaly data sets as well as a broad range of traditional and state-of-the-art anomaly detection algorithms. The benefit of OAB for the research community has been demonstrated by reproducing and extending existing benchmarks to new algorithms with very low effort allowing researchers to focus on the actual algorithm research.
Andreas Lohrer
* Former member
Peer Kröger
Prof. Dr.
* Former member
High-quality arguments are an essential part of decision-making. Automatically predicting the quality of an argument is a complex task that recently got much attention in argument mining. However, the annotation effort for this task is exceptionally high. Therefore, we test uncertainty-based active learning (AL) methods on two popular argument-strength data sets to estimate whether sample-efficient learning can be enabled. Our extensive empirical evaluation shows that uncertainty-based acquisition functions can not surpass the accuracy reached with the random acquisition on these data sets.
During different steps in the process of discovering drug candidates for diseases, it can be supportive to identify groups of molecules that share similar properties, i.e. common overall structural similarity. The existing methods for computing (dis)similarities between chemical structures rely on a priori domain knowledge. Here we investigate the clustering of compounds that are applied on embeddings generated from a recently published Mol2Vec technique which enables an entirely unsupervised vector representation of compounds. A research question we address in this work is: do existent well-known clustering algorithms such as k-means or hierarchical clustering methods yield meaningful clusters on the Mol2Vec embeddings? Further, we investigate how far subspace clustering can be utilized to compress the data by reducing the dimensionality of the compounds vector representation. Our first conducted experiments on a set of COVID-19 drug candidates reveal that well-established methods yield meaningful clusters. Preliminary results from subspace clusterings indicate that a compression of the vector representations seems viable.
Peer Kröger
Prof. Dr.
* Former member
Even though most clustering algorithms serve knowledge discovery in fields other than computer science, most of them still require users to be familiar with programming or data mining to some extent. As that often prevents efficient research, we developed an easy to use, highly explainable clustering method accompanied by an interactive tool for clustering. It is based on intuitively understandable kNN graphs and the subsequent application of adaptable filters, which can be combined ensemble-like and iteratively and prune unnecessary or misleading edges. For a first overview of the data, fully automatic predefined filter cascades deliver robust results. A selection of simple filters and combination methods that can be chosen interactively yield very good results on benchmark datasets compared to various algorithms.
When using arbitrarily oriented subspace clustering algorithms one obtains a partitioning of a given data set and for each partition its individual subspace. Since clustering is an unsupervised machine learning task, we may not have “ground truth” labels at our disposal or do not wish to rely on them. What is needed in such cases are internal measure which permits a label-less analysis of the obtained subspace clustering. In this work, we propose methods for revising clusters obtained from arbitrarily oriented correlation clustering algorithms. Initial experiments conducted reveal improvements in the clustering results compared to the original clustering outcome. Our proposed approach is simple and can be applied as a post-processing step on arbitrarily oriented correlation clusterings.
Peer Kröger
Prof. Dr.
* Former member
LWDA 2021 is a joint conference of six special interest groups of the German Computer Science Society (GI), addressing research in the areas of knowledge discovery and machine learning, information retrieval, database systems, and knowledge management. The German acronym LWDA stands for ‘Lernen, Wissen, Daten, Analysen’ (Learning, Knowledge, Data, Analytics). Following the tradition of the last years, LWDA 2021 provides a joint forum for experienced and young researchers, to bring insights into recent trends, technologies, and applications and to promote interaction among the special interest groups.
We introduce AnyCORE (Anytime Cluster Outlier REmoval), an algorithm that enables users to detect and remove outliers at anytime. The algorithm is based on the idea of MORe++, an approach for outlier detection and removal that iteratively scores and removes 1d-cluster-outliers in n-dimensional data sets. In contrast to MORe++, AnyCORE provides continuous responses for its users and converges independent of cluster centers. This allows AnyCORE to perform outlier detection in combination with an arbitrary clustering method that is most suitable for a given data set. We conducted our AnyCORE experiments on synthetic and real-world data sets by benchmarking its variant with k-Means as the underlying clustering method versus the traditional batch algorithm version of MORe++. In extensive experiments we show that AnyCORE is able to compete with the related batch algorithm version.
Andreas Lohrer
* Former member
Peer Kröger
Prof. Dr.
* Former member
The combination of clustering with Deep Learning has gained much attention in recent years. Unsupervised neural networks like autoencoders can autonomously learn the essential structures in a data set. This idea can be combined with clustering objectives to learn relevant features automatically. Unfortunately, they are often based on a k-means framework, from which they inherit various assumptions, like spherical-shaped clusters. Another assumption, also found in approaches outside the k-means-family, is knowing the number of clusters a-priori. In this paper, we present the novel clustering algorithm DipDECK, which can estimate the number of clusters simultaneously to improving a Deep Learning-based clustering objective. Additionally, we can cluster complex data sets without assuming only spherically shaped clusters. Our algorithm works by heavily overestimating the number of clusters in the embedded space of an autoencoder and, based on Hartigan’s Dip-test - a statistical test for unimodality - analyses the resulting micro-clusters to determine which to merge. We show in extensive experiments the various benefits of our method: (1) we achieve competitive results while learning the clustering-friendly representation and number of clusters simultaneously; (2) our method is robust regarding parameters, stable in performance, and allows for more flexibility in the cluster shape; (3) we outperform relevant competitors in the estimation of the number of clusters.
Christian Böhm
Prof. Dr.
* Former member
In this work, we perform an extensive investigation of two state-of-the-art (SotA) methods for the task of Entity Alignment in Knowledge Graphs. Therefore, we first carefully examine the benchmarking process and identify several shortcomings, making the results reported in the original works not always comparable. Furthermore, we suspect that it is a common practice in the community to make the hyperparameter optimization directly on a test set, reducing the informative value of reported performance. Thus, we select a representative sample of benchmarking datasets and describe their properties. We also examine different initializations for entity representations since they are a decisive factor for model performance. Furthermore, we use a shared train/validation/test split for an appropriate evaluation setting to evaluate all methods on all datasets. In our evaluation, we make several interesting findings. While we observe that most of the time SotA approaches perform better than baselines, they have difficulties when the dataset contains noise, which is the case in most real-life applications. Moreover, in our ablation study, we find out that often different features of SotA method are crucial for good performance than previously assumed.
In this work, we focus on the problem of retrieving relevant arguments for a query claim covering diverse aspects. State-of-the-art methods rely on explicit mappings between claims and premises, and thus are unable to utilize large available collections of premises without laborious and costly manual annotation. Their diversity approach relies on removing duplicates via clustering which does not directly ensure that the selected premises cover all aspects. This work introduces a new multi-step approach for the argument retrieval problem. Rather than relying on ground-truth assignments, our approach employs a machine learning model to capture semantic relationships between arguments. Beyond that, it aims to cover diverse facets of the query, instead of trying to identify duplicates explicitly. Our empirical evaluation demonstrates that our approach leads to a significant improvement in the argument retrieval task even though it requires less data.
In high-dimensional datasets some dimensions or attributes can be more important than others. Whereas most algorithms neglect one or more dimensions for all points of a dataset or at least for all points of a certain cluster together, our method KISS (textbf{k}NN-based textbf{I}mportance textbf{S}core of textbf{S}ubspaces) detects the most important dimensions for each point individually. It is fully unsupervised and does not depend on distorted multidimensional distance measures. Instead, the $k$ nearest neighbors ($k$NN) in one-dimensional projections of the data points are used to calculate the score for every dimension’s importance. Experiments across a variety of settings show that those scores reflect well the structure of the data. KISS can be used for subspace clustering. What sets it apart from other methods for this task is its runtime, which is linear in the number of dimensions and $O(n log(n))$ in the number of points, as opposed to quadratic or even exponential runtimes for previous algorithms.
Peer reviewing is a central process in modern research and essential for ensuring high quality and reliability of published work. At the same time, it is a time-consuming process and increasing interest in emerging fields often results in a high review workload, especially for senior researchers in this area. How to cope with this problem is an open question and it is vividly discussed across all major conferences. In this work, we propose an Argument Mining based approach for the assistance of editors, meta-reviewers, and reviewers. We demonstrate that the decision process in the field of scientific publications is driven by arguments and automatic argument identification is helpful in various use-cases. One of our findings is that arguments used in the peer-review process differ from arguments in other domains making the transfer of pre-trained models difficult. Therefore, we provide the community with a new peer-review dataset from different computer science conferences with annotated arguments. In our extensive empirical evaluation, we show that Argument Mining can be used to efficiently extract the most relevant parts from reviews, which are paramount for the publication decision. The process remains interpretable since the extracted arguments can be highlighted in a review without detaching them from their context.
This thesis explores the connections between clustering and related tasks like subspace clustering, correlation clustering, outlier detection, and data ordering. It introduces novel methods such as the KISS score for subspace clustering, LUCK for correlation clustering, and the ABC algorithm for outlier detection. Additionally, it develops the Circle Index for optimizing data ordering to improve clustering performance. (Shortened.)
Subspace clustering has established itself as a state-of-the-art approach to clustering high-dimensional data. In particular, methods relying on the self-expressiveness property have recently proved especially successful. However, they suffer from two major shortcomings: First, a quadratic-size coefficient matrix is learned directly, preventing these methods from scaling beyond small datasets. Secondly, the trained models are transductive and thus cannot be used to cluster out-of-sample data unseen during training. Instead of learning self-expression coefficients directly, we propose a novel metric learning approach to learn instead a subspace affinity function using a siamese neural network architecture. Consequently, our model benefits from a constant number of parameters and a constant-size memory footprint, allowing it to scale to considerably larger datasets. In addition, we can formally show that out model is still able to exactly recover subspace clusters given an independence assumption. The siamese architecture in combination with a novel geometric classifier further makes our model inductive, allowing it to cluster out-of-sample data. Additionally, non-linear clusters can be detected by simply adding an auto-encoder module to the architecture. The whole model can then be trained end-to-end in a self-supervised manner. This work in progress reports promising preliminary results on the MNIST dataset. In the spirit of reproducible research, me make all code publicly available. In future work we plan to investigate several extensions of our model and to expand experimental evaluation.
In this work we propose SRE, the first internal evaluation measure for arbitrary oriented subspace clustering results. For this purpose we present a new perspective on the subspace clustering task: the goal we formalize is to compute a clustering which represents the original dataset by minimizing the reconstruction loss from the obtained subspaces, while at the same time minimizing the dimensionality as well as the number of clusters. A fundamental feature of our approach is that it is model-agnostic, i.e., it is independent of the characteristics of any specific subspace clustering method. It is scale invariant and mathematically founded. The experiments show that the SRE scoring better assesses the quality of an arbitrarily oriented sub-space clustering compared to commonly used external evaluation measures.
Peer Kröger
Prof. Dr.
* Former member
In the setting of unsupervised machine learning, especially in clustering tasks, the evaluation of either novel algorithms or the assessment of a clustering of novel data is challenging. While mostly in the literature the evaluation of new methods is performed on labelled data, there are cases where no labels are at our disposal. In other cases we may not want to trust the “ground truth” labels. In general there exists a spectrum of so called internal evaluation measures in the literature. Each of the measures is mostly specialized towards a specific clustering model. The model of arbitrarily oriented subspace clusters is a more recent one. To the best of our knowledge there exist at the current time no internal evaluation measures tailored at assessing this particular type of clusterings. In this work we present the first internal quality measures for arbitrarily oriented subspace clusterings namely the normalized projected energy (NPE) and subspace compactness score (SCS). The results from the experiments show that especially NPE is capable of assessing clusterings by considering archetypical properties of arbitrarily oriented subspace clustering.
Peer Kröger
Prof. Dr.
* Former member
Having data with a high number of features raises the need to detect clusters which exhibit within subspaces of features a high similarity. These subspaces can be arbitrarily oriented which gave rise to arbitrarily-oriented subspace clustering (AOSC) algorithms. In the diversity of such algorithms some are specialized at detecting clusters which are global, across the entire dataset regardless of any distances, while others are tailored at detecting local clusters. Both of these views (local and global) are obtained separately by each of the algorithms. While from an algebraic point of view, none of both representations can claim to be the true one, it is vital that domain scientists are presented both views, enabling them to inspect and decide which of the representations is closest to the domain specific reality. We propose in this work a framework which is capable to detect locally dense arbitrarily oriented subspace clusters which are embedded within a global one. We also first introduce definitions of locally and globally arbitrarily oriented subspace clusters. Our experiments illustrate that this approach has no significant impact on the cluster quality nor on the runtime performance, and enables scientists to be no longer limited exclusively to either of the local or global views.
Peer Kröger
Prof. Dr.
* Former member
Unlabeled data is often abundant in the clinic, making machine learning methods based on semi-supervised learning a good match for this setting. Despite this, they are currently receiving relatively little attention in medical image analysis literature. Instead, most practitioners and researchers focus on supervised or transfer learning approaches. The recently proposed Mix-Match and FixMatch algorithms have demonstrated promising results in extracting useful representations while requiring very few labels. Motivated by these recent successes, we apply MixMatch and FixMatch in an ophthalmological diagnostic setting and investigate how they fare against standard transfer learning. We find that both algorithms outperform the transfer learning baseline on all fractions of labelled data. Furthermore, our experiments show that Mean Teacher, which is a component of both algorithms, is not needed for our classification problem, as disabling it leaves the outcome unchanged.
Data Mining and Process Mining – is one just a variant of the other, or do worlds separate the two areas from each other? The notions sound so similar but the contents sometimes look differently, so respective researchers may get confused in their mutual perception, be it authors or reviewers. The talk recalls commonalities like model-based supervised and unsupervised learning approaches, and it also sheds light to peculiarities in process data and process mining tasks as seen from a data mining perspective. When considering trace data from event log files as time series, as sequences, or as activity sets, quite different data mining techniques apply and may be extended and improved. A particular example is rare pattern mining, which fills a gap between frequent patterns and outlier detection. The task aims at identifying patterns that occur with low frequency but above single outliers. Structural deficiences may cause malfunctions or other undesired behavior which get discarded as outliers in event logs, since they are observed infrequently only. Rare pattern mining may identify these situations, and recent approaches include clustering or ordering non-conformant traces. The talk concludes with some remarks on how to sell process mining papers to the data mining community, and vice versa, in order to improve mutual acceptance, and to increase synergies in the fields.
Performance mining from event logs is a central task in managing and optimizing business processes. Established analysis techniques work with a single timestamp per event only. However, when available, time interval information enables proper analysis of the duration of individual activities as well as the overall execution runtime. Our novel approach, performance skyline, considers extended events, including start and end timestamps in log files, aiming at the discovery of events that are crucial to the overall duration of real process executions. As first contribution, our method gains a geometrical process representation for traces with interval events by using interval-based methods from sequence pattern mining and performance analysis. Secondly, we introduce the performance skyline, which discovers dominating events considering a given heuristic in this case, event duration. As a third contribution, we propose three techniques for statistical analysis of performance skylines and process trace sets, enabling more accurate process discovery, conformance checking, and process enhancement. Experiments on real event logs demonstrate that our contributions are highly suitable for detecting and analyzing the dominant events of a process.
The amount of data increases steadily, and yet most clustering algorithms perform complex computations for every single data point. Furthermore, Euclidean distance which is used for most of the clustering algorithms is often not the best choice for datasets with arbitrarily shaped clusters or such with high dimensionality. Based on ABOD, we introduce ABC, the first angle-based clustering method. The algorithm first identifies a small part of the data as border points of clusters based on the angle between their neighbors. Those few border points can, with some adjustments, be clustered with well-known clustering algorithms like hierarchical clustering with single linkage or DBSCAN. Residual points can quickly and easily be assigned to the cluster of their nearest border point, so the overall runtime is heavily reduced while the results improve or remain similar.
Using grid-based clustering algorithms on high-dimensionaldata has the advantage of being able to summarize datapoints into cells, but usually produces an exponential number of grid cells. In this paper we introduce Grace (using textit{Gr}id which is textit{a}daptive for textit{c}lusttextit{e}ring), a clustering algorithm which limits the number of cells produced depending on the number of points in the dataset. A non-equidistant grid is constructed based on the distribution of points in one-dimensional projections of the data. A density threshold is automatically deduced from the data and used to detect dense cells, which are later combined to clusters. The adaptive grid structure makes an efficient but still accurate clustering of multidimensional data possible. Experiments with synthetic as well as real-world data sets of various size and dimensionality confirm these properties.
As data processing techniques get more and more sophisticated every day, many of us researchers often get lost in the details and subtleties of the algorithms we are developing and far too easily seem to forget to look also at the very first steps of every algorithm: the input of the data. Since there are plenty of library functions for this task, we indeed do not have to think about this part of the pipeline anymore. But maybe we should. All data is stored and loaded into a program in some order. In this vision paper we study how ignoring this order can (1) lead to performance issues and (2) make research results unreproducible. We furthermore examine desirable properties of a data ordering and why current approaches are often not suited to tackle the two mentioned problems.
When facing high-dimensional data streams, clustering algorithms quickly reach the boundaries of their usefulness as most of these methods are not designed to deal with the curse of dimensionality. Due to inherent sparsity in high-dimensional data, distances between objects tend to become meaningless since the distances between any two objects measured in the full dimensional space tend to become the same for all pairs of objects. In this work, we present a novel oriented subspace clustering algorithm that is able to deal with such issues and detects arbitrarily oriented subspace clusters in high-dimensional data streams. Data streams generally implicate the challenge that the data cannot be stored entirely and hence there is a general demand for suitable data handling strategies for clustering algorithms such that the data can be processed within a single scan. We therefore propose the CASHSTREAM algorithm that unites state-of-the-art stream processing techniques and additionally relies on the Hough transform to detect arbitrarily oriented subspace clusters. Our experiments compare CASHSTREAM to its static counterpart and show that the amount of consumed memory is significantly decreased while there is no loss in terms of runtime.
Peer Kröger
Prof. Dr.
* Former member
In this work, we focus on the problem of entity alignment in Knowledge Graphs (KG) and we report on our experiences when applying a Graph Convolutional Network (GCN) based model for this task. Variants of GCN are used in multiple state-of-the-art approaches and therefore it is important to understand the specifics and limitations of GCN-based models. Despite serious efforts, we were not able to fully reproduce the results from the original paper and after a thorough audit of the code provided by authors, we concluded, that their implementation is different from the architecture described in the paper. In addition, several tricks are required to make the model work and some of them are not very intuitive.We provide an extensive ablation study to quantify the effects these tricks and changes of architecture have on final performance. Furthermore, we examine current evaluation approaches and systematize available benchmark datasets.We believe that people interested in KG matching might profit from our work, as well as novices entering the field.
This thesis addresses several challenges in social data analytics, focusing on methods for clustering, learning from network data, and analyzing dynamic social data. It introduces novel algorithms for correlation clustering on streaming data, hierarchical clustering for social maps, and user identification based on spatio-temporal mobility patterns. Additionally, the thesis presents various node embedding techniques for learning representations from network topology and proposes a graph neural network model for matching nodes across overlapping graphs. (Shortened.)
The idea of combining the high representational power of deep learning techniques with clustering methods has gained much interest in recent years. Optimizing representation and clustering simultaneously has been shown to have an advantage over optimizing them separately. However, so far all proposed methods have been using a flat clustering strategy, with the true number of clusters known a priori. In this paper, we propose the Deep Embedded Cluster Tree (DeepECT), the first divisive hierarchical embedded clustering method. The cluster tree does not need to know the true number of clusters during optimization. Instead, the level of detail to be analyzed can be chosen afterward and for each sub-tree separately. An optional data-augmentation-based extension allows DeepECT to ignore prior-known invariances of the dataset, such as affine transformations in image data. We evaluate and show the advantages of DeepECT in extensive experiments.
Christian Böhm
Prof. Dr.
* Former member
Collecting spatio-temporal resources is an important goal in many real-world use cases such as finding customers for taxicabs. In this paper, we tackle the resource search problem posed by the GIS Cup 2019 where the objective is to minimize the average search time of taxicabs looking for customers. The main challenge is that the taxicabs may not communicate with each other and the only observation they have is the current time and position. Inspired by radial transit route structures in urban environments, our approach relies on round trips that are used as action space for a downstream reinforcement learning procedure. Our source code is publicly available at https://github.com/Fe18/TripBanditAgent.
Sabrina Friedl
* Former member
MORe++ is a k-Means based Outlier Removal method working on high dimensional data. It is simple, efficient and scalable. The core idea is to find local outliers by examining the points of different k-Means clusters separately. Like that, one-dimensional projections of the data become meaningful and allow to find one-dimensional outliers easily, which else would be hidden by points of other clusters. MORe++ does not need any additional input parameters than the number of clusters k used for k-Means, and delivers an intuitively accessible degree of outlierness. In extensive experiments it performed well compared to k-Means– and ORC.
For a given query object, Reverse k-Nearest Neighbor queries retrieve those objects that have the query object among their k-nearest neighbors. However, computing the k-nearest neighbor sets for all points in a database is expensive in terms of computational costs. Therefore, specific index structures have been invented to apply pruning heuristics which aim at reducing the search space. At time, the state-of-the-art index structure for enabling fast RkNN query processing in general metric spaces is the MRkNNCoP-Tree which uses linear functions to approximate lower and upper bounds on the k-distances to prune the search space. Storing those linear functions results in additional storage costs in O(n) which might be infeasible in situation where storage space is limited, e.g., on mobile devices. In this work, we present a novel index based on the MRkNNCoP-Tree as well as recent developments in the field of neural indexing. By learning a single neural network model that approximates the k-nearest neighbor distance bounds for all points in a database, the storage complexity of the proposed index structure is reduced to O(1) while the index is still able to guarantee exact query results. As shown in our experimental evaluations on synthetic and real-world data sets, our approach can significantly reduce the required storage space in trade-off to some growth in terms of refinement sets when relying on exact query processing.
Peer Kröger
Prof. Dr.
* Former member
Nowadays, as lots of data is gathered in large volumes and with high velocity, the development of algorithms capable of handling complex data streams in (near) real-time is a major challenge. In this work, we present the algorithm CORRSTREAM which tackles the problem of detecting arbitrarily oriented subspace clusters in high-dimensional data streams. The proposed method follows a two phase approach, where the continuous online phase aggregates data points within a proper microcluster structure that stores all necessary information to define a microcluster’s subspace and is generic enough to cope with a variety of offline procedures. Given several such microclusters, the offline phase is able to build a final clustering model which reveals arbitrarily oriented subspaces in which the data tend to cluster. In our experimental evaluation, we show that CORRSTREAM not only has an acceptable throughput but also outperforms static counterpart algorithms by orders of magnitude when considering the runtime. At the same time, the loss of accuracy is quite small.
Peer Kröger
Prof. Dr.
* Former member
Principal Component Analysis (PCA) is a popular method for linear dimensionality reduction. It is often used to discover hidden correlations or to facilitate the interpretation and visualization of data. However, it is liable to suffer from outliers. Strong outliers can skew the principal components and as a consequence lead to a higher reconstruction loss. While there exist several sophisticated approaches to make the PCA more robust, we present an approach which is intriguingly simple: we replace the covariance matrix by a so-called coMAD matrix. The first experiments show that PCA based on the coMAD matrix is more robust towards outliers.
Principal Component Analysis (PCA) is a popular method for linear dimensionality reduction. It is often used to discover hidden correlations or to facilitate the interpretation and visualization of data. However, it is liable to suffer from outliers. Strong outliers can skew the principal components and as a consequence lead to a higher reconstruction loss. While there exist several sophisticated approaches to make the PCA more robust, we present an approach which is intriguingly simple: we replace the covariance matrix by a so-called coMAD matrix. The first experiments show that PCA based on the coMAD matrix is more robust towards outliers.
We introduce a generator for data containing subspace clusters which is accurately tunable and adjustable to the needs of developers. It is online available and allows to give a plethora of characteristics the data should contain, while it is simultaneously able to generate meaningful data containing subspace clusters with a minimum of input data.
When we are given trend data for different keywords, scientists may want to cluster them in order to detect specific terms which exhibit a similar trending. For this purpose the periodic regression on each of the time-series can be performed. We ask in this work: What if we not simply cluster the regression models of each time-series, but the periodic signal constituents? The impact of such an approach is twofold: first we would see at a regression level how similar or dissimilar two time-series are regarding their periodic models, and secondly we would be able to see similarities based on single signal constituents between different time-series, containing the semantic that although time-series may be different on a regression level, they may be similar on an constituent level, reflecting other periodic influences. The results of this approach reveal commonalities between time series on a constituent level that are not visible in first place, by looking at their plain regression models.
When it comes to the task of dimensionality reduction, the Principal Component Analysis (PCA) is among the most well known methods. Despite its popularity, PCA is prone to outliers which can be traced back to the fact that this method relies on a covariance matrix. Even with the variety of sophisticated methods to enhance the robustness of the PCA, we provide here in this work-in-progress an approach which is intriguingly simple: the covariance matrix is replaced by a so-called comode matrix. Through this minor modification the experiments show that the reconstruction loss is significantly reduced. In this work we introduce the comode and its relation to the MeanShift algorithm, including its bandwidth parameter, compare it in an experiment against the classic covariance matrix and evaluate the impact of the bandwidth hyperparameter on the reconstruction error.
Chains connecting two or more different clusters are a well known problem of clustering algorithms like DBSCAN or Single Linkage Clustering. Since already a small number of points resulting from, e.g., noise can form such a chain and build a bridge between different clusters, it can happen that the results of the clustering algorithm are distorted: several disparate clusters get merged into one. This single-link effect is rather known but to the best of our knowledge there are no satisfying solutions which extract those chains, yet. We present a new algorithm detecting not only straight chains between clusters, but also bent and noisy ones. Users are able to choose between eliminating one dimensional and higher dimensional chains connecting clusters to receive the underlying cluster structure. Also, the desired straightness can be set by the user. As this paper is an extension of ‘Chain-detection for DBSCAN’, we apply our technique not only in combination with DBSCAN but also with single link hierarchical clustering. On a real world dataset containing traffic accidents in Great Britain we were able to detect chains emerging from streets between cities and villages, which led to clusters composed of diverse villages. Additionally, we analyzed the robustness regarding the variance of chains in synthetic experiments.
As machine learning becomes a more and more important area in Data Science, bringing with it a rise of abstractness and complexity, the desire for explainability rises, too. With our work we aim to gain explainability focussing on correlation clustering and try to pursue the original goals of different Data Science tasks,: Extracting knowledge from data. As well-known tools like Fold-It or GeoTime show, gamification is a very mighty approach, but not only to solve tasks which prove more difficult for machines than for humans. We could also gain knowledge from how players proceed trying to solve those difficult tasks. That is why we developed Straighten it up!, a game in which users try to find the best linear correlations in high dimensional datasets. Finding arbitrarily oriented subspaces in high dimensional data is an exponentially complex task due to the number of potential subspaces in regards to the number of dimensions. Nevertheless, linearly correlated points are as a simple pattern easy to track by the human eye. Straighten it up! gives users an overview over two-dimensional projections of a self-chosen dataset. Users decide which subspace they want to examine first, and can draw in arbitrarily many lines fitting the data. An offset inside of which points are assigned to the corresponding line can easily be chosen for every line independently, and users can switch between different projections at any time. We developed a scoring system not only as incentive, but first of all for further examination, based on the density of each cluster, its minimum spanning tree, size of offset, and coverage. By tracking every step of a user we are able to detect common mechanisms and examine differences to state-of-the-art correlation and subspace clustering algorithms, resulting in more comprehensibility.
Artificially generated data sets are present in many data mining and machine learning publications in the experimental section. One of the reasons to use synthetic data is, that scientists can express their understanding of a “ground truth”, having labels and thus an expectation of what an algorithm should be able to detect. This permits also a degree of control to create data sets which either emphasize the strengths of a method or reveal its weaknesses and thus potential targets for improvement. In order to develop methods which detect linear correlated clusters, the necessity of generating such artificial clusters is indispensable. This is mostly done by command-line based scripts which may be tedious since they demand from users to ‘visualize’ in their minds how the correlated clusters have to look like and be positioned within the data space. We present in this work RAIL, a generator for Reproducible Artificial Interactive Linear correlated data. With RAIL, users can add multiple planes into a data space and arbitrarily change orientation and position of those planes in an interactive fashion. This is achieved by manipulating the parameters describing each of the planes, giving users immediate feedback in real-time. With this approach scientists no longer need to imagine their data but can interactively explore and design their own artificial data sets containing linear correlated clusters. Another convenient feature in this context is that the data is only generated when the users decide that their design phase is completed. If researchers want to share data, a small file is exchanged containing the parameters which describe the clusters through information such as e.g. their Hessian-Normal-Form or number of points per cluster, instead of sharing several large csv files.
LUCK allows to use any distance-based clustering algorithm to find linear correlated data. For that a novel distance function is introduced, which takes the distribution of the kNN of points into account and corresponds to the probability of two points being part of the same linear correlation. In this work in progress we tested the distance measure with DBSCAN and k-Means comparing it to the well-known linear correlation clustering algorithms ORCLUS, 4C, COPAC, LMCLUS, and CASH, receiving good results for difficult synthetic data sets containing crossing or non-continuous correlations.
As the ordering of data, particularly of graphs, can influence the result of diverse Data Mining tasks performed on it heavily, we introduce the Circle-Index, the first internal quality measurement for orderings of graphs. It is based on a circular arrangement of nodes, but takes in contrast to similar arrangements from the field of, e.g., visual analytics, the edge lengths in this arrangement into account. The minimization of the Circle-Index leads to an arrangement which not only offers a simple way to cluster the data using a constrained texttt{MinCut} in only linear time, but is also visually convincing. We developed the clustering algorithm CirClu which implements this minimization and texttt{MinCut}, and compared it with several established clustering algorithms achieving very good results. Simultaneously we compared the Circle-Index with several internal quality measures for clusterings. We observed a strong coherence between the Circle-Index and the matching of achieved clusterings to the respective ground truths in diverse real world datasets.
Periodicities are omnipresent: In nature in the cycles of predator and prey populations, reoccurring patterns regarding our power consumption over the days, or the presence of flu diseases over the year. With regards to the importance of periodicities we ask: Is there a way to detect periodic correlated clusters which are hidden in event series? We propose as a work in progress a method for detecting sinusoidal periodic correlated clusters on event series which relies on parameter space transformation. Our contributions are: Providing the first non-linear correlation clustering algorithm for detecting periodic correlated clusters. Further our method provides an explicit model giving domain experts information on parameters such as amplitude, frequency, phase-shift and vertical-shift of the detected clusters. Beyond that we approach the issue of determining an adequate frequency and phase-shift of the detected correlations given a frequency and phase-shift boundary.
Peer Kröger
Prof. Dr.
* Former member
In publications where a clustering method is described, the chosen hyperparameters are in many cases to our current observation empirically determined. In this work in progress we discuss and propose one approach on how hyperparameters can be systematically explored and their effects regarding the data set analyzed. We further introduce in the context of hyperparameter analysis a modified definition of the resilience term, which refers here to a subset of data points which persists to be in the same cluster over different hyperparameter settings. In order to analyze relations among different hyperparameters we further introduce the concept of dynamic intersection computing.
In this work we present Rock, a method where the points roam to their clusters using k-NN. Rock is a draft for an algorithm which is capable of detecting non-convex clusters of arbitrary dimension while delivering representatives for each cluster similar to, e.g., Mean Shift or k-Means. Applying Rock, points roam to the mean of their k-NN while k increments in every step. Like that, rather outlying points and noise move to their nearest cluster while the clusters themselves contract first to their skeletons and further to a representative point each. Our empirical results on synthetic and real data demonstrate that Rock is able to detect clusters on datasets where either mode seeking or density-based approaches do not succeed.
In different research domains conducted experiments aim for the detection of (hyper)linear correlations among multiple features within a given data set. For this purpose methods exist where one among them is highly robust against noise and detects linear correlated clusters regardless of any locality assumption. This method is based on parameter space transformation. The currently available parameter transform based algorithms detect the clusters scanning explicitly for intersections of functions in parameter space. This approach comes with drawbacks. It is difficult to analyze aspects going beyond the sole intersection of functions, such as e.g. the area around the intersections and further it is computationally expensive. The work in progress method we provide here overcomes the mentioned drawbacks by sampling d-dimensional tuples in data space, generating a (hyper)plane and representing this plane as a single point in parameter space. By this approach we no longer scan for intersection points of functions in parameter space but for dense regions of such parameter vectors. By this approach in future work well established clustering algorithms can be applied in parameter space to detect e.g. dense regions, modes or hierarchies of linear correlations in parameter space.
Peer Kröger
Prof. Dr.
* Former member
In recent years the demand for having algorithms which provide not only their results, but also add explainability up to a certain extent increased. In this paper we envision a class of clustering algorithms where the users can interact not only with the input or output but also intercept within the very clustering process itself, which we coin with the term process-aware clustering. Further we aspire to sketch the challenges emerging with such type of algorithms, such as the need of adequate measures which evaluate the progression through the computation process of a clustering method. Beyond the explainability on how the results are generated, we propose methods tailored at systematically analyzing the hyperparameter space of an algorithm, determining in a more ordered fashion suitable hyperparameters rather then applying a trial-and-error schema.
Clustering algorithms are mostly following the pipeline to provide input data, and hyperparameter values. Then the algorithms are executed and the output files are generated or visualized. We provide in our work an early prototype of an interactive density-based clustering tool named DICE in which the users can change the hyperparameter settings and immediately observe the resulting clusters. Further the users can browse through each of the single detected clusters and get statistics regarding as well as a convex hull profile for each cluster. Further DICE keeps track of the chosen settings, enabling the user to review which hyperparameter values have been previously chosen. DICE can not only be used in scientific context of analyzing data, but also in didactic settings in which students can learn in an exploratory fashion how a density-based clustering algorithm like e.g. DBSCAN behaves.
Peer Kröger
Prof. Dr.
* Former member
©all images: LMU | TUM