holds the Chair of Applied Statistics in Social Sciences, Economics and Business at LMU Munich.
He conducts research in advanced regression analysis, focusing on generalized additive models and generalized mixed models. His work aims to refine statistical methods for complex data, enhancing their application in various scientific fields.
Uncertainty in machine learning models is a timely and vast field of research. In supervised learning, uncertainty can already occur in the first stage of the training process, the annotation phase. This scenario is particularly evident when some instances cannot be definitively classified. In other words, there is inevitable ambiguity in the annotation step and hence, not necessarily a ‘ground truth’ associated with each instance. The main idea of this work is to drop the assumption of a ground truth label and instead embed the annotations into a multidimensional space. This embedding is derived from the empirical distribution of annotations in a Bayesian setup, modeled via a Dirichlet-Multinomial framework. We estimate the model parameters and posteriors using a stochastic Expectation Maximization algorithm with Markov Chain Monte Carlo steps. The methods developed in this paper readily extend to various situations where multiple annotators independently label instances. To showcase the generality of the proposed approach, we apply our approach to three benchmark datasets for image classification and Natural Language Inference. Besides the embeddings, we can investigate the resulting correlation matrices, which reflect the semantic similarities of the original classes very well for all three exemplary datasets.
In this work, we analyze the uncertainty that is inherently present in the labels used for supervised machine learning in natural language inference (NLI). In cases where multiple annotations per instance are available, neither the majority vote nor the frequency of individual class votes is a trustworthy representation of the labeling uncertainty. We propose modeling the votes via a Bayesian mixture model to recover the data-generating process, i.e., the “true” latent classes, and thus gain insight into the class variations. This will enable a better understanding of the confusion happening during the annotation process. We also assess the stability of the proposed estimation procedure by systematically varying the numbers of i) instances and ii) labels. Thereby, we observe that few instances with many labels can predict the latent class borders reasonably well, while the estimation fails for many instances with only a few labels. This leads us to conclude that multiple labels are a crucial building block for properly analyzing label uncertainty.
This dissertation focuses on dynamic networks in the Social Sciences, examining methods and applications in network modeling. Part two provides an overview of modeling frameworks for dynamic networks, including applications in studying COVID-19 infections using social connectivity as covariates. In part three, the dissertation introduces a Signed Exponential Random Graph Model (SERGM) for signed networks and a bipartite variant of the Temporal Exponential Random Graph Model (TERGM) to study co-inventorship in patents. Part four concludes with models for event networks, including a Relational Event Model for Spurious Events (REMSE) to manage false-discovery rates in event data. (Shortened).
As early as March 2020, the authors of this letter started to work on surveillance data to obtain a clearer picture of the pandemic’s dynamic. This letter outlines the lessons learned during this peculiar time, emphasizing the benefits that better data collection, management, and communication processes would bring to the table. We further want to promote nuanced data analyses as a vital element of general political discussion as opposed to drawing conclusions from raw data, which are often flawed in epidemiological surveillance data, and therefore underline the overall need for statistics to play a more central role in public discourse.
Maximilian Weigert
* Former member
As relational event models are an increasingly popular model for studying relational structures, the reliability of large-scale event data collection becomes more and more important. Automated or human-coded events often suffer from non-negligible false-discovery rates in event identification. And most sensor data are primarily based on actors’ spatial proximity for predefined time windows; hence, the observed events could relate either to a social relationship or random co-location. Both examples imply spurious events that may bias estimates and inference. We propose the Relational Event Model for Spurious Events (REMSE), an extension to existing approaches for interaction data. The model provides a flexible solution for modeling data while controlling for spurious events. Estimation of our model is carried out in an empirical Bayesian approach via data augmentation. Based on a simulation study, we investigate the properties of the estimation procedure. To demonstrate its usefulness in two distinct applications, we employ this model to combat events from the Syrian civil war and student co-location data. Results from the simulation and the applications identify the REMSE as a suitable approach to modeling relational event data in the presence of spurious events.
Estimation of latent network flows is a common problem in statistical network analysis. The typical setting is that we know the margins of the network, that is, in- and outdegrees, but the flows are unobserved. In this article, we develop a mixed regression model to estimate network flows in a bike-sharing network if only the hourly differences of in- and outdegrees at bike stations are known. We also include exogenous covariates such as weather conditions. Two different parameterizations of the model are considered to estimate (a) the whole network flow and (b) the network margins only. The estimation of the model parameters is proposed via an iterative penalized maximum likelihood approach. This is exemplified by modelling network flows in the Vienna bike-sharing system. In order to evaluate our modelling approach, we conduct our analyses exploiting different distributional assumptions while we also respect the provider’s interventions appropriately for keeping the estimation error low. Furthermore, a simulation study is conducted to show the performance of the model. For practical purposes, it is crucial to predict when and at which station there is a lack or an excess of bikes. For this application, our model shows to be well suited by providing quite accurate predictions.
Over the course of the COVID-19 pandemic, Generalized Additive Models (GAMs) have been successfully employed on numerous occasions to obtain vital data-driven insights. In this article we further substantiate the success story of GAMs, demonstrating their flexibility by focusing on three relevant pandemic-related issues. First, we examine the interdepency among infections in different age groups, concentrating on school children. In this context, we derive the setting under which parameter estimates are independent of the (unknown) case-detection ratio, which plays an important role in COVID-19 surveillance data. Second, we model the incidence of hospitalizations, for which data is only available with a temporal delay. We illustrate how correcting for this reporting delay through a nowcasting procedure can be naturally incorporated into the GAM framework as an offset term. Third, we propose a multinomial model for the weekly occupancy of intensive care units (ICU), where we distinguish between the number of COVID-19 patients, other patients and vacant beds. With these three examples, we aim to showcase the practical and ‘off-the-shelf’ applicability of GAMs to gain new insights from real-world data.
Maximilian Weigert
* Former member
In the past decades the growing amount of network data lead to many novel statistical models. In this paper we consider so-called geometric networks. Typical examples are road networks or other infrastructure networks. Nevertheless, the neurons or the blood vessels in a human body can also be interpreted as a geometric network embedded in a three-dimensional space. A network-specific metric, rather than the Euclidean metric, is usually used in all these applications, making the analyses of network data challenging. We consider network-based point processes, and our task is to estimate the intensity (or density) of the process which allows us to detect high- and low-intensity regions of the underlying stochastic processes. Available routines that tackle this problem are commonly based on kernel smoothing methods. This paper uses penalized spline smoothing and extends this toward smooth intensity estimation on geometric networks. Furthermore, our approach easily allows incorporating covariates, enabling us to respect the network geometry in a regression model framework. Several data examples and a simulation study show that penalized spline-based intensity estimation on geometric networks is a numerically stable and efficient tool. Furthermore, it also allows estimating linear and smooth covariate effects, distinguishing our approach from already existing methodologies.
Mixture models are probabilistic models aimed at uncovering and representing latent subgroups within a population. In the realm of network data analysis, the latent subgroups of nodes are typically identified by their connectivity behaviour, with nodes behaving similarly belonging to the same community. In this context, mixture modelling is pursued through stochastic blockmodelling. We consider stochastic blockmodels and some of their variants and extensions from a mixture modelling perspective. We also explore some of the main classes of estimation methods available and propose an alternative approach based on the reformulation of the blockmodel as a graphon. In addition to the discussion of inferential properties and estimating procedures, we focus on the application of the models to several real-world network datasets, showcasing the advantages and pitfalls of different approaches.
Since the primary mode of respiratory virus transmission is person-to-person interaction, we are required to reconsider physical interaction patterns to mitigate the number of people infected with COVID-19. While research has shown that non-pharmaceutical interventions (NPI) had an evident impact on national mobility patterns, we investigate the relative regional mobility behaviour to assess the effect of human movement on the spread of COVID-19. In particular, we explore the impact of human mobility and social connectivity derived from Facebook activities on the weekly rate of new infections in Germany between 3 March and 22 June 2020. Our results confirm that reduced social activity lowers the infection rate, accounting for regional and temporal patterns. The extent of social distancing, quantified by the percentage of people staying put within a federal administrative district, has an overall negative effect on the incidence of infections. Additionally, our results show spatial infection patterns based on geographical as well as social distances.
The presence of unobserved node-specific heterogeneity in exponential random graph models (ERGM) is a general concern, both with respect to model validity as well as estimation instability. We, therefore, include node-specific random effects in the ERGM that account for unobserved heterogeneity in the network. This leads to a mixed model with parametric as well as random coefficients, labelled as mixed ERGM. Estimation is carried out by iterating between approximate pseudolikelihood estimation for the random effects and maximum likelihood estimation for the remaining parameters in the model. This approach provides a stable algorithm, which allows to fit nodal heterogeneity effects even for large scale networks. We also propose model selection based on the Akaike Information Criterion to check for node-specific heterogeneity.
Accurate and interpretable forecasting models predicting spatially and temporally fine-grained changes in the numbers of intrastate conflict casualties are of crucial importance for policymakers and international non-governmental organizations (NGOs). Using a count data approach, we propose a hierarchical hurdle regression model to address the corresponding prediction challenge at the monthly PRIO-grid level. More precisely, we model the intensity of local armed conflict at a specific point in time as a three-stage process. Stages one and two of our approach estimate whether we will observe any casualties at the country- and grid-cell-level, respectively, while stage three applies a regression model for truncated data to predict the number of such fatalities conditional upon the previous two stages. Within this modeling framework, we focus on the role of governmental arms imports as a processual factor allowing governments to intensify or deter from fighting. We further argue that a grid cell’s geographic remoteness is bound to moderate the effects of these military buildups. Out-of-sample predictions corroborate the effectiveness of our parsimonious and theory-driven model, which enables full transparency combined with accuracy in the forecasting process.
We propose a novel tie-oriented model for longitudinal event network data. The generating mechanism is assumed to be a multivariate Poisson process that governs the onset and repetition of yearly observed events with two separate intensity functions. We apply the model to a network obtained from the yearly dyadic number of international deliveries of combat aircraft trades between 1950 and 2017. Based on the trade gravity approach, we identify economic and political factors impeding or promoting the number of transfers. Extensive dynamics as well as country heterogeneities require the specification of semiparametric time-varying effects as well as random effects. Our findings reveal strong heterogeneous as well as time-varying effects of endogenous and exogenous covariates on the onset and repetition of aircraft trade events.
Given the growing number of available tools for modeling dynamic networks, the choice of a suitable model becomes central. The goal of this survey is to provide an overview of tie-oriented dynamic network models. The survey is focused on introducing binary network models with their corresponding assumptions, advantages, and shortfalls. The models are divided according to generating processes, operating in discrete and continuous time. First, we introduce the temporal exponential random graph model (TERGM) and the separable TERGM (STERGM), both being time-discrete models. These models are then contrasted with continuous process models, focusing on the relational event model (REM). We additionally show how the REM can handle time-clustered observations, that is, continuous-time data observed at discrete time points. Besides the discussion of theoretical properties and fitting procedures, we specifically focus on the application of the models on two networks that represent international arms transfers and email exchange, respectively. The data allow to demonstrate the applicability and interpretation of the network models.
Prior work has determined domain similarity using text-based features of a corpus. However, when using pre-trained word embeddings, the underlying text corpus might not be accessible anymore. Therefore, we propose the CCA measure, a new measure of domain similarity based directly on the dimension-wise correlations between corresponding embedding spaces. Our results suggest that an inherent notion of domain can be captured this way, as we are able to reproduce our findings for different domain comparisons for English, German, Spanish and Czech as well as in cross-lingual comparisons. We further find a threshold at which the CCA measure indicates that two corpora come from the same domain in a monolingual setting by applying permutation tests. By evaluating the usability of the CCA measure in a domain adaptation application, we also show that it can be used to determine which corpora are more similar to each other in a cross-domain sentiment detection task.
©all images: LMU | TUM