Home | Research | Groups | Frauke Kreuter

Research Group Frauke Kreuter


Link to website at LMU

Frauke Kreuter

Prof. Dr.

Principal Investigator

Social Data Science and AI

Frauke Kreuter

holds the Chair of Statistics and Data Science at LMU Munich.

Her reserach is interested in all matters of social data science, ranging from fairness in automated decision-making to the use of new data sources in the social sciences, multiple imputation methods, survey methodology, social NLP, and statistical training.

Team members @MCML

PostDocs

Link to website

Malte Schierholz

Dr.

Social Data Science and AI

PhD Students

Link to website

Sarah Ball

Social Data Science and AI

Link to website

Jacob Beck

Social Data Science and AI

Link to website

Lisa Bondo Andersen

Social Data Science and AI

Link to website

Unai Fischer Abaigar

Social Data Science and AI

Link to website

Olga Kononykhina

Social Data Science and AI

Link to website

Ailin Liu

Social Data Science and AI

Link to website

Bolei Ma

Social Data Science and AI

Link to website

Patrick Schenk

Social Data Science and AI

Link to website

Jan Simson

Social Data Science and AI

Link to website

Anna Steinberg

Social Data Science and AI

Link to website

Leah von der Heyde

Social Data Science and AI

Recent News @MCML

Link to Digdeep Podcast: Why Do We Have Reason for Digital Optimism, Alex Mrozek?

23.02.2025

Digdeep Podcast: Why Do We Have Reason for Digital Optimism, Alex Mrozek?

Link to Digdeep Podcast: DeepSeek, OpenAI & Co – What’s Next for AI, Prof. Kristian Kersting?

09.02.2025

Digdeep Podcast: DeepSeek, OpenAI & Co – What’s Next for AI, Prof. Kristian Kersting?

Link to Digdeep Podcast: Live From Work Awesome Berlin With Metaplan and Burger King: Companies Are Spaces for Discourse

23.01.2025

Digdeep Podcast: Live From Work Awesome Berlin With Metaplan and Burger King: Companies Are Spaces for Discourse

Link to Sarah Ball's Research Featured on on the LLM Cybersecurity Podcast

20.01.2025

Link to Digdeep Podcast: Goodbye 2024, Hello 2025!

07.01.2025

Digdeep Podcast: Goodbye 2024, Hello 2025!

Publications @MCML

2025


[36]
J. Simson, F. Draxler, S. Mehr and C. Kern.
Preventing Harmful Data Practices by Using Participatory Input to Navigate the Machine Learning Multiverse.
CHI 2025 - Conference on Human Factors in Computing Systems. Yokohama, Japan, Apr 26-May 01, 2025. To be published.
Abstract

In light of inherent trade-offs regarding fairness, privacy, interpretability and performance, as well as normative questions, the machine learning (ML) pipeline needs to be made accessible for public input, critical reflection and engagement of diverse stakeholders. In this work, we introduce a participatory approach to gather
input from the general public on the design of an ML pipeline. We show how people’s input can be used to navigate and constrain the multiverse of decisions during both model development and evaluation. We highlight that central design decisions should be democratized rather than “optimized” to acknowledge their critical impact on the system’s output downstream. We describe the iterative development of our approach and its exemplary implementation on a citizen science platform. Our results demonstrate how public participation can inform critical design decisions along the model-building pipeline and combat widespread lazy data practices.

MCML Authors
Link to Profile Christoph Kern

Christoph Kern

Prof. Dr.

Social Data Science and AI Lab


[35]
S. Ball, S. Allmendinger, F. Kreuter and N. Kühl.
Human Preferences in Large Language Model Latent Space: A Technical Analysis on the Reliability of Synthetic Data in Voting Outcome Prediction.
Preprint (Feb. 2025). arXiv
Abstract

Generative AI (GenAI) is increasingly used in survey contexts to simulate human preferences. While many research endeavors evaluate the quality of synthetic GenAI data by comparing model-generated responses to gold-standard survey results, fundamental questions about the validity and reliability of using LLMs as substitutes for human respondents remain. Our study provides a technical analysis of how demographic attributes and prompt variations influence latent opinion mappings in large language models (LLMs) and evaluates their suitability for survey-based predictions. Using 14 different models, we find that LLM-generated data fails to replicate the variance observed in real-world human responses, particularly across demographic subgroups. In the political space, persona-to-party mappings exhibit limited differentiation, resulting in synthetic data that lacks the nuanced distribution of opinions found in survey data. Moreover, we show that prompt sensitivity can significantly alter outputs for some models, further undermining the stability and predictiveness of LLM-based simulations. As a key contribution, we adapt a probe-based methodology that reveals how LLMs encode political affiliations in their latent space, exposing the systematic distortions introduced by these models. Our findings highlight critical limitations in AI-generated survey data, urging caution in its use for public opinion research, social science experimentation, and computational behavioral modeling.

MCML Authors
Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI


[34]
B. Ma, Y. Li, W. Zhou, Z. Gong, Y. J. Liu, K. Jasinskaja, A. Friedrich, J. Hirschberg, F. Kreuter and B. Plank.
Pragmatics in the Era of Large Language Models: A Survey on Datasets, Evaluation, Opportunities and Challenges.
Preprint (Feb. 2025). arXiv
Abstract

Understanding pragmatics-the use of language in context-is crucial for developing NLP systems capable of interpreting nuanced language use. Despite recent advances in language technologies, including large language models, evaluating their ability to handle pragmatic phenomena such as implicatures and references remains challenging. To advance pragmatic abilities in models, it is essential to understand current evaluation trends and identify existing limitations. In this survey, we provide a comprehensive review of resources designed for evaluating pragmatic capabilities in NLP, categorizing datasets by the pragmatics phenomena they address. We analyze task designs, data collection methods, evaluation approaches, and their relevance to real-world applications. By examining these resources in the context of modern language models, we highlight emerging trends, challenges, and gaps in existing benchmarks. Our survey aims to clarify the landscape of pragmatic evaluation and guide the development of more comprehensive and targeted benchmarks, ultimately contributing to more nuanced and context-aware NLP models.

MCML Authors
Link to website

Yang Janet Liu

AI and Computational Linguistics

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[33]
C. Wu, B. Ma, Y. Liu, Z. Zhang, N. Deng, Y. Li, B. Chen, Y. Zhang, B. Plank and Y. Xue.
M-ABSA: A Multilingual Dataset for Aspect-Based Sentiment Analysis.
Preprint (Feb. 2025). arXiv
Abstract

Aspect-based sentiment analysis (ABSA) is a crucial task in information extraction and sentiment analysis, aiming to identify aspects with associated sentiment elements in text. However, existing ABSA datasets are predominantly English-centric, limiting the scope for multilingual evaluation and research. To bridge this gap, we present M-ABSA, a comprehensive dataset spanning 7 domains and 21 languages, making it the most extensive multilingual parallel dataset for ABSA to date. Our primary focus is on triplet extraction, which involves identifying aspect terms, aspect categories, and sentiment polarities. The dataset is constructed through an automatic translation process with human review to ensure quality. We perform extensive experiments using various baselines to assess performance and compatibility on M-ABSA. Our empirical findings highlight that the dataset enables diverse evaluation tasks, such as multilingual and multi-domain transfer learning, and large language model evaluation, underscoring its inclusivity and its potential to drive advancements in multilingual ABSA research.

MCML Authors
Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[32]
J. Beck, L. M. Kemeter, K. Dürrbeck, M. H. I. Abdalla and F. Kreuter.
Toward Integrating ChatGPT Into Satellite Image Annotation Workflows: A Comparison of Label Quality and Costs of Human and Automated Annotators.
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 18 (Jan. 2025). DOI
Abstract

High-quality annotations are a critical success factor for machine learning (ML) applications. To achieve this, we have
traditionally relied on human annotators, navigating the challenges of limited budgets and the varying task-specific expertise, costs, and availability. Since the emergence of Large Language Models (LLMs), their popularity for generating automated annotations has grown, extending possibilities and complexity of designing an efficient annotation strategy. Increasingly, computer vision capabilities have been integrated into general-purpose LLMs like ChatGPT. This raises the question of how effectively LLMs can be used in satellite image annotation tasks and how they compare to traditional annotator types. This study presents a comprehensive investigation and comparison of various human and automated annotators for image classification. We evaluate the feasibility and economic competitiveness of using the ChatGPT4-V model for a complex land usage annotation task and compare it with alternative human annotators. A set of satellite images is annotated by a domain expert and 15 additional human and automated annotators, differing in expertise and costs. Our analyses examine the annotation quality loss between the expert and other annotators. This comparison is conducted through (1) descriptive analyses, (2) fitting linear probability models, and (3) comparing F1-scores. Ultimately, we simulate annotation strategies where samples are split according to an automatically assigned certainty score. Routing low-certainty images to human annotators can cut total annotation costs by over 50% with minimal impact on label quality. We discuss implications regarding the economic competitiveness of annotation strategies, prompt engineering and the task-specificity of expertise.

MCML Authors
Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI


[31]
S. Eckman, B. Ma, C. Kern, R. Chew, B. Plank and F. Kreuter.
Correcting Annotator Bias in Training Data: Population-Aligned Instance Replication (PAIR).
Preprint (Jan. 2025). arXiv
Abstract

Models trained on crowdsourced labels may not reflect broader population views when annotator pools are not representative. Since collecting representative labels is challenging, we propose Population-Aligned Instance Replication (PAIR), a method to address this bias through statistical adjustment. Using a simulation study of hate speech and offensive language detection, we create two types of annotators with different labeling tendencies and generate datasets with varying proportions of the types. Models trained on unbalanced annotator pools show poor calibration compared to those trained on representative data. However, PAIR, which duplicates labels from underrepresented annotator groups to match population proportions, significantly reduces bias without requiring new data collection. These results suggest statistical techniques from survey research can help align model training with target populations even when representative annotator pools are unavailable. We conclude with three practical recommendations for improving training data quality.

MCML Authors
Link to Profile Christoph Kern

Christoph Kern

Prof. Dr.

Social Data Science and AI Lab

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI


[30]
U. Fischer Abaigar, C. Kern and J. Perdomo.
The Value of Prediction in Identifying the Worst-Off.
Preprint (Jan. 2025). arXiv
Abstract

Machine learning is increasingly used in government programs to identify and support the most vulnerable individuals, prioritizing assistance for those at greatest risk over optimizing aggregate outcomes. This paper examines the welfare impacts of prediction in equity-driven contexts, and how they compare to other policy levers, such as expanding bureaucratic capacity. Through mathematical models and a real-world case study on long-term unemployment amongst German residents, we develop a comprehensive understanding of the relative effectiveness of prediction in surfacing the worst-off. Our findings provide clear analytical frameworks and practical, data-driven tools that empower policymakers to make principled decisions when designing these systems.

MCML Authors
Link to website

Unai Fischer Abaigar

Social Data Science and AI

Link to Profile Christoph Kern

Christoph Kern

Prof. Dr.

Social Data Science and AI Lab


[29]
O. Kononykhina, M. Schierholz and F. Kreuter.
The Impact of Question Framing on the Precision of Automatic Occupation Coding.
Preprint (Jan. 2025). arXiv
Abstract

Occupational data play a vital role in research, official statistics, and policymaking, yet their collection and accurate classification remain a persistent challenge. This study investigates the effects of occupational question wording on data variability and the performance of automatic coding tools. Through a series of survey experiments conducted and replicated in Germany, we tested two widely-used occupational question formats: one focusing on ‘job title’ (Berufsbezeichnung) and another on ‘occupational tasks’ (berufliche Tätigkeit). Our analysis reveals that automatic coding tools, such as CASCOT and OccuCoDe, exhibit significant sensitivity to the form and origin of the data. Specifically, these tools performed more efficiently when coding responses to the job title question format compared to the occupational task format. Additionally, we found that including examples of main tasks and duties in the questions led respondents to provide more detailed but less linguistically diverse responses. This reduced diversity may negatively affect the precision of automatic coding. These findings highlight the importance of tailoring automatic coding tools to the specific structure and origin of the data they are applied to. We emphasize the need for further research to optimize question design and coding tools for greater accuracy and applicability in occupational data collection.

MCML Authors
Link to website

Olga Kononykhina

Social Data Science and AI

Link to website

Malte Schierholz

Dr.

Social Data Science and AI

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI


2024


[28]
U. Fischer Abaigar, C. Kern, N. Barda and F. Kreuter.
Bridging the gap: Towards an expanded toolkit for AI-driven decision-making in the public sector.
Government Information Quarterly 41.4 (Dec. 2024). DOI
Abstract

AI-driven decision-making systems are becoming instrumental in the public sector, with applications spanning areas like criminal justice, social welfare, financial fraud detection, and public health. While these systems offer great potential benefits to institutional decision-making processes, such as improved efficiency and reliability, these systems face the challenge of aligning machine learning (ML) models with the complex realities of public sector decision-making. In this paper, we examine five key challenges where misalignment can occur, including distribution shifts, label bias, the influence of past decision-making on the data side, as well as competing objectives and human-in-the-loop on the model output side. Our findings suggest that standard ML methods often rely on assumptions that do not fully account for these complexities, potentially leading to unreliable and harmful predictions. To address this, we propose a shift in modeling efforts from focusing solely on predictive accuracy to improving decision-making outcomes. We offer guidance for selecting appropriate modeling frameworks, including counterfactual prediction and policy learning, by considering how the model estimand connects to the decision-maker’s utility. Additionally, we outline technical methods that address specific challenges within each modeling approach. Finally, we argue for the importance of external input from domain experts and stakeholders to ensure that model assumptions and design choices align with real-world policy objectives, taking a step towards harmonizing AI and public sector objectives.

MCML Authors
Link to website

Unai Fischer Abaigar

Social Data Science and AI

Link to Profile Christoph Kern

Christoph Kern

Prof. Dr.

Social Data Science and AI Lab

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI


[27]
B. Ma, B. Yoztyurk, A.-C. Haensch, X. Wang, M. Herklotz, F. Kreuter, B. Plank and M. Aßenmacher.
Algorithmic Fidelity of Large Language Models in Generating Synthetic German Public Opinions: A Case Study.
Preprint (Dec. 2024). arXiv
Abstract

In recent research, large language models (LLMs) have been increasingly used to investigate public opinions. This study investigates the algorithmic fidelity of LLMs, i.e., the ability to replicate the socio-cultural context and nuanced opinions of human participants. Using open-ended survey data from the German Longitudinal Election Studies (GLES), we prompt different LLMs to generate synthetic public opinions reflective of German subpopulations by incorporating demographic features into the persona prompts. Our results show that Llama performs better than other LLMs at representing subpopulations, particularly when there is lower opinion diversity within those groups. Our findings further reveal that the LLM performs better for supporters of left-leaning parties like The Greens and The Left compared to other parties, and matches the least with the right-party AfD. Additionally, the inclusion or exclusion of specific variables in the prompts can significantly impact the models’ predictions. These findings underscore the importance of aligning LLMs to more effectively model diverse public opinions while minimizing political biases and enhancing robustness in representativeness.

MCML Authors
Link to website

Xinpeng Wang

AI and Computational Linguistics

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics

Link to website

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science


[26]
B. Ma, X. Wang, T. Hu, A.-C. Haensch, M. A. Hedderich, B. Plank and F. Kreuter.
The Potential and Challenges of Evaluating Attitudes, Opinions, and Values in Large Language Models.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

Recent advances in Large Language Models (LLMs) have sparked wide interest in validating and comprehending the human-like cognitive-behavioral traits LLMs may capture and convey. These cognitive-behavioral traits include typically Attitudes, Opinions, Values (AOVs). However, measuring AOVs embedded within LLMs remains opaque, and different evaluation methods may yield different results. This has led to a lack of clarity on how different studies are related to each other and how they can be interpreted. This paper aims to bridge this gap by providing a comprehensive overview of recent works on the evaluation of AOVs in LLMs. Moreover, we survey related approaches in different stages of the evaluation pipeline in these works. By doing so, we address the potential and challenges with respect to understanding the model, human-AI alignment, and downstream application in social sciences. Finally, we provide practical insights into evaluation methods, model enhancement, and interdisciplinary collaboration, thereby contributing to the evolving landscape of evaluating AOVs in LLMs.

MCML Authors
Link to website

Xinpeng Wang

AI and Computational Linguistics

Link to Profile Michael Hedderich

Michael Hedderich

Dr.

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI


[25]
C. Kern, R. Bach, H. Mautner and F. Kreuter.
When Small Decisions Have Big Impact: Fairness Implications of Algorithmic Profiling Schemes.
ACM Journal on Responsible Computing (Nov. 2024). DOI
Abstract

Algorithmic profiling is increasingly used in the public sector with the hope of allocating limited public resources more effectively and objectively. One example is the prediction-based profiling of job seekers to guide the allocation of support measures by public employment services. However, empirical evaluations of potential side-effects such as unintended discrimination and fairness concerns are rare in this context. We systematically compare and evaluate statistical models for predicting job seekers’ risk of becoming long-term unemployed concerning subgroup prediction performance, fairness metrics, and vulnerabilities to data analysis decisions. Focusing on Germany as a use case, we evaluate profiling models under realistic conditions using large-scale administrative data. We show that despite achieving high prediction performance on average, profiling models can be considerably less accurate for vulnerable social subgroups. In this setting, different classification policies can have very different fairness implications. We therefore call for rigorous auditing processes before such models are put to practice.

MCML Authors
Link to Profile Christoph Kern

Christoph Kern

Prof. Dr.

Social Data Science and AI Lab

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI


[24]
X. Wang, C. Hu, B. Ma, P. Rottger and B. Plank.
Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think.
COLM 2024 - Conference on Language Modeling. Philadelphia, PA, USA, Oct 07-09, 2024. PDF
Abstract

Multiple choice questions (MCQs) are commonly used to evaluate the capabilities of large language models (LLMs). One common way to evaluate the model response is to rank the candidate answers based on the log probability of the first token prediction. An alternative way is to examine the text output. Prior work has shown that first token probabilities lack robustness to changes in MCQ phrasing, and that first token probabilities do not match text answers for instruction-tuned models. Therefore, in this paper, we investigate the robustness of text answers. We show that the text answers are more robust to question perturbations than the first token probabilities, when the first token answers mismatch the text answers. The difference in robustness increases as the mismatch rate becomes greater. As the mismatch reaches over 50%, the text answer is more robust to option order changes than the debiased first token probabilities using state-of-the-art debiasing methods such as PriDe. Our findings provide further evidence for the benefits of text answer evaluation over first token probability evaluation.

MCML Authors
Link to website

Xinpeng Wang

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[23]
L. von der Heyde, A.-C. Haensch and A. Wenz.
United in Diversity? Contextual Biases in LLM-Based Predictions of the 2024 European Parliament Elections.
Preprint (Sep. 2024). arXiv
Abstract

Large language models (LLMs) are perceived by some as having the potential to revolutionize social science research, considering their training data includes information on human attitudes and behavior. If these attitudes are reflected in LLM output, LLM-generated ‘synthetic samples’ could be used as a viable and efficient alternative to surveys of real humans. However, LLM-synthetic samples might exhibit coverage bias due to training data and fine-tuning processes being unrepresentative of diverse linguistic, social, political, and digital contexts. In this study, we examine to what extent LLM-based predictions of public opinion exhibit context-dependent biases by predicting voting behavior in the 2024 European Parliament elections using a state-of-the-art LLM. We prompt GPT-4-Turbo with anonymized individual-level background information, varying prompt content and language, ask the LLM to predict each person’s voting behavior, and compare the weighted aggregates to the real election results. Our findings emphasize the limited applicability of LLM-synthetic samples to public opinion prediction. We show that (1) the LLM-based prediction of future voting behavior largely fails, (2) prediction accuracy is unequally distributed across national and linguistic contexts, and (3) improving LLM predictions requires detailed attitudinal information about individuals for prompting. In investigating the contextual differences of LLM-based predictions of public opinion, our research contributes to the understanding and mitigation of biases and inequalities in the development of LLMs and their applications in computational social science.

MCML Authors
Link to website

Leah von der Heyde

Social Data Science and AI


[22]
X. Wang, B. Ma, C. Hu, L. Weber-Genzel, P. Röttger, F. Kreuter, D. Hovy and B. Plank.
My Answer is C: First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models.
ACL 2024 - Findings of the 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI
Abstract

The open-ended nature of language generation makes the evaluation of autoregressive large language models (LLMs) challenging. One common evaluation approach uses multiple-choice questions to limit the response space. The model is then evaluated by ranking the candidate answers by the log probability of the first token prediction. However, first-tokens may not consistently reflect the final response output, due to model’s diverse response styles such as starting with ‘Sure’ or refusing to answer. Consequently, first-token evaluation is not indicative of model behaviour when interacting with users. But by how much? We evaluate how aligned first-token evaluation is with the text output along several dimensions, namely final option choice, refusal rate, choice distribution and robustness under prompt perturbation. Our results show that the two approaches are severely misaligned on all dimensions, reaching mismatch rates over 60%. Models heavily fine-tuned on conversational or safety data are especially impacted. Crucially, models remain misaligned even when we increasingly constrain prompts, i.e., force them to start with an option letter or example template. Our findings i) underscore the importance of inspecting the text output as well and ii) caution against relying solely on first-token evaluation.

MCML Authors
Link to website

Xinpeng Wang

AI and Computational Linguistics

Leon Weber-Genzel

Leon Weber-Genzel

Dr.

* Former Member

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


[21]
B. Ma.
Evaluating Lexical Aspect with Large Language Models.
CMCL @ACL 2024 - Workshop on Cognitive Modeling and Computational Linguistics at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). Bangkok, Thailand, Aug 11-16, 2024. DOI
Abstract

In this study, we explore the proficiency of large language models (LLMs) in understanding two key lexical aspects: duration (durative/stative) and telicity (telic/atelic). Through experiments on datasets featuring sentences, verbs, and verb positions, we prompt the LLMs to identify aspectual features of verbs in sentences. Our findings reveal that certain LLMs, particularly those closed-source ones, are able to capture information on duration and telicity, albeit with some performance variations and weaker results compared to the baseline. By employing prompts at three levels (sentence-only, sentence with verb, and sentence with verb and its position), we demonstrate that integrating verb information generally enhances performance in aspectual feature recognition, though it introduces instability. We call for future research to look deeper into methods aimed at optimizing LLMs for aspectual feature comprehension.

MCML Authors

[20]
A. Dimmelmeier, H. Doll, M. Schierholz, E. Kormanyos, M. Fehr, B. Ma, J. Beck, A. Fraser and F. Kreuter.
Informing climate risk analysis using textual information - A research agenda.
ClimateNLP @ACL 2024 - 1st Workshop on Natural Language Processing Meets Climate Change at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). Bangkok, Thailand, Aug 11-16, 2024. DOI
Abstract

We present a research agenda focused on efficiently extracting, assuring quality, and consolidating textual company sustainability information to address urgent climate change decision-making needs. Starting from the goal to create integrated FAIR (Findable, Accessible, Interoperable, Reusable) climate-related data, we identify research needs pertaining to the technical aspects of information extraction as well as to the design of the integrated sustainability datasets that we seek to compile. Regarding extraction, we leverage technological advancements, particularly in large language models (LLMs) and Retrieval-Augmented Generation (RAG) pipelines, to unlock the underutilized potential of unstructured textual information contained in corporate sustainability reports. In applying these techniques, we review key challenges, which include the retrieval and extraction of CO2 emission values from PDF documents, especially from unstructured tables and graphs therein, and the validation of automatically extracted data through comparisons with human-annotated values. We also review how existing use cases and practices in climate risk analytics relate to choices of what textual information should be extracted and how it could be linked to existing structured data.

MCML Authors
Link to website

Malte Schierholz

Dr.

Social Data Science and AI

Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI


[19]
S. Eckman, B. Plank and F. Kreuter.
Position: Insights from Survey Methodology can Improve Training Data.
ICML 2024 - 41st International Conference on Machine Learning. Vienna, Austria, Jul 21-27, 2024. URL
Abstract

Whether future AI models are fair, trustworthy, and aligned with the public’s interests rests in part on our ability to collect accurate data about what we want the models to do. However, collecting high-quality data is difficult, and few AI/ML researchers are trained in data collection methods. Recent research in data-centric AI has show that higher quality training data leads to better performing models, making this the right moment to introduce AI/ML researchers to the field of survey methodology, the science of data collection. We summarize insights from the survey methodology literature and discuss how they can improve the quality of training and feedback data. We also suggest collaborative research ideas into how biases in data collection can be mitigated, making models more accurate and human-centric.

MCML Authors
Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI


[18]
U. Fischer Abaigar, C. Kern and F. Kreuter.
The Missing Link: Allocation Performance in Causal Machine Learning.
ICML 2024 - Workshop Humans, Algorithmic Decision-Making and Society: Modeling Interactions and Impact at the 41st International Conference on Machine Learning. Vienna, Austria, Jul 21-27, 2024. arXiv URL
Abstract

Automated decision-making (ADM) systems are being deployed across a diverse range of critical problem areas such as social welfare and healthcare. Recent work highlights the importance of causal ML models in ADM systems, but implementing them in complex social environments poses significant challenges. Research on how these challenges impact the performance in specific downstream decision-making tasks is limited. Addressing this gap, we make use of a comprehensive real-world dataset of jobseekers to illustrate how the performance of a single CATE model can vary significantly across different decision-making scenarios and highlight the differential influence of challenges such as distribution shifts on predictions and allocations.

MCML Authors
Link to website

Unai Fischer Abaigar

Social Data Science and AI

Link to Profile Christoph Kern

Christoph Kern

Prof. Dr.

Social Data Science and AI Lab

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI


[17]
J. Simson, A. Fabris and C. Kern.
Unveiling the Blindspots: Examining Availability and Usage of Protected Attributes in Fairness Datasets.
EWAF 2024 - 3rd European Workshop on Algorithmic Fairness. Mainz, Germany, Jul 01-03, 2024. PDF
Abstract

This work examines the representation of protected attributes across tabular datasets used in algorithmic fairness research. Drawing from international human rights and anti-discrimination laws, we compile a set of protected attributes and investigate both their availability and usage in the literature. Our analysis reveals a significant underrepresentation of certain attributes in datasets that is exacerbated by a strong focus on race and sex in dataset usage. We identify a geographical bias towards the Global North, particularly North America, potentially limiting the applicability of fairness detection and mitigation strategies in less-represented regions. The study exposes critical blindspots in fairness research, highlighting the need for a more inclusive and representative approach to data collection and usage in the field. We propose a shift away from a narrow focus on a small number of datasets and advocate for initiatives aimed at sourcing more diverse and representative data.

MCML Authors
Link to Profile Christoph Kern

Christoph Kern

Prof. Dr.

Social Data Science and AI Lab


[16]
L. von der Heyde, A.-C. Haensch and A. Wenz.
Vox Populi, Vox AI? Using Language Models to Estimate German Public Opinion.
Preprint (Jul. 2024). arXiv
Abstract

The recent development of large language models (LLMs) has spurred discussions about whether LLM-generated ‘synthetic samples’ could complement or replace traditional surveys, considering their training data potentially reflects attitudes and behaviors prevalent in the population. A number of mostly US-based studies have prompted LLMs to mimic survey respondents, with some of them finding that the responses closely match the survey data. However, several contextual factors related to the relationship between the respective target population and LLM training data might affect the generalizability of such findings. In this study, we investigate the extent to which LLMs can estimate public opinion in Germany, using the example of vote choice. We generate a synthetic sample of personas matching the individual characteristics of the 2017 German Longitudinal Election Study respondents. We ask the LLM GPT-3.5 to predict each respondent’s vote choice and compare these predictions to the survey-based estimates on the aggregate and subgroup levels. We find that GPT-3.5 does not predict citizens’ vote choice accurately, exhibiting a bias towards the Green and Left parties. While the LLM captures the tendencies of ’typical’ voter subgroups, such as partisans, it misses the multifaceted factors swaying individual voter choices. By examining the LLM-based prediction of voting behavior in a new context, our study contributes to the growing body of research about the conditions under which LLMs can be leveraged for studying public opinion. The findings point to disparities in opinion representation in LLMs and underscore the limitations in applying them for public opinion estimation.

MCML Authors
Link to website

Leah von der Heyde

Social Data Science and AI


[15]
P. Resnik, B. Ma, A. Hoyle, P. Goel, R. Sarkar, M. Gearing, A.-C. Haensch and F. Kreuter.
TOPCAT: Topic-Oriented Protocol for Content Analysis of Text – A Preliminary Study.
NLP+CSS @NAACL 2024 - 6th Workshop on Natural Language Processing and Computational Social Science at the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024). Mexico City, Mexico, Jun 16-21, 2024. URL
Abstract

Identifying constructs in text data is a labor-intensive task in social science research. Despite the potential richness of open-ended survey responses, the complexity of analyzing them often leads researchers to underutilize or ignore them entirely. While topic modeling offers a technological solution, qualitative researchers may remain skeptical of its rigor. In this paper, we introduce TOPCAT: Topic-Oriented Protocol for Content Analysis of Text, a systematic approach that integrates off-the-shelf topic modeling with human decisionmaking and curation. Our method aims to provide a viable solution for topicalizing open-ended responses in survey research, ensuring both efficiency and trustworthiness. We present the TOPCAT protocol, define an evaluation process, and demonstrate its effectiveness using open-ended responses from a U.S. survey on COVID-19 impact. Our findings suggest that TOPCAT enables efficient and rigorous qualitative analysis, offering a promising avenue for future research in this domain. Furthermore, our findings challenge the adequacy of expert coding schemes as ‘‘gold’’ standards, emphasizing the subjectivity inherent in qualitative content interpretation.

MCML Authors
Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI


[14]
J. Simson, A. Fabris and C. Kern.
Lazy Data Practices Harm Fairness Research.
ACM FAccT 2024 - 7th ACM Conference on Fairness, Accountability, and Transparency. Rio de Janeiro, Brazil, Jun 03-06, 2024. DOI
Abstract

Data practices shape research and practice on fairness in machine learning (fair ML). Critical data studies offer important reflections and critiques for the responsible advancement of the field. In this work, we present a comprehensive analysis of fair ML datasets, demonstrating how unreflective yet common practices hinder the reach and reliability of algorithmic fairness findings. We systematically study protected information encoded in tabular datasets and their usage in 280 experiments across 142 publications. Our analyses identify three main areas of concern: (1) a lack of representation for certain protected attributes in both data and evaluations, (2) the widespread exclusion of minorities during data preprocessing, and (3) a lack of transparency about consequential yet overlooked dataset processing choices. We further note additional factors, such as limitations in publicly available data, privacy considerations and a general lack of awareness that further contribute to these issues. Through exemplary analyses on the usage of popular datasets, we demonstrate how opaque data choices significantly impact minorities, fairness metrics, and the resulting model comparison. To address these challenges, we propose a set of recommendations for data usage in fairness research centered on transparency and responsible inclusion. This study underscores the need for a critical reevaluation of data practices in fair ML and offers directions to improve both the sourcing and usage of datasets.

MCML Authors
Link to Profile Christoph Kern

Christoph Kern

Prof. Dr.

Social Data Science and AI Lab


[13]
J. Simson, F. Pfisterer and C. Kern.
One Model Many Scores: Using Multiverse Analysis to Prevent Fairness Hacking and Evaluate the Influence of Model Design Decisions.
ACM FAccT 2024 - 7th ACM Conference on Fairness, Accountability, and Transparency. Rio de Janeiro, Brazil, Jun 03-06, 2024. DOI
Abstract

A vast number of systems across the world use algorithmic decision making (ADM) to (partially) automate decisions that have previously been made by humans. The downstream effects of ADM systems critically depend on the decisions made during a systems’ design, implementation, and evaluation, as biases in data can be mitigated or reinforced along the modeling pipeline. Many of these decisions are made implicitly, without knowing exactly how they will influence the final system. To study this issue, we draw on insights from the field of psychology and introduce the method of multiverse analysis for algorithmic fairness. In our proposed method, we turn implicit decisions during design and evaluation into explicit ones and demonstrate their fairness implications. By combining decisions, we create a grid of all possible “universes” of decision combinations. For each of these universes, we compute metrics of fairness and performance. Using the resulting dataset, one can investigate the variability and robustness of fairness scores and see how and which decisions impact fairness. We demonstrate how multiverse analyses can be used to better understand fairness implications of design and evaluation decisions using an exemplary case study of predicting public health care coverage for vulnerable populations. Our results highlight how decisions regarding the evaluation of a system can lead to vastly different fairness metrics for the same model. This is problematic, as a nefarious actor could optimise or “hack” a fairness metric to portray a discriminating model as fair merely by changing how it is evaluated. We illustrate how a multiverse analysis can help to address this issue.

MCML Authors
Link to Profile Christoph Kern

Christoph Kern

Prof. Dr.

Social Data Science and AI Lab


[12]
S. Ball, F. Kreuter and N. Panickssery.
Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models.
Preprint (Jun. 2024). arXiv
Abstract

Conversational large language models are trained to refuse to answer harmful questions. However, emergent jailbreaking techniques can still elicit unsafe outputs, presenting an ongoing challenge for model alignment. To better understand how different jailbreak types circumvent safeguards, this paper analyses model activations on different jailbreak inputs. We find that it is possible to extract a jailbreak vector from a single class of jailbreaks that works to mitigate jailbreak effectiveness from other semantically-dissimilar classes. This may indicate that different kinds of effective jailbreaks operate via a similar internal mechanism. We investigate a potential common mechanism of harmfulness feature suppression, and find evidence that effective jailbreaks noticeably reduce a model’s perception of prompt harmfulness. These findings offer actionable insights for developing more robust jailbreak countermeasures and lay the groundwork for a deeper, mechanistic understanding of jailbreak dynamics in language models.

MCML Authors
Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI


[11]
B. Ma, E. Nie, S. Yuan, H. Schmid, M. Färber, F. Kreuter and H. Schütze.
ToPro: Token-Level Prompt Decomposition for Cross-Lingual Sequence Labeling Tasks.
EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics. St. Julians, Malta, Mar 17-22, 2024. URL
Abstract

Prompt-based methods have been successfully applied to multilingual pretrained language models for zero-shot cross-lingual understanding. However, most previous studies primarily focused on sentence-level classification tasks, and only a few considered token-level labeling tasks such as Named Entity Recognition (NER) and Part-of-Speech (POS) tagging. In this paper, we propose Token-Level Prompt Decomposition (ToPro), which facilitates the prompt-based method for token-level sequence labeling tasks. The ToPro method decomposes an input sentence into single tokens and applies one prompt template to each token. Our experiments on multilingual NER and POS tagging datasets demonstrate that ToPro-based fine-tuning outperforms Vanilla fine-tuning and Prompt-Tuning in zero-shot cross-lingual transfer, especially for languages that are typologically different from the source language English. Our method also attains state-of-the-art performance when employed with the mT5 model. Besides, our exploratory study in multilingual large language models shows that ToPro performs much better than the current in-context learning method. Overall, the performance improvements show that ToPro could potentially serve as a novel and simple benchmarking method for sequence labeling tasks.

MCML Authors
Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[10]
J. Beck, S. Eckman, B. Ma, R. Chew and F. Kreuter.
Order Effects in Annotation Tasks: Further Evidence of Annotation Sensitivity.
UncertaiNLP @EACL 2024 - 1st Workshop on Uncertainty-Aware NLP at the 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2024). St. Julians, Malta, Mar 17-22, 2024. URL
Abstract

The data-centric revolution in AI has revealed the importance of high-quality training data for developing successful AI models. However, annotations are sensitive to annotator characteristics, training materials, and to the design and wording of the data collection instrument. This paper explores the impact of observation order on annotations. We find that annotators’ judgments change based on the order in which they see observations. We use ideas from social psychology to motivate hypotheses about why this order effect occurs. We believe that insights from social science can help AI researchers improve data and model quality.

MCML Authors
Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI


[9]
E. Nie, S. Yuan, B. Ma, H. Schmid, M. Färber, F. Kreuter and H. Schütze.
Decomposed Prompting: Unveiling Multilingual Linguistic Structure Knowledge in English-Centric Large Language Models.
Preprint (Feb. 2024). arXiv
Abstract

Despite the predominance of English in their training data, English-centric Large Language Models (LLMs) like GPT-3 and LLaMA display a remarkable ability to perform multilingual tasks, raising questions about the depth and nature of their cross-lingual capabilities. This paper introduces the decomposed prompting approach to probe the linguistic structure understanding of these LLMs in sequence labeling tasks. Diverging from the single text-to-text prompt, our method generates for each token of the input sentence an individual prompt which asks for its linguistic label. We assess our method on the Universal Dependencies part-of-speech tagging dataset for 38 languages, utilizing both English-centric and multilingual LLMs. Our findings show that decomposed prompting surpasses the iterative prompting baseline in efficacy and efficiency under zero- and few-shot settings. Further analysis reveals the influence of evaluation methods and the use of instructions in prompts. Our multilingual investigation shows that English-centric language models perform better on average than multilingual models. Our study offers insights into the multilingual transferability of English-centric LLMs, contributing to the understanding of their multilingual linguistic knowledge.

MCML Authors
Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


2023


[8]
Z. Zhang, H. Yang, B. Ma, D. Rügamer and E. Nie.
Baby's CoThought: Leveraging Large Language Models for Enhanced Reasoning in Compact Models.
CoNLL 2023 - BabyLM Challenge at 27th Conference on Computational Natural Language Learning. Singapore, Dec 06-10, 2023. DOI GitHub
Abstract

Large Language Models (LLMs) demonstrate remarkable performance on a variety of natural language understanding (NLU) tasks, primarily due to their in-context learning ability. This ability could be applied to building babylike models, i.e. models at small scales, improving training efficiency. In this paper, we propose a ‘CoThought’ pipeline, which efficiently trains smaller ‘baby’ language models (BabyLMs) by leveraging the Chain of Thought prompting of LLMs. Our pipeline restructures a dataset of less than 100M in size using GPT-3.5-turbo, transforming it into task-oriented, human-readable texts that are comparable to the school texts for language learners. The BabyLM is then pretrained on this restructured dataset in a RoBERTa fashion. In evaluations across 4 benchmarks, our BabyLM outperforms the vanilla RoBERTa in 10 linguistic, NLU, and question-answering tasks by more than 3 points, showing a superior ability to extract contextual information. These results suggest that compact LMs pretrained on small, LLM-resabructured data can better understand tasks and achieve improved performance.

MCML Authors
Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[7]
T. Kaufmann, S. Ball, J. Beck, E. Hüllermeier and F. Kreuter.
On the challenges and practices of reinforcement learning from real human feedback.
HLDM @ECML-PKDD 2023 - 1st Workshop on Hybrid Human-Machine Learning and Decision Making at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2023). Turin, Italy, Sep 18-22, 2023. DOI
Abstract

Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning (RL) that does not require an engineered reward function but instead learns from human feedback. Due to its increasing popularity, various authors have studied how to learn an accurate reward model from only few samples, making optimal use of this feedback. Because of the cost and complexity of user studies, however, this research is often conducted with synthetic human feedback. Such feedback can be generated by evaluating behavior based on ground-truth rewards which are available for some benchmark tasks. While this setting can help evaluate some aspects of RLHF, it differs from practical settings in which synthetic feedback is not available. Working with real human feedback brings additional challenges that cannot be observed with synthetic feedback, including fatigue, inter-rater inconsistencies, delay, misunderstandings, and modality-dependent difficulties. We describe and discuss some of these challenges together with current practices and opportunities for further research in this paper.

MCML Authors
Link to website

Timo Kaufmann

Artificial Intelligence and Machine Learning

Link to Profile Eyke Hüllermeier

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI


[6]
B. Ma, E. Nie, H. Schmid and H. Schütze.
Is Prompt-Based Finetuning Always Better than Vanilla Finetuning? Insights from Cross-Lingual Language Understanding.
KONVENS 2023 - 19th Conference on Natural Language Processing. Ingolstadt, Germany, Sep 18-22, 2023. URL
Abstract

Multilingual pretrained language models (MPLMs) have demonstrated substantial performance improvements in zero-shot cross-lingual transfer across various natural language understanding tasks by finetuning MPLMs on task-specific labelled data of a source language (e.g. English) and evaluating on a wide range of target languages. Recent studies show that prompt-based finetuning surpasses regular finetuning in few-shot scenarios. However, the exploration of prompt-based learning in multilingual tasks remains limited. In this study, we propose the PROFIT pipeline to investigate the cross-lingual capabilities of Prompt-based Finetuning. We conduct comprehensive experiments on diverse cross-lingual language understanding tasks (sentiment classification, paraphrase identification, and natural language inference) and empirically analyze the variation trends of prompt-based finetuning performance in cross-lingual transfer across different few-shot and full-data settings. Our results reveal the effectiveness and versatility of prompt-based finetuning in cross-lingual language understanding. Our findings indicate that prompt-based finetuning outperforms vanilla finetuning in full-data scenarios and exhibits greater advantages in few-shot scenarios, with different performance patterns dependent on task types. Additionally, we analyze underlying factors such as language similarity and pretraining data size that impact the cross-lingual performance of prompt-based finetuning. Overall, our work provides valuable insights into the cross-lingual prowess of prompt-based finetuning.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Computational Linguistics


[5]
M. Trappmann, G.-C. Haas, S. Malich, F. Keusch, S. Bähr, F. Kreuter and S. Schwarz.
Augmenting survey data with digital trace data: Is there a threat to panel retention?
Journal of Survey Statistics and Methodology 11.3 (Jun. 2023). DOI
Abstract

Linking digital trace data to existing panel survey data may increase the overall analysis potential of the data. However, producing linked products often requires additional engagement from survey participants through consent or participation in additional tasks. Panel operators may worry that such additional requests may backfire and lead to lower panel retention, reducing the analysis potential of the data. To examine these concerns, we conducted an experiment in the German PASS panel survey after wave 11. Three quarters of panelists (n = 4,293) were invited to install a research app and to provide sensor data over a period of 6 months, while one quarter (n = 1,428) did not receive an invitation. We find that the request to install a smartphone app and share data significantly decreases panel retention in the wave immediately following the invitation by 3.3 percentage points. However, this effect wears off and is no longer significant in the second and third waves after the invitation. We conclude that researchers who run panel surveys have to take moderate negative effects on retention into account but that the potential gain likely outweighs these moderate losses.

MCML Authors
Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI


[4]
I. Ziegler, B. Ma, B. Bischl, E. Dorigatti and B. Schubert.
Proteasomal cleavage prediction: state-of-the-art and future directions.
Preprint (2023). DOI GitHub
Abstract

Epitope vaccines are a promising approach for precision treatment of pathogens, cancer, autoimmune diseases, and allergies. Effectively designing such vaccines requires accurate proteasomal cleavage prediction to ensure that the epitopes included in the vaccine trigger an immune response. The performance of proteasomal cleavage predictors has been steadily improving over the past decades owing to increasing data availability and methodological advances. In this review, we summarize the current proteasomal cleavage prediction landscape and, in light of recent progress in the field of deep learning, develop and compare a wide range of recent architectures and techniques, including long short-term memory (LSTM), transformers, and convolutional neural networks (CNN), as well as four different denoising techniques. All open-source cleavage predictors re-trained on our dataset performed within two AUC percentage points. Our comprehensive deep learning architecture benchmark improved performance by 1.7 AUC percentage points, while closed-source predictors performed considerably worse. We found that a wide range of architectures and training regimes all result in very similar performance, suggesting that the specific modeling approach employed has a limited impact on predictive performance compared to the specifics of the dataset employed. We speculate that the noise and implicit nature of data acquisition techniques used for training proteasomal cleavage prediction models and the complexity of biological processes of the antigen processing pathway are the major limiting factors. While biological complexity can be tackled by more data and, to a lesser extent, better models, noise and randomness inherently limit the maximum achievable predictive performance.

MCML Authors
Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science


2022


[3]
I. Ziegler, B. Ma, E. Nie, B. Bischl, D. Rügamer, B. Schubert and E. Dorigatti.
What cleaves? Is proteasomal cleavage prediction reaching a ceiling?
LMRL @NeurIPS 2022 - Workshop on Learning Meaningful Representations of Life at the 36th Conference on Neural Information Processing Systems (NeurIPS 2022). New Orleans, LA, USA, Nov 28-Dec 09, 2022. URL
Abstract

Epitope vaccines are a promising direction to enable precision treatment for cancer, autoimmune diseases, and allergies. Effectively designing such vaccines requires accurate prediction of proteasomal cleavage in order to ensure that the epitopes in the vaccine are presented to T cells by the major histocompatibility complex (MHC). While direct identification of proteasomal cleavage in vitro is cumbersome and low throughput, it is possible to implicitly infer cleavage events from the termini of MHC-presented epitopes, which can be detected in large amounts thanks to recent advances in high-throughput MHC ligandomics. Inferring cleavage events in such a way provides an inherently noisy signal which can be tackled with new developments in the field of deep learning that supposedly make it possible to learn predictors from noisy labels. Inspired by such innovations, we sought to modernize proteasomal cleavage predictors by benchmarking a wide range of recent methods, including LSTMs, transformers, CNNs, and denoising methods, on a recently introduced cleavage dataset. We found that increasing model scale and complexity appeared to deliver limited performance gains, as several methods reached about 88.5% AUC on C-terminal and 79.5% AUC on N-terminal cleavage prediction. This suggests that the noise and/or complexity of proteasomal cleavage and the subsequent biological processes of the antigen processing pathway are the major limiting factors for predictive performance rather than the specific modeling approach used. While biological complexity can be tackled by more data and better models, noise and randomness inherently limit the maximum achievable predictive performance.

MCML Authors
Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

Link to Profile David Rügamer

David Rügamer

Prof. Dr.

Statistics, Data Science and Machine Learning


[2]
K. E. Riehm, E. Badillo Goicoechea, F. M. Wang, E. Kim, L. R. Aldridge, C. P. Lupton-Smith, R. Presskreischer, T.-H. Chang, S. LaRocca, F. Kreuter and E. A. Stuart.
Association of Non-Pharmaceutical Interventions to Reduce the Spread of SARS-CoV-2 With Anxiety and Depressive Symptoms: A Multi-National Study of 43 Countries.
International Journal of Public Health 67 (Mar. 2022). DOI
Abstract

Objectives: To examine the association of non-pharmaceutical interventions (NPIs) with anxiety and depressive symptoms among adults and determine if these associations varied by gender and age.
Methods: We combined survey data from 16,177,184 adults from 43 countries who participated in the daily COVID-19 Trends and Impact Survey via Facebook with time-varying NPI data from the Oxford COVID-19 Government Response Tracker between 24 April 2020 and 20 December 2020. Using logistic regression models, we examined the association of [1] overall NPI stringency and [2] seven individual NPIs (school closures, workplace closures, cancellation of public events, restrictions on the size of gatherings, stay-at-home requirements, restrictions on internal movement, and international travel controls) with anxiety and depressive symptoms.
Results: More stringent implementation of NPIs was associated with a higher odds of anxiety and depressive symptoms, albeit with very small effect sizes. Individual NPIs had heterogeneous associations with anxiety and depressive symptoms by gender and age.
Conclusion: Governments worldwide should be prepared to address the possible mental health consequences of stringent NPI implementation with both universal and targeted interventions for vulnerable groups.

MCML Authors
Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI


[1]
R. Valliant, J. A. Dever, F. Kreuter and G. Zipf.
Package ‘PracTools’.
2022. URL
Abstract

Functions and datasets to support Valliant, Dever, and Kreuter (2018), ‘Practical Tools for Designing and Weighting Survey Samples’. Contains functions for sample size calculation for survey samples using stratified or clustered one-, two-, and three-stage sample designs, and single-stage audit sample designs. Functions are included that will group geographic units accounting for distances apart and measures of size. Other functions compute variance components for multistage designs and sample sizes in two-phase designs. A number of example data sets are included.

MCML Authors
Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI