Research Group Frauke Kreuter

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Frauke Kreuter

holds the Chair of Statistics and Data Science at LMU Munich.

Her reserach is interested in all matters of social data science, ranging from fairness in automated decision-making to the use of new data sources in the social sciences, multiple imputation methods, survey methodology, social NLP, and statistical training.

Team members @MCML

PostDocs

Anna-Carolina Haensch

Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Malte Schierholz

Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

PhD Students

Sarah Ball

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Jacob Beck

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Lisa Bondo Andersen

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Tobias Holtdirk

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Olga Kononykhina

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Ailin Liu

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Bolei Ma

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Patrick Schenk

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Anna Steinberg

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Leah von der Heyde

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Publications @MCML

2025

[49]

L. von der Heyde, A.-C. Haensch, B. Weiß and J. Daikeler.
AIn't Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation.
JSM 2025 - Joint Statistical Meetings. Nashville, TN, USA, Aug 02-07, 2025. To be published. Preprint available. arXiv

Abstract

The recent development and wider accessibility of LLMs have spurred discussions about how they can be used in survey research, including classifying open-ended survey responses. Due to their linguistic capacities, it is possible that LLMs are an efficient alternative to time-consuming manual coding and the pre-training of supervised machine learning models. As most existing research on this topic has focused on English-language responses relating to non-complex topics or on single LLMs, it is unclear whether its findings generalize and how the quality of these classifications compares to established methods. In this study, we investigate to what extent different LLMs can be used to code open-ended survey responses in other contexts, using German data on reasons for survey participation as an example. We compare several state-of-the-art LLMs and several prompting approaches, and evaluate the LLMs’ performance by using human expert codings. Overall performance differs greatly between LLMs, and only a fine-tuned LLM achieves satisfactory levels of predictive performance. Performance differences between prompting approaches are conditional on the LLM used. Finally, LLMs’ unequal classification performance across different categories of reasons for survey participation results in different categorical distributions when not using fine-tuning. We discuss the implications of these findings, both for methodological research on coding open-ended responses and for their substantive analysis, and for practitioners processing or substantively analyzing such data. Finally, we highlight the many trade-offs researchers need to consider when choosing automated methods for open-ended response classification in the age of LLMs. In doing so, our study contributes to the growing body of research about the conditions under which LLMs can be efficiently, accurately, and reliably leveraged in survey research.

MCML Authors

Leah von der Heyde

Social Data Science and AI

Anna-Carolina Haensch

Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

[48]

B. Ma, Y. Li, W. Zhou, Z. Gong, Y. J. Liu, K. Jasinskaja, A. Friedrich, J. Hirschberg, F. Kreuter and B. Plank.
Pragmatics in the Era of Large Language Models: A Survey on Datasets, Evaluation, Opportunities and Challenges.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

Understanding pragmatics-the use of language in context-is crucial for developing NLP systems capable of interpreting nuanced language use. Despite recent advances in language technologies, including large language models, evaluating their ability to handle pragmatic phenomena such as implicatures and references remains challenging. To advance pragmatic abilities in models, it is essential to understand current evaluation trends and identify existing limitations. In this survey, we provide a comprehensive review of resources designed for evaluating pragmatic capabilities in NLP, categorizing datasets by the pragmatics phenomena they address. We analyze task designs, data collection methods, evaluation approaches, and their relevance to real-world applications. By examining these resources in the context of modern language models, we highlight emerging trends, challenges, and gaps in existing benchmarks. Our survey aims to clarify the landscape of pragmatic evaluation and guide the development of more comprehensive and targeted benchmarks, ultimately contributing to more nuanced and context-aware NLP models.

MCML Authors

Bolei Ma

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Yang Janet Liu

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Frauke Kreuter

Prof. Dr.

Social Data Science and AI

Barbara Plank

Prof. Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

AI and Computational Linguistics

[47]

B. Ma, B. Yoztyurk, A.-C. Haensch, X. Wang, M. Herklotz, F. Kreuter, B. Plank and M. Aßenmacher.
Algorithmic Fidelity of Large Language Models in Generating Synthetic German Public Opinions: A Case Study.
ACL 2025 - 63rd Annual Meeting of the Association for Computational Linguistics. Vienna, Austria, Jul 27-Aug 01, 2025. URL

Abstract

In recent research, large language models (LLMs) have been increasingly used to investigate public opinions. This study investigates the algorithmic fidelity of LLMs, i.e., the ability to replicate the socio-cultural context and nuanced opinions of human participants. Using open-ended survey data from the German Longitudinal Election Studies (GLES), we prompt different LLMs to generate synthetic public opinions reflective of German subpopulations by incorporating demographic features into the persona prompts. Our results show that Llama performs better than other LLMs at representing subpopulations, particularly when there is lower opinion diversity within those groups. Our findings further reveal that the LLM performs better for supporters of left-leaning parties like The Greens and The Left compared to other parties, and matches the least with the right-party AfD. Additionally, the inclusion or exclusion of specific variables in the prompts can significantly impact the models’ predictions. These findings underscore the importance of aligning LLMs to more effectively model diverse public opinions while minimizing political biases and enhancing robustness in representativeness.

MCML Authors

Bolei Ma

Social Data Science and AI

Anna-Carolina Haensch

Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Xinpeng Wang

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Frauke Kreuter

Prof. Dr.

Social Data Science and AI

Barbara Plank

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

AI and Computational Linguistics

Matthias Aßenmacher

Dr.

Statistical Learning and Data Science

[46]

O. Kononykhina, A.-C. Haensch and F. Kreuter.
Mind the Gap: Gender-based Differences in Occupational Embeddings.
GeBNLP @ACL 2025 - 6th Workshop on Gender Bias in Natural Language Processing at the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025). Vienna, Austria, Jul 27-Aug 01, 2025. To be published. URL

Abstract

Large Language Models (LLMs) offer promising alternatives to traditional occupational coding approaches in survey research. Using a German dataset, we examine the extent to which LLM-based occupational coding differs by gender. Our findings reveal systematic disparities: gendered job titles (e.g., “Autor” vs. “Autorin”, meaning “male author” vs. “female author”) frequently result in diverging occupation codes,
even when semantically identical. Across all models, 54%–82% of gendered inputs obtain different Top-5 suggestions. The practical impact, however, depends on the model. GPT includes the correct code most often (62%) but demonstrates female bias (up to +18 pp). IBM is less accurate (51%) but largely balanced. Alibaba, Gemini, and MiniLM achieve about 50% correct-code inclusion, and their small (< 10 pp) and direction-flipping gaps could indicate a sampling noise rather than gender bias. We discuss these findings in the context of fairness and reproducibility in NLP applications for social data.

MCML Authors

Olga Kononykhina

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Anna-Carolina Haensch

Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Frauke Kreuter

Prof. Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

[45]

O. Kononykhina and M. Schierholz.
Can Large Language Models Advance Occupational Coding? Evidence and Methodological Insights.
ESRA 2025 - 11th Conference of the European Survey Research Association. Utrecht, The Netherlands, Jul 14-18, 2025. To be published.

Abstract

Occupational coding is a critical funnel between open-ended job descriptions and the statistical frameworks that shape employment research and policies. Automatic coding tools—whether rule-based or machine learning (ML)—have streamlined the process, and demonstrate promising results. Yet, ML approaches typically require extensive, high-quality training data that exceed what a typical national survey can provide and fall under data protection constraints. This study asks whether mainstream large language models (LLMs) can serve as a viable alternative, largely bypassing the need for exhaustive training data and requiring only some coding skills and API access. We created embeddings for standardized German (Kldb) job descriptions, then used respondents’ own words (e.g., “doctor”) from a representative German survey to generate job embeddings. Cosine similarity was applied to find the five most likely occupational codes for each response. To assess performance, we compared LLM-based suggestions with those from a German ML occupational coding tool (OccuCoDe), using professional manual coding as our benchmark. Results show that in 55% of the cases, both LLM and OccuCoDe included the correct code among their top five suggestions. However, there was limited overlap: in 60% of the cases, the two tools shared at most two out of their five recommended codes. While OccuCoDe more frequently placed the correct code as the first suggestion, LLM-embeddings suggested the correct occupation in 45% of cases where OccuCoDe did not provide any result. Additionally, LLM performance was sensitive to minor changes in job descriptions (e.g., capitalisation or gendered job titles) and sometimes showed “embedding drift,” raising reproducibility concerns. Our findings highlight LLMs’ promise as a complement or substitute to other tools for occupational coding in limited training data contexts, while underscoring critical limitations that must be addressed before fully entrusting them with classifying the work we do.

MCML Authors

Olga Kononykhina

Social Data Science and AI

Malte Schierholz

Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

[44]

F. Kreuter.
Adaptive Alignment: Designing AI for a Changing World - Frauke Kreuter.
ICML 2025 - 42nd International Conference on Machine Learning. Vancouver, Canada, Jul 13-19, 2025. Invited Talk. URL

Abstract

As artificial intelligence systems become deeply embedded in our institutions, economies, and personal lives, the challenge of alignment—ensuring AI acts in accordance with human values and societal norms—has become both urgent and complex. But what exactly should these systems be aligned to—and how do we know we’re getting it right? To address this, we turn to a long-standing body of work: how societies have historically measured public preferences and moral norms—and what often goes wrong in the process. The talk will introduce underutilized datasets—from decades of survey archives to international value studies—that could serve as empirical benchmarks for aligning AI systems with lived human norms. In addition to highlighting valuable data sources, we will examine how lessons from social science can inform the design of human feedback loops in AI. These insights help avoid common pitfalls in capturing human intentions and preferences—such as measurement error, framing effects, and unrepresentative sampling—that have plagued opinion research for decades. We’ll close by addressing the fluid and evolving nature of societal norms, emphasizing the need for alignment strategies that are adaptive to cultural and temporal change. Achieving this kind of adaptability requires not just better data, but durable collaborations between social scientists and machine learning researchers—so that updates to human values can be continuously reflected in system design. The goal is to provoke a deeper, interdisciplinary conversation about what it truly means to align AI with human values—and how to do so responsibly, reliably, and at scale.

MCML Authors

Frauke Kreuter

Prof. Dr.

C4 | Computational Social Sciences
→ Group Christoph Kern

Social Data Science and AI

[43]

M. Novotny, W. Weber, C. Kern and F. Kreuter.
Measuring public opinion towards artificial intelligence: development and validation of a general AI attitude short scale.
AI and Society (Jul. 2025). DOI

Abstract

The rapid proliferation of artificial intelligence (AI) has sparked both enthusiasm and ethical concerns in societies. As AI continues to permeate daily life, policymakers need to understand how it is perceived by diverse stakeholders and communities. To reliably measure attitudes towards AI of the general public, a short scale is essential for universal application. Existing scales face limitations in applicability due to their length, sub-standard internal consistency, or a focus on only negative attitudes. In response, we built up on existing scales and developed a unidimensional six-item general AI attitude short scale. First tests on internet panel data from Germany (n = 1001) and the US (n = 3091) obtained favorable results for classical test theory (CTT) and item response theory (IRT). Confirmatory factor analysis indicated an excellent fit for a single-factor structure, while the scale also exhibited strong criterion-related validity, correlating positively with digital competency and predicting acceptance of several AI applications. Additional IRT analyses suggested high item discrimination, broad coverage of the attitude spectrum and no meaningful differential item functioning (DIF). Thus, we propose a psychometrically sound short scale for measuring general AI attitude and provide insights into the antecedents and consequences of the construct.

MCML Authors

Marcus Novotny

Social Data Science and AI Lab

Christoph Kern

Prof. Dr.

Social Data Science and AI Lab

Frauke Kreuter

Prof. Dr.

C4 | Computational Social Sciences
→ Group Christoph Kern

Social Data Science and AI

[42]

U. Fischer Abaigar, C. Kern and F. Kreuter.
Adjusting survey estimates with multi-accuracy post-processing.
ITACOSM 2025 - Italian Conference on Survey Methodology. Bologna, Italy, Jul 01-04, 2025. Invited talk. To be published. Preprint available.

Abstract

With the rise of non-probability samples and new data sources, survey researchers face growing challenges related to selection bias. One emerging line of work adapts algorithmic tools from machine learning to improve robustness in such settings. This talk introduces multi-accuracy boosting (Kim et al., 2019), a post-processing method that reduces subgroup-level prediction error. Originally developed in the context of fairness, it has since been explored for use in survey adjustment tasks (Kim & Kern et al., 2022). I offer an accessible overview of the method and share reflections on its potential, and open questions for future research.

MCML Authors

Unai Fischer Abaigar

Social Data Science and AI Lab

Christoph Kern

Prof. Dr.

Social Data Science and AI Lab

Frauke Kreuter

Prof. Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

[41]

S. Ball, G. Gluch, S. Goldwasser, F. Kreuter, O. Reingold and G. N. Rothblum.
On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment.
Preprint (Jul. 2025). arXiv

Abstract

With the increased deployment of large language models (LLMs), one concern is their potential misuse for generating harmful content. Our work studies the alignment challenge, with a focus on filters to prevent the generation of unsafe information. Two natural points of intervention are the filtering of the input prompt before it reaches the model, and filtering the output after generation. Our main results demonstrate computational challenges in filtering both prompts and outputs. First, we show that there exist LLMs for which there are no efficient prompt filters: adversarial prompts that elicit harmful behavior can be easily constructed, which are computationally indistinguishable from benign prompts for any efficient filter. Our second main result identifies a natural setting in which output filtering is computationally intractable. All of our separation results are under cryptographic hardness assumptions. In addition to these core findings, we also formalize and study relaxed mitigation approaches, demonstrating further computational barriers. We conclude that safety cannot be achieved by designing filters external to the LLM internals (architecture and weights); in particular, black-box access to the LLM will not suffice. Based on our technical results, we argue that an aligned AI system’s intelligence cannot be separated from its judgment.

MCML Authors

Sarah Ball

Social Data Science and AI

Frauke Kreuter

Prof. Dr.

B2 | Natural Language Processing
→ Group Hinrich Schütze

Social Data Science and AI

[40]

S. Yuan, E. Nie, B. Ma and M. Färber.
Why Lift so Heavy? Slimming Large Language Models by Cutting Off the Layers.
IJCNN 2025 - International Joint Conference on Neural Networks. Rome, Italy, Jun 30-Jul 05, 2025. Preprint. arXiv

Abstract

Large Language Models (LLMs) possess outstanding capabilities in addressing various natural language processing (NLP) tasks. However, the sheer size of these models poses challenges in terms of storage, training and inference due to the inclusion of billions of parameters through layer stacking. While traditional approaches such as model pruning or distillation offer ways for reducing model size, they often come at the expense of performance retention. In our investigation, we systematically explore the approach of reducing the number of layers in LLMs. Surprisingly, we observe that even with fewer layers, LLMs maintain similar or better performance levels, particularly in prompt-based fine-tuning for text classification tasks. Remarkably, in certain cases, models with a single layer outperform their fully layered counterparts. These findings offer valuable insights for future work aimed at mitigating the size constraints of LLMs while preserving their performance, thereby opening avenues for significantly more efficient use of LLMs.

MCML Authors

Ercong Nie

Computational Linguistics

Bolei Ma

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

[39]

C. Wu, B. Ma, Z. Zhang, N. Deng, Y. He and Y. Xue.
Evaluating Zero-Shot Multilingual Aspect-Based Sentiment Analysis with Large Language Models.
International Journal of Machine Learning and Cybernetics (Jun. 2025). DOI

Abstract

Aspect-based sentiment analysis (ABSA), a sequence labeling task, has attracted increasing attention in multilingual contexts. While previous research has focused largely on fine-tuning or training models specifically for ABSA, we evaluate large language models (LLMs) under zero-shot conditions to explore their potential to tackle this challenge with minimal task-specific adaptation. We conduct a comprehensive empirical evaluation of a series of LLMs on multilingual ABSA tasks, investigating various prompting strategies, including vanilla zero-shot, chain-of-thought (CoT), self-improvement, self-debate, and self-consistency, across nine different models. Results indicate that while LLMs show promise in handling multilingual ABSA, they generally fall short of fine-tuned, task-specific models. Notably, simpler zero-shot prompts often outperform more complex strategies, especially in high-resource languages like English. These findings underscore the need for further refinement of LLM-based approaches to effectively address ABSA task across diverse languages.

MCML Authors

Bolei Ma

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

[38]

S. Ball, S. Allmendinger, F. Kreuter and N. Kühl.
Human Preferences in Large Language Model Latent Space: A Technical Analysis on the Reliability of Synthetic Data in Voting Outcome Prediction.
AAPOR 2025 - AAPOR 80th Annual Conference on Reshaping Democracy’s Oracle: TransForming Polls, Surveys, and the Measurement of Public Opinion in the Age of Al. St. Louis, MO, USA, May 14-16, 2025. To be published. Preprint available. arXiv

Abstract

Generative AI (GenAI) is increasingly used in survey contexts to simulate human preferences. While many research endeavors evaluate the quality of synthetic GenAI data by comparing model-generated responses to gold-standard survey results, fundamental questions about the validity and reliability of using LLMs as substitutes for human respondents remain. Our study provides a technical analysis of how demographic attributes and prompt variations influence latent opinion mappings in large language models (LLMs) and evaluates their suitability for survey-based predictions. Using 14 different models, we find that LLM-generated data fails to replicate the variance observed in real-world human responses, particularly across demographic subgroups. In the political space, persona-to-party mappings exhibit limited differentiation, resulting in synthetic data that lacks the nuanced distribution of opinions found in survey data. Moreover, we show that prompt sensitivity can significantly alter outputs for some models, further undermining the stability and predictiveness of LLM-based simulations. As a key contribution, we adapt a probe-based methodology that reveals how LLMs encode political affiliations in their latent space, exposing the systematic distortions introduced by these models. Our findings highlight critical limitations in AI-generated survey data, urging caution in its use for public opinion research, social science experimentation, and computational behavioral modeling.

MCML Authors

Sarah Ball

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Frauke Kreuter

Prof. Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

[37]

O. Kononykhina.
How ML-Filtered Answer Options Shape Responses and Interactions in CATI Surveys.
AAPOR 2025 - AAPOR 80th Annual Conference on Reshaping Democracy’s Oracle: TransForming Polls, Surveys, and the Measurement of Public Opinion in the Age of Al. St. Louis, MO, USA, May 14-16, 2025. To be published. Preprint available. URL

Abstract

Occupational coding has historically been a manual, post-survey task, but tools like OccuCoDe are shifting this process into real-time surveys using machine learning (ML). OccuCoDe dynamically filters and presents tailored answer options, allowing respondents themselves to select the description that best matches their occupation. However, our study revealed low agreement between such respondent-driven ML-based coding and post-survey manual coding, prompting us to explore how the quality of responses in automatic occupational coding relates to the quality of answer options, respondent and interviewer behaviors. We embedded OccuCoDe into a standard monthly multi-topic survey conducted by the Institute for Applied Social Science (INFAS) from 1 April to 31 June 2019. The survey was designed as a cross-sectional and panel survey with a 30:70 ratio for panel and new respondents, resulting in a representative sample of adults in Germany aged 18 and older. We received and analyzed 669 audio recordings through behavioral coding. Results showed that the quality of ML-generated suggestions significantly influenced classification accuracy, with highly accurate suggestion leading to better alignment with manual coding. Contrary to expectations, behavioral factors such as interviewer adherence to scripts or respondent mapping or comprehension issues were not the significant drivers of mismatches. Instead, familiar survey dynamics persisted: respondents often interrupted when they identified an option they liked, or interviewers skipped certain categories (e.g., ‘Other’). These findings suggest that while integrating ML or other AI tools into surveys is potentially fruitful, the key to success lies in refining the precision and distinctiveness of answer options. We also demonstrate that, although both respondents and interviewers showed adaptability to the presence of an automatic component, their behaviors could not overcome mismatches caused by limitations in ML-generated suggestions. In occupational coding—and potentially other survey domains—the effectiveness of real-time ML/AI integration depends on aligning algorithmic outputs with respondent realities to achieve high-quality data.

MCML Authors

Olga Kononykhina

Social Data Science and AI

[36]

C. Kern, U. Fischer-Abaigar, J. Schweisthal, D. Frauen, R. Ghani, S. Feuerriegel, M. van der Schaar and F. Kreuter.
Algorithms for reliable decision-making need causal reasoning.
Nature Computational Science 5 (May. 2025). DOI

Abstract

Decision-making inherently involves cause–effect relationships that introduce causal challenges. We argue that reliable algorithms for decision-making need to build upon causal reasoning. Addressing these causal challenges requires explicit assumptions about the underlying causal structure to ensure identifiability and estimatability, which means that the computational methods must successfully align with decision-making objectives in real-world tasks. Algorithmic decision-making (ADM) has become common in a wide range of domains, including precision medicine, manufacturing, education, hiring, the public sector, and smart cities. At the core of ADM systems are data-driven models that learn from data to recommend decisions, often with the goal of maximizing a defined utility function1. For example, in smart city contexts, ADM is frequently used to optimize traffic flow through predictive models that analyze real-time data, thereby reducing congestion and improving urban mobility. Another prominent application area for ADM are normative decision support systems (often subsumed under ‘prescriptive analytics’) or, more recently, artificial intelligence (AI) agents that either inform or automatically execute managerial and operational decisions in industry. Yet, the applications of ADM to high-stakes decisions face safety and reliability issues1,2,3. Often, the objectives of ADM systems fail to align with the nuanced goals of real-world decision-making, thus creating a tension between the potential of ADM and the risk of harm and failure. Especially when deployed in dynamic, real-world environments, ADM can amplify systemic disadvantages for vulnerable communities and lead to flawed decisions. In this Comment, we argue that reliable algorithmic decision-making — systems that perform safely and robustly under deployment conditions — must be grounded in causal reasoning.

MCML Authors

Christoph Kern

Prof. Dr.

C4 | Computational Social Sciences
→ Group Stefan Feuerriegel

Social Data Science and AI Lab

Jonas Schweisthal

Artificial Intelligence in Management

Dennis Frauen

C4 | Computational Social Sciences
→ Group Stefan Feuerriegel

Artificial Intelligence in Management

Stefan Feuerriegel

Prof. Dr.

Artificial Intelligence in Management

Frauke Kreuter

Prof. Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

[35]

A. Karamolegkou, A. Borah, E. Cho, S. R. Choudhury, M. Galletti, R. Ghosh, P. Gupta, O. Ignat, P. Kargupta, N. Kotonya, H. Lamba, S.-J. Lee, A. Mangla, I. Mondal, D. Nazarova, P. Nemkova, D. Pisarevskaya, N. Rizwan, N. Sabri, D. Stammbach, A. Steinberg, D. Tomás, S. R. Wilson, B. Yi, J. H. Zhu, A. Zubiaga, A. Søgaard, A. Fraser, Z. Jin, R. Mihalcea, J. R. Tetreault and D. Dementieva.
NLP for Social Good: A Survey of Challenges, Opportunities, and Responsible Deployment.
Preprint (May. 2025). arXiv

Abstract

Recent advancements in large language models (LLMs) have unlocked unprecedented possibilities across a range of applications. However, as a community, we believe that the field of Natural Language Processing (NLP) has a growing need to approach deployment with greater intentionality and responsibility. In alignment with the broader vision of AI for Social Good (Tomašev et al., 2020), this paper examines the role of NLP in addressing pressing societal challenges. Through a cross-disciplinary analysis of social goals and emerging risks, we highlight promising research directions and outline challenges that must be addressed to ensure responsible and equitable progress in NLP4SG research.

MCML Authors

Anna Steinberg

Social Data Science and AI

Alexander Fraser

Prof. Dr.

B2 | Natural Language Processing
→ Group Alexander Fraser

Data Analytics & Statistics

Daryna Dementieva

Dr.

Data Analytics & Statistics

[34]

I. Sen, B. Ma, G. Ahnert, A.-C. Haensch, T. Holtdirk, F. Kreuter and M. Strohmaier.
Connecting Natural Language Processing and Survey Methodology: Potentials, Challenges, and Open Questions.
Preprint (May. 2025). DOI

Abstract

Recent generative AI technologies, particularly Large Language Models (LLMs), have increased interest in Natural Language Processing (NLP) methods for scientists and practitioners across disciplines. In this position paper, we highlight one such discipline — survey methodology, which not only uses more and more NLP techniques, e.g., using LLMs to simulate survey respondents, but also stands to benefit NLP, e.g., informing the design of NLP annotation and evaluation tasks. We argue for increasing synergies between NLP and Survey Methodology to realize the potential at their intersection. We also outline challenges that impede progress on these potential synergies and present 10 open questions to encourage further reflection.

MCML Authors

Bolei Ma

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Anna-Carolina Haensch

Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Tobias Holtdirk

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Frauke Kreuter

Prof. Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

[33]

C. Wu, Y. Cai, Y. Liu, P. Zhu, Y. Xue, Z. Gong, J. Hirschberg and B. Ma.
Multimodal Emotion Recognition in Conversations: A Survey of Methods, Trends, Challenges and Prospects.
Preprint (May. 2025). arXiv

Abstract

While text-based emotion recognition methods have achieved notable success, real-world dialogue systems often demand a more nuanced emotional understanding than any single modality can offer. Multimodal Emotion Recognition in Conversations (MERC) has thus emerged as a crucial direction for enhancing the naturalness and emotional understanding of human-computer interaction. Its goal is to accurately recognize emotions by integrating information from various modalities such as text, speech, and visual signals.
This survey offers a systematic overview of MERC, including its motivations, core tasks, representative methods, and evaluation strategies. We further examine recent trends, highlight key challenges, and outline future directions. As interest in emotionally intelligent systems grows, this survey provides timely guidance for advancing MERC research.

MCML Authors

Bolei Ma

Social Data Science and AI

[32]

B. Ma, C. A. Huang and A.-C. Haensch.
Can Large Language Models Advance Crosswalks? The Case of Danish Occupation Codes.
SRW @NAACL 2025 - Student Research Workshop at the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2025). Albuquerque, NM, USA, Apr 29-May 04, 2025. URL

Abstract

Crosswalks, which map one classification system to another, are critical tools for harmonizing data across time, countries, or frameworks. However, constructing crosswalks is labor-intensive and often requires domain expertise. This paper investigates the potential of Large Language Models (LLMs) to assist in creating crosswalks, focusing on two Danish occupational classification systems from different time periods as a case study. We propose a two-stage, prompt-based framework for this task, where LLMs perform similarity assessments between classification codes and identify final mappings through a guided decision process. Using four instruction-tuned LLMs and comparing them against an embedding-based baseline, we evaluate the performance of different models in crosswalks. Our results highlight the strengths of LLMs in crosswalk creation compared to the embedding-based baseline, showing the effectiveness of the interactive prompt-based framework for conducting crosswalks by LLMs. Furthermore, we analyze the impact of model combinations across two interactive rounds, highlighting the importance of model selection and consistency. This work contributes to the growing field of NLP applications for domain-specific knowledge mapping and demonstrates the potential of LLMs in advancing crosswalk methodologies.

MCML Authors

Bolei Ma

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Anna-Carolina Haensch

Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

[31]

L. von der Heyde, A.-C. Haensch and A. Wenz.
Vox Populi, Vox AI? Using Language Models to Estimate German Public Opinion.
Social Science Computer Review Online First (Apr. 2025). DOI

Abstract

‘Synthetic samples’ generated by large language models (LLMs) have been argued to complement or replace traditional surveys, assuming their training data is grounded in human-generated data that potentially reflects attitudes and behaviors prevalent in the population. Initial US-based studies that have prompted LLMs to mimic survey respondents found that the responses match survey data. However, the relationship between the respective target population and LLM training data might affect the generalizability of such findings. In this paper, we critically evaluate the use of LLMs for public opinion research in a different context, by investigating whether LLMs can estimate vote choice in Germany. We generate a synthetic sample matching the 2017 German Longitudinal Election Study respondents and ask the LLM GPT-3.5 to predict each respondent’s vote choice. Comparing these predictions to the survey-based estimates on the aggregate and subgroup levels, we find that GPT-3.5 exhibits a bias towards the Green and Left parties. While the LLM predictions capture the tendencies of “typical” voters, they miss more complex factors of vote choice. By examining the LLM-based prediction of voting behavior in a non-English speaking context, our study contributes to research on the extent to which LLMs can be leveraged for studying public opinion. The findings point to disparities in opinion representation in LLMs and underscore the limitations in applying them for public opinion estimation.

MCML Authors

Leah von der Heyde

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Anna-Carolina Haensch

Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

[30]

A. Wuttke, M. Aßenmacher, C. Klamm, M. Lang, Q. Würschinger and F. Kreuter.
AI Conversational Interviewing: Transforming Surveys with LLMs as Adaptive Interviewers.
Preprint (Mar. 2025). arXiv

Abstract

Traditional methods for eliciting people’s opinions face a trade-off between depth and scale: structured surveys enable large-scale data collection but limit respondents’ ability to voice their opinions in their own words, while conversational interviews provide deeper insights but are resource-intensive. This study explores the potential of replacing human interviewers with large language models (LLMs) to conduct scalable conversational interviews. Our goal is to assess the performance of AI Conversational Interviewing and to identify opportunities for improvement in a controlled environment. We conducted a small-scale, in-depth study with university students who were randomly assigned to a conversational interview by either AI or human interviewers, both employing identical questionnaires on political topics. Various quantitative and qualitative measures assessed interviewer adherence to guidelines, response quality, participant engagement, and overall interview efficacy. The findings indicate the viability of AI Conversational Interviewing in producing quality data comparable to traditional methods, with the added benefit of scalability. We publish our data and materials for re-use and present specific recommendations for effective implementation.

MCML Authors

Matthias Aßenmacher

Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Frauke Kreuter

Prof. Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

[29]

C. Wu, B. Ma, N. Deng, Y. He and Y. Xue.
Multi-Scale and Multi-Objective Optimization for Cross-Lingual Aspect-Based Sentiment Analysis.
Preprint (Feb. 2025). arXiv

Abstract

Aspect-based sentiment analysis (ABSA) is a sequence labeling task that has garnered growing research interest in multilingual contexts. However, recent studies lack more robust feature alignment and finer aspect-level alignment. In this paper, we propose a novel framework, Multi-Scale and Multi-Objective optimization (MSMO) for cross-lingual ABSA. During multi-scale alignment, we achieve cross-lingual sentence-level and aspect-level alignment, aligning features of aspect terms in different contextual environments. Specifically, we introduce code-switched bilingual sentences into the language discriminator and consistency training modules to enhance the model’s robustness. During multi-objective optimization, we design two optimization objectives: supervised training and consistency training, aiming to enhance cross-lingual semantic alignment. To further improve model performance, we incorporate distilled knowledge of the target language into the model. Results show that MSMO significantly enhances cross-lingual ABSA by achieving state-of-the-art performance across multiple languages and models.

MCML Authors

Bolei Ma

Social Data Science and AI

[28]

C. Wu, B. Ma, Y. Liu, Z. Zhang, N. Deng, Y. Li, B. Chen, Y. Zhang, Y. Xue and B. Plank.
M-ABSA: A Multilingual Dataset for Aspect-Based Sentiment Analysis.
Preprint (Feb. 2025). arXiv

Abstract

Aspect-based sentiment analysis (ABSA) is a crucial task in information extraction and sentiment analysis, aiming to identify aspects with associated sentiment elements in text. However, existing ABSA datasets are predominantly English-centric, limiting the scope for multilingual evaluation and research. To bridge this gap, we present M-ABSA, a comprehensive dataset spanning 7 domains and 21 languages, making it the most extensive multilingual parallel dataset for ABSA to date. Our primary focus is on triplet extraction, which involves identifying aspect terms, aspect categories, and sentiment polarities. The dataset is constructed through an automatic translation process with human review to ensure quality. We perform extensive experiments using various baselines to assess performance and compatibility on M-ABSA. Our empirical findings highlight that the dataset enables diverse evaluation tasks, such as multilingual and multi-domain transfer learning, and large language model evaluation, underscoring its inclusivity and its potential to drive advancements in multilingual ABSA research.

MCML Authors

Bolei Ma

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Yihong Liu

B2 | Natural Language Processing
→ Group Hinrich Schütze

Computational Linguistics

Barbara Plank

Prof. Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

AI and Computational Linguistics

[27]

J. Beck, L. M. Kemeter, K. Dürrbeck, M. H. I. Abdalla and F. Kreuter.
Toward Integrating ChatGPT Into Satellite Image Annotation Workflows: A Comparison of Label Quality and Costs of Human and Automated Annotators.
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 18 (Jan. 2025). DOI

Abstract

High-quality annotations are a critical success factor for machine learning (ML) applications. To achieve this, we have
traditionally relied on human annotators, navigating the challenges of limited budgets and the varying task-specific expertise, costs, and availability. Since the emergence of Large Language Models (LLMs), their popularity for generating automated annotations has grown, extending possibilities and complexity of designing an efficient annotation strategy. Increasingly, computer vision capabilities have been integrated into general-purpose LLMs like ChatGPT. This raises the question of how effectively LLMs can be used in satellite image annotation tasks and how they compare to traditional annotator types. This study presents a comprehensive investigation and comparison of various human and automated annotators for image classification. We evaluate the feasibility and economic competitiveness of using the ChatGPT4-V model for a complex land usage annotation task and compare it with alternative human annotators. A set of satellite images is annotated by a domain expert and 15 additional human and automated annotators, differing in expertise and costs. Our analyses examine the annotation quality loss between the expert and other annotators. This comparison is conducted through (1) descriptive analyses, (2) fitting linear probability models, and (3) comparing F1-scores. Ultimately, we simulate annotation strategies where samples are split according to an automatically assigned certainty score. Routing low-certainty images to human annotators can cut total annotation costs by over 50% with minimal impact on label quality. We discuss implications regarding the economic competitiveness of annotation strategies, prompt engineering and the task-specificity of expertise.

MCML Authors

Jacob Beck

Social Data Science and AI

Frauke Kreuter

Prof. Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

[26]

S. Eckman, B. Ma, C. Kern, R. Chew, B. Plank and F. Kreuter.
Correcting Annotator Bias in Training Data: Population-Aligned Instance Replication (PAIR).
Preprint (Jan. 2025). arXiv

Abstract

Models trained on crowdsourced labels may not reflect broader population views when annotator pools are not representative. Since collecting representative labels is challenging, we propose Population-Aligned Instance Replication (PAIR), a method to address this bias through statistical adjustment. Using a simulation study of hate speech and offensive language detection, we create two types of annotators with different labeling tendencies and generate datasets with varying proportions of the types. Models trained on unbalanced annotator pools show poor calibration compared to those trained on representative data. However, PAIR, which duplicates labels from underrepresented annotator groups to match population proportions, significantly reduces bias without requiring new data collection. These results suggest statistical techniques from survey research can help align model training with target populations even when representative annotator pools are unavailable. We conclude with three practical recommendations for improving training data quality.

MCML Authors

Bolei Ma

Social Data Science and AI

Christoph Kern

Prof. Dr.

Social Data Science and AI Lab

Barbara Plank

Prof. Dr.

AI and Computational Linguistics

Frauke Kreuter

Prof. Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

[25]

O. Kononykhina, M. Schierholz and F. Kreuter.
The Impact of Question Framing on the Precision of Automatic Occupation Coding.
Preprint (Jan. 2025). arXiv

Abstract

Occupational data play a vital role in research, official statistics, and policymaking, yet their collection and accurate classification remain a persistent challenge. This study investigates the effects of occupational question wording on data variability and the performance of automatic coding tools. Through a series of survey experiments conducted and replicated in Germany, we tested two widely-used occupational question formats: one focusing on ‘job title’ (Berufsbezeichnung) and another on ‘occupational tasks’ (berufliche Tätigkeit). Our analysis reveals that automatic coding tools, such as CASCOT and OccuCoDe, exhibit significant sensitivity to the form and origin of the data. Specifically, these tools performed more efficiently when coding responses to the job title question format compared to the occupational task format. Additionally, we found that including examples of main tasks and duties in the questions led respondents to provide more detailed but less linguistically diverse responses. This reduced diversity may negatively affect the precision of automatic coding. These findings highlight the importance of tailoring automatic coding tools to the specific structure and origin of the data they are applied to. We emphasize the need for further research to optimize question design and coding tools for greater accuracy and applicability in occupational data collection.

MCML Authors

Olga Kononykhina

Social Data Science and AI

Malte Schierholz

Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Frauke Kreuter

Prof. Dr.

C4 | Computational Social Sciences
→ Group Christoph Kern

Social Data Science and AI

2024

[24]

U. Fischer Abaigar, C. Kern, N. Barda and F. Kreuter.
Bridging the gap: Towards an expanded toolkit for AI-driven decision-making in the public sector.
Government Information Quarterly 41.4 (Dec. 2024). DOI

Abstract

AI-driven decision-making systems are becoming instrumental in the public sector, with applications spanning areas like criminal justice, social welfare, financial fraud detection, and public health. While these systems offer great potential benefits to institutional decision-making processes, such as improved efficiency and reliability, these systems face the challenge of aligning machine learning (ML) models with the complex realities of public sector decision-making. In this paper, we examine five key challenges where misalignment can occur, including distribution shifts, label bias, the influence of past decision-making on the data side, as well as competing objectives and human-in-the-loop on the model output side. Our findings suggest that standard ML methods often rely on assumptions that do not fully account for these complexities, potentially leading to unreliable and harmful predictions. To address this, we propose a shift in modeling efforts from focusing solely on predictive accuracy to improving decision-making outcomes. We offer guidance for selecting appropriate modeling frameworks, including counterfactual prediction and policy learning, by considering how the model estimand connects to the decision-maker’s utility. Additionally, we outline technical methods that address specific challenges within each modeling approach. Finally, we argue for the importance of external input from domain experts and stakeholders to ensure that model assumptions and design choices align with real-world policy objectives, taking a step towards harmonizing AI and public sector objectives.

MCML Authors

Unai Fischer Abaigar

Social Data Science and AI Lab

Christoph Kern

Prof. Dr.

Social Data Science and AI Lab

Frauke Kreuter

Prof. Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

[23]

B. Ma, X. Wang, T. Hu, A.-C. Haensch, M. A. Hedderich, B. Plank and F. Kreuter.
The Potential and Challenges of Evaluating Attitudes, Opinions, and Values in Large Language Models.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI

Abstract

Recent advances in Large Language Models (LLMs) have sparked wide interest in validating and comprehending the human-like cognitive-behavioral traits LLMs may capture and convey. These cognitive-behavioral traits include typically Attitudes, Opinions, Values (AOVs). However, measuring AOVs embedded within LLMs remains opaque, and different evaluation methods may yield different results. This has led to a lack of clarity on how different studies are related to each other and how they can be interpreted. This paper aims to bridge this gap by providing a comprehensive overview of recent works on the evaluation of AOVs in LLMs. Moreover, we survey related approaches in different stages of the evaluation pipeline in these works. By doing so, we address the potential and challenges with respect to understanding the model, human-AI alignment, and downstream application in social sciences. Finally, we provide practical insights into evaluation methods, model enhancement, and interdisciplinary collaboration, thereby contributing to the evolving landscape of evaluating AOVs in LLMs.

MCML Authors

Bolei Ma

Social Data Science and AI

Xinpeng Wang

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Anna-Carolina Haensch

Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Michael Hedderich

Dr.

AI and Computational Linguistics

Barbara Plank

Prof. Dr.

AI and Computational Linguistics

Frauke Kreuter

Prof. Dr.

Social Data Science and AI

[22]

C. Kern, R. Bach, H. Mautner and F. Kreuter.
When Small Decisions Have Big Impact: Fairness Implications of Algorithmic Profiling Schemes.
ACM Journal on Responsible Computing (Nov. 2024). DOI

Abstract

Algorithmic profiling is increasingly used in the public sector with the hope of allocating limited public resources more effectively and objectively. One example is the prediction-based profiling of job seekers to guide the allocation of support measures by public employment services. However, empirical evaluations of potential side-effects such as unintended discrimination and fairness concerns are rare in this context. We systematically compare and evaluate statistical models for predicting job seekers’ risk of becoming long-term unemployed concerning subgroup prediction performance, fairness metrics, and vulnerabilities to data analysis decisions. Focusing on Germany as a use case, we evaluate profiling models under realistic conditions using large-scale administrative data. We show that despite achieving high prediction performance on average, profiling models can be considerably less accurate for vulnerable social subgroups. In this setting, different classification policies can have very different fairness implications. We therefore call for rigorous auditing processes before such models are put to practice.

MCML Authors

Christoph Kern

Prof. Dr.

Social Data Science and AI Lab

Frauke Kreuter

Prof. Dr.

B2 | Natural Language Processing
→ Group Barbara Plank

Social Data Science and AI

[21]

X. Wang, C. Hu, B. Ma, P. Rottger and B. Plank.
Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think.
COLM 2024 - Conference on Language Modeling. Philadelphia, PA, USA, Oct 07-09, 2024. PDF

Abstract

Multiple choice questions (MCQs) are commonly used to evaluate the capabilities of large language models (LLMs). One common way to evaluate the model response is to rank the candidate answers based on the log probability of the first token prediction. An alternative way is to examine the text output. Prior work has shown that first token probabilities lack robustness to changes in MCQ phrasing, and that first token probabilities do not match text answers for instruction-tuned models. Therefore, in this paper, we investigate the robustness of text answers. We show that the text answers are more robust to question perturbations than the first token probabilities, when the first token answers mismatch the text answers. The difference in robustness increases as the mismatch rate becomes greater. As the mismatch reaches over 50%, the text answer is more robust to option order changes than the debiased first token probabilities using state-of-the-art debiasing methods such as PriDe. Our findings provide further evidence for the benefits of text answer evaluation over first token probability evaluation.

MCML Authors

Xinpeng Wang

AI and Computational Linguistics

Bolei Ma

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Barbara Plank

Prof. Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

AI and Computational Linguistics

[20]

L. von der Heyde, A.-C. Haensch, A. Wenz and B. Ma.
United in Diversity? Contextual Biases in LLM-Based Predictions of the 2024 European Parliament Elections.
Preprint (Sep. 2024). arXiv

Abstract

Large language models (LLMs) are perceived by some as having the potential to revolutionize social science research, considering their training data includes information on human attitudes and behavior. If these attitudes are reflected in LLM output, LLM-generated ‘synthetic samples’ could be used as a viable and efficient alternative to surveys of real humans. However, LLM-synthetic samples might exhibit coverage bias due to training data and fine-tuning processes being unrepresentative of diverse linguistic, social, political, and digital contexts. In this study, we examine to what extent LLM-based predictions of public opinion exhibit context-dependent biases by predicting voting behavior in the 2024 European Parliament elections using a state-of-the-art LLM. We prompt GPT-4-Turbo with anonymized individual-level background information, varying prompt content and language, ask the LLM to predict each person’s voting behavior, and compare the weighted aggregates to the real election results. Our findings emphasize the limited applicability of LLM-synthetic samples to public opinion prediction. We show that (1) the LLM-based prediction of future voting behavior largely fails, (2) prediction accuracy is unequally distributed across national and linguistic contexts, and (3) improving LLM predictions requires detailed attitudinal information about individuals for prompting. In investigating the contextual differences of LLM-based predictions of public opinion, our research contributes to the understanding and mitigation of biases and inequalities in the development of LLMs and their applications in computational social science.

MCML Authors

Leah von der Heyde

Social Data Science and AI

Anna-Carolina Haensch

Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Bolei Ma

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

[19]

X. Wang, B. Ma, C. Hu, L. Weber-Genzel, P. Röttger, F. Kreuter, D. Hovy and B. Plank.
My Answer is C: First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models.
ACL 2024 - Findings of the 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

The open-ended nature of language generation makes the evaluation of autoregressive large language models (LLMs) challenging. One common evaluation approach uses multiple-choice questions to limit the response space. The model is then evaluated by ranking the candidate answers by the log probability of the first token prediction. However, first-tokens may not consistently reflect the final response output, due to model’s diverse response styles such as starting with ‘Sure’ or refusing to answer. Consequently, first-token evaluation is not indicative of model behaviour when interacting with users. But by how much? We evaluate how aligned first-token evaluation is with the text output along several dimensions, namely final option choice, refusal rate, choice distribution and robustness under prompt perturbation. Our results show that the two approaches are severely misaligned on all dimensions, reaching mismatch rates over 60%. Models heavily fine-tuned on conversational or safety data are especially impacted. Crucially, models remain misaligned even when we increasingly constrain prompts, i.e., force them to start with an option letter or example template. Our findings i) underscore the importance of inspecting the text output as well and ii) caution against relying solely on first-token evaluation.

MCML Authors

Xinpeng Wang

B2 | Natural Language Processing
→ Group Barbara Plank

AI and Computational Linguistics

Bolei Ma

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Leon Weber-Genzel

Dr.

B2 | Natural Language Processing
→ Group Barbara Plank

* Former Member

Frauke Kreuter

Prof. Dr.

Social Data Science and AI

Barbara Plank

Prof. Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

AI and Computational Linguistics

[18]

B. Ma.
Evaluating Lexical Aspect with Large Language Models.
CMCL @ACL 2024 - Workshop on Cognitive Modeling and Computational Linguistics at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

In this study, we explore the proficiency of large language models (LLMs) in understanding two key lexical aspects: duration (durative/stative) and telicity (telic/atelic). Through experiments on datasets featuring sentences, verbs, and verb positions, we prompt the LLMs to identify aspectual features of verbs in sentences. Our findings reveal that certain LLMs, particularly those closed-source ones, are able to capture information on duration and telicity, albeit with some performance variations and weaker results compared to the baseline. By employing prompts at three levels (sentence-only, sentence with verb, and sentence with verb and its position), we demonstrate that integrating verb information generally enhances performance in aspectual feature recognition, though it introduces instability. We call for future research to look deeper into methods aimed at optimizing LLMs for aspectual feature comprehension.

MCML Authors

Bolei Ma

Social Data Science and AI

[17]

A. Dimmelmeier, H. Doll, M. Schierholz, E. Kormanyos, M. Fehr, B. Ma, J. Beck, A. Fraser and F. Kreuter.
Informing climate risk analysis using textual information - A research agenda.
ClimateNLP @ACL 2024 - 1st Workshop on Natural Language Processing Meets Climate Change at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). Bangkok, Thailand, Aug 11-16, 2024. DOI

Abstract

We present a research agenda focused on efficiently extracting, assuring quality, and consolidating textual company sustainability information to address urgent climate change decision-making needs. Starting from the goal to create integrated FAIR (Findable, Accessible, Interoperable, Reusable) climate-related data, we identify research needs pertaining to the technical aspects of information extraction as well as to the design of the integrated sustainability datasets that we seek to compile. Regarding extraction, we leverage technological advancements, particularly in large language models (LLMs) and Retrieval-Augmented Generation (RAG) pipelines, to unlock the underutilized potential of unstructured textual information contained in corporate sustainability reports. In applying these techniques, we review key challenges, which include the retrieval and extraction of CO2 emission values from PDF documents, especially from unstructured tables and graphs therein, and the validation of automatically extracted data through comparisons with human-annotated values. We also review how existing use cases and practices in climate risk analytics relate to choices of what textual information should be extracted and how it could be linked to existing structured data.

MCML Authors

Malte Schierholz

Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Bolei Ma

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Jacob Beck

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics

Frauke Kreuter

Prof. Dr.

A3 | Computational Models
→ Group Eyke Hüllermeier

Social Data Science and AI

[16]

S. Dutta, T. Kaufmann, G. Glavaš, I. Habernal, K. Kersting, F. Kreuter, M. Mezini, I. Gurevych, E. Hüllermeier and H. Schütze.
Problem Solving Through Human-AI Preference-Based Cooperation.
Preprint (Aug. 2024). arXiv

Abstract

While there is a widespread belief that artificial general intelligence (AGI) – or even superhuman AI – is imminent, complex problems in expert domains are far from being solved. We argue that such problems require human-AI cooperation and that the current state of the art in generative AI is unable to play the role of a reliable partner due to a multitude of shortcomings, including difficulty to keep track of a complex solution artifact (e.g., a software program), limited support for versatile human preference expression and lack of adapting to human preference in an interactive setting. To address these challenges, we propose HAICo2, a novel human-AI co-construction framework. We take first steps towards a formalization of HAICo2 and discuss the difficult open research problems that it faces.

MCML Authors

Timo Kaufmann

Artificial Intelligence and Machine Learning

Frauke Kreuter

Prof. Dr.

A3 | Computational Models

Social Data Science and AI

Eyke Hüllermeier

Prof. Dr.

Artificial Intelligence and Machine Learning

Hinrich Schütze

Prof. Dr.

Computational Linguistics

[15]

S. Eckman, B. Plank and F. Kreuter.
Position: Insights from Survey Methodology can Improve Training Data.
ICML 2024 - 41st International Conference on Machine Learning. Vienna, Austria, Jul 21-27, 2024. URL

Abstract

Whether future AI models are fair, trustworthy, and aligned with the public’s interests rests in part on our ability to collect accurate data about what we want the models to do. However, collecting high-quality data is difficult, and few AI/ML researchers are trained in data collection methods. Recent research in data-centric AI has show that higher quality training data leads to better performing models, making this the right moment to introduce AI/ML researchers to the field of survey methodology, the science of data collection. We summarize insights from the survey methodology literature and discuss how they can improve the quality of training and feedback data. We also suggest collaborative research ideas into how biases in data collection can be mitigated, making models more accurate and human-centric.

MCML Authors

Barbara Plank

Prof. Dr.

AI and Computational Linguistics

Frauke Kreuter

Prof. Dr.

C4 | Computational Social Sciences
→ Group Christoph Kern

Social Data Science and AI

[14]

U. Fischer Abaigar, C. Kern and F. Kreuter.
The Missing Link: Allocation Performance in Causal Machine Learning.
ICML 2024 - Workshop Humans, Algorithmic Decision-Making and Society: Modeling Interactions and Impact at the 41st International Conference on Machine Learning. Vienna, Austria, Jul 21-27, 2024. arXiv URL

Abstract

Automated decision-making (ADM) systems are being deployed across a diverse range of critical problem areas such as social welfare and healthcare. Recent work highlights the importance of causal ML models in ADM systems, but implementing them in complex social environments poses significant challenges. Research on how these challenges impact the performance in specific downstream decision-making tasks is limited. Addressing this gap, we make use of a comprehensive real-world dataset of jobseekers to illustrate how the performance of a single CATE model can vary significantly across different decision-making scenarios and highlight the differential influence of challenges such as distribution shifts on predictions and allocations.

MCML Authors

Unai Fischer Abaigar

Social Data Science and AI Lab

Christoph Kern

Prof. Dr.

Social Data Science and AI Lab

Frauke Kreuter

Prof. Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

[13]

P. Resnik, B. Ma, A. Hoyle, P. Goel, R. Sarkar, M. Gearing, A.-C. Haensch and F. Kreuter.
TOPCAT: Topic-Oriented Protocol for Content Analysis of Text – A Preliminary Study.
NLP+CSS @NAACL 2024 - 6th Workshop on Natural Language Processing and Computational Social Science at the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024). Mexico City, Mexico, Jun 16-21, 2024. URL

Abstract

Identifying constructs in text data is a labor-intensive task in social science research. Despite the potential richness of open-ended survey responses, the complexity of analyzing them often leads researchers to underutilize or ignore them entirely. While topic modeling offers a technological solution, qualitative researchers may remain skeptical of its rigor. In this paper, we introduce TOPCAT: Topic-Oriented Protocol for Content Analysis of Text, a systematic approach that integrates off-the-shelf topic modeling with human decisionmaking and curation. Our method aims to provide a viable solution for topicalizing open-ended responses in survey research, ensuring both efficiency and trustworthiness. We present the TOPCAT protocol, define an evaluation process, and demonstrate its effectiveness using open-ended responses from a U.S. survey on COVID-19 impact. Our findings suggest that TOPCAT enables efficient and rigorous qualitative analysis, offering a promising avenue for future research in this domain. Furthermore, our findings challenge the adequacy of expert coding schemes as ‘‘gold’’ standards, emphasizing the subjectivity inherent in qualitative content interpretation.

MCML Authors

Bolei Ma

Social Data Science and AI

Anna-Carolina Haensch

Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Frauke Kreuter

Prof. Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

[12]

S. Ball, F. Kreuter and N. Panickssery.
Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models.
Preprint (Jun. 2024). arXiv

Abstract

Conversational large language models are trained to refuse to answer harmful questions. However, emergent jailbreaking techniques can still elicit unsafe outputs, presenting an ongoing challenge for model alignment. To better understand how different jailbreak types circumvent safeguards, this paper analyses model activations on different jailbreak inputs. We find that it is possible to extract a jailbreak vector from a single class of jailbreaks that works to mitigate jailbreak effectiveness from other semantically-dissimilar classes. This may indicate that different kinds of effective jailbreaks operate via a similar internal mechanism. We investigate a potential common mechanism of harmfulness feature suppression, and find evidence that effective jailbreaks noticeably reduce a model’s perception of prompt harmfulness. These findings offer actionable insights for developing more robust jailbreak countermeasures and lay the groundwork for a deeper, mechanistic understanding of jailbreak dynamics in language models.

MCML Authors

Sarah Ball

Social Data Science and AI

Frauke Kreuter

Prof. Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

[11]

B. Ma, E. Nie, S. Yuan, H. Schmid, M. Färber, F. Kreuter and H. Schütze.
ToPro: Token-Level Prompt Decomposition for Cross-Lingual Sequence Labeling Tasks.
EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics. St. Julians, Malta, Mar 17-22, 2024. URL

Abstract

Prompt-based methods have been successfully applied to multilingual pretrained language models for zero-shot cross-lingual understanding. However, most previous studies primarily focused on sentence-level classification tasks, and only a few considered token-level labeling tasks such as Named Entity Recognition (NER) and Part-of-Speech (POS) tagging. In this paper, we propose Token-Level Prompt Decomposition (ToPro), which facilitates the prompt-based method for token-level sequence labeling tasks. The ToPro method decomposes an input sentence into single tokens and applies one prompt template to each token. Our experiments on multilingual NER and POS tagging datasets demonstrate that ToPro-based fine-tuning outperforms Vanilla fine-tuning and Prompt-Tuning in zero-shot cross-lingual transfer, especially for languages that are typologically different from the source language English. Our method also attains state-of-the-art performance when employed with the mT5 model. Besides, our exploratory study in multilingual large language models shows that ToPro performs much better than the current in-context learning method. Overall, the performance improvements show that ToPro could potentially serve as a novel and simple benchmarking method for sequence labeling tasks.

MCML Authors

Bolei Ma

Social Data Science and AI

Ercong Nie

B2 | Natural Language Processing
→ Group Hinrich Schütze

Computational Linguistics

Frauke Kreuter

Prof. Dr.

Social Data Science and AI

Hinrich Schütze

Prof. Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Computational Linguistics

[10]

J. Beck, S. Eckman, B. Ma, R. Chew and F. Kreuter.
Order Effects in Annotation Tasks: Further Evidence of Annotation Sensitivity.
UncertaiNLP @EACL 2024 - 1st Workshop on Uncertainty-Aware NLP at the 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2024). St. Julians, Malta, Mar 17-22, 2024. URL

Abstract

The data-centric revolution in AI has revealed the importance of high-quality training data for developing successful AI models. However, annotations are sensitive to annotator characteristics, training materials, and to the design and wording of the data collection instrument. This paper explores the impact of observation order on annotations. We find that annotators’ judgments change based on the order in which they see observations. We use ideas from social psychology to motivate hypotheses about why this order effect occurs. We believe that insights from social science can help AI researchers improve data and model quality.

MCML Authors

Jacob Beck

Social Data Science and AI

Bolei Ma

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Frauke Kreuter

Prof. Dr.

B2 | Natural Language Processing
→ Group Hinrich Schütze

Social Data Science and AI

[9]

E. Nie, S. Yuan, B. Ma, H. Schmid, M. Färber, F. Kreuter and H. Schütze.
Decomposed Prompting: Unveiling Multilingual Linguistic Structure Knowledge in English-Centric Large Language Models.
Preprint (Feb. 2024). arXiv

Abstract

Despite the predominance of English in their training data, English-centric Large Language Models (LLMs) like GPT-3 and LLaMA display a remarkable ability to perform multilingual tasks, raising questions about the depth and nature of their cross-lingual capabilities. This paper introduces the decomposed prompting approach to probe the linguistic structure understanding of these LLMs in sequence labeling tasks. Diverging from the single text-to-text prompt, our method generates for each token of the input sentence an individual prompt which asks for its linguistic label. We assess our method on the Universal Dependencies part-of-speech tagging dataset for 38 languages, utilizing both English-centric and multilingual LLMs. Our findings show that decomposed prompting surpasses the iterative prompting baseline in efficacy and efficiency under zero- and few-shot settings. Further analysis reveals the influence of evaluation methods and the use of instructions in prompts. Our multilingual investigation shows that English-centric language models perform better on average than multilingual models. Our study offers insights into the multilingual transferability of English-centric LLMs, contributing to the understanding of their multilingual linguistic knowledge.

MCML Authors

Ercong Nie

Computational Linguistics

Bolei Ma

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Frauke Kreuter

Prof. Dr.

Social Data Science and AI

Hinrich Schütze

Prof. Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Computational Linguistics

2023

[8]

Z. Zhang, H. Yang, B. Ma, D. Rügamer and E. Nie.
Baby's CoThought: Leveraging Large Language Models for Enhanced Reasoning in Compact Models.
CoNLL 2023 - BabyLM Challenge at 27th Conference on Computational Natural Language Learning. Singapore, Dec 06-10, 2023. DOI GitHub

Abstract

Large Language Models (LLMs) demonstrate remarkable performance on a variety of natural language understanding (NLU) tasks, primarily due to their in-context learning ability. This ability could be applied to building babylike models, i.e. models at small scales, improving training efficiency. In this paper, we propose a ‘CoThought’ pipeline, which efficiently trains smaller ‘baby’ language models (BabyLMs) by leveraging the Chain of Thought prompting of LLMs. Our pipeline restructures a dataset of less than 100M in size using GPT-3.5-turbo, transforming it into task-oriented, human-readable texts that are comparable to the school texts for language learners. The BabyLM is then pretrained on this restructured dataset in a RoBERTa fashion. In evaluations across 4 benchmarks, our BabyLM outperforms the vanilla RoBERTa in 10 linguistic, NLU, and question-answering tasks by more than 3 points, showing a superior ability to extract contextual information. These results suggest that compact LMs pretrained on small, LLM-resabructured data can better understand tasks and achieve improved performance.

MCML Authors

Bolei Ma

Social Data Science and AI

David Rügamer

Prof. Dr.

B2 | Natural Language Processing
→ Group Hinrich Schütze

Statistics, Data Science and Machine Learning

Ercong Nie

Computational Linguistics

[7]

T. Kaufmann, S. Ball, J. Beck, E. Hüllermeier and F. Kreuter.
On the challenges and practices of reinforcement learning from real human feedback.
HLDM @ECML-PKDD 2023 - 1st Workshop on Hybrid Human-Machine Learning and Decision Making at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2023). Turin, Italy, Sep 18-22, 2023. DOI

Abstract

Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning (RL) that does not require an engineered reward function but instead learns from human feedback. Due to its increasing popularity, various authors have studied how to learn an accurate reward model from only few samples, making optimal use of this feedback. Because of the cost and complexity of user studies, however, this research is often conducted with synthetic human feedback. Such feedback can be generated by evaluating behavior based on ground-truth rewards which are available for some benchmark tasks. While this setting can help evaluate some aspects of RLHF, it differs from practical settings in which synthetic feedback is not available. Working with real human feedback brings additional challenges that cannot be observed with synthetic feedback, including fatigue, inter-rater inconsistencies, delay, misunderstandings, and modality-dependent difficulties. We describe and discuss some of these challenges together with current practices and opportunities for further research in this paper.

MCML Authors

Timo Kaufmann

A3 | Computational Models
→ Group Eyke Hüllermeier

Artificial Intelligence and Machine Learning

Sarah Ball

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Jacob Beck

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Eyke Hüllermeier

Prof. Dr.

A3 | Computational Models

Artificial Intelligence and Machine Learning

Frauke Kreuter

Prof. Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

[6]

B. Ma, E. Nie, H. Schmid and H. Schütze.
Is Prompt-Based Finetuning Always Better than Vanilla Finetuning? Insights from Cross-Lingual Language Understanding.
KONVENS 2023 - 19th Conference on Natural Language Processing. Ingolstadt, Germany, Sep 18-22, 2023. URL

Abstract

Multilingual pretrained language models (MPLMs) have demonstrated substantial performance improvements in zero-shot cross-lingual transfer across various natural language understanding tasks by finetuning MPLMs on task-specific labelled data of a source language (e.g. English) and evaluating on a wide range of target languages. Recent studies show that prompt-based finetuning surpasses regular finetuning in few-shot scenarios. However, the exploration of prompt-based learning in multilingual tasks remains limited. In this study, we propose the PROFIT pipeline to investigate the cross-lingual capabilities of Prompt-based Finetuning. We conduct comprehensive experiments on diverse cross-lingual language understanding tasks (sentiment classification, paraphrase identification, and natural language inference) and empirically analyze the variation trends of prompt-based finetuning performance in cross-lingual transfer across different few-shot and full-data settings. Our results reveal the effectiveness and versatility of prompt-based finetuning in cross-lingual language understanding. Our findings indicate that prompt-based finetuning outperforms vanilla finetuning in full-data scenarios and exhibits greater advantages in few-shot scenarios, with different performance patterns dependent on task types. Additionally, we analyze underlying factors such as language similarity and pretraining data size that impact the cross-lingual performance of prompt-based finetuning. Overall, our work provides valuable insights into the cross-lingual prowess of prompt-based finetuning.

MCML Authors

Bolei Ma

Social Data Science and AI

Ercong Nie

B2 | Natural Language Processing
→ Group Hinrich Schütze

Computational Linguistics

Hinrich Schütze

Prof. Dr.

Computational Linguistics

[5]

M. Trappmann, G.-C. Haas, S. Malich, F. Keusch, S. Bähr, F. Kreuter and S. Schwarz.
Augmenting survey data with digital trace data: Is there a threat to panel retention?
Journal of Survey Statistics and Methodology 11.3 (Jun. 2023). DOI

Abstract

Linking digital trace data to existing panel survey data may increase the overall analysis potential of the data. However, producing linked products often requires additional engagement from survey participants through consent or participation in additional tasks. Panel operators may worry that such additional requests may backfire and lead to lower panel retention, reducing the analysis potential of the data. To examine these concerns, we conducted an experiment in the German PASS panel survey after wave 11. Three quarters of panelists (n = 4,293) were invited to install a research app and to provide sensor data over a period of 6 months, while one quarter (n = 1,428) did not receive an invitation. We find that the request to install a smartphone app and share data significantly decreases panel retention in the wave immediately following the invitation by 3.3 percentage points. However, this effect wears off and is no longer significant in the second and third waves after the invitation. We conclude that researchers who run panel surveys have to take moderate negative effects on retention into account but that the potential gain likely outweighs these moderate losses.

MCML Authors

Frauke Kreuter

Prof. Dr.

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

[4]

I. Ziegler, B. Ma, B. Bischl, E. Dorigatti and B. Schubert.
Proteasomal cleavage prediction: state-of-the-art and future directions.
Preprint (2023). DOI GitHub

Abstract

Epitope vaccines are a promising approach for precision treatment of pathogens, cancer, autoimmune diseases, and allergies. Effectively designing such vaccines requires accurate proteasomal cleavage prediction to ensure that the epitopes included in the vaccine trigger an immune response. The performance of proteasomal cleavage predictors has been steadily improving over the past decades owing to increasing data availability and methodological advances. In this review, we summarize the current proteasomal cleavage prediction landscape and, in light of recent progress in the field of deep learning, develop and compare a wide range of recent architectures and techniques, including long short-term memory (LSTM), transformers, and convolutional neural networks (CNN), as well as four different denoising techniques. All open-source cleavage predictors re-trained on our dataset performed within two AUC percentage points. Our comprehensive deep learning architecture benchmark improved performance by 1.7 AUC percentage points, while closed-source predictors performed considerably worse. We found that a wide range of architectures and training regimes all result in very similar performance, suggesting that the specific modeling approach employed has a limited impact on predictive performance compared to the specifics of the dataset employed. We speculate that the noise and implicit nature of data acquisition techniques used for training proteasomal cleavage prediction models and the complexity of biological processes of the antigen processing pathway are the major limiting factors. While biological complexity can be tackled by more data and, to a lesser extent, better models, noise and randomness inherently limit the maximum achievable predictive performance.

MCML Authors

Bolei Ma

Social Data Science and AI

Bernd Bischl

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistical Learning and Data Science

Emilio Dorigatti

Dr.

* Former Member

2022

[3]

I. Ziegler, B. Ma, E. Nie, B. Bischl, D. Rügamer, B. Schubert and E. Dorigatti.
What cleaves? Is proteasomal cleavage prediction reaching a ceiling?
LMRL @NeurIPS 2022 - Workshop on Learning Meaningful Representations of Life at the 36th Conference on Neural Information Processing Systems (NeurIPS 2022). New Orleans, LA, USA, Nov 28-Dec 09, 2022. URL

Abstract

Epitope vaccines are a promising direction to enable precision treatment for cancer, autoimmune diseases, and allergies. Effectively designing such vaccines requires accurate prediction of proteasomal cleavage in order to ensure that the epitopes in the vaccine are presented to T cells by the major histocompatibility complex (MHC). While direct identification of proteasomal cleavage in vitro is cumbersome and low throughput, it is possible to implicitly infer cleavage events from the termini of MHC-presented epitopes, which can be detected in large amounts thanks to recent advances in high-throughput MHC ligandomics. Inferring cleavage events in such a way provides an inherently noisy signal which can be tackled with new developments in the field of deep learning that supposedly make it possible to learn predictors from noisy labels. Inspired by such innovations, we sought to modernize proteasomal cleavage predictors by benchmarking a wide range of recent methods, including LSTMs, transformers, CNNs, and denoising methods, on a recently introduced cleavage dataset. We found that increasing model scale and complexity appeared to deliver limited performance gains, as several methods reached about 88.5% AUC on C-terminal and 79.5% AUC on N-terminal cleavage prediction. This suggests that the noise and/or complexity of proteasomal cleavage and the subsequent biological processes of the antigen processing pathway are the major limiting factors for predictive performance rather than the specific modeling approach used. While biological complexity can be tackled by more data and better models, noise and randomness inherently limit the maximum achievable predictive performance.

MCML Authors

Bolei Ma

C4 | Computational Social Sciences
→ Group Frauke Kreuter

Social Data Science and AI

Ercong Nie

B2 | Natural Language Processing
→ Group Hinrich Schütze

Computational Linguistics

Bernd Bischl

Prof. Dr.

Statistical Learning and Data Science

David Rügamer

Prof. Dr.

A1 | Statistical Foundations & Explainability
→ Group Bernd Bischl

Statistics, Data Science and Machine Learning

Emilio Dorigatti

Dr.

* Former Member

[2]

K. E. Riehm, E. Badillo Goicoechea, F. M. Wang, E. Kim, L. R. Aldridge, C. P. Lupton-Smith, R. Presskreischer, T.-H. Chang, S. LaRocca, F. Kreuter and E. A. Stuart.
Association of Non-Pharmaceutical Interventions to Reduce the Spread of SARS-CoV-2 With Anxiety and Depressive Symptoms: A Multi-National Study of 43 Countries.
International Journal of Public Health 67 (Mar. 2022). DOI

Abstract

Objectives: To examine the association of non-pharmaceutical interventions (NPIs) with anxiety and depressive symptoms among adults and determine if these associations varied by gender and age.
Methods: We combined survey data from 16,177,184 adults from 43 countries who participated in the daily COVID-19 Trends and Impact Survey via Facebook with time-varying NPI data from the Oxford COVID-19 Government Response Tracker between 24 April 2020 and 20 December 2020. Using logistic regression models, we examined the association of [1] overall NPI stringency and [2] seven individual NPIs (school closures, workplace closures, cancellation of public events, restrictions on the size of gatherings, stay-at-home requirements, restrictions on internal movement, and international travel controls) with anxiety and depressive symptoms.
Results: More stringent implementation of NPIs was associated with a higher odds of anxiety and depressive symptoms, albeit with very small effect sizes. Individual NPIs had heterogeneous associations with anxiety and depressive symptoms by gender and age.
Conclusion: Governments worldwide should be prepared to address the possible mental health consequences of stringent NPI implementation with both universal and targeted interventions for vulnerable groups.

MCML Authors

Frauke Kreuter

Prof. Dr.

Social Data Science and AI

[1]

R. Valliant, J. A. Dever, F. Kreuter and G. Zipf.
Package ‘PracTools’.
2022. URL

Abstract

Functions and datasets to support Valliant, Dever, and Kreuter (2018), ‘Practical Tools for Designing and Weighting Survey Samples’. Contains functions for sample size calculation for survey samples using stratified or clustered one-, two-, and three-stage sample designs, and single-stage audit sample designs. Functions are included that will group geographic units accounting for distances apart and measures of size. Other functions compute variance components for multistage designs and sample sizes in two-phase designs. A number of example data sets are included.

MCML Authors

Frauke Kreuter

Prof. Dr.