Home | Tags | #p_plank

#p_plank

MWG+25

BlackboxNLP-2025 MIB Shared Task: Exploring Ensemble Strategies for Circuit Localization Methods

BlackboxNLP @EMNLP 2025

#p_plank #p_schuetze

Learn more

BMP+25

The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It

EMNLP 2025

#p_plank

Learn more

CLK+25

Threading the Needle: Reweaving Chain-of-Thought Reasoning to Explain Human Label Variation

EMNLP 2025

#p_plank

Learn more

DMS+25

Reason to Rote: Rethinking Memorization in Reasoning

EMNLP 2025

#p_plank

Learn more

HCP+25

LiTEx: A Linguistic Taxonomy of Explanations for Understanding Within-Label Variation in Natural Language Inference

EMNLP 2025

#p_plank

Learn more

LCB+25

PERSEVAL: A Framework for Perspectivist Classification Evaluation

EMNLP 2025

#p_plank

Learn more

TPF25

RACQUET: Unveiling the Dangers of Overlooked Referential Ambiguity in Visual LLMs

EMNLP 2025

#p_plank

Learn more

WML+25

M-ABSA: A Multilingual Dataset for Aspect-Based Sentiment Analysis

EMNLP 2025

#p_kreuter #p_plank #p_schuetze

Learn more

LBB+25

Make Every Letter Count: Building Dialect Variation Dictionaries From Monolingual Corpora

Findings @EMNLP 2025

#p_plank

Learn more

LWK+25a

Tracing Multilingual Factual Knowledge Acquisition in Pretraining

Findings @EMNLP 2025

#p_plank #p_schuetze

Learn more

ZCP+25

MAKIEval: A Multilingual Automatic WiKidata-Based Framework for Cultural Awareness Evaluation for LLMs

Findings @EMNLP 2025

#p_hedderich #p_plank

Learn more

ZPL+25

What Media Frames Reveal About Stance: A Dataset and Study About Memes in Climate Change Discourse

Findings @EMNLP 2025

#p_plank

Learn more

ZHK+25a

Evaluating Large Language Models for Cross-Lingual Retrieval

Findings @EMNLP 2025

#p_plank

Learn more

LCP+25a

LeWiDi-2025 at NLPerspectives: The Third Edition of the Learning With Disagreements Shared Task

LeWiDi @EMNLP 2025

#p_plank

Learn more

EMK+25

Aligning NLP Models With Target Population Perspectives Using PAIR: Population-Aligned Instance Replication

NLPerspectives @EMNLP 2025

#p_kern #p_kreuter #p_plank

Learn more

BWP+25

Standard-to-Dialect Transfer Trends Differ Across Text and Speech: A Case Study on Intent and Topic Classification in German Dialects

Preprint (Oct. 2025)

#p_plank

Learn more

HCP+25a

Agree, Disagree, Explain: Decomposing Human Label Variation in NLI Through the Lens of Explanations

Preprint (Oct. 2025)

#p_plank

Learn more

MCS+25

Preprint (Oct. 2025)

#p_kreuter #p_plank

Learn more

OMP+25

If Probable, Then Acceptable? Understanding Conditional Acceptability Judgments in Large Language Models

Preprint (Oct. 2025)

#p_plank

Learn more

RPB+25

BoN Appetit Team at LeWiDi-2025: Best-of-N Test-Time Scaling Can Not Stomach Annotation Disagreements (Yet)

Preprint (Oct. 2025)

#p_plank

Learn more

WJP+25

Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort

Preprint (Oct. 2025)

#p_plank

Learn more

XTP25

From Noise to Signal to Selbstzweck: Reframing Human Label Variation in the Era of Post-Training in NLP

Preprint (Oct. 2025)

#p_plank

Learn more

BBB+25

LLMs Instead of Human Judges? a Large Scale Empirical Study Across 20 NLP Evaluation Tasks

ACL 2025

#p_plank

Learn more

ELP+25

Probing LLMs for Multilingual Discourse Generalization Through a Unified Label Set

ACL 2025

#p_hedderich #p_plank

Learn more

HWZ+25

What's the Difference? Supporting Users in Identifying the Effects of Prompt and Model Changes Through Token Patterns

ACL 2025

#p_hedderich #p_plank

Learn more

MLZ+25

Pragmatics in the Era of Large Language Models: A Survey on Datasets, Evaluation, Opportunities and Challenges

ACL 2025

#p_kreuter #p_plank

Learn more

MYH+25

Algorithmic Fidelity of Large Language Models in Generating Synthetic German Public Opinions: A Case Study

ACL 2025

#p_bischl #p_kreuter #p_plank

Learn more

MWP25

Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models

ACL 2025

#p_plank

Learn more

SFP25

Do LLMs Give Psychometrically Plausible Responses in Educational Assessments?

BEA @ACL 2025

#p_plank

Learn more

BFT25

Analyzing the Effect of Linguistic Similarity on Cross-Lingual Transfer: Tasks and Experimental Setups Matter

Findings @ACL 2025

#p_plank

Learn more

CPK+25

A Rose by Any Other Name: LLM-Generated Explanations Are Good Proxies for Human Explanations to Collect Label Distributions on NLI

Findings @ACL 2025

#p_plank

Learn more

GAB+25

Revisiting Active Learning Under (Human) Label Variation

Preprint (Jul. 2025)

#p_bischl #p_kauermann #p_plank

Learn more

BWF+25

A Multi-Dialectal Dataset for German Dialect ASR and Dialect-to-Standard Speech Translation

Preprint (Jun. 2025)

#p_plank

Learn more

CLP+25

Evaluation Should Not Ignore Variation: On the Impact of Reference Set Choice on Summarization Metrics

Preprint (Jun. 2025)

#p_plank

Learn more

JSH+25

MultiplEYE: Creating a Multilingual Eye-Tracking-While-Reading Corpus

ETRA 2025

#p_plank

Learn more

EDM+25

Grokking ExPLAIND: Unifying Model, Data, and Training Attribution to Study Model Behavior

Preprint (May. 2025)

#p_hedderich #p_plank

Learn more

SDH+25

Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically

Preprint (May. 2025)

#p_plank

Learn more

WWL+25

Refusal Direction Is Universal Across Safety-Aligned Languages

Preprint (May. 2025)

#p_plank #p_schuetze

Learn more

SP25

Dialetto, Ma Quanto Dialetto? Transcribing and Evaluating Dialects on a Continuum

Findings @NAACL 2025

#p_plank

Learn more

MES+25

Lost in Inference: Rediscovering the Role of Natural Language Inference for Large Language Models

NAACL 2025

#p_plank

Learn more

Bla25

Beyond 'Noisy' Text: How (And Why) to Process Dialect Data

W-NUT @NAACL 2025

#p_plank

Learn more

WHR+25

Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation

ICLR 2025

#p_plank

Learn more

MZR+25

Enabling Systematic Generalization in Abstract Spatial Reasoning Through Meta-Learning for Compositionality

Preprint (Apr. 2025)

#p_plank

Learn more

SWZ+25

Think Before Refusal : Triggering Safety Reflection in LLMs to Mitigate False Refusal Behavior

Preprint (Mar. 2025)

#p_navab #p_plank

Learn more

LFP25

Mind the Uncertainty in Human Disagreement: Evaluating Discrepancies Between Model Predictions and Human Responses in VQA

AAAI 2025

#p_plank #p_seidl

Learn more

FMB+25

Using Natural Language Processing to Analyse Text Data in Behavioural Science

Nature Reviews Psychology 4. Feb. 2025

#p_feuerriegel #p_plank

Learn more

XSE+25

Better Aligned With Survey Respondents or Training Data? Unveiling Political Leanings of LLMs on U.S. Supreme Court Cases

Preprint (Feb. 2025)

#p_plank

Learn more

LKB+25

Cross-Dialect Information Retrieval: Information Access in Low-Resource and High-Variance Languages

COLING 2025

#p_plank

Learn more

MBP25

Evaluating Pixel Language Models on Non-Standardized Languages

COLING 2025

#p_plank

Learn more

BKP25

Add Noise, Tasks, or Layers? MaiNLP at the VarDial 2025 Shared Task on Norwegian Dialectal Slot and Intent Detection

VarDial @COLING 2025

#p_plank

Learn more

KBP25

Improving Dialectal Slot and Intent Detection With Auxiliary Tasks: A Multi-Dialectal Bavarian Case Study

VarDial @COLING 2025

#p_plank

Learn more

LPP+25

Neural Text Normalization for Luxembourgish Using Real-Life Variation Data

VarDial @COLING 2025

#p_plank

Learn more

ZLW+24

FinerCut: Finer-Grained Interpretable Layer Pruning for Large Language Models

Compression Workshop @NeurIPS 2024

#p_bischl #p_plank

Learn more

BCF+24a

PERSEID - Perspectivist Irony Detection: A CALAMITA Challenge

CLiC-It 2024

#p_plank

Learn more

BCW+24

Data Augmentation Through Back-Translation for Stereotypes and Irony Detection

CLiC-It 2024

#p_plank

Learn more

FPS+24

GFG - Gender-Fair Generation: A CALAMITA Challenge

CLiC-It 2024

#p_plank

Learn more

LAS+24

GDTB: Genre Diverse Data for English Shallow Discourse Parsing Across Modalities, Text Types, and Domains

EMNLP 2024

#p_plank

Learn more

MP24b

Liar, Liar, Logical Mire: A Benchmark for Suppositional Reasoning in Large Language Models

EMNLP 2024

#p_plank

Learn more

BCL+24

I’m Sure You’re a Real Scholar Yourself: Exploring Ironic Content Generation by Large Language Models

Findings @EMNLP 2024

#p_plank

Learn more

CWP+24

'Seeing the Big Through the Small': Can LLMs Approximate Human Judgment Distributions on NLI From a Few Explanations?

Findings @EMNLP 2024

#p_plank

Learn more

MWH+24

The Potential and Challenges of Evaluating Attitudes, Opinions, and Values in Large Language Models

Findings @EMNLP 2024

#p_hedderich #p_kreuter #p_plank

Learn more

SLF+24

To Know or Not to Know? Analyzing Self-Consistency of Large Language Models Under Ambiguity

Findings @EMNLP 2024

#p_plank

Learn more

WZP+24

MultiClimate: Multimodal Stance Detection on Climate Change Videos

NLP4PI @EMNLP 2024

#p_plank

Learn more

MP24a

Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models--A Survey

COLM 2024

#p_plank

Learn more

WHM24

Look at the Text: Instruction-Tuned Language Models Are More Robust Multiple Choice Selectors Than You Think

COLM 2024

#p_kreuter #p_plank

Learn more

BKP+24

MaiBaam Annotation Guidelines

Preprint (Oct. 2024)

#p_plank

Learn more

CWM+24

Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination

Preprint (Oct. 2024)

#p_hedderich #p_plank

Learn more

BPS+24

What Do Dialect Speakers Want? a Survey of Attitudes Towards Language Technology for German Dialects

ACL 2024

#p_plank #p_schuetze

Learn more

MP24

Comparing Inferential Strategies of Humans and Large Language Models in Deductive Reasoning

ACL 2024

#p_plank

Learn more

WPM+24

VariErr NLI: Separating Annotation Error From Human Label Variation

ACL 2024

#p_plank

Learn more

XTI+24

Through the Lens of Split Vote: Exploring Disagreement, Difficulty and Calibration in Legal Case Outcome Classification

ACL 2024

#p_plank

Learn more

ZPB24

CLIMATELI: Evaluating Entity Linking on Climate Change Data

ClimateNLP @ACL 2024

#p_plank

Learn more

WMH+24

My Answer Is C: First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models

Findings @ACL 2024

#p_kreuter #p_plank

Learn more

EPK24

Position: Insights From Survey Methodology Can Improve Training Data

ICML 2024

#p_kreuter #p_plank

Learn more

ZSP+24a

MaiNLP at SemEval-2024 Task 1: Analyzing Source Language Selection in Cross-Lingual Textual Relatedness

SemEval @NAACL 2024

#p_plank

Learn more

BKB+24

MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank

LREC-COLING 2024

#p_plank #p_schuetze

Learn more

MP24c

IndirectQA: Understanding Indirect Answers to Implicit Polar Questions in French and Spanish

LREC-COLING 2024

#p_plank

Learn more

PSS+24

Sebastian, Basti, Wastl?! Recognizing Named Entities in Bavarian Dialectal Data

LREC-COLING 2024

#p_plank

Learn more

WJG+24

Slot and Intent Detection Resources for Bavarian and Lithuanian: Assessing Translations vs Natural Queries to Digital Assistants

LREC-COLING 2024

#p_plank

Learn more

ZWH+24

Constructions Are So Difficult That Even Large Language Models Get Them Right for the Wrong Reasons

LREC-COLING 2024

#p_plank #p_schuetze

Learn more

GHA+24

More Labels or Cases? Assessing Label Variation in Natural Language Inference

UnImplicit 2024

#p_bischl #p_plank

Learn more

PSL+24

Different Tastes of Entities: Investigating Human Label Variation in Named Entity Annotations

UnImplicit 2024

#p_plank

Learn more

ABP24

Exploring the Robustness of Task-Oriented Dialogue Systems for Colloquial German Varieties

EACL 2024

#p_plank

Learn more

BFP+24

Interpreting Predictive Probabilities: Model Confidence or Human Label Variation?

EACL 2024

#p_plank

Learn more

ZGK+24

NNOSE: Nearest Neighbor Occupational Skill Extraction

EACL 2024

#p_plank

Learn more

ZGP24

Entity Linking in the Job Market Domain

Findings @EACL 2024

#p_plank

Learn more

SPP+24

EEVEE: An Easy Annotation Tool for Natural Language Processing

LAW @EACL 2024

#p_plank

Learn more

WLA+24a

Donkii: Characterizing and Detecting Errors in Instruction-Tuning Datasets

LAW @EACL 2024

#p_plank

Learn more

ZWS+23

LoHoRavens: A Long-Horizon Language-Conditioned Benchmark for Robotic Tabletop Manipulation

Robot Learning @NeurIPS 2023

#p_plank #p_schuetze

Learn more

GBA+23

What Comes Next? Evaluating Uncertainty in Neural Text Generators Against Human Production Variability

EMNLP 2023

#p_plank

Learn more

LMG+23

Establishing Trustworthiness: Rethinking Tasks and Model Evaluation

EMNLP 2023

#p_plank

Learn more

WP23

ACTOR: Active Learning With Annotator-Specific Classification Heads to Embrace Human Label Variation

EMNLP 2023

#p_plank

Learn more

XTI+23

From Dissonance to Insights: Dissecting Disagreements in Rationale Construction for Case Outcome Classification

EMNLP 2023

#p_plank

Learn more

MGP+23

Subspace Chronicles: How Linguistic Information Emerges, Shifts and Interacts During Language Model Training

Findings @EMNLP 2023

#p_plank

Learn more

WP23a

ActiveAED: A Human in the Loop Improves Annotation Error Detection

Findings @ACL 2023

#p_plank

Learn more

BDI+23

Uncertainty in Natural Language Generation: From Theory to Applications

Preprint (Jul. 2023)

#p_plank

Learn more

BSP23b

A Survey of Corpora for Germanic Low-Resource Languages and Dialects

NoDaLiDa 2023

#p_plank #p_schuetze

Learn more

WWS+23a

How to Distill Your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives

EACL 2023

#p_plank #p_schuetze

Learn more

BSP23a

Does Manipulating Tokenization Aid Cross-Lingual Transfer? a Study on POS Tagging for Non-Standardized Languages

VarDial @EACL 2023

#p_plank #p_schuetze

Learn more

BAP+22

Stop Measuring Calibration When Humans Disagree

EMNLP 2022

#p_plank

Learn more

BMZ+22

Evidence > Intuition: Transferability Estimation for Encoder Selection

EMNLP 2022

#p_plank

Learn more

MVP22

Spectral Probing

EMNLP 2022

#p_plank

Learn more

Pla22

The 'Problem' of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation

EMNLP 2022

#p_plank

Learn more

BP22a

CrossRE: A Cross-Domain Dataset for Relation Extraction

Findings @EMNLP 2022

#p_plank

Learn more

UBM+22

Experimental Standards for Deep Learning in Natural Language Processing Research

Findings @EMNLP 2022

#p_plank

Learn more

#p_plank

BlackboxNLP-2025 MIB Shared Task: Exploring Ensemble Strategies for Circuit Localization Methods

The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It

Threading the Needle: Reweaving Chain-of-Thought Reasoning to Explain Human Label Variation

Reason to Rote: Rethinking Memorization in Reasoning

LiTEx: A Linguistic Taxonomy of Explanations for Understanding Within-Label Variation in Natural Language Inference

PERSEVAL: A Framework for Perspectivist Classification Evaluation

RACQUET: Unveiling the Dangers of Overlooked Referential Ambiguity in Visual LLMs

M-ABSA: A Multilingual Dataset for Aspect-Based Sentiment Analysis

Make Every Letter Count: Building Dialect Variation Dictionaries From Monolingual Corpora

Tracing Multilingual Factual Knowledge Acquisition in Pretraining

MAKIEval: A Multilingual Automatic WiKidata-Based Framework for Cultural Awareness Evaluation for LLMs

What Media Frames Reveal About Stance: A Dataset and Study About Memes in Climate Change Discourse

Evaluating Large Language Models for Cross-Lingual Retrieval

LeWiDi-2025 at NLPerspectives: The Third Edition of the Learning With Disagreements Shared Task

Aligning NLP Models With Target Population Perspectives Using PAIR: Population-Aligned Instance Replication

Standard-to-Dialect Transfer Trends Differ Across Text and Speech: A Case Study on Intent and Topic Classification in German Dialects

Agree, Disagree, Explain: Decomposing Human Label Variation in NLI Through the Lens of Explanations

Too Open for Opinion? Embracing Open-Endedness in Large Language Models for Social Simulation

If Probable, Then Acceptable? Understanding Conditional Acceptability Judgments in Large Language Models

BoN Appetit Team at LeWiDi-2025: Best-of-N Test-Time Scaling Can Not Stomach Annotation Disagreements (Yet)

Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort

From Noise to Signal to Selbstzweck: Reframing Human Label Variation in the Era of Post-Training in NLP

LLMs Instead of Human Judges? a Large Scale Empirical Study Across 20 NLP Evaluation Tasks

Probing LLMs for Multilingual Discourse Generalization Through a Unified Label Set

What's the Difference? Supporting Users in Identifying the Effects of Prompt and Model Changes Through Token Patterns

Pragmatics in the Era of Large Language Models: A Survey on Datasets, Evaluation, Opportunities and Challenges

Algorithmic Fidelity of Large Language Models in Generating Synthetic German Public Opinions: A Case Study

Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models

Do LLMs Give Psychometrically Plausible Responses in Educational Assessments?

Analyzing the Effect of Linguistic Similarity on Cross-Lingual Transfer: Tasks and Experimental Setups Matter

A Rose by Any Other Name: LLM-Generated Explanations Are Good Proxies for Human Explanations to Collect Label Distributions on NLI

Revisiting Active Learning Under (Human) Label Variation

A Multi-Dialectal Dataset for German Dialect ASR and Dialect-to-Standard Speech Translation

Evaluation Should Not Ignore Variation: On the Impact of Reference Set Choice on Summarization Metrics

MultiplEYE: Creating a Multilingual Eye-Tracking-While-Reading Corpus

Grokking ExPLAIND: Unifying Model, Data, and Training Attribution to Study Model Behavior

Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically

Refusal Direction Is Universal Across Safety-Aligned Languages

Dialetto, Ma Quanto Dialetto? Transcribing and Evaluating Dialects on a Continuum

Lost in Inference: Rediscovering the Role of Natural Language Inference for Large Language Models

Beyond 'Noisy' Text: How (And Why) to Process Dialect Data

Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation

Enabling Systematic Generalization in Abstract Spatial Reasoning Through Meta-Learning for Compositionality

Think Before Refusal : Triggering Safety Reflection in LLMs to Mitigate False Refusal Behavior

Mind the Uncertainty in Human Disagreement: Evaluating Discrepancies Between Model Predictions and Human Responses in VQA

Using Natural Language Processing to Analyse Text Data in Behavioural Science

Better Aligned With Survey Respondents or Training Data? Unveiling Political Leanings of LLMs on U.S. Supreme Court Cases

Cross-Dialect Information Retrieval: Information Access in Low-Resource and High-Variance Languages

Evaluating Pixel Language Models on Non-Standardized Languages

Add Noise, Tasks, or Layers? MaiNLP at the VarDial 2025 Shared Task on Norwegian Dialectal Slot and Intent Detection

Improving Dialectal Slot and Intent Detection With Auxiliary Tasks: A Multi-Dialectal Bavarian Case Study

Neural Text Normalization for Luxembourgish Using Real-Life Variation Data

FinerCut: Finer-Grained Interpretable Layer Pruning for Large Language Models

PERSEID - Perspectivist Irony Detection: A CALAMITA Challenge

Data Augmentation Through Back-Translation for Stereotypes and Irony Detection

GFG - Gender-Fair Generation: A CALAMITA Challenge

GDTB: Genre Diverse Data for English Shallow Discourse Parsing Across Modalities, Text Types, and Domains

Liar, Liar, Logical Mire: A Benchmark for Suppositional Reasoning in Large Language Models

I’m Sure You’re a Real Scholar Yourself: Exploring Ironic Content Generation by Large Language Models

'Seeing the Big Through the Small': Can LLMs Approximate Human Judgment Distributions on NLI From a Few Explanations?

The Potential and Challenges of Evaluating Attitudes, Opinions, and Values in Large Language Models

To Know or Not to Know? Analyzing Self-Consistency of Large Language Models Under Ambiguity

MultiClimate: Multimodal Stance Detection on Climate Change Videos

Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models--A Survey

Look at the Text: Instruction-Tuned Language Models Are More Robust Multiple Choice Selectors Than You Think

MaiBaam Annotation Guidelines

Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination

What Do Dialect Speakers Want? a Survey of Attitudes Towards Language Technology for German Dialects

Comparing Inferential Strategies of Humans and Large Language Models in Deductive Reasoning

VariErr NLI: Separating Annotation Error From Human Label Variation

Through the Lens of Split Vote: Exploring Disagreement, Difficulty and Calibration in Legal Case Outcome Classification

CLIMATELI: Evaluating Entity Linking on Climate Change Data

My Answer Is C: First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models

Position: Insights From Survey Methodology Can Improve Training Data

MaiNLP at SemEval-2024 Task 1: Analyzing Source Language Selection in Cross-Lingual Textual Relatedness

MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank

IndirectQA: Understanding Indirect Answers to Implicit Polar Questions in French and Spanish

Sebastian, Basti, Wastl?! Recognizing Named Entities in Bavarian Dialectal Data

Slot and Intent Detection Resources for Bavarian and Lithuanian: Assessing Translations vs Natural Queries to Digital Assistants