15.01.2025

Teaser image to TruthQuest – A New Benchmark for AI Reasoning

TruthQuest – A New Benchmark for AI Reasoning

MCML Research Insight - With Philipp Mondorf and Barbara Plank

In their recent work, "Liar, Liar, Logical Mire: A Benchmark for Suppositional Reasoning in Large Language Models" our Junior Member Philipp Mondorf and our PI Barbara Plank tackle a fascinating question: How well do AI systems handle complex reasoning tasks?

«We introduce TruthQuest, a benchmark for suppositional reasoning based on the principles of knights and knaves puzzles.»


Philipp Mondorf

MCML Junior Member

To answer this question the paper introduces TruthQuest, a benchmark designed to evaluate large language models (LLMs) using a classic logic puzzle framework: knights and knaves. In these puzzles, knights always tell the truth, while knaves always lie. The goal, and challenge, is to deduce the identity of each character based on their statements. Solving them requires a type of reasoning that goes beyond straightforward deduction. Instead, it demands the ability to explore hypothetical scenarios and infer truth or falsehood based on logical implications.

The authors rigorously tested prominent LLMs, including models from the Llama series (i.e., Llama 2 and Llama 3) and the Mixtral family, across puzzles of varying complexity. Notably, these puzzles included statements ranging from straightforward self-references to intricate logical equivalences and implications. The study reveals that even the most advanced models face significant challenges when reasoning through these puzzles, particularly as the complexity increases. Performance was measured not only in terms of accuracy but also through error analysis, uncovering patterns in how models reason, and fail. Errors ranged from basic misunderstandings about truth and lies to struggles with deducing the implications of potentially false statements. While some models showed promise with advanced techniques like chain-of-thought prompting, their accuracy dropped sharply with more characters and intricate logical relationships.

«Our benchmark presents problems of varying complexity, considering both the number of characters and the types of logical statements involved.»


Philipp Mondorf

MCML Junior Member

Under zero-shot conditions, where models must solve puzzles without prior example-based guidance, performance was close to the random guessing for most LLMs, even the larger ones. However, Llama 3-70B exhibited superior performance compared to others in its class, particularly when aided by advanced prompting techniques like chain-of-thought (CoT). This technique involves breaking down the reasoning process into explicit steps, helping the model tackle simpler puzzles with remarkable improvements in accuracy.

For puzzles involving more characters or complex logical structures, the models struggled universally. Errors multiplied when tasks required analyzing the implications of false statements or balancing multiple hypothetical scenarios. For instance, when dealing with puzzles featuring five or six characters, even Llama 3-70B, the top-performing model, saw a steep drop in accuracy.

An instance of the knights & knaves puzzle.

An instance of the knights & knaves puzzle. By reasoning about the characters’ statements and their truthfulness, it is possible to deduce that Greeny and Bluey must be knights, while Pinky is a knave.

Error analysis offered fascinating insights into the capabilities and limitations of current LLMs. Lower-performing models, such as Llama 2-7B, displayed a diverse range of reasoning flaws, from failing to recognize the basic distinction between truth and lies to misinterpreting logical operators. In contrast, higher-performing models like Llama 3-70B primarily struggled with deducing the implications of potentially false statements.

Why focus on such puzzles? Knights and knaves problems, despite their simplicity, expose fundamental capabilities, or limitations, of AI systems in reasoning through uncertainty and contradiction. These skills are crucial for real-world applications, from autonomous decision-making systems to conversational AI, where navigating ambiguity is a daily challenge.

«TruthQuest evaluates whether models can hypothesize, evaluate, and conclude: a more human-like reasoning approach.»


Philipp Mondorf

MCML Junior Member

Additionally, the benchmark illuminates the current state of AI reasoning in a novel way. Unlike many standard datasets that test pre-defined deductive steps, TruthQuest evaluates whether models can hypothesize, evaluate, and conclude: a more human-like reasoning approach.

The journey doesn’t end here. The authors suggest several exciting directions for future research:

  • Expanding the benchmark to include puzzles with multiple valid solutions or no solution at all, adding another layer of complexity.
  • Introducing new character types with unique truth-telling behaviors, such as “logicians” who always reason correctly or “politicians” who never do, to further challenge models.
  • Developing even more sophisticated prompting techniques, such as Tree-of-Thoughts or Graph-of-Thoughts, to assist models in navigating complex reasoning scenarios.

Explore the full paper, published at EMNLP 2024 to see how prominent LLMs performed and where they fell short! The quest for AI that truly understands logic continues.

P. Mondorf and B. Plank.
Liar, Liar, Logical Mire: A Benchmark for Suppositional Reasoning in Large Language Models.
EMNLP 2024 - Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

Knights and knaves problems represent a classic genre of logical puzzles where characters either tell the truth or lie. The objective is to logically deduce each character’s identity based on their statements. The challenge arises from the truth-telling or lying behavior, which influences the logical implications of each statement. Solving these puzzles requires not only direct deductions from individual statements, but the ability to assess the truthfulness of statements by reasoning through various hypothetical scenarios. As such, knights and knaves puzzles serve as compelling examples of suppositional reasoning. In this paper, we introduce TruthQuest, a benchmark for suppositional reasoning based on the principles of knights and knaves puzzles. Our benchmark presents problems of varying complexity, considering both the number of characters and the types of logical statements involved. Evaluations on TruthQuest show that large language models like Llama 3 and Mixtral-8x7B exhibit significant difficulties solving these tasks. A detailed error analysis of the models’ output reveals that lower-performing models exhibit a diverse range of reasoning errors, frequently failing to grasp the concept of truth and lies. In comparison, more proficient models primarily struggle with accurately inferring the logical implications of potentially false statements.

MCML Authors
Link to website

Philipp Mondorf

AI and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

AI and Computational Linguistics


Share Your Research!


Get in touch with us!

Are you an MCML Junior Member and interested in showcasing your research on our blog?

We’re happy to feature your work—get in touch with us to present your paper.

15.01.2025


Subscribe to RSS News feed

Related

Link to Satellite Insights for a Sustainable Future - with researcher Ivica Obadic

25.08.2025

Satellite Insights for a Sustainable Future - With Researcher Ivica Obadic

AI from satellite imagery helps design livable cities, improve well-being & food systems with transparent models by Ivica Obadić.

Link to Digital Twins for Surgery - with researcher Azade Farshad

18.08.2025

Digital Twins for Surgery - With Researcher Azade Farshad

Azade Farshad develops patient digital twins at TUM & MCML to improve personalized treatment, surgical planning, and training.

Link to From Physics Dreams to Algorithm Discovery - with Niki Kilbertus

13.08.2025

From Physics Dreams to Algorithm Discovery - With Niki Kilbertus

Niki Kilbertus develops AI algorithms to uncover cause and effect, making science smarter and decisions in fields like medicine more reliable.

Link to Tracking Actions in Space and Time: ICCV 2025 Challenge & Workshop

12.08.2025

Tracking Actions in Space and Time: ICCV 2025 Challenge & Workshop

ICCV 2025 workshop: Advancing AI to detect who does what, when, and where — across space, time, and complex real-world videos.

Link to AI for Dynamic Urban Mapping - with researcher Shanshan Bai

11.08.2025

AI for Dynamic Urban Mapping - With Researcher Shanshan Bai

Shanshan Bai uses geo-tagged social media and AI to map cities in real time. Part of KI Trans, funded by DATIpilot to support AI in education.