Belief in the Machine: LM Epistemological Reasoning Leaderboard

Investigating Epistemological Blind Spots of Language Models

As language models (LMs) become integral to fields like healthcare, law, and journalism, their ability to differentiate between fact, belief, and knowledge is essential for reliable decision-making. This leaderboard presents results from our study that systematically evaluates the epistemological reasoning capabilities of 24 modern LMs, including:

  • DeepSeek's R1
  • OpenAI's o1
  • Google's Gemini 2 Flash
  • Anthropic's Claude 3.7 Sonnet
  • Meta's Llama 3.3 70B

The evaluation uses a new benchmark consisting of 13,000 questions across 13 tasks that test how well models understand and reason about truth, belief, and knowledge.

Key Findings

  1. While LMs achieve 86% accuracy on factual scenarios, performance drops significantly with false scenarios, particularly in belief-related tasks
  2. LMs struggle with recognizing and affirming personal beliefs, especially when those beliefs contradict factual data
  3. LMs process first-person versus third-person beliefs differently, performing better on third-person tasks (80.7%) compared to first-person tasks (54.4%)
  4. LMs lack a robust understanding of the factive nature of knowledge (that knowledge inherently requires truth)
  5. LMs often rely on linguistic cues for fact-checking rather than deeper reasoning

Citation

@article{suzgun2024beliefmachine,
      title={Belief in the Machine: Investigating Epistemological Blind Spots of Language Models}, 
      author={Mirac Suzgun and Tayfun Gur and Federico Bianchi and Daniel E. Ho and Thomas Icard and Dan Jurafsky and James Zou},
      year={2024},
      eprint={2410.21195},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.21195}, 
}

View full paper on arXiv | View code on GitHub

Model Performance Comparison

Select filters to customize the leaderboard view:

1. Select Datasets

Datasets to Display as Columns

2. Select Conditions

Conditions to Filter

3. Select Models

Models to Include

Model Performance Leaderboard

About the Benchmark

The benchmark used in this study consists of 13,000 questions across 13 tasks designed to test epistemological reasoning:

  • Direct Fact Verification: Testing if models can verify basic factual statements
  • First-person & Third-person Belief: Evaluating how models understand beliefs from different perspectives
  • Belief Attribution: Testing if models can correctly attribute beliefs to individuals
  • Knowledge Attribution: Testing if models understand that knowledge requires truth

The benchmark evaluates models under both true and false conditions to assess how well they understand the relationship between truth, belief, and knowledge.