arxiv:2505.18931

Can Large Language Models Infer Causal Relationships from Real-World Text?

Published on May 25

· Submitted by

amanchadha on May 29

Upvote

Authors:

Aman Chadha ,

Abstract

A benchmark for assessing LLMs' ability to infer causal relationships from real-world texts highlights significant challenges, revealing common pitfalls in handling implicit information and long-range connections.

AI-generated summary

Understanding and inferring causal relationships from texts is a core aspect of human cognition and is essential for advancing large language models (LLMs) towards artificial general intelligence. Existing work primarily focuses on synthetically generated texts which involve simple causal relationships explicitly mentioned in the text. This fails to reflect the complexities of real-world tasks. In this paper, we investigate whether LLMs are capable of inferring causal relationships from real-world texts. We develop a benchmark drawn from real-world academic literature which includes diverse texts with respect to length, complexity of relationships (different levels of explicitness, number of events, and causal relationships), and domains and sub-domains. To the best of our knowledge, our benchmark is the first-ever real-world dataset for this task. Our experiments on state-of-the-art LLMs evaluated on our proposed benchmark demonstrate significant challenges, with the best-performing model achieving an average F1 score of only 0.477. Analysis reveals common pitfalls: difficulty with implicitly stated information, in distinguishing relevant causal factors from surrounding contextual details, and with connecting causally relevant information spread across lengthy textual passages. By systematically characterizing these deficiencies, our benchmark offers targeted insights for further research into advancing LLM causal reasoning.

View arXiv page View PDF Add to collection

Community

amanchadha

Paper author Paper submitter 3 days ago

The paper introduces ReCAST, the first benchmark to evaluate LLMs' ability to infer complex, realistic causal graphs from long-form, real-world academic texts, revealing that state-of-the-art models fail to exhibit robust causal reasoning under such conditions. Specifics below:

Introduction of ReCAST: The first benchmark explicitly designed to evaluate LLMs’ ability to construct complex causal graphs from long-form, real-world texts (primarily economics literature), addressing gaps in prior synthetic and shallow-text benchmarks.
Realistic Dataset Pipeline: A rigorous 3-stage pipeline (collection, annotation, post-processing) produces high-fidelity graph-text pairs, incorporating human expert annotation, LLM-aided normalization, and strict formatting standards to ensure semantic accuracy and reproducibility.
Causal Graph Grounding from Narrative Text: Unlike prior work focusing on pairwise or sentence-level causality, ReCAST demands the extraction of multi-node causal networks from unstructured, naturally-written academic texts—a significantly harder and more authentic challenge.
LLM-as-a-Judge Evaluation Framework: A novel automated evaluation method where LLMs assess generated graphs using semantic similarity, abstraction alignment, and relational correctness, allowing for nuanced grading beyond rigid structural or token-based comparisons.
Degree of Confounding as Difficulty Metric: Introduction of a quantitative metric for implicitness, measuring how many graph nodes are not explicitly mentioned in the text, which serves as a predictor of LLM failure and a unique axis of analysis.
Name-Assisted Graph Construction Ablation: A diagnostic setup where LLMs are given all ground-truth node names, isolating causal reasoning from entity extraction. Results show only marginal improvement, confirming causal inference as the primary performance bottleneck.
Extensive Multi-Factor Error Analysis: Empirical study shows that LLMs fail due to inability to integrate dispersed information, abstract causal pathways, or avoid hallucinations—especially when causal relationships are implicit, multi-hop, or embedded in domain-specific jargon.

librarian-bot

3 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.18931 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.18931 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.18931 in a Space README.md to link it from this page.