Masking in Multi-hop QA: An Analysis of How Language Models Perform with Context Permutation
Abstract
Encoder-decoder models outperform causal decoder-only models in multi-hop question answering by leveraging permutation of search results and bi-directional attention.
Multi-hop Question Answering (MHQA) adds layers of complexity to question answering, making it more challenging. When Language Models (LMs) are prompted with multiple search results, they are tasked not only with retrieving relevant information but also employing multi-hop reasoning across the information sources. Although LMs perform well on traditional question-answering tasks, the causal mask can hinder their capacity to reason across complex contexts. In this paper, we explore how LMs respond to multi-hop questions by permuting search results (retrieved documents) under various configurations. Our study reveals interesting findings as follows: 1) Encoder-decoder models, such as the ones in the Flan-T5 family, generally outperform causal decoder-only LMs in MHQA tasks, despite being significantly smaller in size; 2) altering the order of gold documents reveals distinct trends in both Flan T5 models and fine-tuned decoder-only models, with optimal performance observed when the document order aligns with the reasoning chain order; 3) enhancing causal decoder-only models with bi-directional attention by modifying the causal mask can effectively boost their end performance. In addition to the above, we conduct a thorough investigation of the distribution of LM attention weights in the context of MHQA. Our experiments reveal that attention weights tend to peak at higher values when the resulting answer is correct. We leverage this finding to heuristically improve LMs' performance on this task. Our code is publicly available at https://github.com/hwy9855/MultiHopQA-Reasoning.
Community
As large language models increasingly power AI assistants, their ability to reason across multiple pieces of evidence is a crucial yet unresolved challenge. In real-world retrieval-augmented generation (RAG) scenarios, simply changing the order of supporting documents can mean the difference between success and failure. In this paper, we rigorously analyse how various types of LMs tackle multi-hop question answering when the context order and placement are changed, revealing which architectures and strategies excel—and why. Most notably, we show that identifying and leveraging models’ attention patterns can unlock significant gains, offering practical guidance for building smarter, more reliable multi-hop QA systems.
Paper: https://arxiv.org/abs/2505.11754
Code: https://github.com/hwy9855/MultiHopQA-Reasoning
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Adapting Large Language Models for Multi-Domain Retrieval-Augmented-Generation (2025)
- Bidirectional LMs are Better Knowledge Memorizers? A Benchmark for Real-world Knowledge Injection (2025)
- Order Independence With Finetuning (2025)
- Unleashing the Power of LLMs in Dense Retrieval with Query Likelihood Modeling (2025)
- Can LLMs reason over extended multilingual contexts? Towards long-context evaluation beyond retrieval and haystacks (2025)
- RankAlign: A Ranking View of the Generator-Validator Gap in Large Language Models (2025)
- Understanding and Improving Information Preservation in Prompt Compression for LLMs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper