Papers
arxiv:2505.11754

Masking in Multi-hop QA: An Analysis of How Language Models Perform with Context Permutation

Published on May 16
· Submitted by hwy9855 on May 21
Authors:
,

Abstract

Encoder-decoder models outperform causal decoder-only models in multi-hop question answering by leveraging permutation of search results and bi-directional attention.

AI-generated summary

Multi-hop Question Answering (MHQA) adds layers of complexity to question answering, making it more challenging. When Language Models (LMs) are prompted with multiple search results, they are tasked not only with retrieving relevant information but also employing multi-hop reasoning across the information sources. Although LMs perform well on traditional question-answering tasks, the causal mask can hinder their capacity to reason across complex contexts. In this paper, we explore how LMs respond to multi-hop questions by permuting search results (retrieved documents) under various configurations. Our study reveals interesting findings as follows: 1) Encoder-decoder models, such as the ones in the Flan-T5 family, generally outperform causal decoder-only LMs in MHQA tasks, despite being significantly smaller in size; 2) altering the order of gold documents reveals distinct trends in both Flan T5 models and fine-tuned decoder-only models, with optimal performance observed when the document order aligns with the reasoning chain order; 3) enhancing causal decoder-only models with bi-directional attention by modifying the causal mask can effectively boost their end performance. In addition to the above, we conduct a thorough investigation of the distribution of LM attention weights in the context of MHQA. Our experiments reveal that attention weights tend to peak at higher values when the resulting answer is correct. We leverage this finding to heuristically improve LMs' performance on this task. Our code is publicly available at https://github.com/hwy9855/MultiHopQA-Reasoning.

Community

Paper author Paper submitter

As large language models increasingly power AI assistants, their ability to reason across multiple pieces of evidence is a crucial yet unresolved challenge. In real-world retrieval-augmented generation (RAG) scenarios, simply changing the order of supporting documents can mean the difference between success and failure. In this paper, we rigorously analyse how various types of LMs tackle multi-hop question answering when the context order and placement are changed, revealing which architectures and strategies excel—and why. Most notably, we show that identifying and leveraging models’ attention patterns can unlock significant gains, offering practical guidance for building smarter, more reliable multi-hop QA systems.

Paper: https://arxiv.org/abs/2505.11754
Code: https://github.com/hwy9855/MultiHopQA-Reasoning

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.11754 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.11754 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.11754 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.