File size: 4,638 Bytes
942a47e
 
 
 
 
 
 
 
e58eb7a
 
 
942a47e
 
 
 
230bfd5
 
 
 
942a47e
 
ac47cb8
942a47e
 
 
 
 
8850dc2
 
942a47e
 
8850dc2
942a47e
0814f51
942a47e
 
 
 
 
 
 
 
 
 
e691008
 
 
 
 
 
 
 
 
 
 
8850dc2
 
 
 
e691008
8850dc2
 
e691008
 
 
8850dc2
e691008
 
 
8850dc2
 
 
c7fd498
8850dc2
c7fd498
 
 
8850dc2
 
 
 
 
 
 
c7fd498
8850dc2
c7fd498
8850dc2
 
 
 
 
3e4b78d
 
3fd320e
3e4b78d
 
3fd320e
 
 
 
 
 
 
 
3e4b78d
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
---
license: mit
language:
- en
base_model:
- answerdotai/ModernBERT-base
pipeline_tag: token-classification
tags:
- token classification
- hallucination detection
- transformers
---

# LettuceDetect: Hallucination Detection Model

<p align="center">
  <img src="https://github.com/KRLabsOrg/LettuceDetect/blob/main/assets/lettuce_detective.png?raw=true" alt="LettuceDetect Logo" width="400"/>
</p>

**Model Name:** lettucedect-base-modernbert-en-v1  
**Organization:** KRLabsOrg  
**Github:** https://github.com/KRLabsOrg/LettuceDetect

## Overview

LettuceDetect is a transformer-based model for hallucination detection on context and answer pairs, designed for Retrieval-Augmented Generation (RAG) applications. This model is built on **ModernBERT**, which has been specifically chosen and trained becasue of its extended context support (up to **8192 tokens**). This long-context capability is critical for tasks where detailed and extensive documents need to be processed to accurately determine if an answer is supported by the provided context.

**This is our Large model based on ModernBERT-large**

## Model Details

- **Architecture:** ModernBERT (Large) with extended context support (up to 8192 tokens)
- **Task:** Token Classification / Hallucination Detection
- **Training Dataset:** RagTruth
- **Language:** English

## How It Works

The model is trained to identify tokens in the answer text that are not supported by the given context. During inference, the model returns token-level predictions which are then aggregated into spans. This allows users to see exactly which parts of the answer are considered hallucinated.

## Usage

### Installation

Install the 'lettucedetect' repository

```bash
pip install lettucedetect
```

### Using the model

```python
from lettucedetect.models.inference import HallucinationDetector

# For a transformer-based approach:
detector = HallucinationDetector(
    method="transformer", model_path="KRLabsOrg/lettucedect-base-modernbert-en-v1"
)

contexts = ["France is a country in Europe. The capital of France is Paris. The population of France is 67 million.",]
question = "What is the capital of France? What is the population of France?"
answer = "The capital of France is Paris. The population of France is 69 million."

# Get span-level predictions indicating which parts of the answer are considered hallucinated.
predictions = detector.predict(context=contexts, question=question, answer=answer, output_format="spans")
print("Predictions:", predictions)

# Predictions: [{'start': 31, 'end': 71, 'confidence': 0.9944414496421814, 'text': ' The population of France is 69 million.'}]
```


## Performance

**Example level results**

We evaluate our model on the test set of the [RAGTruth](https://aclanthology.org/2024.acl-long.585/) dataset. Our large model, **lettucedetect-large-v1**, achieves an overall F1 score of 79.22%, outperforming prompt-based methods like GPT-4 (63.4%) and encoder-based models like [Luna](https://aclanthology.org/2025.coling-industry.34.pdf) (65.4%). It also surpasses fine-tuned LLAMA-2-13B (78.7%) (presented in [RAGTruth](https://aclanthology.org/2024.acl-long.585/)) and is competitive with the SOTA fine-tuned LLAMA-3-8B (83.9%) (presented in the [RAG-HAT paper](https://aclanthology.org/2024.emnlp-industry.113.pdf)). Overall, **lettucedetect-large-v1** and **lettucedect-base-v1** are very performant models, while being very effective in inference settings.

The results on the example-level can be seen in the table below.

<p align="center">
  <img src="https://github.com/KRLabsOrg/LettuceDetect/blob/main/assets/example_level_lettucedetect.png?raw=true" alt="Example-level Results" width="800"/>
</p>

**Span-level results**

At the span level, our model achieves the best scores across all data types, significantly outperforming previous models. The results can be seen in the table below. Note that here we don't compare to models, like [RAG-HAT](https://aclanthology.org/2024.emnlp-industry.113.pdf), since they have no span-level evaluation presented.

<p align="center">
  <img src="https://github.com/KRLabsOrg/LettuceDetect/blob/main/assets/span_level_lettucedetect.png?raw=true" alt="Span-level Results" width="800"/>
</p>

## Citing

If you use the model or the tool, please cite the following paper:

```bibtex
@misc{Kovacs:2025,
      title={LettuceDetect: A Hallucination Detection Framework for RAG Applications}, 
      author={Ádám Kovács and Gábor Recski},
      year={2025},
      eprint={2502.17125},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.17125}, 
}
```