File size: 4,800 Bytes
a625032
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9c2036a
 
 
 
a625032
9c2036a
a625032
9c2036a
a625032
9c2036a
a625032
9c2036a
 
a625032
9c2036a
 
a625032
9c2036a
 
a625032
9c2036a
 
a625032
9c2036a
 
37e4951
a625032
9c2036a
 
 
a625032
 
 
 
 
 
 
 
9c2036a
a625032
 
9c2036a
 
 
37e4951
9c2036a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37e4951
9c2036a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37e4951
9c2036a
 
 
 
 
 
 
 
37e4951
 
 
9c2036a
 
9b68e43
 
9c2036a
 
 
a625032
 
9c2036a
a625032
 
 
 
 
 
 
 
 
9c2036a
 
a625032
 
9c2036a
 
a625032
9c2036a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
tags:
- ColBERT
- PyLate
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:909188
- loss:Contrastive
base_model: EuroBERT/EuroBERT-210m
datasets:
- baconnier/rag-comprehensive-triplets
pipeline_tag: sentence-similarity
library_name: PyLate
metrics:
- accuracy
model-index:
- name: PyLate model based on EuroBERT/EuroBERT-210m
  results:
  - task:
      type: col-berttriplet
      name: Col BERTTriplet
    dataset:
      name: Unknown
      type: unknown
    metrics:
    - type: accuracy
      value: 0.9848384857177734
      name: Accuracy
license: apache-2.0
language:
- es
- en
---
[<img src="https://cdn-avatars.huggingface.co/v1/production/uploads/67b2f4e49edebc815a3a4739/R1g957j1aBbx8lhZbWmxw.jpeg" width="200"/>](https://huggingface.co/fjmgAI)

## Fine-Tuned Model

**`fjmgAI/col1-210M-EuroBERT`**

## Base Model
**`EuroBERT/EuroBERT-210m`**

## Fine-Tuning Method
Fine-tuning was performed using **[PyLate](https://github.com/lightonai/pylate)**, with contrastive training on the [rag-comprehensive-triplets](https://huggingface.co/datasets/baconnier/rag-comprehensive-triplets) dataset. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.

## Dataset
**[`baconnier/rag-comprehensive-triplets`](https://huggingface.co/datasets/baconnier/rag-comprehensive-triplets)**

### Description
This dataset has been filtered for the Spanish language containing **303,000 examples**, designed for **rag-comprehensive-triplets**.

## Fine-Tuning Details
- The model was trained using the **Contrastive Training**.
- Evaluated with <code>pylate.evaluation.colbert_triplet.ColBERTTripletEvaluator</code>

| Metric       | Value      |
|:-------------|:-----------|
| **accuracy** | **0.9848** |

## Usage
First install the PyLate library:

```bash
pip install -U pylate
```

### Calculate Similarity

```python
import torch
from pylate import models

# Load the ColBERT model 
model = models.ColBERT("fjmgAI/col1-210M-EuroBERT", trust_remote_code=True)

# Move the model to GPU if available, otherwise use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Example data for similarity comparison
query = "¿Cuál es la capital de España?"  # Query sentence
positive_doc = "La capital de España es Madrid."  # Relevant document
negative_doc = "Florida es un estado en los Estados Unidos."  # Irrelevant document
sentences = [query, positive_doc, negative_doc]  # Combine all texts

# Tokenize the input sentences using ColBERT's tokenizer
inputs = model.tokenize(sentences)

# Move all input tensors to the same device as the model (GPU/CPU)
inputs = {key: value.to(device) for key, value in inputs.items()}

# Generate token embeddings (no gradients needed for inference)
with torch.no_grad():
    embeddings_dict = model(inputs)  
    embeddings = embeddings_dict['token_embeddings']

# Define ColBERT's MaxSim similarity function
def colbert_similarity(query_emb, doc_emb):
    """
    Computes ColBERT-style similarity between query and document embeddings.
    Uses maximum similarity (MaxSim) between individual tokens.
    
    Args:
        query_emb: [query_tokens, embedding_dim]
        doc_emb: [doc_tokens, embedding_dim]
    
    Returns:
        Normalized similarity score
    """
    # Compute dot product between all token pairs
    similarity_matrix = torch.matmul(query_emb, doc_emb.T)  
    
    # Get maximum similarity for each query token (MaxSim)
    max_similarities = similarity_matrix.max(dim=1)[0]
    
    # Return average of maximum similarities (normalized by query length)
    return max_similarities.sum() / query_emb.shape[0]

# Extract embeddings for each text
query_emb = embeddings[0]  
positive_emb = embeddings[1]  
negative_emb = embeddings[2]

# Compute similarity scores
positive_score = colbert_similarity(query_emb, positive_emb)
negative_score = colbert_similarity(query_emb, negative_emb)

print(f"Similarity with positive document: {positive_score.item():.4f}")
print(f"Similarity with negative document: {negative_score.item():.4f}")
```

## Framework Versions
- Python: 3.10.12
- Sentence Transformers: 3.4.1
- PyLate: 1.1.7
- Transformers: 4.48.2
- PyTorch: 2.5.1+cu121
- Accelerate: 1.2.1
- Datasets: 3.3.1
- Tokenizers: 0.21.0

## Purpose
This tuned model is designed for **Spanish applications** that require the use of **efficient semantic search** comparing embeddings at the token level with its MaxSim operation, ideal for **question-answering and document retrieval**.


- **Developed by:** fjmgAI
- **License:** apache-2.0

[<img src="https://github.com/lightonai/pylate/blob/main/docs/img/logo.png?raw=true" width="200"/>](https://github.com/lightonai/pylate)