File size: 4,800 Bytes
a625032 9c2036a a625032 9c2036a a625032 9c2036a a625032 9c2036a a625032 9c2036a a625032 9c2036a a625032 9c2036a a625032 9c2036a a625032 9c2036a 37e4951 a625032 9c2036a a625032 9c2036a a625032 9c2036a 37e4951 9c2036a 37e4951 9c2036a 37e4951 9c2036a 37e4951 9c2036a 9b68e43 9c2036a a625032 9c2036a a625032 9c2036a a625032 9c2036a a625032 9c2036a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 |
---
tags:
- ColBERT
- PyLate
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:909188
- loss:Contrastive
base_model: EuroBERT/EuroBERT-210m
datasets:
- baconnier/rag-comprehensive-triplets
pipeline_tag: sentence-similarity
library_name: PyLate
metrics:
- accuracy
model-index:
- name: PyLate model based on EuroBERT/EuroBERT-210m
results:
- task:
type: col-berttriplet
name: Col BERTTriplet
dataset:
name: Unknown
type: unknown
metrics:
- type: accuracy
value: 0.9848384857177734
name: Accuracy
license: apache-2.0
language:
- es
- en
---
[<img src="https://cdn-avatars.huggingface.co/v1/production/uploads/67b2f4e49edebc815a3a4739/R1g957j1aBbx8lhZbWmxw.jpeg" width="200"/>](https://huggingface.co/fjmgAI)
## Fine-Tuned Model
**`fjmgAI/col1-210M-EuroBERT`**
## Base Model
**`EuroBERT/EuroBERT-210m`**
## Fine-Tuning Method
Fine-tuning was performed using **[PyLate](https://github.com/lightonai/pylate)**, with contrastive training on the [rag-comprehensive-triplets](https://huggingface.co/datasets/baconnier/rag-comprehensive-triplets) dataset. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.
## Dataset
**[`baconnier/rag-comprehensive-triplets`](https://huggingface.co/datasets/baconnier/rag-comprehensive-triplets)**
### Description
This dataset has been filtered for the Spanish language containing **303,000 examples**, designed for **rag-comprehensive-triplets**.
## Fine-Tuning Details
- The model was trained using the **Contrastive Training**.
- Evaluated with <code>pylate.evaluation.colbert_triplet.ColBERTTripletEvaluator</code>
| Metric | Value |
|:-------------|:-----------|
| **accuracy** | **0.9848** |
## Usage
First install the PyLate library:
```bash
pip install -U pylate
```
### Calculate Similarity
```python
import torch
from pylate import models
# Load the ColBERT model
model = models.ColBERT("fjmgAI/col1-210M-EuroBERT", trust_remote_code=True)
# Move the model to GPU if available, otherwise use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
# Example data for similarity comparison
query = "¿Cuál es la capital de España?" # Query sentence
positive_doc = "La capital de España es Madrid." # Relevant document
negative_doc = "Florida es un estado en los Estados Unidos." # Irrelevant document
sentences = [query, positive_doc, negative_doc] # Combine all texts
# Tokenize the input sentences using ColBERT's tokenizer
inputs = model.tokenize(sentences)
# Move all input tensors to the same device as the model (GPU/CPU)
inputs = {key: value.to(device) for key, value in inputs.items()}
# Generate token embeddings (no gradients needed for inference)
with torch.no_grad():
embeddings_dict = model(inputs)
embeddings = embeddings_dict['token_embeddings']
# Define ColBERT's MaxSim similarity function
def colbert_similarity(query_emb, doc_emb):
"""
Computes ColBERT-style similarity between query and document embeddings.
Uses maximum similarity (MaxSim) between individual tokens.
Args:
query_emb: [query_tokens, embedding_dim]
doc_emb: [doc_tokens, embedding_dim]
Returns:
Normalized similarity score
"""
# Compute dot product between all token pairs
similarity_matrix = torch.matmul(query_emb, doc_emb.T)
# Get maximum similarity for each query token (MaxSim)
max_similarities = similarity_matrix.max(dim=1)[0]
# Return average of maximum similarities (normalized by query length)
return max_similarities.sum() / query_emb.shape[0]
# Extract embeddings for each text
query_emb = embeddings[0]
positive_emb = embeddings[1]
negative_emb = embeddings[2]
# Compute similarity scores
positive_score = colbert_similarity(query_emb, positive_emb)
negative_score = colbert_similarity(query_emb, negative_emb)
print(f"Similarity with positive document: {positive_score.item():.4f}")
print(f"Similarity with negative document: {negative_score.item():.4f}")
```
## Framework Versions
- Python: 3.10.12
- Sentence Transformers: 3.4.1
- PyLate: 1.1.7
- Transformers: 4.48.2
- PyTorch: 2.5.1+cu121
- Accelerate: 1.2.1
- Datasets: 3.3.1
- Tokenizers: 0.21.0
## Purpose
This tuned model is designed for **Spanish applications** that require the use of **efficient semantic search** comparing embeddings at the token level with its MaxSim operation, ideal for **question-answering and document retrieval**.
- **Developed by:** fjmgAI
- **License:** apache-2.0
[<img src="https://github.com/lightonai/pylate/blob/main/docs/img/logo.png?raw=true" width="200"/>](https://github.com/lightonai/pylate) |