--- tags: - ColBERT - PyLate - sentence-similarity - feature-extraction - generated_from_trainer - dataset_size:909188 - loss:Contrastive base_model: EuroBERT/EuroBERT-610m datasets: - baconnier/rag-comprehensive-triplets pipeline_tag: sentence-similarity library_name: PyLate metrics: - accuracy model-index: - name: PyLate model based on EuroBERT/EuroBERT-610m results: - task: type: col-berttriplet name: Col BERTTriplet dataset: name: Unknown type: unknown metrics: - type: accuracy value: 0.9841766953468323 name: Accuracy license: apache-2.0 language: - es - en --- ## Fine-Tuned Model **`raialvaro/colbert-610M-EuroBERT`** ## Base Model **`EuroBERT/EuroBERT-610m`** ## Fine-Tuning Method Fine-tuning was performed using **[PyLate](https://github.com/lightonai/pylate)**, with contrastive training on the [rag-comprehensive-triplets](https://huggingface.co/datasets/baconnier/rag-comprehensive-triplets) dataset. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator. ## Dataset **[`baconnier/rag-comprehensive-triplets`](https://huggingface.co/datasets/baconnier/rag-comprehensive-triplets)** ### Description This dataset has been filtered for the Spanish language containing **303,000 examples**, designed for **rag-comprehensive-triplets**. ## Fine-Tuning Details - The model was trained using the **Contrastive Training**. - Evaluated with pylate.evaluation.colbert_triplet.ColBERTTripletEvaluator | Metric | Value | |:-------------|:-----------| | **accuracy** | **0.98417** | ## Usage First install the PyLate library: ```bash pip install -U pylate ``` ### Calculate Similarity ```python import torch from pylate import models # Load the ColBERT model model = models.ColBERT("raialvaro/colbert-610M-EuroBERT", trust_remote_code=True) # Move the model to GPU if available, otherwise use CPU device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device) # Example data for similarity comparison query = "¿Cuál es la capital de España?" # Query sentence positive_doc = "La capital de España es Madrid." # Relevant document negative_doc = "Florida es un estado en los Estados Unidos." # Irrelevant document sentences = [query, positive_doc, negative_doc] # Combine all texts # Tokenize the input sentences using ColBERT's tokenizer inputs = model.tokenize(sentences) # Move all input tensors to the same device as the model (GPU/CPU) inputs = {key: value.to(device) for key, value in inputs.items()} # Generate token embeddings (no gradients needed for inference) with torch.no_grad(): embeddings_dict = model(inputs) embeddings = embeddings_dict['token_embeddings'] # Define ColBERT's MaxSim similarity function def colbert_similarity(query_emb, doc_emb): """ Computes ColBERT-style similarity between query and document embeddings. Uses maximum similarity (MaxSim) between individual tokens. Args: query_emb: [query_tokens, embedding_dim] doc_emb: [doc_tokens, embedding_dim] Returns: Normalized similarity score """ # Compute dot product between all token pairs similarity_matrix = torch.matmul(query_emb, doc_emb.T) # Get maximum similarity for each query token (MaxSim) max_similarities = similarity_matrix.max(dim=1)[0] # Return average of maximum similarities (normalized by query length) return max_similarities.sum() / query_emb.shape[0] # Extract embeddings for each text query_emb = embeddings[0] positive_emb = embeddings[1] negative_emb = embeddings[2] # Compute similarity scores positive_score = colbert_similarity(query_emb, positive_emb) negative_score = colbert_similarity(query_emb, negative_emb) print(f"Similarity with positive document: {positive_score.item():.4f}") print(f"Similarity with negative document: {negative_score.item():.4f}") ``` ## Framework Versions - Python: 3.10.12 - Sentence Transformers: 3.4.1 - PyLate: 1.1.7 - Transformers: 4.48.2 - PyTorch: 2.5.1+cu121 - Accelerate: 1.2.1 - Datasets: 3.3.1 - Tokenizers: 0.21.0 ## Purpose This tuned model is designed for **Spanish applications** that require the use of **efficient semantic search** comparing embeddings at the token level with its MaxSim operation, ideal for **question-answering and document retrieval**. - **Developed by:** raialvaro - **License:** apache-2.0