|
--- |
|
tags: |
|
- sentence-transformers |
|
- sentence-similarity |
|
- feature-extraction |
|
base_model: Shuu12121/CodeModernBERT-Owl |
|
pipeline_tag: sentence-similarity |
|
library_name: sentence-transformers |
|
metrics: |
|
- code_eval |
|
model-index: |
|
- name: SentenceTransformer based on Shuu12121/CodeModernBERT-Owl |
|
results: |
|
- task: |
|
type: semantic-similarity |
|
name: Semantic Similarity |
|
dataset: |
|
name: code docstring dev |
|
type: code-docstring-dev |
|
metrics: |
|
- type: pearson_cosine |
|
value: null |
|
name: Pearson Cosine |
|
- type: spearman_cosine |
|
value: null |
|
name: Spearman Cosine |
|
license: apache-2.0 |
|
datasets: |
|
- code-search-net/code_search_net |
|
- Shuu12121/java-codesearch-dataset-open |
|
- Shuu12121/rust-codesearch-dataset-open |
|
- google/code_x_glue_ct_code_to_text |
|
language: |
|
- en |
|
--- |
|
|
|
|
|
|
|
|
|
|
|
# SentenceTransformer based on Shuu12121/CodeModernBERT-Owl🦉 |
|
|
|
|
|
|
|
This model is a **sentence-transformers** model fine-tuned from **[Shuu12121/CodeModernBERT-Owl](https://huggingface.co/Shuu12121/CodeModernBERT-Owl)**, which is a **ModernBERT model specifically designed for code, pre-trained from scratch by me**. |
|
**It is specifically designed for code search and efficiently calculates semantic similarity between code snippets and documentation.** |
|
One of the key features of this model is its **maximum sequence length of 2048 tokens**, which allows it to handle moderately long code snippets and documentation. |
|
Despite being a relatively small model with about **150 million parameters**, it demonstrates remarkable performance in code search tasks. |
|
|
|
--- |
|
|
|
このモデルは、**私が一から事前学習を行ったコード特化のModernBERTモデルである [Shuu12121/CodeModernBERT-Owl](https://huggingface.co/Shuu12121/CodeModernBERT-Owl)** をベースにファインチューニングされた **[sentence-transformers](https://www.SBERT.net)** モデルです。 |
|
**特にコードサーチに特化しており、コード片やドキュメントから効果的に意味的類似性を計算できる** ように設計されています。 |
|
本モデルの特徴として、**最大シーケンス長が2048トークン**に対応しており、**中程度の長さのコード片やドキュメントにも対応可能**です。 |
|
**150M程度と比較的小さいモデル**ながらも、コード検索タスクにおいて高い性能を発揮します。 |
|
|
|
|
|
--- |
|
|
|
### Model Evaluation / モデル評価 |
|
|
|
#### CoIR Evaluation Results / CoIRにおける評価結果 |
|
|
|
Despite being a relatively small model with around **150M parameters**, this model achieved an impressive **76.89** on the **CodeSearchNet** benchmark, demonstrating its high performance in code search tasks. |
|
Since this model is specialized for code search, it does not support other tasks, and thus evaluation scores for other tasks are not provided. |
|
In the CodeSearchNet task, this model outperforms many well-known models, as shown in the comparison table below. |
|
|
|
このモデルは、**150M程度と比較的小さいモデル**ながら、**コードサーチタスクにおける評価指標である CodeSearchNet で 76.89** を達成しました。 |
|
他のタスクには対応していないため、評価値は提供されていません。 |
|
CodeSearchNetタスクにおける評価値としては、他の有名なモデルと比較しても高いパフォーマンスを示しています。 |
|
|
|
| Model Name | CodeSearchNet Score | |
|
|-----------------------------------------------|----------------------| |
|
| **Shuu12121/CodeModernBERT-Owl** | **76.89** | |
|
| Salesforce/SFR-Embedding-Code-2B_R | 73.5 | |
|
| CodeSage-large-v2 | 94.26 | |
|
| Salesforce/SFR-Embedding-Code-400M_R | 72.53 | |
|
| CodeSage-large | 90.58 | |
|
| Voyage-Code-002 | 81.79 | |
|
| E5-Mistral | 54.25 | |
|
| E5-Base-v2 | 67.99 | |
|
| OpenAI-Ada-002 | 74.21 | |
|
| BGE-Base-en-v1.5 | 69.6 | |
|
| BGE-M3 | 43.23 | |
|
| UniXcoder | 60.2 | |
|
| GTE-Base-en-v1.5 | 43.35 | |
|
| Contriever | 34.72 | |
|
|
|
--- |
|
|
|
### Model Details / モデル詳細 |
|
|
|
- **Model Type / モデルタイプ:** Sentence Transformer |
|
- **Base Model / ベースモデル:** [Shuu12121/CodeModernBERT-Owl](https://huggingface.co/Shuu12121/CodeModernBERT-Owl) |
|
- **Maximum Sequence Length / 最大シーケンス長:** 2048 tokens |
|
- **Output Dimensions / 出力次元:** 768 dimensions |
|
- **Similarity Function / 類似度関数:** Cosine Similarity |
|
- **License / ライセンス:** Apache-2.0 |
|
|
|
--- |
|
|
|
### Usage / 使用方法 |
|
|
|
#### Installation / インストール |
|
|
|
To install Sentence Transformers, run the following command: |
|
Sentence Transformers をインストールするには、以下のコマンドを実行します。 |
|
|
|
```bash |
|
pip install -U sentence-transformers |
|
``` |
|
|
|
#### Model Loading and Inference / モデルのロードと推論 |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
|
|
# Load the model / モデルをダウンロードしてロード |
|
model = SentenceTransformer("Shuu12121/CodeSearch-ModernBERT-Owl") |
|
|
|
# Example sentences for inference / 推論用の文リスト |
|
sentences = [ |
|
'Encrypts the zip file', |
|
'def freeze_encrypt(dest_dir, zip_filename, config, opt):\n \n pgp_keys = grok_keys(config)\n icefile_prefix = "aomi-%s" % \\\n os.path.basename(os.path.dirname(opt.secretfile))\n if opt.icefile_prefix:\n icefile_prefix = opt.icefile_prefix\n\n timestamp = time.strftime("%H%M%S-%m-%d-%Y",\n datetime.datetime.now().timetuple())\n ice_file = "%s/%s-%s.ice" % (dest_dir, icefile_prefix, timestamp)\n if not encrypt(zip_filename, ice_file, pgp_keys):\n raise aomi.exceptions.GPG("Unable to encrypt zipfile")\n\n return ice_file', |
|
'def transform(self, sents):\n \n\n def convert(tokens):\n return torch.tensor([self.vocab.stoi[t] for t in tokens], dtype=torch.long)\n\n if self.vocab is None:\n raise Exception(\n "Must run .fit() for .fit_transform() before " "calling .transform()."\n )\n\n seqs = sorted([convert(s) for s in sents], key=lambda x: -len(x))\n X = torch.LongTensor(pad_sequence(seqs, batch_first=True))\n return X', |
|
] |
|
|
|
# Generate embeddings / 埋め込みベクトルの生成 |
|
embeddings = model.encode(sentences) |
|
print(embeddings.shape) # Output: [3, 768] |
|
|
|
# Calculate similarity scores / 類似度スコアの計算 |
|
similarities = model.similarity(embeddings, embeddings) |
|
print(similarities.shape) # Output: [3, 3] |
|
``` |
|
|
|
--- |
|
|
|
### Library Versions / ライブラリバージョン |
|
|
|
- Python: 3.11.11 |
|
- Sentence Transformers: 3.4.1 |
|
- Transformers: 4.50.0 |
|
- PyTorch: 2.6.0+cu124 |
|
- Accelerate: 1.5.2 |
|
- Datasets: 3.4.1 |
|
- Tokenizers: 0.21.1 |
|
|
|
--- |
|
|
|
### Citation / 引用情報 |
|
|
|
#### Sentence Transformers |
|
```bibtex |
|
@inproceedings{reimers-2019-sentence-bert, |
|
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", |
|
author = "Reimers, Nils and Gurevych, Iryna", |
|
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", |
|
month = "11", |
|
year = "2019", |
|
publisher = "Association for Computational Linguistics", |
|
url = "https://arxiv.org/abs/1908.10084", |
|
} |
|
``` |
|
|
|
#### MultipleNegativesRankingLoss |
|
```bibtex |
|
@misc{henderson2017efficient, |
|
title={Efficient Natural Language Response Suggestion for Smart Reply}, |
|
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil}, |
|
year={2017}, |
|
eprint={1705.00652}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |