Update README.md

282490c verified about 1 month ago

8.33 kB

	---
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	base_model: Shuu12121/CodeModernBERT-Owl
	pipeline_tag: sentence-similarity
	library_name: sentence-transformers
	metrics:
	- code_eval
	model-index:
	- name: SentenceTransformer based on Shuu12121/CodeModernBERT-Owl
	results:
	- task:
	type: semantic-similarity
	name: Semantic Similarity
	dataset:
	name: code docstring dev
	type: code-docstring-dev
	metrics:
	- type: pearson_cosine
	value: null
	name: Pearson Cosine
	- type: spearman_cosine
	value: null
	name: Spearman Cosine
	license: apache-2.0
	datasets:
	- code-search-net/code_search_net
	- Shuu12121/java-codesearch-dataset-open
	- Shuu12121/rust-codesearch-dataset-open
	- google/code_x_glue_ct_code_to_text
	language:
	- en
	---





	# SentenceTransformer based on Shuu12121/CodeModernBERT-Owl🦉



	This model is a sentence-transformers model fine-tuned from [Shuu12121/CodeModernBERT-Owl](https://huggingface.co/Shuu12121/CodeModernBERT-Owl), which is a ModernBERT model specifically designed for code, pre-trained from scratch by me.
	It is specifically designed for code search and efficiently calculates semantic similarity between code snippets and documentation.
	One of the key features of this model is its maximum sequence length of 2048 tokens, which allows it to handle moderately long code snippets and documentation.
	Despite being a relatively small model with about 150 million parameters, it demonstrates remarkable performance in code search tasks.

	---

	このモデルは、私が一から事前学習を行ったコード特化のModernBERTモデルである [Shuu12121/CodeModernBERT-Owl](https://huggingface.co/Shuu12121/CodeModernBERT-Owl) をベースにファインチューニングされた [sentence-transformers](https://www.SBERT.net) モデルです。
	特にコードサーチに特化しており、コード片やドキュメントから効果的に意味的類似性を計算できるように設計されています。
	本モデルの特徴として、最大シーケンス長が2048トークンに対応しており、中程度の長さのコード片やドキュメントにも対応可能です。
	150M程度と比較的小さいモデルながらも、コード検索タスクにおいて高い性能を発揮します。


	---

	### Model Evaluation / モデル評価

	#### CoIR Evaluation Results / CoIRにおける評価結果

	Despite being a relatively small model with around 150M parameters, this model achieved an impressive 76.89 on the CodeSearchNet benchmark, demonstrating its high performance in code search tasks.
	Since this model is specialized for code search, it does not support other tasks, and thus evaluation scores for other tasks are not provided.
	In the CodeSearchNet task, this model outperforms many well-known models, as shown in the comparison table below.

	このモデルは、150M程度と比較的小さいモデルながら、コードサーチタスクにおける評価指標である CodeSearchNet で 76.89 を達成しました。
	他のタスクには対応していないため、評価値は提供されていません。
	CodeSearchNetタスクにおける評価値としては、他の有名なモデルと比較しても高いパフォーマンスを示しています。

	\| Model Name \| CodeSearchNet Score \|
	\|-----------------------------------------------\|----------------------\|
	\| Shuu12121/CodeModernBERT-Owl \| 76.89 \|
	\| Salesforce/SFR-Embedding-Code-2B_R \| 73.5 \|
	\| CodeSage-large-v2 \| 94.26 \|
	\| Salesforce/SFR-Embedding-Code-400M_R \| 72.53 \|
	\| CodeSage-large \| 90.58 \|
	\| Voyage-Code-002 \| 81.79 \|
	\| E5-Mistral \| 54.25 \|
	\| E5-Base-v2 \| 67.99 \|
	\| OpenAI-Ada-002 \| 74.21 \|
	\| BGE-Base-en-v1.5 \| 69.6 \|
	\| BGE-M3 \| 43.23 \|
	\| UniXcoder \| 60.2 \|
	\| GTE-Base-en-v1.5 \| 43.35 \|
	\| Contriever \| 34.72 \|

	---

	### Model Details / モデル詳細

	- Model Type / モデルタイプ: Sentence Transformer
	- Base Model / ベースモデル: [Shuu12121/CodeModernBERT-Owl](https://huggingface.co/Shuu12121/CodeModernBERT-Owl)
	- Maximum Sequence Length / 最大シーケンス長: 2048 tokens
	- Output Dimensions / 出力次元: 768 dimensions
	- Similarity Function / 類似度関数: Cosine Similarity
	- License / ライセンス: Apache-2.0

	---

	### Usage / 使用方法

	#### Installation / インストール

	To install Sentence Transformers, run the following command:
	Sentence Transformers をインストールするには、以下のコマンドを実行します。

	```bash
	pip install -U sentence-transformers
	```

	#### Model Loading and Inference / モデルのロードと推論

	```python
	from sentence_transformers import SentenceTransformer

	# Load the model / モデルをダウンロードしてロード
	model = SentenceTransformer("Shuu12121/CodeSearch-ModernBERT-Owl")

	# Example sentences for inference / 推論用の文リスト
	sentences = [
	'Encrypts the zip file',
	'def freeze_encrypt(dest_dir, zip_filename, config, opt):\n \n pgp_keys = grok_keys(config)\n icefile_prefix = "aomi-%s" % \\\n os.path.basename(os.path.dirname(opt.secretfile))\n if opt.icefile_prefix:\n icefile_prefix = opt.icefile_prefix\n\n timestamp = time.strftime("%H%M%S-%m-%d-%Y",\n datetime.datetime.now().timetuple())\n ice_file = "%s/%s-%s.ice" % (dest_dir, icefile_prefix, timestamp)\n if not encrypt(zip_filename, ice_file, pgp_keys):\n raise aomi.exceptions.GPG("Unable to encrypt zipfile")\n\n return ice_file',
	'def transform(self, sents):\n \n\n def convert(tokens):\n return torch.tensor([self.vocab.stoi[t] for t in tokens], dtype=torch.long)\n\n if self.vocab is None:\n raise Exception(\n "Must run .fit() for .fit_transform() before " "calling .transform()."\n )\n\n seqs = sorted([convert(s) for s in sents], key=lambda x: -len(x))\n X = torch.LongTensor(pad_sequence(seqs, batch_first=True))\n return X',
	]

	# Generate embeddings / 埋め込みベクトルの生成
	embeddings = model.encode(sentences)
	print(embeddings.shape) # Output: [3, 768]

	# Calculate similarity scores / 類似度スコアの計算
	similarities = model.similarity(embeddings, embeddings)
	print(similarities.shape) # Output: [3, 3]
	```

	---

	### Library Versions / ライブラリバージョン

	- Python: 3.11.11
	- Sentence Transformers: 3.4.1
	- Transformers: 4.50.0
	- PyTorch: 2.6.0+cu124
	- Accelerate: 1.5.2
	- Datasets: 3.4.1
	- Tokenizers: 0.21.1

	---

	### Citation / 引用情報

	#### Sentence Transformers
	```bibtex
	@inproceedings{reimers-2019-sentence-bert,
	title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
	author = "Reimers, Nils and Gurevych, Iryna",
	booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
	month = "11",
	year = "2019",
	publisher = "Association for Computational Linguistics",
	url = "https://arxiv.org/abs/1908.10084",
	}
	```

	#### MultipleNegativesRankingLoss
	```bibtex
	@misc{henderson2017efficient,
	title={Efficient Natural Language Response Suggestion for Smart Reply},
	author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
	year={2017},
	eprint={1705.00652},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```