VISAI-AI
/

nitibench-ccl-human-finetuned-bge-m3

@@ -1,144 +1,175 @@
 ---
-datasets: []
-language: []
 library_name: sentence-transformers
 pipeline_tag: sentence-similarity
 tags:
 - sentence-transformers
 - sentence-similarity
 - feature-extraction
-widget: []
 ---
-# SentenceTransformer
-This is a [sentence-transformers](https://www.SBERT.net) model trained. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
-## Model Details
-### Model Description
-- **Model Type:** Sentence Transformer
-<!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
-- **Maximum Sequence Length:** 8192 tokens
-- **Output Dimensionality:** 1024 tokens
-- **Similarity Function:** Cosine Similarity
-<!-- - **Training Dataset:** Unknown -->
-<!-- - **Language:** Unknown -->
-<!-- - **License:** Unknown -->
-### Model Sources
-- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
-- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
-- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
-### Full Model Architecture
 ```
-SentenceTransformer(
-  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
-  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
-  (2): Normalize()
-)
 ```
-## Usage
-### Direct Usage (Sentence Transformers)
-First install the Sentence Transformers library:
-```bash
-pip install -U sentence-transformers
 ```
-Then you can load this model and run inference.
 ```python
-from sentence_transformers import SentenceTransformer
-# Download from the 🤗 Hub
-model = SentenceTransformer("sentence_transformers_model_id")
-# Run inference
-sentences = [
-    'The weather is lovely today.',
-    "It's so sunny outside!",
-    'He drove to the stadium.',
-]
-embeddings = model.encode(sentences)
-print(embeddings.shape)
-# [3, 1024]
-# Get the similarity scores for the embeddings
-similarities = model.similarity(embeddings, embeddings)
-print(similarities.shape)
-# [3, 3]
-```
-<!--
-### Direct Usage (Transformers)
-<details><summary>Click to see the direct usage in Transformers</summary>
-</details>
--->
-<!--
-### Downstream Usage (Sentence Transformers)
-You can finetune this model on your own dataset.
-<details><summary>Click to expand</summary>
-</details>
--->
-<!--
-### Out-of-Scope Use
-*List how the model may foreseeably be misused and address what users ought not to do with the model.*
--->
-<!--
-## Bias, Risks and Limitations
-*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
--->
-<!--
-### Recommendations
-*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
--->
-## Training Details
-### Framework Versions
-- Python: 3.10.14
-- Sentence Transformers: 3.0.1
-- Transformers: 4.34.0
-- PyTorch: 2.1.0+cu121
-- Accelerate: 0.21.0
-- Datasets: 2.21.0
-- Tokenizers: 0.14.1
-## Citation
-### BibTeX
-<!--
-## Glossary
-*Clearly define terms in order to be accessible across audiences.*
--->
-<!--
-## Model Card Authors
-*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
--->
-<!--
-## Model Card Contact
-*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
--->

 ---
 library_name: sentence-transformers
 pipeline_tag: sentence-similarity
 tags:
 - sentence-transformers
 - sentence-similarity
 - feature-extraction
+license: mit
+datasets:
+- airesearch/WangchanX-Legal-ThaiCCL-RAG
+- VISAI-AI/nitibench
+language:
+- th
+base_model:
+- BAAI/bge-m3
 ---
+# Auto-Finetuned BGE-M3 CCL
+This is a finetuned [`BAAI/bge-m3`](https://huggingface.co/BAAI/bge-m3) model on [`airesearch/WangchanX-Legal-ThaiCCL-RAG`](https://huggingface.co/datasets/airesearch/WangchanX-Legal-ThaiCCL-RAG) queries.
+## Finetuning Details
+Apart from the original [`airesearch/WangchanX-Legal-ThaiCCL-RAG`](https://huggingface.co/datasets/airesearch/WangchanX-Legal-ThaiCCL-RAG) which requires human to rerank and remove irrelevant documents, the model was finetuned on a completely automated environment.
+Specifically, given the query in the WangchanX-Legal-ThaiCCL-RAG dataset and a set of law sections to be retrieved, we follow the following procedure:
+1. Use [`BAAI/bge-m3`](https://huggingface.co/BAAI/bge-m3) to retrieve N positive law sections based on thresholding score of 0.8
+2. Among those N documents, we use [`BAAI/bge-reranker-v2-m3`](https://huggingface.co/BAAI/bge-reranker-v2-m3) to rerank documents and filtered any document that reranker scores less than 0.8 - achieving final positive law sections
+3. Using positives from (2), we finetuned BGE-M3 model
+## Model Performance
+| **Dataset**      | **Top-K** | **HR@k** | **Multi HR@k** | **Recall@k** | **MRR@k** | **Multi MRR@k** |
+|:----------------:|:---------:|:-------:|:-------------:|:-----------:|:--------:|:---------------:|
+| **NitiBench-CCL**    | 1         | 0.735   | –             | 0.735       | 0.735    | –               |
+| **NitiBench-CCL**    | 5         | 0.906   | –             | 0.906       | 0.805    | –               |
+| **NitiBench-CCL**    | 10        | 0.938   | –             | 0.938       | 0.809    | –               |
+| **NitiBench-Tax**| 1         | 0.480   | 0.140         | 0.255       | 0.480    | 0.255           |
+| **NitiBench-Tax**| 5         | 0.740   | 0.220         | 0.411       | 0.565    | 0.320           |
+| **NitiBench-Tax**| 10        | 0.800   | 0.280         | 0.499       | 0.574    | 0.333           |
+## Usage
+Install:
+```
+git clone https://github.com/FlagOpen/FlagEmbedding.git
+cd FlagEmbedding
+pip install -e .
 ```
+or:
+```
+pip install -U FlagEmbedding
 ```
+### Generate Embedding for text
+- Dense Embedding
+```python
+from FlagEmbedding import BGEM3FlagModel
+model = BGEM3FlagModel('VISAI-AI/nitibench-ccl-human-finetuned-bge-m3',
+                       use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
+sentences_1 = ["What is BGE M3?", "Defination of BM25"]
+sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
+               "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
+embeddings_1 = model.encode(sentences_1,
+                            batch_size=12,
+                            max_length=8192, # If you don't need such a long length, you can set a smaller value to speed up the encoding process.
+                            )['dense_vecs']
+embeddings_2 = model.encode(sentences_2)['dense_vecs']
+similarity = embeddings_1 @ embeddings_2.T
+print(similarity)
+# [[0.6265, 0.3477], [0.3499, 0.678 ]]
 ```
+You also can use sentence-transformers and huggingface transformers to generate dense embeddings.
+Refer to [baai_general_embedding](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding#usage) for details.
+- Sparse Embedding (Lexical Weight)
 ```python
+from FlagEmbedding import BGEM3FlagModel
+model = BGEM3FlagModel('VISAI-AI/nitibench-ccl-human-finetuned-bge-m3',  use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
+sentences_1 = ["สถาบันทางการเงินสามารถลงทุนในหลักทรัพย์ เป็นอัตราส่วนร้อยละสิบของเงินกองทุนทั้งหมดของสถาบันการเงินนั้น สำหรับการถือหรือมีหุ้นในทุกบริษัทรวมกันได้หรือไม่?",
+               "ในกรณีที่ธนาคารแห่งประเทศไทยมีคำสั่งปิดกิจการของสถาบันการเงิน เนื่องสถาบันการเงินดำรงเงินกองทุนต่ำกว่าร้อยละสามสิบห้าของอัตราตามที่กำหนด จะต้องนำเสนอต่อบุคคลใดหรือหน่วยงานใดเพื่อเพิกถอนใบอนุญาตของสถาบันการเงินนั้น"]
+sentences_2 = ["พระราชบัญญัติธุรกิจสถาบันการเงิน พ.ศ. 2551 มาตรา 33 ภายใต้บังคับมาตรา 34 และมาตรา 35 ให้สถาบันการเงินลงทุนในหลักทรัพย์เพื่อเป็นกรรมสิทธิ์ของตนได้ ตามหลักเกณฑ์ที่ธนาคารแห่งประเทศไทยประกาศกำหนด",
+               "พระราชบัญญัติธุรกิจสถาบันการเงิน พ.ศ. 2551 มาตรา 97 ในกรณีที่สถาบันการเงินดำรงเงินกองทุนต่ำกว่าร้อยละสามสิบห้าของอัตราตามที่กำหนดในมาตรา 30 ให้ธนาคารแห่งประเทศไทยมีคำสั่งปิดกิจการของสถาบันการเงินนั้น เว้นแต่ในกรณีที่ธนาคารแห่งประเทศไทยเห็นว่าการมีคำสั่งปิดกิจการจะก่อให้เกิดผลกระทบ หรือความเสียหายต่อระบบเศรษฐกิจโดยรวมอย่างรุนแรง ธนาคารแห่งประเทศไทยอาจยังไม่สั่งปิดกิจการของสถาบันการเงินก็ได้\nเมื่อธนาคารแห่งประเทศไทยมีคำสั่งปิดกิจการตามวรรคหนึ่งแล้ว ให้เสนอรัฐมนตรีเพิกถอนใบอนุญาตของสถาบันการเงินนั้น"]
+output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=False)
+output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=False)
+# you can see the weight for each token:
+print(model.convert_id_to_token(output_1['lexical_weights']))
+# [{'สถาบัน': 0.126, 'การเงิน': 0.10956, 'สามารถ': 0.07, 'ลงทุน': 0.1417, 'ใน': 0.01715, 'หลัก': 0.0758, 'ทรัพย์': 0.1702, 'อัตรา': 0.04926, 'ส่วน': 0.06107, 'ร้อย���ะ': 0.09, 'สิบ': 0.14, 'เงิน': 0.05026, 'กองทุน': 0.1205, 'ทั้งหมด': 0.03644, 'ถือ': 0.0987, 'หุ้น': 0.0928, 'ในทุก': 0.04883, 'บริษัท': 0.0999, 'รวม': 0.0835, 'กันได้': 0.09814, 'หรือไม่': 0.0398},
+#  {'กรณี': 0.0323, 'ธนาคาร': 0.08136, 'แห่งประเทศไทย': 0.151, 'คําสั่ง': 0.161, 'ปิด': 0.1583, 'กิจการ': 0.1199, 'สถาบัน': 0.08545, 'การเงิน': 0.1334, 'เนื่อง': 0.006992, 'ดํารง': 0.1523, 'เงิน': 0.12146, 'กองทุน': 0.1776, 'ต่ํากว่า': 0.1335, 'ร้อยละ': 0.10126, 'สาม': 0.02498, 'ห้า': 0.1158, 'อัตรา': 0.12256, 'กําหนด': 0.0572, 'จะต้อง': 0.07074, 'นําเสนอ': 0.1752, 'ต่อ': 0.0696, 'บุคคล': 0.0817, 'ใด': 0.0577, 'หรือ': 0.0248, 'หน่วยงาน': 0.076, 'เพ': 0.02034, 'ิก': 0.0921, 'ถอน': 0.1582, 'ใบ': 0.04617, 'อนุญาต': 0.179}]
+# compute the scores via lexical mathcing
+lexical_scores = model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_2['lexical_weights'][0])
+print(lexical_scores)
+# 0.10838508605957031
+print(model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_1['lexical_weights'][1]))
+# 0.06803131103515625
+```
+- Multi-Vector (ColBERT)
+```python
+from FlagEmbedding import BGEM3FlagModel
+model = BGEM3FlagModel('VISAI-AI/nitibench-ccl-human-finetuned-bge-m3',  use_fp16=True)
+sentences_1 = ["สถาบันทางการเงินสามารถลงทุนในหลักทรัพย์ เป็นอัตราส่วนร้อยละสิบของเงินกองทุนทั้งหมดของสถาบันการเงินนั้น สำหรับการถือหรือมีหุ้นในทุกบริษัทรวมกันได้หรือไม่?",
+               "ในกรณีที่ธนาคารแห่งประเทศไทยมีคำสั่งปิดกิจการของสถาบันการเงิน เนื่องสถาบันการเงินดำรงเงินกองทุนต่ำกว่าร้อยละสามสิบห้าของอัตราตามที่กำหนด จะต้องนำเสนอต่อบุคคลใดหรือหน่วยงานใดเพื่อเพิกถอนใบอนุญาตของสถาบันการเงินนั้น"]
+sentences_2 = ["พระราชบัญญัติธุรกิจสถาบันการเงิน พ.ศ. 2551 มาตรา 33 ภายใต้บังคับมาตรา 34 และมาตรา 35 ให้สถาบันการเงินลงทุนในหลักทรัพย์เพื่อเป็นกรรมสิทธิ์ของตนได้ ตามหลักเกณฑ์ที่ธนาคารแห่งประเทศไทยประกาศกำหนด",
+               "พระราชบัญญัติธุรกิจสถาบันการเงิน พ.ศ. 2551 มาตรา 97 ในกรณีที่สถาบันการเงินดำรงเงินกองทุนต่ำกว่าร้อยละสามสิบห้าของอัตราตามที่กำหนดในมาตรา 30 ให้ธนาคารแห่งประเทศไทยมีคำสั่งปิดกิจการของสถาบันการเงินนั้น เว้นแต่ในกรณีที่ธนาคารแห่งประเทศไทยเห็นว่าการมีคำสั่งปิดกิจการจะก่อให้เกิดผลกระทบ หรือความเสียหายต่อระบบเศรษฐกิจโดยรวมอย่างรุนแรง ธนาคารแห่งประเทศไทยอาจยังไม่สั่งปิดกิจการของสถาบันการเงินก็ได้\nเมื่อธนาคารแห่งประเทศไทยมีคำสั่งปิดกิจการตามวรรคหนึ่งแล้ว ให้เสนอรัฐมนตรีเพิกถอนใบอนุญาตของสถาบันการเงินนั้น"]
+output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=True)
+output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=True)
+print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][0]))
+print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][1]))
+# tensor(0.5813)
+# tensor(0.5718)
+```
+### Compute score for text pairs
+Input a list of text pairs, you can get the scores computed by different methods.
+```python
+from FlagEmbedding import BGEM3FlagModel
+model = BGEM3FlagModel('VISAI-AI/nitibench-ccl-human-finetuned-bge-m3',  use_fp16=True)
+sentences_1 = ["สถาบันทางการเงินสามารถลงทุนในหลักทรัพย์ เป็นอัตราส่วนร้อยละสิบของเงินกองทุนทั้งหมดของสถาบันการเงินนั้น สำหรับการถือหรือมีหุ้นในทุกบริษัทรวมกันได้หรือไม่?",
+               "ในกรณีที่ธนาคารแห่งประเทศไทยมีคำสั่งปิดกิจการของสถาบันการเงิน เนื่องสถาบันการเงินดำรงเงินกองทุนต่ำกว่าร้อยละสามสิบห้าของอัตราตามที่กำหนด จะต้องนำเสนอต่อบุคคลใดหรือหน่วยงานใดเพื่อเพิกถอนใบอนุญาตของสถาบันการเงินนั้น"]
+sentences_2 = ["พระราชบัญญัติธุรกิจสถาบันการเงิน พ.ศ. 2551 มาตรา 33 ภายใต้บังคับมาตรา 34 และมาตรา 35 ให้สถาบันการเงินลงทุนในหลักทรัพย์เพื่อเป็นกรรมสิทธิ์ของตนได้ ตามหลักเกณฑ์ที่ธนาคารแห่งประเทศไทยประกาศกำหนด",
+               "พระราชบัญญัติธุรกิจสถาบันการเงิน พ.ศ. 2551 มาตรา 97 ในกรณีที่สถาบันการเงินดำรงเงินกองทุนต่ำกว่าร้อยละสามสิบห้าของอัตราตามที่กำหนดในมาตรา 30 ให้ธนาคารแห่งประเทศไทยมีคำสั่งปิดกิจการของสถาบันการเงินนั้น เว้นแต่ในกรณีที่ธนาคารแห่งประเทศไทยเห็นว่าการมีคำสั่งปิดกิจการจะก่อให้เกิดผลกระทบ หรือความเสียหายต่อระบบเศรษฐกิจโดยรวมอย่างรุนแรง ธนาคารแห่งประเทศไทยอาจยังไม่สั่งปิดกิจการของสถาบันการเงินก็ได้\nเมื่อธนาคารแห่งประเทศไทยมีคำสั่งปิดกิจการตามวรรคหนึ่งแล้ว ให้เสนอรัฐมนตรีเพิกถอนใบอนุญาตของสถาบันการเงินนั้น"]
+sentence_pairs = [[i,j] for i in sentences_1 for j in sentences_2]
+print(model.compute_score(sentence_pairs,
+                          max_passage_length=128, # a smaller max length leads to a lower latency
+                          weights_for_different_modes=[0.4, 0.2, 0.4])) # weights_for_different_modes(w) is used to do weighted sum: w[0]*dense_score + w[1]*sparse_score + w[2]*colbert_score
+# {
+#   'colbert': [0.5812647342681885, 0.5717734098434448, 0.6460118889808655, 0.8784525990486145],
+#   'sparse': [0.1083984375, 0.07684326171875, 0.07061767578125, 0.314208984375],
+#   'dense': [0.61865234375, 0.58935546875, 0.666015625, 0.8916015625],
+#   'sparse+dense': [0.4485676884651184, 0.41851806640625, 0.4675496518611908, 0.6991373896598816],
+#   'colbert+sparse+dense': [0.5016465187072754, 0.47982022166252136, 0.538934588432312, 0.7708634734153748]
+# }
+```
+## Acknowledgement
+Thanks to Pirat Pothavorn for evaluating the model performance on NitiBench, Supavish Punchun for finetuning the model. Additionally, we thank you all authors of this open-sourced project.
+## Citation
+### BibTeX
+```
+@misc{akarajaradwong2025nitibenchcomprehensivestudiesllm,
+      title={NitiBench: A Comprehensive Studies of LLM Frameworks Capabilities for Thai Legal Question Answering},
+      author={Pawitsapak Akarajaradwong and Pirat Pothavorn and Chompakorn Chaksangchaichot and Panuthep Tasawong and Thitiwat Nopparatbundit and Sarana Nutanong},
+      year={2025},
+      eprint={2502.10868},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2502.10868},
+}
+```