andreaschari commited on
Commit
c627d6b
·
verified ·
1 Parent(s): cf3770c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -112
README.md CHANGED
@@ -1,15 +1,24 @@
1
  ---
 
 
 
 
2
  tags:
3
  - sentence-transformers
4
- - sentence-similarity
5
  - feature-extraction
6
- pipeline_tag: sentence-similarity
7
- library_name: sentence-transformers
 
 
 
8
  ---
9
 
10
- # SentenceTransformer
 
 
 
11
 
12
- This is a [sentence-transformers](https://www.SBERT.net) model trained. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
13
 
14
  ## Model Details
15
 
@@ -23,90 +32,6 @@ This is a [sentence-transformers](https://www.SBERT.net) model trained. It maps
23
  <!-- - **Language:** Unknown -->
24
  <!-- - **License:** Unknown -->
25
 
26
- ### Model Sources
27
-
28
- - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
29
- - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
30
- - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
31
-
32
- ### Full Model Architecture
33
-
34
- ```
35
- SentenceTransformer(
36
- (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
37
- (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
38
- (2): Normalize()
39
- )
40
- ```
41
-
42
- ## Usage
43
-
44
- ### Direct Usage (Sentence Transformers)
45
-
46
- First install the Sentence Transformers library:
47
-
48
- ```bash
49
- pip install -U sentence-transformers
50
- ```
51
-
52
- Then you can load this model and run inference.
53
- ```python
54
- from sentence_transformers import SentenceTransformer
55
-
56
- # Download from the 🤗 Hub
57
- model = SentenceTransformer("sentence_transformers_model_id")
58
- # Run inference
59
- sentences = [
60
- 'The weather is lovely today.',
61
- "It's so sunny outside!",
62
- 'He drove to the stadium.',
63
- ]
64
- embeddings = model.encode(sentences)
65
- print(embeddings.shape)
66
- # [3, 1024]
67
-
68
- # Get the similarity scores for the embeddings
69
- similarities = model.similarity(embeddings, embeddings)
70
- print(similarities.shape)
71
- # [3, 3]
72
- ```
73
-
74
- <!--
75
- ### Direct Usage (Transformers)
76
-
77
- <details><summary>Click to see the direct usage in Transformers</summary>
78
-
79
- </details>
80
- -->
81
-
82
- <!--
83
- ### Downstream Usage (Sentence Transformers)
84
-
85
- You can finetune this model on your own dataset.
86
-
87
- <details><summary>Click to expand</summary>
88
-
89
- </details>
90
- -->
91
-
92
- <!--
93
- ### Out-of-Scope Use
94
-
95
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
96
- -->
97
-
98
- <!--
99
- ## Bias, Risks and Limitations
100
-
101
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
102
- -->
103
-
104
- <!--
105
- ### Recommendations
106
-
107
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
108
- -->
109
-
110
  ## Training Details
111
 
112
  ### Framework Versions
@@ -116,26 +41,4 @@ You can finetune this model on your own dataset.
116
  - PyTorch: 2.4.1
117
  - Accelerate: 0.34.2
118
  - Datasets: 3.0.1
119
- - Tokenizers: 0.20.3
120
-
121
- ## Citation
122
-
123
- ### BibTeX
124
-
125
- <!--
126
- ## Glossary
127
-
128
- *Clearly define terms in order to be accessible across audiences.*
129
- -->
130
-
131
- <!--
132
- ## Model Card Authors
133
-
134
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
135
- -->
136
-
137
- <!--
138
- ## Model Card Contact
139
-
140
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
141
- -->
 
1
  ---
2
+ datasets:
3
+ - unicamp-dl/mmarco
4
+ library_name: sentence-transformers
5
+ pipeline_tag: sentence-similarity
6
  tags:
7
  - sentence-transformers
 
8
  - feature-extraction
9
+ - sentence-similarity
10
+ license: mit
11
+ widget: []
12
+ base_model:
13
+ - BAAI/bge-m3
14
  ---
15
 
16
+ # BGE-m3 RU mMARCO/v2 Transliterated Queries
17
+
18
+ This is a [BGE-M3](https://huggingface.co/BAAI/bge-m3) model post-trained on the Russian dataset from MMARCO/v2.
19
+ The queries are transliterated Russian to English using [uroman](https://github.com/isi-nlp/uroman).
20
 
21
+ The model was used for the SIGIR 2025 Short paper: Lost in Transliteration: Bridging the Script Gap in Neural IR.
22
 
23
  ## Model Details
24
 
 
32
  <!-- - **Language:** Unknown -->
33
  <!-- - **License:** Unknown -->
34
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
  ## Training Details
36
 
37
  ### Framework Versions
 
41
  - PyTorch: 2.4.1
42
  - Accelerate: 0.34.2
43
  - Datasets: 3.0.1
44
+ - Tokenizers: 0.20.3