SentenceTransformer based on seongil-dn/unsupervised_20m_3800

This is a sentence-transformers model finetuned from seongil-dn/unsupervised_20m_3800. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: seongil-dn/unsupervised_20m_3800
Maximum Sequence Length: 1024 tokens
Output Dimensionality: 1024 dimensions
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 1024, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("seongil-dn/bge-m3-3800_steps_v2_234")
# Run inference
sentences = [
    'The Battle of Northampton was fought in which century?',
    'Battle of Megiddo (15th century BC) Battle of Megiddo (15th century BC) The Battle of Megiddo (15th century BC) was fought between Egyptian forces under the command of Pharaoh Thutmose III and a large rebellious coalition of Canaanite vassal states led by the king of Kadesh. It is the first battle to have been recorded in what is accepted as relatively reliable detail. Megiddo is also the first recorded use of the composite bow and the first body count. All details of the battle come from Egyptian sources—primarily the hieroglyphic writings on the Hall of Annals in the Temple of Amun-Re at Karnak, Thebes (now Luxor),',
    'Northampton Sand Northampton Sand The Northampton Sand, sometimes called the Northamptonshire Sand is a geological formation of Jurassic age found in the East Midlands of England. Particularly in the twentieth century, it has been of economic importance as a source of iron ore, but is now worked much less. The Northampton Sand Formation constitutes the lowest part of the Inferior Oolite Series and lies on the upper Lias clay. It attains a maximum thickness of up to to the north and west of Northampton where it lies in a subterranean basin. In the south, it fades out around Towcester. Northward from the',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

Size: 1,921,571 training samples
Columns: anchor, positive, negative, negative_2, negative_3, negative_4, and negative_5

Approximate statistics based on the first 1000 samples:

	anchor	positive	negative	negative_2	negative_3	negative_4	negative_5
type	string	string	string	string	string	string	string
details	min: 9 tokens mean: 22.32 tokens max: 119 tokens	min: 127 tokens mean: 157.21 tokens max: 276 tokens	min: 122 tokens mean: 154.65 tokens max: 212 tokens	min: 122 tokens mean: 155.52 tokens max: 218 tokens	min: 122 tokens mean: 156.04 tokens max: 284 tokens	min: 124 tokens mean: 156.3 tokens max: 268 tokens	min: 121 tokens mean: 156.15 tokens max: 249 tokens

Samples:

anchor	positive	negative	negative_2	negative_3	negative_4	negative_5
`What African country is projected to pass the United States in population by the year 2055?`	African immigration to the United States officially 40,000 African immigrants, although it has been estimated that the population is actually four times this number when considering undocumented immigrants. The majority of these immigrants were born in Ethiopia, Egypt, Nigeria, and South Africa. African immigrants like many other immigrant groups are likely to establish and find success in small businesses. Many Africans that have seen the social and economic stability that comes from ethnic enclaves such as Chinatowns have recently been establishing ethnic enclaves of their own at much higher rates to reap the benefits of such communities. Such examples include Little Ethiopia in Los Angeles and	What Will Happen to the Gang Next Year? watching television at the time of the broadcast. This made it the lowest-rated episode in "30 Rock"'s history. and a decrease from the previous episode "The Return of Avery Jessup" (2.92 million) What Will Happen to the Gang Next Year? "What Will Happen to the Gang Next Year?" is the twenty-second and final episode of the sixth season of the American television comedy series "30 Rock", and the 125th overall episode of the series. It was directed by Michael Engler, and written by Matt Hubbard. The episode originally aired on the National Broadcasting Company (NBC) network in the United States	Christianity in the United States Christ is the fifth-largest denomination, the largest Pentecostal church, and the largest traditionally African-American denomination in the nation. Among Eastern Christian denominations, there are several Eastern Orthodox and Oriental Orthodox churches, with just below 1 million adherents in the US, or 0.4% of the total population. Christianity was introduced to the Americas as it was first colonized by Europeans beginning in the 16th and 17th centuries. Going forward from its foundation, the United States has been called a Protestant nation by a variety of sources. Immigration further increased Christian numbers. Today most Christian churches in the United States are either	What Will Happen to the Gang Next Year? What Will Happen to the Gang Next Year? "What Will Happen to the Gang Next Year?" is the twenty-second and final episode of the sixth season of the American television comedy series "30 Rock", and the 125th overall episode of the series. It was directed by Michael Engler, and written by Matt Hubbard. The episode originally aired on the National Broadcasting Company (NBC) network in the United States on May 17, 2012. In the episode, Jack (Alec Baldwin) and Avery (Elizabeth Banks) seek to renew their vows; Criss (James Marsden) sets out to show Liz (Tina Fey) he can pay	History of the Jews in the United States Representatives by Rep. Samuel Dickstein (D; New York). This also failed to pass. During the Holocaust, fewer than 30,000 Jews a year reached the United States, and some were turned away due to immigration policies. The U.S. did not change its immigration policies until 1948. Currently, laws requiring teaching of the Holocaust are on the books in five states. The Holocaust had a profound impact on the community in the United States, especially after 1960, as Jews tried to comprehend what had happened, and especially to commemorate and grapple with it when looking to the future. Abraham Joshua Heschel summarized	Public holidays in the United States will have very few customers that day. The labor force in the United States comprises about 62% (as of 2014) of the general population. In the United States, 97% of the private sector businesses determine what days this sector of the population gets paid time off, according to a study by the Society for Human Resource Management. The following holidays are observed by the majority of US businesses with paid time off: This list of holidays is based off the official list of federal holidays by year from the US Government. The holidays however are at the discretion of employers
`Which is the largest species of the turtle family?`	Turtle tortoise, "testudo". "Terrapin" comes from an Algonquian word for turtle. Some languages do not have this distinction, as all of these are referred to by the same name. For example, in Spanish, the word "tortuga" is used for turtles, tortoises, and terrapins. A sea-dwelling turtle is "tortuga marina", a freshwater species "tortuga de río", and a tortoise "tortuga terrestre". The largest living chelonian is the leatherback sea turtle ("Dermochelys coriacea"), which reaches a shell length of and can reach a weight of over . Freshwater turtles are generally smaller, but with the largest species, the Asian softshell turtle "Pelochelys cantorii",	Convention on the Conservation of Migratory Species of Wild Animals take joint action. At May 2018, there were 126 Parties to the Convention. The CMS Family covers a great diversity of migratory species. The Appendices of CMS include many mammals, including land mammals, marine mammals and bats; birds; fish; reptiles and one insect. Among the instruments, AEWA covers 254 species of birds that are ecologically dependent on wetlands for at least part of their annual cycle. EUROBATS covers 52 species of bat, the Memorandum of Understanding on the Conservation of Migratory Sharks seven species of shark, the IOSEA Marine Turtle MOU six species of marine turtle and the Raptors MoU	Razor-backed musk turtle Razor-backed musk turtle The razor-backed musk turtle ("Sternotherus carinatus") is a species of turtle in the family Kinosternidae. The species is native to the southern United States. There are no subspecies that are recognized as being valid. "S. carinatus" is found in the states of Alabama, Arkansas, Louisiana, Mississippi, Oklahoma, and Texas. The razor-backed musk turtle grows to a straight carapace length of about . It has a brown-colored carapace, with black markings at the edges of each scute. The carapace has a distinct, sharp keel down the center of its length, giving the species its common name. The body	African helmeted turtle African helmeted turtle The African helmeted turtle ("Pelomedusa subrufa"), also known commonly as the marsh terrapin, the crocodile turtle, or in the pet trade as the African side-necked turtle, is a species of omnivorous side-necked terrapin in the family Pelomedusidae. The species naturally occurs in fresh and stagnant water bodies throughout much of Sub-Saharan Africa, and in southern Yemen. The marsh terrapin is typically a rather small turtle, with most individuals being less than in straight carapace length, but one has been recorded with a length of . It has a black or brown carapace. The top of the tail	Box turtle Box turtle Box turtles are North American turtles of the genus Terrapene. Although box turtles are superficially similar to tortoises in terrestrial habits and overall appearance, they are actually members of the American pond turtle family (Emydidae). The twelve taxa which are distinguished in the genus are distributed over four species. They are largely characterized by having a domed shell, which is hinged at the bottom, allowing the animal to close its shell tightly to escape predators. The genus name "Terrapene" was coined by Merrem in 1820 as a genus separate from "Emys" for those species which had a sternum	Vallarta mud turtle Vallarta mud turtle The Vallarta mud turtle ("Kinosternon vogti") is a recently identified species of mud turtle in the family Kinosternidae. While formerly considered conspecific with the Jalisco mud turtle, further studies indicated that it was a separate species. It can be identified by a combination of the number of plastron and carapace scutes, body size, and the distinctive yellow rostral shield in males. It is endemic to Mexican state of Jalisco. It is only known from a few human-created or human-affected habitats (such as small streams and ponds) found around Puerto Vallarta. It is one of only 3 species
`How many gallons of beer are in an English barrel?`	Low-alcohol beer Prohibition in the United States. Near beer could not legally be labeled as "beer" and was officially classified as a "cereal beverage". The public, however, almost universally called it "near beer". The most popular "near beer" was Bevo, brewed by the Anheuser-Busch company. The Pabst company brewed "Pablo", Miller brewed "Vivo", and Schlitz brewed "Famo". Many local and regional breweries stayed in business by marketing their own near-beers. By 1921 production of near beer had reached over 300 million US gallons (1 billion L) a year (36 L/s). A popular illegal practice was to add alcohol to near beer. The	Keg terms "half-barrel" and "quarter-barrel" are derived from the U.S. beer barrel, legally defined as being equal to 31 U.S. gallons (this is not the same volume as some other units also known as "barrels"). A 15.5 U.S. gallon keg is also equal to: However, beer kegs can come in many sizes: In European countries the most common keg size is 50 liters. This includes the UK, which uses a non-metric standard keg of 11 imperial gallons, which is coincidentally equal to . The German DIN 6647-1 and DIN 6647-2 have also defined kegs in the sizes of 30 and 20	Beer in Chile craft beers. They are generally low or very low volume producers. In Chile there are more than 150 craft beer producers distributed along the 15 Chilean Regions. The list below includes: Beer in Chile The primary beer brewed and consumed in Chile is pale lager, though the country also has a tradition of brewing corn beer, known as chicha. Chile’s beer history has a strong German influence – some of the bigger beer producers are from the country’s southern lake district, a region populated by a great number of German immigrants during the 19th century. Chile also produces English ale-style	Barrel variation. In modern times, produce barrels for all dry goods, excepting cranberries, contain 7,056 cubic inches, about 115.627 L. Barrel A barrel, cask, or tun is a hollow cylindrical container, traditionally made of wooden staves bound by wooden or metal hoops. Traditionally, the barrel was a standard size of measure referring to a set capacity or weight of a given commodity. For example, in the UK a barrel of beer refers to a quantity of . Wine was shipped in barrels of . Modern wooden barrels for wine-making are either made of French common oak ("Quercus robur") and white oak	The Rare Barrel The Rare Barrel The Rare Barrel is a brewery and brewpub in Berkeley, California, United States, that exclusively produces sour beers. Founders Jay Goodwin and Alex Wallash met while attending UCSB. They started home-brewing in their apartment and decided that they would one day start a brewery together. Goodwin started working at The Bruery, where he worked his way from a production assistant to brewer, eventually becoming the head of their barrel aging program. The Rare Barrel brewed its first batch of beer in February 2013, and opened its tasting room on December 27, 2013. The Rare Barrel was named	Barrel (unit) Barrel (unit) A barrel is one of several units of volume applied in various contexts; there are dry barrels, fluid barrels (such as the UK beer barrel and US beer barrel), oil barrels and so on. For historical reasons the volumes of some barrel units are roughly double the volumes of others; volumes in common usage range from about . In many connections the term "drum" is used almost interchangeably with "barrel". Since medieval times the term barrel as a unit of measure has had various meanings throughout Europe, ranging from about 100 litres to 1000 litres. The name was

Loss: CachedGISTEmbedLoss with these parameters:

{'guide': SentenceTransformer(
  (0): Transformer({'max_seq_length': 1024, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
), 'temperature': 0.01}

Training Hyperparameters

Non-Default Hyperparameters

per_device_train_batch_size: 2048
learning_rate: 3e-05
weight_decay: 0.01
num_train_epochs: 1
warmup_ratio: 0.05
bf16: True
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: no
prediction_loss_only: True
per_device_train_batch_size: 2048
per_device_eval_batch_size: 8
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 3e-05
weight_decay: 0.01
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 1
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.05
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: True
fp16: False
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: True
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: no_duplicates
multi_dataset_batch_sampler: proportional

Training Logs