FronyAI Embedding (large)

This is a lightweight and efficient embedding model designed specifically for the Korean language. It has been trained on a diverse set of data sources, including AI 허브, to ensure robust performance in a wide range of retrieval tasks. The model demonstrates strong retrieval capabilities across:

  • Korean–Korean
  • Korean–English
  • English–Korean

To support resource-constrained environments, the model also provides compatibility with Matryoshka Embeddings, enabling retrieval even at reduced dimensions (e.g., half of the original size) without significant performance loss. All training and data preprocessing were performed on a single GPU (46VRAM), showcasing not only the model’s effectiveness but also its efficiency.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base Model: meta-llama/Llama-3.2-1B
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 2048 / 1024 dimensions
  • Similarity Function: Cosine Similarity
  • Languages: ko, en
  • License: apache-2.0

Datasets

This model is trained from many sources data including AI 허브.
Total trained query and document pair is 1,000,000.

Training Details

The overall training process was conducted with reference to snowflake-arctic-2.0.
Training was divided into two stages: Pre-training and Post-training.

  • In the pre-training stage, the model was trained using in-batch negatives.
  • In the post-training stage, we utilized the multilingual-e5-large model to identify hard negatives—specifically, the top 4 samples with a similarity score below a 99% threshold.

Given the increasing prevalence of LLM-generated content, we also converted existing data into Markdown-style passages to improve retrieval performance on such formats.
The types of data augmentation applied are as follows:

Augmentation* Description
Pair concatenation Multi-query & Multi-passage
Language transfer Korean to English on query & passage
Style transfer Plain sentences to Markdown description
*Augmentation was carried out using the Gemma-3-12B

Evaluation

The evaluation consists of five dataset groups, and the results in the table represent the average retrieval performance across these five groups. Three groups are subsets extracted from AI 허브 datasets. One group is based on a specific sports regulation PDF, for which synthetic query and markdown-style passage pairs were generated using GPT-4o-mini. The final group is a concatenation of all four aforementioned groups, providing a comprehensive mixed set.
The following table presents the average retrieval performance across five dataset groups.

Models Open/Closed Size Accuracy@1 Accuracy@3 Accuracy@5 Accuracy@10
frony-embed-large Open 1.24B 0.6764 0.8008 0.8359 0.8653
frony-embed-large (half dim) Open 1.24B 0.6644 0.7890 0.8238 0.8577
frony-embed-medium Open 337M 0.6649 0.8040 0.8458 0.8876
frony-embed-medium (half dim) Open 337M 0.6520 0.7923 0.8361 0.8796
bge-m3 Open 560M 0.5852 0.7763 0.8418 0.8987
multilingual-e5-large Open 560M 0.5764 0.7630 0.8267 0.8891
snowflake-arctic-embed-l-v2.0 Open 568M 0.5726 0.7591 0.8232 0.8917
jina-embeddings-v3 Open 572M 0.5270 0.7246 0.7953 0.8649
upstage-large Closed - 0.6334 0.8527 0.9065 0.9478
openai-text-embedding-3-large Closed - 0.4907 0.6617 0.7311 0.8148
*Transformer blocks only

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("FronyAI/frony-embed-large-ko-v1")
# Run inference

# '<Q>' is special token for query.
queries = [
    '<Q>안녕하세요',
]
embeddings = model.encode(queries)

# '<P>' is special token for passage.
passages = [
    '<P>반갑습니다',
]
embeddings = model.encode(passages)
Downloads last month
5
Safetensors
Model size
1.24B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FronyAI/frony-embed-large-ko-v1

Base model

klue/roberta-large
Finetuned
(70)
this model