Overview

This model is designed for the abstractive proposition segmentation task in Korean, as described in the paper Scalable and Domain-General Abstractive Proposition Segmentation. The model segments text into atomic and self-contained units (atomic facts).

Training Details

  • Base Model: yanolja/EEVE-Korean-Instruct-10.8B-v1.0
  • Fine-tuning Method: LoRA
  • Dataset: RoSE
    • Translation: The dataset was translated into Korean using GPT-4o.
      • GPT-4o was prompted to translate propositions using the vocabulary in the text.
    • Data Split: The dataset was randomly split into training, validation, and test sets (1900:100:500) for fine-tuning.

Usage

Data Preprocessing

from konlpy.tag import Kkma

sent_start_token = "<sent>"
sent_end_token = "</sent>"
instruction = "I will provide a passage split into sentences by <s> and </s> markers. For each sentence, generate its list of propositions. Each proposition contains a single fact mentioned in the corresponding sentence written as briefly and clearly as possible.\n\n"

kkma = Kkma()

def get_input(text, tokenizer):
  sentences = kkma.sentences(text)
  prompt = instruction + "Passage: " + sent_start_token + f"{sent_end_token}{sent_start_token}".join(sentences) + sent_end_token + "\nPropositions:\n"
  messages = [{"role": "system", "content": "You are a helpful assistant."},
              {"role": "user", "content": prompt}]
  input_text = tokenizer.apply_chat_template(
                      messages,
                      tokenize=False,
                      add_generation_prompt=True)
  return input_text

def get_output(text):
  results = []
  group = []

  if text.startswith("Propositions:"):
      lines = text[len("Propositions:"):].strip().split("\n")
  else:
      lines = text.strip().split("\n")
      
  for line in lines:
    if line.strip() == sent_start_token:
      continue
    elif line.strip() == sent_end_token:
      results.append(group)
      group = []
    else:
      if not line.strip().startswith("-"):
        break
      line = line[1:].strip()
      group.append(line)

  return results

Loading Model and Tokenizer

import peft, torch
from transformers import AutoModelForCausalLM, AutoTokenizer

LORA_PATH = "seonjeongh/Korean-Propositionalizer"

lora_config = peft.PeftConfig.from_pretrained(LORA_PATH)
base_model = AutoModelForCausalLM.from_pretrained(lora_config.base_model_name_or_path,
                                                  torch_dtype=torch.float16,
                                                  device_map="auto")
model = peft.PeftModel.from_pretrained(base_model, LORA_PATH)
model = model.merge_and_unload(progressbar=True)
tokenizer = AutoTokenizer.from_pretrained(lora_config.base_model_name_or_path)

Inference Example

device = "cuda"

text = "์˜ฅ์Šคํฌ๋“œ๋Š” ํ™”์š”์ผ ๋งจ์ฒด์Šคํ„ฐ ์œ ๋‚˜์ดํ‹ฐ๋“œ์™€์˜ ๊ฒฝ๊ธฐ์—์„œ 3-2๋กœ ํŒจํ•œ ๊ฒฝ๊ธฐ์—์„œ 21์„ธ ์ดํ•˜ ํŒ€์œผ๋กœ ๋“์ ํ–ˆ๋‹ค. ๊ทธ ๊ณจ์€ 16์„ธ ์„ ์ˆ˜์˜ 1๊ตฐ ๋ฐ๋ท” ์ฃผ์žฅ์„ ๊ฐ•ํ™”ํ•  ๊ฒƒ์ด๋‹ค. ์„ผํ„ฐ๋ฐฑ์€ ์ด๋ฒˆ ์‹œ์ฆŒ ์›จ์ŠคํŠธํ–„ 1๊ตฐ๊ณผ ํ•จ๊ป˜ ํ›ˆ๋ จํ–ˆ๋‹ค. ์›จ์ŠคํŠธํ–„ ์œ ๋‚˜์ดํ‹ฐ๋“œ์˜ ์ตœ์‹  ๋‰ด์Šค๋Š” ์—ฌ๊ธฐ๋ฅผ ํด๋ฆญํ•˜์„ธ์š”."
inputs = tokenizer([get_input(text, tokenizer)], return_tensors='pt').to(device)
output = model.generate(**inputs, max_new_tokens=512, pad_token_id = tokenizer.pad_token_id, eos_token_id = tokenizer.eos_token_id, use_cache=True)
response = tokenizer.batch_decode(output[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0]
results = get_output(response)
print(results)
Example output
[
   [
    "์˜ฅ์Šคํฌ๋“œ๋Š” 21์„ธ ์ดํ•˜ ํŒ€์œผ๋กœ ๋“์ ํ–ˆ๋‹ค.",
    "์˜ฅ์Šคํฌ๋“œ๋Š” ๋งจ์ฒด์Šคํ„ฐ ์œ ๋‚˜์ดํ‹ฐ๋“œ์™€์˜ ๊ฒฝ๊ธฐ์—์„œ 3-2๋กœ ํŒจํ–ˆ๋‹ค.",
    "์˜ฅ์Šคํฌ๋“œ๋Š” ํ™”์š”์ผ ๊ฒฝ๊ธฐ๋ฅผ ํ–ˆ๋‹ค.",
   ],
   [
    "๊ทธ ๊ณจ์€ 16์„ธ ์„ ์ˆ˜์˜ ์ฃผ์žฅ์„ ๊ฐ•ํ™”ํ•  ๊ฒƒ์ด๋‹ค.",
    "๊ทธ ๊ณจ์€ 16์„ธ ์„ ์ˆ˜์˜ 1 ๊ตฐ ๋ฐ๋ท” ์ฃผ์žฅ์„ ๊ฐ•ํ™”ํ•  ๊ฒƒ์ด๋‹ค.",
   ],
   [
    "์„ผํ„ฐ ๋ฐฑ์€ ์›จ์ŠคํŠธ ํ–„ 1 ๊ตฐ๊ณผ ํ•จ๊ป˜ ํ›ˆ๋ จํ–ˆ๋‹ค.",
    "์„ผํ„ฐ ๋ฐฑ์€ ์ด๋ฒˆ ์‹œ์ฆŒ ์›จ์ŠคํŠธ ํ–„ 1 ๊ตฐ๊ณผ ํ•จ๊ป˜ ํ›ˆ๋ จํ–ˆ๋‹ค.",
   ],
   [
    "์›จ์ŠคํŠธํ–„ ์œ ๋‚˜์ดํ‹ฐ๋“œ์˜ ์ตœ์‹  ๋‰ด์Šค๋Š” ์—ฌ๊ธฐ๋ฅผ ํด๋ฆญํ•˜์„ธ์š”."
   ]
]

Inputs and Outputs

  • Input: Text.
  • Output: List of propositions for all the sentences in the text passage. The propositions for each sentence are grouped separately.

Evaluation Results

  • Metric: Reference-less & reference-base metrics proposed in Scalable and Domain-General Abstractive Proposition Segmentation.
  • Models:
    • Dynamic 10-shot models: For each test example, the most similar 10 examples were selected from the training set using BM25.
    • Translate-test models: google/gemma-7b-aps-it model + EN->KO, KO->EN translation using GPT-4o or GPT-4o-mini.
    • Translate-train models: LoRA fine-tuned sLLMs using the Korean RoSE dataset.

Reference-less metric

Model Precision Recall F1
Gold 97.46 96.28 95.88
dynamic 10-shot (Qwen/Qwen2.5-72B-Instruct) 98.86 93.99 95.58
dynamic 10-shot GPT-4o 97.61 97.00 96.87
dynamic 10-shot GPT-4o-mini 98.51 97.12 97.17
Translate-Test (google/gemma-7b-aps-it & GPT-4o Translation) 97.38 96.93 96.52
Translate-Test (google/gemma-7b-aps-it & GPT-4o-mini Translation) 97.24 96.26 95.73
Translate-Train (Qwen/Qwen2.5-7B-Instruct) 94.66 92.81 92.08
Translate-Train (LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct) 93.80 93.29 92.80
Translate-Train (yanolja/EEVE-Korean-Instruct-10.8B-v1.0) 97.41 96.02 95.93

Reference-base metric

Model Precision Recall F1
Gold 100 100 100
dynamic 10-shot (Qwen/Qwen2.5-72B-Instruct) 48.49 40.27 42.99
dynamic 10-shot GPT-4o 49.16 44.72 46.05
dynamic 10-shot GPT-4o-mini 49.30 39.25 42.88
Translate-Test (google/gemma-7b-aps-it & GPT-4o Translation) 57.02 47.52 51.10
Translate-Test (google/gemma-7b-aps-it & GPT-4o-mini Translation) 57.19 47.68 51.26
Translate-Train (Qwen/Qwen2.5-7B-Instruct) 42.62 38.37 39.64
Translate-Train (LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct) 46.82 43.08 44.02
Translate-Train (yanolja/EEVE-Korean-Instruct-10.8B-v1.0) 50.82 45.89 47.44
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for seonjeongh/Korean-Propositionalizer

Dataset used to train seonjeongh/Korean-Propositionalizer