Model Card for Model ID

OmniGEC-Minimal-8B extends the open-weight AYA-Expanse-8B with instruction tuning and supervised fine-tuning on OmniGEC, a silver-standard GEC corpus that includes MultiGEC-25, Wikipedia, Reddit edits for 11 low-/mid-resource European languages. The result is a single model capable of paragraph-level correction across all covered languages achieving State-Of-The-Art (SOTA) results for paragraph-based editing in minimal and fluency tracks.

Per-language GLEU scores on MultiGEC-25 test set (Minimal edits)

Language OmniGEC-Minimal-8B
(AYA-Expanse-8B)
OmniGEC-Minimal-12B
(Gemma-3-12B)
Czech 65.13 66.39
English 78.08 77.30
Estonian 41.52 55.12
German 78.22 75.47
Greek 56.03 53.01
Italian 77.83 74.70
Latvian 71.71 81.54
Slovenian 54.22 58.31
Swedish 55.99 63.91
Ukrainian 76.41 75.17
Average 65.51 68.09

Per-language GLEU scores on MultiGEC-25 test set (Fluency)

Language OmniGEC-Fluency-8B
(AYA-Expanse-8B)
OmniGEC-Fluency-12B
(Gemma-3-12B)
Estonian 49.55 52.42
Icelandic 35.04 42.50
Ukrainian 75.82 71.88
Average 53.47 55.60

Training Data

Sub-corpus Tokens Source Notes
WikiEdits-MultiGEC ≈ 1.2 M Human Wikipedia “copy-edit” revisions (6 m window) capped EN size to reduce bias
Reddit-MultiGEC ≈ 13 M Posts from ≥ 400 language-specific subreddits content-moderated, GPT-4o-mini corrections
UberText-GEC (TBD) ≈ 110 M Ukrainian Telegram corpus GPT-4o-mini corrections, UA-only
MultiGEC-25 ≈ 0.5 M Golden shared-task data train/dev/test = 80 / 10 / 10

Silver corrections were created with a three-step prompt → generate 3 candidates → aggregate pipeline using o1-preview and GPT-4o-mini.

Evaluation

  • Metric: GLEU via the official MultiGEC-25 CodaLab evaluator (minimal & fluency tracks).
  • Both OmniGEC-tuned models surpass the paragraph-based baseline LLaMA-3-8B by +9–10 GLEU on the minimal track and deliver the current best open scores for Estonian and Latvian.

OmniGEC-Tuned Checkpoints

AYA-Expanse-8B · Gemma-3-12B-IT

🔧 Quick start

pip install transformers
git clone https://github.com/r-kovalch/omnigec-models.git
cd multigec-models
from transformers import AutoTokenizer, AutoModelForCausalLM
from src.instruction_templates import multigec_prompts
from src.utils.multigec import LANG_TO_CODE, LANG_CODE_TO_TOKEN

# For AYA-based models (OmniGEC-Minimal-8B, OmniGEC-Fluency-8B)
def formatting_prompts_func(example):
    language_code = LANG_TO_CODE[example["language"]]
    language_token = LANG_CODE_TO_TOKEN[language_code]

    user_input = example['feature']
    prompt_template = multigec_prompts[example["language"]].prompt_template
    instruction = prompt_template.format(original_text=user_input)

    text = f"<|START_OF_TURN_TOKEN|><|USER_TOKEN|>{language_token}{instruction}<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>"

    return text

# For Gemma-based models (OmniGEC-Minimal-12B, OmniGEC-Fluency-12B)
def formatting_prompts_func(example):
    language_code = LANG_TO_CODE[example["language"]]
    # Since special tokens for Gemma models does not have |, we remove them
    language_token = LANG_CODE_TO_TOKEN[language_code].replace("|", "")

    user_input = example['feature']
    prompt_template = multigec_prompts[example["language"]].prompt_template
    instruction = prompt_template.format(original_text=user_input)

    text = f"<start_of_turn>user\n{language_token}{instruction}<end_of_turn>\n<start_of_turn>model\n"

    return text

repo = "lang-uk/OmniGEC-Minimal-8B"   # or -Fluency-8B / -12B
tok  = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype="auto")

lang = "en"
text = "She go to school every day ."
# Choose formatting func accordingly to base model Gemma/Aya
prompt = formatting_prompts_func(text)

out = model.generate(**tok(prompt, return_tensors="pt"), max_new_tokens=1600)
print(tok.decode(out[0], skip_special_tokens=True))

Limitations

  • Reddit and UberText corrections are machine-generated; noise remains, esp. in slang.
  • Sequences > 1,600 tokens are truncated unless you raise max_new_tokens.

Details

  • For details on use, please refer to our GitHub
  • We strongly recommend you to follow the inference code we used in notebooks for both gemma and aya, as there's additional parameters, like temperature, top_k, max_new_tokens and others, specific for each model.

Authors

Petro Ivaniuk, Mariana Romanyshyn, Roman Kovalchuk

Downloads last month
8
Safetensors
Model size
8.02B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lang-uk/OmniGEC-Minimal-8B

Finetuned
(27)
this model
Quantizations
1 model

Datasets used to train lang-uk/OmniGEC-Minimal-8B

Collection including lang-uk/OmniGEC-Minimal-8B