JQL-AI/JQL-Edu-Heads · Hugging Face

Summary

Multilingual JQL regression head that scores texts based on their educational value as described in our paper.

Model Details

Model Description

This model is a regression head built on top of the Snowflake/snowflake-arctic-embed-m-v2.0 embedding model. It assigns a score to text documents, representing their educational value on a scale from 0 (lowest) to 5 (highest). Since the underlying embedding model provides language-aligned embeddings, the regression head can be used for multiple languages.

We provide checkpoints for three different training sets. These training sets were generated by letting a large language model (LLM) annotate 500k text documents. We provide trained heads based on annotations from: Llama3.3-70B-it, Gemma-3-27B-it, and Mistral Small 3.1-24B-it. For each LLM, we also created training sets with balanced and unbalanced distributions of educational value scores. Checkpoints trained on balanced datasets are denoted with "balanced"; otherwise, they are denoted as "unbalanced".

Developed by: A collaboration between HessianAI, DFKI, Fraunhofer IAIS, Lamarr Institute, and TU Darmstadt.
Model type: Regression Head
Language(s) (NLP): Bulgarian, Czech, Croatian, Macedonian, Polish, Slovak, Slovenian, Serbian, Ukrainian, Danish, German, Icelandic, Dutch, Norwegian, Swedish, Catalan, Spanish, French, Galician, Italian, Portuguese, Romanian, Estonian, Finnish, Hungarian, Lithuanian, Latvian, Greek, Irish, Basque, Maltese, Turkish, Albanian, and Armenian.
License: Apache-2.0

As evaluated in the paper the trained regression heads generalize to any language of the backbone embedding model beyond those used in our training.

Model Sources [optional]

Repository: github.com/JQL-AI/JQL-Annotation-Pipeline
Paper: arXiv
Project Page: [https://huggingface.co/spaces/Jackal-AI/JQL]

Direct Use

The model is designed to quickly and efficiently assess the educational value of texts—significantly faster than querying a large language model (LLM) directly. This makes it particularly useful for building large, high-quality text datasets.

Downstream Use

In a downstream experiment, we demonstrate that training with high-quality texts selected by this model is faster and more effective than training with texts filtered only by heuristic methods. Details of our experiments can be found in the accompanying paper.

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Usage is described in the accompanying GitHub

Technical Specifications

Model Architecture and Objective

The regression head consists of two linear layers with a ReLU activation function in between. The input dimension is 768, the hidden dimension is 1000, and the model uses bfloat16 precision.

📖 Citation

@article{ali2025judging,
    title     = {Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models},
    author    = {
      Mehdi Ali,
      Manuel Brack,
      Max Lübbering,
      Elias Wendt,
      Abbas Goher Khan,
      Richard Rutmann,
      Alex Jude,
      Maurice Kraus,
      Alexander Arno Weber,
      Felix Stollenwerk,
      David Kaczér,
      Florian Mai,
      Lucie Flek,
      Rafet Sifa,
      Nicolas Flores-Herr,
      Joachim Köhler,
      Patrick Schramowski,
      Michael Fromm,
      Kristian Kersting
    },
    year      = {2025},
    journal   = {arXiv preprint arXiv:2505:22232}
  }

JQL-AI
/

JQL-Edu-Heads