Dactory models
Model description
This is a set of fastText-based models to evaluate the quality and domain of text, in the 24 official languages of the European Union. The main usage of these models is to preprocess data from the Common Crawl project, to obtain a training set for large language models. These models can be used as part of the dactory pipeline, released by Kyutai to process Common Crawl.
There is one model per language, and each model is a multilabel classifier with the eight following labels:
random webpages (rand
), Wikipedia articles (wiki
), textbooks (books
), scientific articles from pes2o (science
),
Stack Exchange websites related to STEM (stem
), Humanities (hum
), pop culture (pop
) and life advices (life
).
The models were trained to distinguish lines sampled uniformly from these different sources.
To get training data for the languages other than English, we translated the English training set with MADLAD, except for the rand
and wiki
labels, for which data is readily available in all languages.
- Model name: Dactory models
- Languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Irish, Croatian, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovenian, Swedish
- Developed by: Kyutai
- Model type: Classification
- License: CC-BY-SA 4.0
- Version: 1.0
- Released: April 2025
Use cases
These models can we used to evaluate the quality of text, by estimating how similar it is to text from high quality sources.
In particular, one can take the score corresponding to the rand
label as an estimate of the text quality.
They can also be used to organize a collection of documents, by similarity to the different data sources used to train the model.
For example, a large language model trained mostly on documents labeled as books
will perform well on multi-choice Q&A benchmarks such as MMLU, while a LLM trained mostly on documents labeled as wiki
will perform well on general knowledge Q&A benchmark such as TriviaQA.
How to use
You can download the files locally by using the huggingface-hub Python package.
For example:
import fasttext
from huggingface_hub import hf_hub_download
local_path = hf_hub_download(repo_id="kyutai/dactory-models", filename="filter_en.bin")
model = fasttext.load_model(local_path)
print(model.predict("A computer scientist is a scientist who specializes in the academic study of computer science."))