Dactory models

Model description

This is a set of fastText-based models to evaluate the quality and domain of text, in the 24 official languages of the European Union. The main usage of these models is to preprocess data from the Common Crawl project, to obtain a training set for large language models. These models can be used as part of the dactory pipeline, released by Kyutai to process Common Crawl.

There is one model per language, and each model is a multilabel classifier with the eight following labels: random webpages (rand), Wikipedia articles (wiki), textbooks (books), scientific articles from pes2o (science), Stack Exchange websites related to STEM (stem), Humanities (hum), pop culture (pop) and life advices (life). The models were trained to distinguish lines sampled uniformly from these different sources. To get training data for the languages other than English, we translated the English training set with MADLAD, except for the rand and wiki labels, for which data is readily available in all languages.

Model name: Dactory models
Languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Irish, Croatian, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovenian, Swedish
Developed by: Kyutai
Model type: Classification
License: CC-BY-SA 4.0
Version: 1.0
Released: April 2025

Use cases

These models can we used to evaluate the quality of text, by estimating how similar it is to text from high quality sources. In particular, one can take the score corresponding to the rand label as an estimate of the text quality. They can also be used to organize a collection of documents, by similarity to the different data sources used to train the model. For example, a large language model trained mostly on documents labeled as books will perform well on multi-choice Q&A benchmarks such as MMLU, while a LLM trained mostly on documents labeled as wiki will perform well on general knowledge Q&A benchmark such as TriviaQA.

How to use

You can download the files locally by using the huggingface-hub Python package.

For example:

import fasttext
from huggingface_hub import hf_hub_download

local_path = hf_hub_download(repo_id="kyutai/dactory-models", filename="filter_en.bin")
model = fasttext.load_model(local_path)
print(model.predict("A computer scientist is a scientist who specializes in the academic study of computer science."))