If possible, can you also share the "vocab.yml" file you used in the training?

#1
by NovaYear - opened

According to my first evaluations, it seems very successful. If possible, can you also share the "vocab.yml" file you used in the training? I want to convert this model to .safetensors format, which can be used outside of marian-decoder. The "convert_marian_to_pytorch.py" script used for this and found in the huggingface library also requires the "vocab.yml" file. The "model.tr-en.vocab" file you shared does not work for us.

I found a solution, I am sharing it so that those who need it can use it.

import sentencepiece as spm
import yaml

sp = spm.SentencePieceProcessor()
sp.load('model.tr-en.spm')

vocab = {}
for i in range(sp.get_piece_size()):
token = sp.id_to_piece(i)
vocab[token] = i

with open('vocab.yml', 'w', encoding='utf-8') as f:
yaml.dump(vocab, f, allow_unicode=True)

NovaYear changed discussion status to closed

Hi @NovaYear thanks for your interest in our work and we are glad that you found a solution!

Sign up or log in to comment