If possible, can you also share the "vocab.yml" file you used in the training?

by NovaYear - opened 9 days ago

9 days ago

According to my first evaluations, it seems very successful. If possible, can you also share the "vocab.yml" file you used in the training? I want to convert this model to .safetensors format, which can be used outside of marian-decoder. The "convert_marian_to_pytorch.py" script used for this and found in the huggingface library also requires the "vocab.yml" file. The "model.tr-en.vocab" file you shared does not work for us.

NovaYear

9 days ago

I found a solution, I am sharing it so that those who need it can use it.

import sentencepiece as spm
import yaml

sp = spm.SentencePieceProcessor()
sp.load('model.tr-en.spm')

vocab = {}
for i in range(sp.get_piece_size()):
token = sp.id_to_piece(i)
vocab[token] = i

with open('vocab.yml', 'w', encoding='utf-8') as f:
yaml.dump(vocab, f, allow_unicode=True)

NovaYear changed discussion status to closed 9 days ago

pinzhenchen

HPLT org 9 days ago

Hi @NovaYear thanks for your interest in our work and we are glad that you found a solution!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment