How to turn off byte-fallback for Phi-3's tokenizer?

#10
by khoinguyenthe - opened

I have been trying out Phi-3 models and it's been a wonderful experience.

However, sometimes the tokenizer throws exception:

The line of code

text =self.tokenizer.decode(output_tokens)

throws Exception: 'utf-8' codec can't decode byte 0xf0 in position 10283: invalid continuation byte

Most of the time this happened when the model's output was quite long (~800 words, and if count in the brackets, dots, ... it's ~1.4k element; this is still far from the max_length 4196 imo)

I have researched around and find out that this can be fixed by turning off the byte-fallback of the BPE tokenizer, then the tokenizer will ignore the non-utf8 tokens.

I have tried

Tweaked the tokenizer.json file:

  • Set the model/byte_fallback to false
  • and remove the item {"type": "ByteFallback"} in decoder/decoders section

but the errors still happens.

I am using the mini-4k-intruct onnx-cuda-int14 version, btw.

I wonder

Why did my changes not work and is there anyway to fix this?
Thanks for every help and suggestion!

(Note: This is also posted as an issue on Phi-3CookBook github repo: issue #14)

Could you please share instructions on how we can reproduce this issue? What script are you using to run the model?

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment