Missing Checkpoint Files

#5
by hcoxec - opened

Following from another discussion, quite a lot of the intermediate checkpoints are incomplete and so unusable. There are 3 types of errors that appear over and over again. For reference, the same code that threw the below errors, loads the other 32B checkpoints without issue. But approximately 25% of the intermediate checkpoints are unusable.

Most common, part of the weights are missing, to give just four examples - I'm sure there are more given my difficulty using other checkpoints:

In some cases, the tokenizer does not appear to be there, for example checkpoint "stage1-step170000-tokens1427B" throws this error:

  • File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 2276, in _from_pretrained - tokenizer = cls(*init_inputs, **init_kwargs)
  • File "/usr/local/lib/python3.10/dist-packages/transformers/models/gpt2/tokenization_gpt2.py", line 159, in init
  • with open(merges_file, encoding="utf-8") as merges_handle:
  • TypeError: expected str, bytes or os.PathLike object, not NoneType

In other cases the tokenizer appears to be in some corrupted slow tokenizer format that fails to convert:

-ValueError: Converting from Tiktoken failed, if a converter for SentencePiece is available, provide a model path with a SentencePiece tokenizer.model file.Currently available slow->fast convertors: ['AlbertTokenizer', 'BartTokenizer', 'BarthezTokenizer', 'BertTokenizer', 'BigBirdTokenizer', 'BlenderbotTokenizer', 'CamembertTokenizer', 'CLIPTokenizer', 'CodeGenTokenizer', 'ConvBertTokenizer', 'DebertaTokenizer', 'DebertaV2Tokenizer', 'DistilBertTokenizer', 'DPRReaderTokenizer', 'DPRQuestionEncoderTokenizer', 'DPRContextEncoderTokenizer', 'ElectraTokenizer', 'FNetTokenizer', 'FunnelTokenizer', 'GPT2Tokenizer', 'HerbertTokenizer', 'LayoutLMTokenizer', 'LayoutLMv2Tokenizer', 'LayoutLMv3Tokenizer', 'LayoutXLMTokenizer', 'LongformerTokenizer', 'LEDTokenizer', 'LxmertTokenizer', 'MarkupLMTokenizer', 'MBartTokenizer', 'MBart50Tokenizer', 'MPNetTokenizer', 'MobileBertTokenizer', 'MvpTokenizer', 'NllbTokenizer', 'OpenAIGPTTokenizer', 'PegasusTokenizer', 'Qwen2Tokenizer', 'RealmTokenizer', 'ReformerTokenizer', 'RemBertTokenizer', 'RetriBertTokenizer', 'RobertaTokenizer', 'RoFormerTokenizer', 'SeamlessM4TTokenizer', 'SqueezeBertTokenizer', 'T5Tokenizer', 'UdopTokenizer', 'WhisperTokenizer', 'XLMRobertaTokenizer', 'XLNetTokenizer', 'SplinterTokenizer', 'XGLMTokenizer', 'LlamaTokenizer', 'CodeLlamaTokenizer', 'GemmaTokenizer', 'Phi3Tokenizer']

I would appreciate it immensely if the missing checkpoints could be made available in usable format. Thanks so much!

On further review , I think I underestimated- it's looking like almost 50% of checkpoints are unusable with the missing weights error being the most common.

Hey @hcoxec , thank you for reaching out. We have noticed this and I started re uploading the checkpoints. Estimated timeline for this process to finish is Tuesday. I will update you once it is done.

Sign up or log in to comment