Applying SFT
Thank you for this great model, it is leagues ahead on a task I am researching. I am doing LORA finetuning, and I am pretty much going to use the same code as before, but I am a little confused with the inputs to this model. There is num_patch_list which appears to be a single number per clip (or segment as referred to in the demo inference code), however through batch_chat it seems like it may expect a list of numbers rather than a list of lists (which is what num_patch_list would be in a training batch). To avoid any errors, I was wondering if there's any code with AutoProcessor preparing the inputs for forward pass. I am trying to piece things together with chat, generate, and forward in the main modeling code. Appreciate the help in advance!
I think I have it mostly figured out. One thing I can't understand is the image_flags argument in the forward pass on modeling_internvl_chat_hico2.py:L207
I guess my forward pass code runs now with
image_flags = torch.ones((pixel_values.shape[0], 16, 4096)) # hardcoded for 16
But I wanted to confirm correctness. Also, I wanted to confirm that the pixel_values shape mean (batch_size*frames_per_video, channels, height, width)? Is this reshaped later so that the correct frames are assigned to the right video+question pair. Perhaps that is what num_patch_list is for.
A big problem I have is with extremely long dynamic token lengths that appear randomly during training and causes an OOM error: https://gist.github.com/arushirai1/4d5bb066f38a491265ea5eb27a7e9edd
I am not sure why the dynamic token length is gradually increasing. Additional information: only the LLM LoRA on the last 10 layers are tuned. UPDATE: I realized this dynamic token length is not related to the visual transformer, rather the narrations. And it was a bug I had introduced which introduced some leakage across the collate function calls.