Text-to-Image
Diffusers
Safetensors
StableDiffusionPipeline
stable-diffusion
Inference Endpoints

Question in the Text encoder setting

#81
by JungaoCanada - opened

Hi,

I find there probably is a problem in setting up the text encoder, not sure why this occurs...

In particular, in the text encoder, the number of hidden layers is set to 23 https://huggingface.co/stabilityai/stable-diffusion-2-1/blob/main/text_encoder/config.json#L19, However, when looking into the official OpenClip H-14, the number of the hidden layer is 24 https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/model_configs/ViT-H-14.json#L15, this can also be confirmed from the number of layers in the LAION CLIP ViT H-14 repo, https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K/blob/main/config.json#L54

Does anyone know why the hugging face repo is setting the number of hidden layers to 23? Is this a bug, or a small trick to improve the sampling performance?

Thanks

Can this possibly be about the last projection layer being removed from/not used in SD as it takes the 77x1024 text embedding as input, not the final CLIP projection of dim 1024?

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment