--- tags: - pytorch - vae - diffusion - image-generation - cc3m license: mit datasets: - pixparse/cc3m-wds library_name: diffusers --- # UNet-Style VAE for 256x256 Image Reconstruction This model is a UNet-style Variational Autoencoder (VAE) trained on the [CC3M](https://huggingface.co/datasets/pixparse/cc3m-wds) dataset for high-quality image reconstruction and generation. It integrates adversarial, perceptual, and identity-preserving loss terms to improve semantic and visual fidelity. ## Architecture - **Encoder/Decoder**: Multi-scale UNet architecture - **Latent Space**: 8-channel latent bottleneck with reparameterization (mu, logvar) - **Losses**: - L1 reconstruction loss - KL divergence with annealing - LPIPS perceptual loss (VGG backbone) - Identity loss via MoCo-v2 embeddings - Adversarial loss via Patch Discriminator w/ Spectral Norm $$ \mathcal{L}_{total} = \mathcal{L}_{recon} + \mathcal{L}_{PIPS} + 0.5 * \mathcal{L}_{GAN} + 0.1 *\mathcal{L}_{ID} + 10^{-6} *\mathcal{L}_{KL} $$ ## Reconstructions | Input | Output | |-------|--------| | ![input](./input_grid.png) | ![output](./recon_grid.png) | ## Training Config | Hyperparameter | Value | |-----------------------|----------------------------| | Dataset | CC3M (850k images) | | Image Resolution | 256 x 256 | | Batch Size | 16 | | Optimizer | AdamW | | Learning Rate | 5e-5 | | Precision | bf16 (mixed precision) | | Total Steps | 210,000 | | GAN Start Step | 50,000 | | KL Annealing | Yes (10% of training) | | Augmentations | Crop, flip, jitter, blur, rotation | Trained using a cosine learning rate schedule with gradient clipping and automatic mixed precision (`torch.cuda.amp`)