Upload folder using huggingface_hub

Browse files

Files changed (3) hide show

README.md +2 -82
config.json +43 -0
diffusion_pytorch_model.safetensors +3 -0

README.md CHANGED Viewed

@@ -1,83 +1,3 @@
----
-tags:
-- pytorch
-- vae
-- diffusion
-- image-generation
-- cc3m
-license: mit
-datasets:
-- pixparse/cc3m-wds
-library_name: diffusers
-model-index:
-- name: vae-256px-8z
-  results:
-  - task:
-      type: image-generation
-    dataset:
-      type: conceptual-captions
-      name: Conceptual Captions
-    metrics:
-    - type: Frechet Inception Distance (FID)
-      value: 9.43
-    - type: Learned Perceptual Image Patch Similarity (LPIPS)
-      value: 0.163
-    - type: ID-similarity
-      value: 0.0010186772755879851
-    source:
-      name: Conceptual Captions GitHub
-      url: https://github.com/google-research-datasets/conceptual-captions
----
-# UNet-Style VAE for 256x256 Image Reconstruction
-This model is a UNet-style Variational Autoencoder (VAE) trained on the [CC3M](https://huggingface.co/datasets/pixparse/cc3m-wds) dataset for high-quality image reconstruction and generation. It integrates adversarial, perceptual, and identity-preserving loss terms to improve semantic and visual fidelity.
-## Architecture
-- **Encoder/Decoder**: Multi-scale UNet architecture
-- **Latent Space**: 8-channel latent bottleneck with reparameterization (mu, logvar)
-- **Losses**:
-  - L1 reconstruction loss
-  - KL divergence with annealing
-  - LPIPS perceptual loss (VGG backbone)
-  - Identity loss via MoCo-v2 embeddings
-  - Adversarial loss via Patch Discriminator w/ Spectral Norm
-$$
-\mathcal{L}_{total} = \mathcal{L}_{recon} + \mathcal{L}_{PIPS} + 0.5 * \mathcal{L}_{GAN} + 0.1 *\mathcal{L}_{ID} + 10^{-6} *\mathcal{L}_{KL}
-$$
-## Reconstructions
-| Input | Output |
-|-------|--------|
-| ![input](./input_grid.png) | ![output](./recon_grid.png) |
-## Training Config
-| Hyperparameter        | Value                      |
-|-----------------------|----------------------------|
-| Dataset               | CC3M (850k images)         |
-| Image Resolution      | 256 x 256                  |
-| Batch Size            | 16                         |
-| Optimizer             | AdamW                      |
-| Learning Rate         | 5e-5                       |
-| Precision             | bf16 (mixed precision)     |
-| Total Steps           | 210,000                    |
-| GAN Start Step        | 50,000                     |
-| KL Annealing          | Yes (10% of training)      |
-| Augmentations         | Crop, flip, jitter, blur, rotation |
-Trained using a cosine learning rate schedule with gradient clipping and automatic mixed precision (`torch.cuda.amp`)
-## Usage Example
-```python
-import torch
-from diffusers import AutoencoderKL
-vae = AutoencoderKL.from_pretrained("gabehubner/vae-256px-8z")
-vae.eval()
-input_tensor = torch.randn(1, 3, 256, 256)  # Replace with your actual input
-with torch.no_grad():


1	+ # VAE





























2
3	+ A UNet-style VAE trained on CC3M with adversarial and perceptual losses.

config.json ADDED Viewed

	@@ -0,0 +1,43 @@

+{
+  "_class_name": "VAEWrapper",
+  "_diffusers_version": "0.33.1",
+  "act_fn": "silu",
+  "attention_resolutions": [
+    32
+  ],
+  "block_out_channels": [
+    64
+  ],
+  "channel_multipliers": [
+    1,
+    2,
+    2,
+    4,
+    4
+  ],
+  "double_z": true,
+  "down_block_types": [
+    "DownEncoderBlock2D"
+  ],
+  "force_upcast": true,
+  "hidden_channels": 128,
+  "image_size": 256,
+  "in_channels": 3,
+  "latent_channels": 4,
+  "latents_mean": null,
+  "latents_std": null,
+  "layers_per_block": 1,
+  "mid_block_add_attention": true,
+  "norm_num_groups": 32,
+  "num_res_blocks": 3,
+  "out_channels": 3,
+  "sample_size": 32,
+  "scaling_factor": 0.18215,
+  "shift_factor": null,
+  "up_block_types": [
+    "UpDecoderBlock2D"
+  ],
+  "use_post_quant_conv": true,
+  "use_quant_conv": true,
+  "z_channels": 8
+}

diffusion_pytorch_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b84b814c16e6e5bcb8e0300200a960c1199e27ccae73ad1215d48695eac74f82
+size 322169676