July 8, 2024
Stable Diffusion belongs to a class of deep learning models called diffusion models. They are generative models, meaning they are designed to generate new data similar to what they have seen in training. In the case of Stable Diffusion, the data are images.
Why do you need to know? Apart from being a fascinating subject in its own right, some understanding of the inner mechanics will make you a better artist. You can use the tool correctly to achieve results with higher precision.
How does text-to-image differ from image-to-image? What’s the CFG scale? What’s denoising strength? You will find the answer in this article.
In the simplest form, Stable Diffusion is a text-to-image model. Give it a text prompt. It will return an AI image matching the text.
Stable Diffusion model has achieved state of the art results for image generation. Stable Diffusion is based on a particular type of diffusion model called Latent Diffusion model, proposed in High-Resolution Image Synthesis with Latent Diffusion Models and created by the researchers and engineers from CompVis, LMU and RunwayML. The model was initially trained on 512x512 images from a subset of the LAION-5B database.
This is particularly achieved by encoding text inputs into latent vectors using pre-trained language models like CLIP. Diffusion models can achieve state-of-the-art results for generating image data from texts. But the process of denoising is very slow and consumes a lot of memory when generating high-resolution images. Therefore, it is challenging to train these models and also use them for inference.
In this regard, latent diffusion can reduce the memory and computational time by applying the diffusion process over a lower dimensional latent space, instead of using the actual pixel space. In latent diffusion, the model is trained to generate latent (compressed) representations of the images.
Forward diffusion turns a photo into noise. (Figure modified from this article)
A forward diffusion process adds noise to a training image, gradually turning it into an uncharacteristic noise image. The forward process will turn any cat or dog image into a noise image. Eventually, you won’t be able to tell whether they are initially a dog or a cat. (This is important)
Now comes the exciting part. What if we can reverse the diffusion? Like playing a video backward. Going backward in time. We will see where the ink drop was initially added.
The reverse diffusion process recovers an image. Starting from a noisy, meaningless image, reverse diffusion recovers a cat OR a dog image. This is the main idea. Technically, every diffusion process has two parts: (1) drift and (2) random motion. The reverse diffusion drifts towards either cat OR dog images but nothing in between. That’s why the result can either be a cat or a dog.
Stable Diffusion is a large text to image diffusion model trained on billions of images. Image diffusion models learn to denoise images to generate output images. Stable Diffusion uses latent images encoded from training data as input. Further, given an image zo, the diffusion algorithm progressively adds noise to the image and produces a noisy image zt, with t being how many times noise is added. When t is large enough, the image approximates pure noise. Given a set of inputs such as time step t, text prompt, image diffusion algorithms learn a network to predict the noise added to the noisy image zt.
There are mainly three main components in latent diffusion:
The VAE (Variational Autoencoder) model is an advanced machine learning model used in the field of deep learning, particularly in image processing and generation. It consists of two primary components: an encoder and a decoder.
During the training process, especially in models like latent diffusion models, the encoded latents (the compressed representations of the images) undergo a process where they are progressively corrupted with noise. This is done in a controlled manner, step by step, to train the model in generating or reconstructing images from these noisy representations.
One of the key advantages of using such a model, as you mentioned, is the significant reduction in memory and compute requirements. By working in a lower-dimensional latent space (e.g., 64x64x4 instead of 512x512x3), the model requires less memory, which is particularly beneficial when working with limited resources like a 16GB GPU on platforms like Google Colab. This efficiency allows for quicker generation of high-quality images, such as 512x512 resolution images, which is a substantial achievement in the field of image generation using deep learning.
The U-Net model in image generation contexts, especially for denoising, works by taking noisy latent representations of images as input and predicting the noise present in them. It consists of an encoder, a middle block, and a decoder with skip connections:
The model includes down/up-sampling layers, ResNet layers, and Vision Transformers (ViTs). It's designed to be conditional, considering both the timestep and text embeddings for guidance, which helps in accurately predicting and removing noise from the latents, resulting in a cleaner, high-quality image reconstruction.
The text encoder in models like Stable Diffusion transforms input text prompts into embeddings, which guide the image generation process. It typically uses a transformer-based architecture to convert text into latent text embeddings. Stable Diffusion employs a pre-trained text encoder, such as CLIP, to leverage its proficiency in correlating text with visual content. These embeddings are then used as conditional inputs to the U-Net for denoising and image generation, aligning the visuals with the text prompt's intent.
Putting it all together, the model works as follow during inference process:
The Scheduler in latent diffusion models like Stable Diffusion plays a key role in controlling the noise application and removal process during training and image generation. It determines how noise is added to and predicted from the images at various steps of the model's operation.
The applications of a Latent Diffusion Model like Stable Diffusion, facilitated by components like the Scheduler, are diverse and impactful:
Furthermore, these models significantly reduce the costs of training and inference. This efficiency has the potential to democratize high-resolution image synthesis, making advanced image generation and modification techniques accessible to a wider audience.