The Magic of Stable Diffusion: Transforming Text into Visual Wonders

The Magic of Stable Diffusion: Transforming Text into Visual Wonders

Stable Diffusion belongs to a class of deep learning models called diffusion models. They are generative models, meaning they are designed to generate new data similar to what they have seen in training. In the case of Stable Diffusion, the data are images.

Why do you need to know? Apart from being a fascinating subject in its own right, some understanding of the inner mechanics will make you a better artist. You can use the tool correctly to achieve results with higher precision.

How does text-to-image differ from image-to-image? What’s the CFG scale? What’s denoising strength? You will find the answer in this article.

What can Stable Diffusion do?

In the simplest form, Stable Diffusion is a text-to-image model. Give it a text prompt. It will return an AI image matching the text.

Example of stable diffusion prompt and images.
                                        Stable diffusion turns text prompts into images.

Stable Diffusion model has achieved state of the art results for image generation. Stable Diffusion is based on a particular type of diffusion model called Latent Diffusion model, proposed in High-Resolution Image Synthesis with Latent Diffusion Models and created by the researchers and engineers from CompVis, LMU and RunwayML. The model was initially trained on 512x512 images from a subset of the LAION-5B database.

This is particularly achieved by encoding text inputs into latent vectors using pre-trained language models like CLIP. Diffusion models can achieve state-of-the-art results for generating image data from texts. But the process of denoising is very slow and consumes a lot of memory when generating high-resolution images. Therefore, it is challenging to train these models and also use them for inference.

In this regard, latent diffusion can reduce the memory and computational time by applying the diffusion process over a lower dimensional latent space, instead of using the actual pixel space. In latent diffusion, the model is trained to generate latent (compressed) representations of the images.

Forward diffusion

Forward diffusion turns a photo into noise. (Figure modified from this article)

A forward diffusion process adds noise to a training image, gradually turning it into an uncharacteristic noise image. The forward process will turn any cat or dog image into a noise image. Eventually, you won’t be able to tell whether they are initially a dog or a cat. (This is important)

Reverse diffusion

Now comes the exciting part. What if we can reverse the diffusion? Like playing a video backward. Going backward in time. We will see where the ink drop was initially added.

The reverse diffusion process recovers an image. Starting from a noisy, meaningless image, reverse diffusion recovers a cat OR a dog image. This is the main idea. Technically, every diffusion process has two parts: (1) drift and (2) random motion. The reverse diffusion drifts towards either cat OR dog images but nothing in between. That’s why the result can either be a cat or a dog.

Training of Diffusion Model

Stable Diffusion is a large text to image diffusion model trained on billions of images. Image diffusion models learn to denoise images to generate output images. Stable Diffusion uses latent images encoded from training data as input. Further, given an image zo, the diffusion algorithm progressively adds noise to the image and produces a noisy image zt, with t being how many times noise is added. When t is large enough, the image approximates pure noise. Given a set of inputs such as time step t, text prompt, image diffusion algorithms learn a network to predict the noise added to the noisy image zt.

There are mainly three main components in latent diffusion:

  1. An autoencoder (VAE).
  2. A U-Net.
  3. A text-encoder, e.g. CLIP’s Text Encoder.

1. The autoencoder (VAE)

The VAE (Variational Autoencoder) model is an advanced machine learning model used in the field of deep learning, particularly in image processing and generation. It consists of two primary components: an encoder and a decoder.

  1. Encoder: The encoder part of a VAE takes in high-dimensional input data, such as images, and compresses it into a lower-dimensional representation. For example, in the context of image processing, an encoder could take an image with dimensions of 512x512x3 (where 512x512 represents the resolution and 3 represents the RGB color channels) and compress it into a much smaller latent representation. This latent representation is a condensed version of the original image, capturing its essential features in a more compact form. In your example, the image is encoded into a latent space of size 64x64x4.
  2. Decoder: The decoder part of the VAE works in the opposite manner. It takes the low-dimensional latent representation and reconstructs it back into the high-dimensional space. The quality of the reconstruction depends on how well the VAE has learned to capture the important features of the data in the latent space.

During the training process, especially in models like latent diffusion models, the encoded latents (the compressed representations of the images) undergo a process where they are progressively corrupted with noise. This is done in a controlled manner, step by step, to train the model in generating or reconstructing images from these noisy representations.

One of the key advantages of using such a model, as you mentioned, is the significant reduction in memory and compute requirements. By working in a lower-dimensional latent space (e.g., 64x64x4 instead of 512x512x3), the model requires less memory, which is particularly beneficial when working with limited resources like a 16GB GPU on platforms like Google Colab. This efficiency allows for quicker generation of high-quality images, such as 512x512 resolution images, which is a substantial achievement in the field of image generation using deep learning.

2. UNet

The U-Net model in image generation contexts, especially for denoising, works by taking noisy latent representations of images as input and predicting the noise present in them. It consists of an encoder, a middle block, and a decoder with skip connections:

  1. Encoder: Compresses the image representation to a lower resolution.
  2. Middle Block: Processes the data at its most compressed form.
  3. Decoder: Reconstructs the original resolution image from the compressed representation, using skip connections from the encoder for better detail preservation.

The model includes down/up-sampling layers, ResNet layers, and Vision Transformers (ViTs). It's designed to be conditional, considering both the timestep and text embeddings for guidance, which helps in accurately predicting and removing noise from the latents, resulting in a cleaner, high-quality image reconstruction.

3. The Text-encoder

The text encoder in models like Stable Diffusion transforms input text prompts into embeddings, which guide the image generation process. It typically uses a transformer-based architecture to convert text into latent text embeddings. Stable Diffusion employs a pre-trained text encoder, such as CLIP, to leverage its proficiency in correlating text with visual content. These embeddings are then used as conditional inputs to the U-Net for denoising and image generation, aligning the visuals with the text prompt's intent.

Putting it all together, the model works as follow during inference process:

Scheduler

The Scheduler in latent diffusion models like Stable Diffusion plays a key role in controlling the noise application and removal process during training and image generation. It determines how noise is added to and predicted from the images at various steps of the model's operation.

  1. Noise Application: The Scheduler dictates how to systematically add noise to an image. This process is crucial during the training phase, where the model learns to reconstruct the original image from its noisy version.
  2. Noise Prediction: The Scheduler also guides the model in predicting and removing noise from the images. This is essential for generating clean, high-quality images.
  3. Customization for Fewer Steps: For different applications, the Scheduler can be adjusted to operate with a smaller number of steps. This flexibility allows for optimizing the model's performance based on specific needs or resource constraints.

The applications of a Latent Diffusion Model like Stable Diffusion, facilitated by components like the Scheduler, are diverse and impactful:

  • Text-to-Image Generation: Creating images from textual descriptions.
  • Image-to-Image Generation: Generating new images or modifying existing ones based on a starting image.
  • Image Upscaling: Enlarging smaller images to larger resolutions.
  • Inpainting: Modifying specific areas of an image based on provided prompts, useful for tasks like object removal or area enhancement.

Furthermore, these models significantly reduce the costs of training and inference. This efficiency has the potential to democratize high-resolution image synthesis, making advanced image generation and modification techniques accessible to a wider audience.