August 8, 2022

GAN vs Diffusion models

I have been trying to studying different AI models available and how they function. Mostly to get under the hood and unpack which model I want to apply more time to. I have tried almost all of them of which, of course, D.ALLE 2 takes the cake in speed and creativity. Another very popular CLIP AI model is Midjourney but the results were a bit too similar for me. Somehow in the sea of AI art Midjourney creations are always identifiable unlike D.ALLE 2 or Disco Diffusion.

But no criticism here, only synthesis.

Now what is a CLIP-Diffusion model?

A CLIP diffusion model of making images from text but used a noise reduction process unlike GANs. According to Chris Allen, “Diffusion is a mathematical process for removing noise from an image. CLIP is a tool for labeling images. When combined, CLIP uses its image identification skills to iteratively guide the diffusion denoising process toward an image that closely matches a text prompt. Diffusion is an iterative process. Each iteration, or step, CLIP will evaluate the existing image against the prompt, and provide a ‘direction’ to the diffusion process. Diffusion will ‘denoise’ the existing image, and DD will display its ‘current estimate’ of what the final image would look like. Initially, the image is just a blurry mess, but as DD advances through the iteration timesteps, coarse and then fine details of the image will emerge.”

Google says, “They work by corrupting the training data by progressively adding Gaussian noise. This removes details in the data till it becomes pure noise. Then, it trains a neural network to reverse the corruption process. “Running this reversed corruption process synthesises data from pure noise by gradually denoising it until a clean sample is produced.”

“Then what is GAN?” I asked.

GAN is an algorithmic architecture that uses two neural networks that are set one against the other to generate newly synthesised instances of data that can pass for real data. The GAN architecture has two parts — A ‘Generator’ that learns to generate plausible data and a ‘Discriminator’ that decides whether or not each instance of data that it reviews belongs to the actual training dataset. It also penalises the generator for producing implausible results. The generator, as well as the discriminator, are neural networks. The generator output is directly connected to the discriminator output. With the process of backpropagation, the discriminator’s classification gives a signal that the generator uses to update its weights.

When the generator training goes well, we can see that the discriminator will get worse at differentiating between real and fake data. This leads to a reduction in accuracy.

A paper titled ‘Diffusion Models Beat GANs on Image Synthesis’ by OpenAI researchers shows that diffusion models achieve image samples superior to generative models but have limitations.

The paper claims that the gap between diffusion models and GANs come from two factors — “The model architectures used by recent GAN literature have been heavily explored. GANs are able to trade off diversity for fidelity, producing high-quality samples but not covering the whole distribution.”

Okay then which one should I use?

Honestly, I am still very new to this and I am exploring my options is where its at. But generally speaking even though GANs are orders of magnitude faster compared to diffusion models, I naturally have gravitated toward the openness and customisability of Disco Diffusion. It’s like WordPress(.org) but for AI image making. Haha

In any case, Disco offers a very robust set of controllers which for me work best, because I can tweak things to my needs to fit a production pipeline and speed up my workflow.

But all this is a very amateur and over simplified explanation 🙂

I tweaked some of the parameters in Disco. And it spat some really cool and some funky images.

Until next time — Enjoy 🙂