How does Stable Diffusion work?
Stable Diffusion has always looked like a miracle for me. I could (to some extent) comprehend how a LLM like ChatGPT works and generates text. I did not know most of the details and complexities of how it works but I had a relatively good idea. SD, however, has always been a mystery. The fact that you can write "red cat" and get a picture of a "Cat" which is in red, looked impossible.
I have been reading different articles and GitHub projects trying to understand how "Stable Diffusion" works. I will soon write down everything I learned. But the biggest finding for me was that, "Stable Diffusion does not 'draw' pictures that it has been told. It 'deletes' what is not the picture to have a result same as what it has been told".
Allow me to explain. Unlike I imagined initially, SD is not given a textual prompt, plus a white canvas to draw on. It is given the prompt plus a canvas full of noisy dots. It then "denoises" the noisy canvas and removes "extra" noises from the canvas. This "denoising" is controlled via vectors that are generated using the given prompt.
I'll explain this more technically in my next post.