Project 5

Part A

Part 0

In this first part, we played around with DeepFloyd and generated some images with the already-trained model. Below we can see that less inference steps in the first stage result in more random images (less coherent subject matter), while less inference steps for the second stage seems to result in more noisy upscaling (the subject matter is still preserved, but everything seems much more grainy). As for the overall trend, more inference steps seems to improve the quality of images slightly, especially in the “man wearing a hat” prompt which appears much more realistic and defined.

Below are generated images as well as their corresponding prompts above each image. The smaller images are from stage 1, and the larger images are the upscaled versions from stage 2. I used varrying num_inference_steps for the stages, but for all, I used the seed: YOUR_SEED = 180

num_inference_steps = 20 for both stages
num_inference_steps = 30 for generation, num_inference_steps = 5 for upscaling
num_inference_steps = 75 for both stages
num_inference_steps = 1 for generation, num_inference_steps = 30 for upscaling

Part 1: Sampling Loops

In this part, I wrote some sampling loops using the pretrained DeepFloyd denoisers. This essentially means strating with a clean image x0x_0 , iteratively adding noise in steps, and using the denoisers to predict how much noise to remove at each iteration step, and iteratively denoising the image back to a clean image.

1.1 Implementing the Forward Process

In this section I implement a forward(im, t) function which computes a noisy image xtx_t given tt and x0x_0 defined as:

q(xtx0)=N(xt;αˉx0,(1αˉt)I)(A.1)q(x_t | x_0) = N(x_t ; \sqrt{\bar\alpha} x_0, (1 - \bar\alpha_t)\mathbf{I})\tag{A.1}

which is equivalent to:

xt=αˉtx0+1αˉtϵwhere ϵN(0,1)(A.2)x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1 - \bar\alpha_t} \epsilon \quad \text{where}~ \epsilon \sim N(0, 1) \tag{A.2}

Below is the test image at noise level [0, 250, 500, 750].

t=0
t=250
t=500
t=750

1.2 Classical Denoising

Classical methods of denoising typically involved applying a guassian blur filter on the noisy image. Below we can see this method applied to the noisy images from 1.1, however the results aren’t optimal.

t=250
t=500
t=750

1.3 One-Step Denoising

Using a pretrained diffusion model’s denoisers, we can predict the amount of noise that needs to be removed given the tt as well as prompt embeddings, which we use "a high quality photo". Below are the results from passing in the images at t = [250, 500, 750] into stage_1.unet and subtracting the predicted noise from the image. As we can see, the denoising U-net does a decent job of removing all noise from the image, however later t values result in a tower that much less resembles the original Campanille.

t=250
t=250
t=500
t=500
t=750
t=750
original

1.4 Iterative Denoising

Using i_start = 10:

In this section, rather than using the denoiser to jump directly from t=T to t=0, I implemented an iterative_denoise function to denoise the image in steps, and in this case starting with T=990 and decreasing in a stride of 30. To calculate the image at a given time step, I used the following formula:

xt=αˉtβt1αˉtx0+    αt(1αˉt)1αˉtxt+    vσ(A.3)x_{t'} = \frac{\sqrt{\bar\alpha_{t'}}\beta_t}{1 - \bar\alpha_t} x_0 +        \frac{\sqrt{\alpha_t}(1 - \bar\alpha_{t'})}{1 - \bar\alpha_t} x_t +        v_\sigma\tag{A.3}

Below is every 5th image in the sequence going from 990, 960, 930, ..., 30, 0.

t=90
t=240
t=390
t=540
t=690

Below are the results as well as comparisons to the original image, single step denoising, and gaussian blurring.

original
iterative
one step
gaussian blurring

1.5 Diffusion Model Sampling

In this part we use the iterative_denoise function is to generate images from scratch by setting i_start = 0 and passing in random noise.

Below are 5 results using the prompt embedding for "a high quality photo":

1.6 Classifier-Free Guidance (CFG)

The results from the previous part were reasonable but still not spectacular. In this part we improve our results using CFG, which uses conditional and unconditional noise estimates:

ϵ=ϵu+γ    (ϵcϵu)(A.4)\epsilon = \epsilon_u + \gamma        (\epsilon_c - \epsilon_u) \tag{A.4}

To get the unconditional noise estimates, we simply pass in a null prompt embedding for "" to the unet.

Below are 5 images with a CFG scale of γ=7\gamma=7 with the prompt embedding "a high quality photo"

1.7 Image-to-image Translatione

Below are some results using SDEdit with CFG to make edits to existing images which works by adding noise to some image at some midway point in the iteration loop from CFG, and denoising starting at that midway point. Below, are the original images, as well as the results from starting at i_start=[1, 3, 5, 7, 10, 20] in the iteration loop with the text prompt "a high quality photo".

original campanille

image of a creeper

image of the chroma console

1.7.1 Editing Hand-Drawn and Web Images

Here, I used a very similar procedure as the above, and I applied it to web images as well as hand drawn images.

web image of a pedal board

hand drawn image of a robot face.
hand drawn image of a burger

1.7.2 Inpainting

In this section I implemented inpainting, where I essentially used stable diffusion to change the section of the image defined by a mask. This involved modifying the CFG loop so that in each iteration step, in the current image xtx_t, we replace the sections outside of the mask with the original image and keep the diffusion model’s input inside of the mask as such:

xtmxt+(1m)    forward(xorig,t)(A.5)x_t \leftarrow \textbf{m} x_t + (1 - \textbf{m})        \text{forward}(x_{orig}, t) \tag{A.5}

Below are some results of editing some images with the corresponding masks.

Original:

Mask:

Result:

1.7.3 Text-Conditional Image-to-image Translation

In this section, I used a similar procedure as SDEdit, but uses a different prompt embedding than "a high quality photo". Instead, I used "a rocket ship" which guided our model to generate images that resembled a rocket ship. Below are some results using 3 different images.

1.8 Visual Anagrams

A visual anagram is an image that looks like one subject normally, but looks like another subject when rearranged (in this case flipped vertically). In this case, the images are constructed by calculating the noise as so and then using that noise in the original CFG method:

ϵ1=UNet(xt,t,p1)ϵ2=flip(UNet(flip(xt),t,p2))ϵ=(ϵ1+ϵ2)/2\epsilon_1 = \text{UNet}(x_t, t, p_1) \\ \epsilon_2 = \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2)) \\ \epsilon = (\epsilon_1 + \epsilon_2) / 2

Below are some results using the corresponding pairs of prompt embeddings.

Anagram for ("an oil painting of an old man" , "an oil painting of people around a campfire")

Anagram for ("a photo of a hipster barista","a lithograph of a skull")

Anagram for ("a photo of the amalfi cost","a photo of a dog")

1.9 Hybrid Images

A hybrid image is an image that appears as one subject upclose, and another far away. These images are constructed in a similar manner as the visual anagrams, but intead, we take the noise from two prompt embeddings, and apply a low and high pass filter to them using a guassian blur. We then calculate the noise as such:

ϵ1=UNet(xt,t,p1)ϵ2=UNet(xt,t,p2)ϵ=flowpass(ϵ1)+fhighpass(ϵ2)\epsilon_1 = \text{UNet}(x_t, t, p_1) \\ \epsilon_2 = \text{UNet}(x_t, t, p_2) \\ \epsilon = f_\text{lowpass}(\epsilon_1) + f_\text{highpass}(\epsilon_2)

Below are some results using the corresponding pairs of prompt embeddings.

Hybrid image for ("a lithograph of a skull", "a lithograph of waterfalls")

Hybrid image for ("a rocket ship", "an oil painting of an old man")

Hybrid image for ("a pencil" , "a rocket ship")

Other results: