Project 5
Part A
Part 0
In this first part, we played around with DeepFloyd and generated some images with the already-trained model. Below we can see that less inference steps in the first stage result in more random images (less coherent subject matter), while less inference steps for the second stage seems to result in more noisy upscaling (the subject matter is still preserved, but everything seems much more grainy). As for the overall trend, more inference steps seems to improve the quality of images slightly, especially in the “man wearing a hat” prompt which appears much more realistic and defined.
Below are generated images as well as their corresponding prompts above each image. The smaller images are from stage 1, and the larger images are the upscaled versions from stage 2. I used varrying num_inference_steps
for the stages, but for all, I used the seed: YOUR_SEED = 180
Part 1: Sampling Loops
In this part, I wrote some sampling loops using the pretrained DeepFloyd denoisers. This essentially means strating with a clean image , iteratively adding noise in steps, and using the denoisers to predict how much noise to remove at each iteration step, and iteratively denoising the image back to a clean image.
1.1 Implementing the Forward Process
In this section I implement a forward(im, t)
function which computes a noisy image given and defined as:
which is equivalent to:
Below is the test image at noise level [0, 250, 500, 750].
1.2 Classical Denoising
Classical methods of denoising typically involved applying a guassian blur filter on the noisy image. Below we can see this method applied to the noisy images from 1.1, however the results aren’t optimal.
1.3 One-Step Denoising
Using a pretrained diffusion model’s denoisers, we can predict the amount of noise that needs to be removed given the as well as prompt embeddings, which we use "a high quality photo"
. Below are the results from passing in the images at t = [250, 500, 750]
into stage_1.unet
and subtracting the predicted noise from the image. As we can see, the denoising U-net does a decent job of removing all noise from the image, however later t
values result in a tower that much less resembles the original Campanille.
1.4 Iterative Denoising
Using i_start = 10:
- Create
strided_timesteps
: a list of monotonically decreasing timesteps, starting at 990, with a stride of 30, eventually reaching 0. Also initialize the timesteps using the functionstage_1.scheduler.set_timesteps(timesteps=strided_timesteps)
- Show the noisy image every 5th loop of denoising (it should gradually become less noisy)
- Show the final predicted clean image, using iterative denoising
- Show the predicted clean image using only a single denoising step, as was done in the previous part. This should look much worse.
- Show the predicted clean image using gaussian blurring, as was done in part 1.2.
- Complete the
iterative_denoise
function
In this section, rather than using the denoiser to jump directly from t=T
to t=0
, I implemented an iterative_denoise
function to denoise the image in steps, and in this case starting with T=990
and decreasing in a stride of 30. To calculate the image at a given time step, I used the following formula:
Below is every 5th image in the sequence going from 990, 960, 930, ..., 30, 0
.
Below are the results as well as comparisons to the original image, single step denoising, and gaussian blurring.
1.5 Diffusion Model Sampling
In this part we use the iterative_denoise
function is to generate images from scratch by setting i_start = 0
and passing in random noise.
Below are 5 results using the prompt embedding for "a high quality photo"
:
1.6 Classifier-Free Guidance (CFG)
- Implement the
iterative_denoise_cfg
function
- Show 5 images of
"a high quality photo"
with a CFG scale of .γ=7
The results from the previous part were reasonable but still not spectacular. In this part we improve our results using CFG, which uses conditional and unconditional noise estimates:
To get the unconditional noise estimates, we simply pass in a null prompt embedding for ""
to the unet.
Below are 5 images with a CFG scale of with the prompt embedding "a high quality photo"
1.7 Image-to-image Translatione
Below are some results using SDEdit with CFG to make edits to existing images which works by adding noise to some image at some midway point in the iteration loop from CFG, and denoising starting at that midway point. Below, are the original images, as well as the results from starting at i_start=[1, 3, 5, 7, 10, 20]
in the iteration loop with the text prompt "a high quality photo"
.
1.7.1 Editing Hand-Drawn and Web Images
Here, I used a very similar procedure as the above, and I applied it to web images as well as hand drawn images.
1.7.2 Inpainting
In this section I implemented inpainting, where I essentially used stable diffusion to change the section of the image defined by a mask. This involved modifying the CFG loop so that in each iteration step, in the current image , we replace the sections outside of the mask with the original image and keep the diffusion model’s input inside of the mask as such:
Below are some results of editing some images with the corresponding masks.
Original:
Mask:
Result:
1.7.3 Text-Conditional Image-to-image Translation
In this section, I used a similar procedure as SDEdit, but uses a different prompt embedding than "a high quality photo"
. Instead, I used "a rocket ship"
which guided our model to generate images that resembled a rocket ship. Below are some results using 3 different images.
1.8 Visual Anagrams
A visual anagram is an image that looks like one subject normally, but looks like another subject when rearranged (in this case flipped vertically). In this case, the images are constructed by calculating the noise as so and then using that noise in the original CFG method:
Below are some results using the corresponding pairs of prompt embeddings.
Anagram for ("an oil painting of an old man" , "an oil painting of people around a campfire")
Anagram for ("a photo of a hipster barista","a lithograph of a skull")
Anagram for ("a photo of the amalfi cost","a photo of a dog")
1.9 Hybrid Images
A hybrid image is an image that appears as one subject upclose, and another far away. These images are constructed in a similar manner as the visual anagrams, but intead, we take the noise from two prompt embeddings, and apply a low and high pass filter to them using a guassian blur. We then calculate the noise as such:
Below are some results using the corresponding pairs of prompt embeddings.
Hybrid image for ("a lithograph of a skull", "a lithograph of waterfalls")
Hybrid image for ("a rocket ship", "an oil painting of an old man")
Hybrid image for ("a pencil" , "a rocket ship")
Other results: