cookie

We use cookies to improve your browsing experience. By clicking «Accept all», you agree to the use of cookies.

avatar

Casual GAN Papers

🔥 Popular deep learning & GAN papers explained casually! 📚 Main ideas & insights from papers to stay up to date with research trends ⭐️ New posts every Tue and Fri ⏰ Reading time <10 minutes patreon.com/casual_gan Admin/Ads: @KirillDemochkin

Show more
The country is not specifiedThe language is not specifiedThe category is not specified
Advertising posts
1 262
Subscribers
No data24 hours
No data7 days
No data30 days

Data loading in progress...

Subscriber growth rate

Data loading in progress...

#87: "Hierarchical Text-Conditional Image Generation with CLIP Latents" by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu et al. explained in 5 minutes As always, here is the visual summary! (All figures are taken from the original paper) *** Tip Casual GAN Papers on KoFi to help this community grow!
Show all...
dalle2.jpg7.12 MB
#87: "Hierarchical Text-Conditional Image Generation with CLIP Latents" by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu et al. explained in 5 minutes Does this paper even need an introduction? This is DALL-E 2 and if you have somehow not heard of it yet, sit down. Your mind is about to be blown. Enough said. Let’s dive in, shall we? https://www.casualganpapers.com/sota-text-to-image-openai-clip-dalle/DALL-E-2-explained.html
Show all...
Tip the author
Discuss
Website
Hey everyone, It has been a while! I was busy with a NeurIPS submission 👀 but I think I am back now. Lots of papers to cover, lots of posts to write, lots of people to meet! Let's start this off with a bang that is DALL-E 2 💥 Surprisingly well written, less wordy than StyleGAN 3, and an elegant idea as long as you pretend that the diffusion math doesn't exist 🤡 Best, -Kirill
Show all...
#86.4: "High-Resolution Image Synthesis with Latent Diffusion Models” by Robin Rombach, Andreas Blattmann et al. As always, here is the visual summary! (All figures are taken from the original paper) *** Tip Casual GAN Papers on KoFi to help this community grow!
Show all...
CGP - LDMs.jpg7.14 MB
00:21
Video unavailableShow in Telegram
#86.0: "High-Resolution Image Synthesis with Latent Diffusion Models” by Robin Rombach, Andreas Blattmann et al. Hi everyone! We are almost all caught up with diffusion and text-to-image goodness! Best, -Kirill
Show all...
results-2.gif.mp42.70 KB
#86.3: "High-Resolution Image Synthesis with Latent Diffusion Models” by Robin Rombach, Andreas Blattmann et al. Continued from the message above☝️ 📈Experiment insights / Key takeaways: - Baselines: LSGM, ADM, StyleGAN, ProjectedGAN, DC-VAE - Datasets: ImageNet, CelebA-HQ, LSUN-Churches, LSUN-Bedrooms, MS-COCO - Metrics: FID, Perception-Recall - Qualitative: x4-x8 compression is the sweet point for ImageNet - Quantitative: LDMS > LSGM, new SOTA FID on CelebA-HQ - 5.11, all scores (with models with 1/2 model size and 1/4 compute) are better (vs other diffusion models) except for LSUN-Bedrooms, where ADM is better - Additional: the model can get up to 1024x1024, can be used for inpainting, super-resolution, and semantic synthesis. There are a lot of details about the experiments, but that is the 5-minute gist. 🛠 Possible Improvements: - LDMs still much slower than GANs - Pixel-perfect accuracy is a bottleneck for LDMs in certain tasks (Which ones?). ✏️My Notes: - (Naming: 3.5/5) The name “LDM” is as straightforward as the problem that the paper that is discussed in the paper. It is an easy-to-pronounce acronym, but not a word and definitely not a meme. - (Reader Experience - RX: 3/5) Right away, kudos for explicitly listing all of the core contributions of the paper right where they belong - at the end of the introduction. I am going to duck a point for visually-inconsistent figures. They are all over the place. Moreover, the small font size in the tables is very hard to read, especially with how packed the tables appear. Finally, why are the images so tiny? Can you even make out what is on Figure 8? What is the purpose of putting in figures that you can’t read? It would probably be better to cut one or two out to make the rest more readable. Finally, the results table is very hard to read, because different baselines in different order are used for different datasets. - I can’t help but draw parallels between Latent Diffusion and StyleNeRF papers - sandwiching an expensive operation (Diffusion & Ray Marching) between a convolutional encoder-decoder to reduce computational costs and memory requirements by performing the operation in spatially-condensed latent space. - Let’s think for a second: what other ideas from DNR & styleNeRF could further improve diffusion models ? One idea I can see being useful is the “NeRF path regularization”, which means, in terms of DMs, training a low-resolution DM alongside a high-resolution LDM, and adding a loss that matches subsampled pixels of the LDM to the pixels in the DM - It should be possible to interpolate between codes in the learned latent space. Not sure how exactly this could be used, but it is probably worth looking into 🔗Links: Paper / Code 👋 Thanks for reading! If you found this paper digest useful, subscribe and share this post to support Casual GAN Papers! - Tip Casual GAN Papers on KoFi to help this community grow! - Join telegram chat / discord - Visit the CGP web blog ! - Follow on Twitter - Visit the library By: @casual_gan P.S. DM me papers to cover! @KirillDemochkin
Show all...
Tip the author
Discuss
Website
#86.1: "High-Resolution Image Synthesis with Latent Diffusion Models” by Robin Rombach, Andreas Blattmann et al. 🔑 Keywords: #faster_diffusion_models #no_CLIP_reranking #classifier_free_guidance 🎯 At a glance: One of the cleanest pitches for a paper I have seen: diffusion models are way too expensive to train in terms of memory, time and compute, therefore let’s make them lighter, faster, and cheaper. As for the details, let’s dive in, shall we? ⭐️ Paper difficulty: 🌕🌕🌕🌑🌑 ⌛️ Prerequisites: (Highly recommended reading to understand the core contributions of this paper): 1) Diffusion Models (ADM) 2) VQGAN 🚀 Motivation: Diffusion models (DMs) have a more stable training phase than GANs and less parameters than autoregressive models, yet they are just really resource intensive. The most powerful DMs require up to a 1000 V100 days to train (that’s a lot of $$$ for compute) and about a day per 1000 inference samples. The authors of Latent Diffusion Models (LDMs) pinpoint this problem to the high dimensionality of the pixel space, in which the diffusion process occurs and propose to perform it in a more compact latent space instead. In short, they achieve this feat by pertaining an autoencoder model that learns an efficient compact latent space that is perceptually equivalent to the pixel space. A DM sandwiched between the convolutional encoder-decoder is then trained inside the latent space in a more computationally-efficient way. In other words, this is a VQGAN with a DM instead of a transformer (and without a discriminator). Post continues in the next message 👇 By: @casual_gan P.S. Want to promote your paper? Contact me! @KirillDemochkin
Show all...
Tip the author
Discuss
Website
#86.2: "High-Resolution Image Synthesis with Latent Diffusion Models” by Robin Rombach, Andreas Blattmann et al. Continued from the message above ☝️ 🔍 Main Ideas: 1. Perceptual Image Compression: Authors train an autoencoder that outputs a tensor of latent codes. This latent embedding is regularized with vector quantization within the decoder. This is a slight but important change from VGQAN that means the underlying diffusion model works with continuous latent codes and the quantization happens afterwards. 2. Latent Diffusion Models: As the second part of the two-stage training approach a diffusion model is trained inside the learned latent space of the autoencoder. I won’t go into details about how the diffusion itself works as I have covered it before in a previous post. What you need to know here is that the denoising model is a UNet that predicts the noise that was added to the latent codes in the previous step of the diffusion process. 3. Conditioning Mechanisms: Authors utilize domain-specific encoders and cross-attention layers to control the generative model with additional information. The conditions of various modalities such as text, are passed through their own encoders. The results get incorporated in the generative process via cross attention with flattened features from the intermediate layers of the UNet. Post continues in the next message 👇 By: @casual_gan P.S. DM me papers to cover! @KirillDemochkin
Show all...
Tip the author
Discuss
Website
#85.4: "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models” by Alex Nichol et al. As always, here is the visual summary! (All figures are taken from the original paper) *** Tip Casual GAN Papers on KoFi to help this community grow!
Show all...
CGP - GLIDE.jpg5.91 MB
#85.1: "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models” by Alex Nichol et al. 🔑 Keywords: #faster_diffusion_models #no_CLIP_reranking #classifier_free_guidance 🎯 At a glance: “Diffusion models beat GANs”. While true, the statement comes with several ifs and buts, not to say that the math behind diffusion models is not for the faint of heart. Alas, GLIDE, an OpenAI paper from last December took a big step towards making it true in every sense. Specifically, it introduced a new guidance method for diffusion models that produces higher quality images than even DALL-E, which uses expensive CLIP reranking. And if that wasn’t impressive enough, GLIDE models can be fine-tuned for various downstream tasks such a inpainting and and text-based editing. As for the details, let’s dive in, shall we? ⭐️ Paper difficulty: 🌕🌕🌕🌕🌑 ⌛️ Prerequisites: (Highly recommended reading to understand the core contributions of this paper): 1) Diffusion Models (ADM) 2) CLIP 🚀 Motivation: It used to be with diffusion models that you could boost quality at the cost of some diversity with the classifier guidance technique. However, vanilla classifier guidance requires a pertained classifier that outputs class labels, which is not very suitable for text. Recently though, a new classifier-free guidance approach was introduced. It came with two advantages: the model uses its own knowledge for guidance instead of relying on an external classifier, and it greatly simplifies guidance, when it isn’t possible to directly predict a label, which is should sound familiar for fans of text-to-image models. Post continues in the next message 👇 By: @casual_gan P.S. Want a post about your paper? Contact me! @KirillDemochkin
Show all...
Tip the author
Discuss
Website
Choose a Different Plan

Your current plan allows analytics for only 5 channels. To get more, please choose a different plan.