cookie

Utilizamos cookies para mejorar tu experiencia de navegación. Al hacer clic en "Aceptar todo", aceptas el uso de cookies.

avatar

Casual GAN Papers

🔥 Popular deep learning & GAN papers explained casually! 📚 Main ideas & insights from papers to stay up to date with research trends ⭐️ New posts every Tue and Fri ⏰ Reading time <10 minutes patreon.com/casual_gan Admin/Ads: @KirillDemochkin

Mostrar más
El país no está especificadoEl idioma no está especificadoLa categoría no está especificada
Publicaciones publicitarias
1 262
Suscriptores
Sin datos24 horas
Sin datos7 días
Sin datos30 días

Carga de datos en curso...

Tasa de crecimiento de suscriptores

Carga de datos en curso...

#87: "Hierarchical Text-Conditional Image Generation with CLIP Latents" by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu et al. explained in 5 minutes As always, here is the visual summary! (All figures are taken from the original paper) *** Tip Casual GAN Papers on KoFi to help this community grow!
Mostrar todo...
dalle2.jpg7.12 MB
#87: "Hierarchical Text-Conditional Image Generation with CLIP Latents" by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu et al. explained in 5 minutes Does this paper even need an introduction? This is DALL-E 2 and if you have somehow not heard of it yet, sit down. Your mind is about to be blown. Enough said. Let’s dive in, shall we? https://www.casualganpapers.com/sota-text-to-image-openai-clip-dalle/DALL-E-2-explained.html
Mostrar todo...
Tip the author
Discuss
Website
Hey everyone, It has been a while! I was busy with a NeurIPS submission 👀 but I think I am back now. Lots of papers to cover, lots of posts to write, lots of people to meet! Let's start this off with a bang that is DALL-E 2 💥 Surprisingly well written, less wordy than StyleGAN 3, and an elegant idea as long as you pretend that the diffusion math doesn't exist 🤡 Best, -Kirill
Mostrar todo...
#86.4: "High-Resolution Image Synthesis with Latent Diffusion Models” by Robin Rombach, Andreas Blattmann et al. As always, here is the visual summary! (All figures are taken from the original paper) *** Tip Casual GAN Papers on KoFi to help this community grow!
Mostrar todo...
CGP - LDMs.jpg7.14 MB
00:21
Video unavailableShow in Telegram
#86.0: "High-Resolution Image Synthesis with Latent Diffusion Models” by Robin Rombach, Andreas Blattmann et al. Hi everyone! We are almost all caught up with diffusion and text-to-image goodness! Best, -Kirill
Mostrar todo...
results-2.gif.mp42.70 KB
#86.1: "High-Resolution Image Synthesis with Latent Diffusion Models” by Robin Rombach, Andreas Blattmann et al. 🔑 Keywords: #faster_diffusion_models #no_CLIP_reranking #classifier_free_guidance 🎯 At a glance: One of the cleanest pitches for a paper I have seen: diffusion models are way too expensive to train in terms of memory, time and compute, therefore let’s make them lighter, faster, and cheaper. As for the details, let’s dive in, shall we? ⭐️ Paper difficulty: 🌕🌕🌕🌑🌑 ⌛️ Prerequisites: (Highly recommended reading to understand the core contributions of this paper): 1) Diffusion Models (ADM) 2) VQGAN 🚀 Motivation: Diffusion models (DMs) have a more stable training phase than GANs and less parameters than autoregressive models, yet they are just really resource intensive. The most powerful DMs require up to a 1000 V100 days to train (that’s a lot of $$$ for compute) and about a day per 1000 inference samples. The authors of Latent Diffusion Models (LDMs) pinpoint this problem to the high dimensionality of the pixel space, in which the diffusion process occurs and propose to perform it in a more compact latent space instead. In short, they achieve this feat by pertaining an autoencoder model that learns an efficient compact latent space that is perceptually equivalent to the pixel space. A DM sandwiched between the convolutional encoder-decoder is then trained inside the latent space in a more computationally-efficient way. In other words, this is a VQGAN with a DM instead of a transformer (and without a discriminator). Post continues in the next message 👇 By: @casual_gan P.S. Want to promote your paper? Contact me! @KirillDemochkin
Mostrar todo...
Tip the author
Discuss
Website
#86.3: "High-Resolution Image Synthesis with Latent Diffusion Models” by Robin Rombach, Andreas Blattmann et al. Continued from the message above☝️ 📈Experiment insights / Key takeaways: - Baselines: LSGM, ADM, StyleGAN, ProjectedGAN, DC-VAE - Datasets: ImageNet, CelebA-HQ, LSUN-Churches, LSUN-Bedrooms, MS-COCO - Metrics: FID, Perception-Recall - Qualitative: x4-x8 compression is the sweet point for ImageNet - Quantitative: LDMS > LSGM, new SOTA FID on CelebA-HQ - 5.11, all scores (with models with 1/2 model size and 1/4 compute) are better (vs other diffusion models) except for LSUN-Bedrooms, where ADM is better - Additional: the model can get up to 1024x1024, can be used for inpainting, super-resolution, and semantic synthesis. There are a lot of details about the experiments, but that is the 5-minute gist. 🛠 Possible Improvements: - LDMs still much slower than GANs - Pixel-perfect accuracy is a bottleneck for LDMs in certain tasks (Which ones?). ✏️My Notes: - (Naming: 3.5/5) The name “LDM” is as straightforward as the problem that the paper that is discussed in the paper. It is an easy-to-pronounce acronym, but not a word and definitely not a meme. - (Reader Experience - RX: 3/5) Right away, kudos for explicitly listing all of the core contributions of the paper right where they belong - at the end of the introduction. I am going to duck a point for visually-inconsistent figures. They are all over the place. Moreover, the small font size in the tables is very hard to read, especially with how packed the tables appear. Finally, why are the images so tiny? Can you even make out what is on Figure 8? What is the purpose of putting in figures that you can’t read? It would probably be better to cut one or two out to make the rest more readable. Finally, the results table is very hard to read, because different baselines in different order are used for different datasets. - I can’t help but draw parallels between Latent Diffusion and StyleNeRF papers - sandwiching an expensive operation (Diffusion & Ray Marching) between a convolutional encoder-decoder to reduce computational costs and memory requirements by performing the operation in spatially-condensed latent space. - Let’s think for a second: what other ideas from DNR & styleNeRF could further improve diffusion models ? One idea I can see being useful is the “NeRF path regularization”, which means, in terms of DMs, training a low-resolution DM alongside a high-resolution LDM, and adding a loss that matches subsampled pixels of the LDM to the pixels in the DM - It should be possible to interpolate between codes in the learned latent space. Not sure how exactly this could be used, but it is probably worth looking into 🔗Links: Paper / Code 👋 Thanks for reading! If you found this paper digest useful, subscribe and share this post to support Casual GAN Papers! - Tip Casual GAN Papers on KoFi to help this community grow! - Join telegram chat / discord - Visit the CGP web blog ! - Follow on Twitter - Visit the library By: @casual_gan P.S. DM me papers to cover! @KirillDemochkin
Mostrar todo...
Tip the author
Discuss
Website
#86.2: "High-Resolution Image Synthesis with Latent Diffusion Models” by Robin Rombach, Andreas Blattmann et al. Continued from the message above ☝️ 🔍 Main Ideas: 1. Perceptual Image Compression: Authors train an autoencoder that outputs a tensor of latent codes. This latent embedding is regularized with vector quantization within the decoder. This is a slight but important change from VGQAN that means the underlying diffusion model works with continuous latent codes and the quantization happens afterwards. 2. Latent Diffusion Models: As the second part of the two-stage training approach a diffusion model is trained inside the learned latent space of the autoencoder. I won’t go into details about how the diffusion itself works as I have covered it before in a previous post. What you need to know here is that the denoising model is a UNet that predicts the noise that was added to the latent codes in the previous step of the diffusion process. 3. Conditioning Mechanisms: Authors utilize domain-specific encoders and cross-attention layers to control the generative model with additional information. The conditions of various modalities such as text, are passed through their own encoders. The results get incorporated in the generative process via cross attention with flattened features from the intermediate layers of the UNet. Post continues in the next message 👇 By: @casual_gan P.S. DM me papers to cover! @KirillDemochkin
Mostrar todo...
Tip the author
Discuss
Website
#85.4: "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models” by Alex Nichol et al. As always, here is the visual summary! (All figures are taken from the original paper) *** Tip Casual GAN Papers on KoFi to help this community grow!
Mostrar todo...
CGP - GLIDE.jpg5.91 MB
#85.3: "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models” by Alex Nichol et al. Continued from the message above☝️ 📈Experiment insights / Key takeaways: - Baselines: DALL-E, LAFITE, XMC-GAN (second best), DF-GAN, DM-GAN, AttnGAN - Datasets: MS-COCO - Metrics: Human perception, CLIP score, FID, Precision-Recall - Qualitative: Classifier-free guided samples look visually more appealing than CLIP-guided images. GLIDE has compositional and object-centric properties. - Quantitative: Classifier-free guidance is nearly Paretto optimal in terms of FID vs IS, Precision vs Recall, and CLIP score vs FID. The takeaway is that CLIP-guidance finds adversarial samples for CLIP instead of the most realistic ones. 🛠 Possible Improvements: - From the model card: “Despite the dataset filtering applied before training, GLIDE (filtered) continues to exhibit biases that extend beyond those found in images of people.” - Unrealistic and out-of-distribution prompts are not handled well, meaning that GLIDE samples are limited by the concepts present in the training data. ✏️My Notes: - (Naming: 4/5) Memorable but not a meme. - (Reader Experience - RX: 3/5) While the samples are presented in a very clean and consistent manner (except for figure 5 that does not fit on the screen, which is an issue because the models are arranged row-wise and you will need to scroll back and forth to compare samples across models), the strange order and naming of the paper section and lack of an architecture overview figure threw me for a loop. Moreover, the structure of the paper is quite unorthodox as most of the information about the proposed method is actually hidden in the background section, not in the typical “The Proposed Method” section, which is simply called “Training” here, and contains configuration details I would expect to see in the beginning of the “Experiments” section. - Classifier-free guidance reminds me of the good ol’ truncation trick from StyleGAN - Props to the authors for citing Katherine Crowson - TBH I wonder, how the heck does 64x64 CLIP even work? I don’t think I could compare images to captions at that resolution with my eyes not to even mention a model - Not sure how I feel about the whole “this model is not safe, hence we won’t release it” narrative that OpenAI is trying to spin since they clearly intend to monetize these huge AI models. 🔗Links: Paper / Code 👋 Thanks for reading! If you found this paper digest useful, subscribe and share this post to support Casual GAN Papers! - Tip Casual GAN Papers on KoFi to help this community grow! - Join telegram chat / discord - Visit the CGP web blog ! - Follow on Twitter - Visit the library By: @casual_gan P.S. DM me papers to cover! @KirillDemochkin
Mostrar todo...
Tip the author
Discuss
Website
Elige un Plan Diferente

Tu plan actual sólo permite el análisis de 5 canales. Para obtener más, elige otro plan.