en
Feedback
Eldor’s AI Lab

Eldor’s AI Lab

Open in Telegram

🚀 Eldor’s AI Lab – Sun’iy intellektni chuqur va amaliy o‘rganish! 🔹 AI va ML nazariyasi 🔹 Kod va amaliy mashg‘ulotlar 🔹 Dasturlash bo‘yicha maslahatlar 🔹 Ilmiy maqolalar va eng so‘nggi yangiliklar 💡 AIni o‘rganishni istaysizmi? Let's go!

Show more
The country is not specifiedThe category is not specified
376
Subscribers
No data24 hours
+17 days
+330 days
Posts Archive
📌 8.2-dars: Activation Functions — Neyron tarmoqning "qarori" 🎯 Deep Learning Mathematics — @EldorML Savol: 100 ta qatlam qo'shsam, model kuchliroq bo'ladimi? Javob: Activation bo'lmasa — YO'Q. Sababini ko'ramiz. 🔹 Asosiy mantiq Faqat W·x + b ishlatib 2 qatlam qursak: • z1 = W1·x + b1 • y = W2·z1 + b2 = (W2·W1)·x + (W2·b1 + b2) = W_yangi·x + b_yangi 💥 100 ta qatlam birlashib — bitta chiziqqa aylanadi! Chuqurlik kuch bermaydi. Yechim — qatlamlar orasiga nochiziqli funksiya qo’shish: h = f(z1) ← activation! Endi qatlamlar birlashmaydi. Model egri chiziq, XOR, rasm, matnni o'rgana oladi. 🔹 1. Sigmoid (1990) — birinchi mashhur σ(x) = 1 / (1 + e^(-x)) → chiqish (0, 1) ✅ Ehtimollik sifatida o'qiladi ❌ Vanishing gradient: max hosila = 0.25 10 qatlam: 0.25^10 ≈ 0.0000009 💀 Birinchi qatlamga gradient yetmaydi! Shu sabab 1990-yillarda chuqur tarmoqlar ishlamasdi. 🔹 2. Tanh — yaxshilangan Sigmoid tanh(x) → chiqish (-1, 1), nol atrofida markazlangan Sigmoiddan yaxshiroq, lekin vanishing gradient muammosi qoldi. 💡 RNN/LSTM ichida bugungacha ishlatiladi. 🔹 3. Softmax — ko'p sinf uchun Sigmoid 2 sinf uchun. 10 sinf (0-9) uchun — Softmax: Softmax(xᵢ) = e^(xᵢ) / Σ e^(xⱼ) Logitlar → ehtimollar, yig'indi = 1.00 💡 Faqat oxirgi qatlamda ishlatiladi. 🔹 4. ReLU (2012) — INQILOB ReLU(x) = max(0, x) 2012-yil AlexNet ImageNet'da g'olib. Siri — ReLU. ✅ Hosila = 1 (musbat tomonda) → vanishing gradient ancha yaxshi ✅ Juda tez (faqat if x > 0) ✅ Sparsity — neyronlarning yarmi "uyqu rejimida" ❌ Dying ReLU: katta manfiy bias → neyron har doim 0 → gradient 0 → o'lik ☠️ Yechim — Leaky ReLU: x ≤ 0 → 0.01·x (kichik gradient, neyron o'lmaydi) 🔹 5. GELU (2018) — Transformer davri ReLU qattiq qaror beradi: x ≤ 0 → 0. GELU yumshoq, ehtimol asosida: GELU(x) = x · Φ(x) x = -2: ReLU → 0, GELU → -0.046 x = 2: ReLU → 2, GELU → 1.95 🔥 BERT, GPT-2, GPT-3, ViT — hammasi GELU. 🔹 6. Swish/SiLU (2017) va Mish (2019) SiLU(x) = x · σ(x) Mish(x) = x · tanh(ln(1 + e^x)) GELUga juda o'xshash. Farqi kichik koeffitsient. SiLU → EfficientNet, MobileNetV3, YOLOv5/v8, Stable Diffusion Mish → YOLOv4 💡 GELU vs Swish vs Mish — farqi juda kichik, kontekstga bog'liq. 🎯 Qaysi vazifada qaysi? CNN (rasm) → ReLU yoki SiLU Transformer (BERT, GPT, ViT) → GELU Mobile / Diffusion → SiLU YOLO → SiLU RNN/LSTM → Tanh Binary (oxirgi qatlam) → Sigmoid Multi-class (oxirgi qatlam) → Softmax 💡 Qoida: ReLU bilan boshlang, keyin GELU/SiLU sinab ko'ring. ⚠️ Muhim: Hech qaysi activation "muammosiz" emas. Har biri ayrim kamchiliklarni yumshatadi, lekin o'z narxi bilan (sekinroq hisoblash, ko'proq xotira). 🤝 YouTube: 🎥 Havola 🖥️ Colab: 📂 Havola 📘 Barcha darslar: Havola 🚨 Videolar jonli yozilgan. Matematik izohlarda xatolar bo'lishi mumkin. Oldindan uzr so'rayman 🙏 @EldorML

Video message01:00

📌 8.1-dars: Forward va Backward Pass — Neyron tarmoq qanday "o'ylaydi" va "o'rganadi" 🎯 Deep Learning Mathematics@EldorML Savol: CNN, ViT, Diffusion, GNN, Transformer — nima ularni bog'laydi? Javob: Forward + Backward Pass. Hammasining yuragi shu. 🔹 Asosiy mantiq Bola olma va apelsinni o'rganadi: - Forward: mevani ko'radi → "olma" deydi - Backward: ona "yo'q, apelsin" → bola xatoni tushunadi Neyron tarmoq aynan shu. "Bola" o'rniga — weights. "Ona javobi" o'rniga — loss. 🔹 1. Forward Pass — bashorat 2 qatlamli tarmoq, x = [1, 2], target = 5: z1 = W1·x + b1 → [0.2, 1.9, 1.3] h = ReLU(z1) → [0.2, 1.9, 1.3] y = W2·h + b2 → 0.5 L = (y - 5)² → 20.25 Model 0.5 dedi, javob 5 edi. Xato = 20.25 💥 🔹 2. Computation Graph Har operatsiya grafga yoziladi: x → [W1·x+b1] → [ReLU] → [W2·h+b2] → y → L Backward passda shu grafdan teskari yo'l yuriladi. 💡 PyTorch, TensorFlow — barchasi shu prinsipda. Siz forward yozasiz, framework backwardni avtomatik hisoblaydi (autograd). 🔹 3. Backward Pass — Chain Rule Savol: "W1 ni biroz o'zgartirsam, loss qanchaga o'zgaradi?" dL/dW1 = dL/dy · dy/dh · dh/dz1 · dz1/dW1 Qatlamma-qatlam orqaga: dL/dy = 2(y-5) = -9 dL/dh = -9 · W2 = [-3.6, -2.7, 4.5] dL/dz1 = dL/dh · 1 = [-3.6, -2.7, 4.5] (ReLU musbat) dL/dW1 = dL/dz1 · xᵀ → 3×2 matritsa 🔹 4. Gradient Descent — yangilanish W_yangi = W_eski - η · dL/dW η = 0.01 bilan: W1 = [[0.5, -0.2], → [[0.536, -0.128], [0.3, 0.8], [0.327, 0.854], [-0.1, 0.6]] [-0.145, 0.510]] Parametrlar xato kamayadigan tomonga siljidi 📉 🔹 5. To'liq oqim Forward → Loss → Backward → Yangilash ↓ 1000 marta takrorlash ↓ Model tayyor ✅ 🎯 Xulosa - Forward — bashorat (kirish → chiqish) - Loss — xatoni o'lchash - Backward — chain rule bo'yicha gradientlar - Gradient Descent — parametrlarni yangilash - Autograd — PyTorch buni avtomatik qiladi 💡 CNN, ViT, Diffusion, GNN, Transformer — hammasi shu mexanizmda o'rganadi. Faqat ichidagi operatsiyalar farq qiladi. GPT-4 da ham, sizning 2 qatlamli tarmog'ingizda ham — bir xil prinsip! 🤝 YouTube: 🎥 Havola 🖥️ Colab: 📂 Havola 📘 Barcha darslar: Havola 🚨 Videolar jonli yozilgan. Matematik izohlarda xatolar bo'lishi mumkin. Oldindan uzr so'rayman 🙏 @EldorML

📌 7.5-dars: Efficient Attention — Transformerning O(n²) muammosi 🎯 Deep Learning Mathematics@EldorML Savol: ChatGPT, Claude, Llama qanday qilib 1M tokenli kontekstni qo'llab-quvvatlaydi? Javob: Efficient Attention variantlari. 🔹 Muammo: O(n²) n = 512 → 262 ming n = 8192 → 67 million n = 100K → 10 milliard 💥 QK^T — n×n matritsa. n oshganda portlaydi. 🔹 1. Sparse Attention — kam juftlik Token hammaga qarashi shart emas. - Sliding Window — yaqin w ta tokenga - Longformer — lokal + global tokenlar (65K) - BigBird — window + global + random (100K) Murakkablik: O(n·w) — chiziqli 🔹 2. Linear Attention — matematik usul Usul: (QK^T)V = Q(K^T V) K^T V → d × d matritsa (kichik!) Murakkablik: O(n · d²) Softmax muammosi → kernel usuli (Performer): softmax(q·k) ≈ phi(q)·phi(k) n = 100K da: standart 10 milliard → Performer 26 million Tezlash: 380x 🚀 🔹 3. FlashAttention — GPU darajasidagi O(n²) ni o'zgartirmaydi, lekin 5-10x tezroq! Siri: GPU xotirasi 2 xil HBM (40 GB, sekin) SRAM (20 MB, 100x tez) Standart: hammasi HBM orqali (sekin) Flash: bloklarda SRAMda → HBMga faqat natija Natija: xotira 10-20x kam, 2-4x tezroq 🔹 4. Qo'shimcha usullar - Gradient Checkpointing — xotira 4x kam (+30% vaqt) - Mixed Precision (BF16) — 2x kam, 2x tez - GQA — Llama, GPT-4 da ishlatiladi 🎯 Xulosa - O(n²) — uzun matn uchun fizik to'siq - Sparse → Longformer/BigBird (kam juftlik) - Linear → Performer (matematik qayta yozish) - FlashAttention → 5-10x bepul tezlash - GQA + BF16 + Checkpointing → barcha LLM'da 💡 GPT-4, Claude, Llama 3 — bir nechta tekniklarni birga ishlatadi: GQA + FlashAttention + BF16 + KV-cache. Endi 128K, 1M tokenli kontekst qanday ishlashini tushunasiz! 🤝 YouTube: 🎥 Havola 🖥️ Colab: 📂 Havola 📘 Barcha darslar: Havola 🚨 Videolar jonli yozilgan. Matematik izohlarda xatolar bo'lishi mumkin. Oldindan uzr so'rayman 🙏 @EldorML

📌 7.4-dars: Graph Neural Networks (GNN) — Graf shaklidagi ma'lumotlar 🎯 Deep Learning Mathematics @EldorML Oldingi darsda Diffusion Models va shovqin (noise)dan rasm yaratish haqida gaplashdik. Endi savol: ❓ Agar ma'lumot rasm ham, matn ham emas, balki graf bo'lsa-chi? ❓ Facebook "Siz tanishingiz mumkin", Google Maps trafik, AlphaFold — qanday ishlaydi? Javob: barchasi Graph Neural Networks asosida. 🔹 1. Asosiy savol CNN — rasmlar uchun (regular grid) Transformer — matn uchun (sequence) GNN — graflar uchun (irregular structure) Misollar: • Ijtimoiy tarmoq: odamlar (tugun) + do'stlik (qirra) • Molekula: atomlar + bog'lanishlar • Yo'l xaritasi: shaharlar + yo'llar • Tavsiya: foydalanuvchi-mahsulot "GNN — bu CNNning umumlashtirilgan versiyasi: 'qo'shni piksellar' o'rniga 'qo'shni tugunlar' bilan ishlaydi." 🔹 2. Adjacency Matrix — grafni raqamlarda Kim kim bilan bog'langanini matritsa orqali ifodalaymiz: Ali Vali Soli Rustam Ali [ 0 1 1 0 ] Vali [ 1 0 0 1 ] Soli [ 1 0 0 1 ] Rustam [ 0 1 1 0 ] 🟢 Diagonal nol — tugun o'ziga bog'lanmagan 🟢 Simmetrik — yo'naltirilmagan grafda Self-loop qo'shamiz: A_tilde = A + I Sababi: tugun aggregate paytida o'z xususiyatini ham saqlashi kerak. 🔹 3. Message Passing — GNNning yuragi Uch qadam: 1) MESSAGE — har tugun qo'shnilariga "xabar" yuboradi 2) AGGREGATE — har tugun olgan xabarlarni birlashtiradi (sum/mean/max) 3) UPDATE — neyron tarmoq orqali yangi xususiyat hisoblanadi Hayotiy o'xshatish — gap-tarqalish: Boshida: faqat Ali biladi 1 qadam: Ali → Vali, Soli ham biladi 2 qadam: Vali, Soli → Rustam ham biladi 💡 Eng muhim xulosa: K marta message passing = har tugun K-uzoqlikdagi qo'shnilardan ma'lumot oladi degani. 🔹 4. GCN formulasi H^(k+1) = sigma( A_hat · H^(k) · W^(k) ) Bu yerda: A_hat = D^(-1/2) · A_tilde · D^(-1/2) Qadamma-qadam: • A_tilde · H — qo'shnilar yig'indisi (avtomatik aggregate) • H · W — linear transform (CNNdagi filter o'xshashi) • D^(-1/2) bilan ko'paytma — normalizatsiya • sigma — ReLU yoki SiLU 🔹 5. Normalizatsiya nima uchun? Muammo: ba'zi tugunlarda 1000+ qo'shni (mashhur odam), ba'zilarida 5 ta. Sodda yig'indida: Mashhur odam → katta qiymat Oddiy odam → kichik qiymat Bu adolatsiz — mashhur tugunlar dominantlik qiladi. Yechim: degree bilan bo'lish: h_i_new = sum( h_j / sqrt(d_i · d_j) ) Endi har kimning ma'lumoti bir xil masshtabda. 🔹 6. K qatlam = K-uzoqlik 1 qatlam → bevosita qo'shnilar 2 qatlam → qo'shnining qo'shnisi K qatlam → K-uzoqlik ⚠️ Lekin 5+ qatlam — over-smoothing muammosi: barcha tugunlar bir xil bo'lib qoladi. Boshida: 10 qatlamdan keyin: Ali = [1, 0] Ali = [0.4, 0.4] Vali = [0, 1] Vali = [0.4, 0.4] Soli = [1, 1] Soli = [0.4, 0.4] → HAMMASI BIR XIL! Optimal: 2-3 qatlam. 🔹 7. GNN vazifa turlari Node-level — har tugun uchun bashorat Misol: spam akkauntmi? qaysi guruh? Edge-level — qirra bo'ladimi? Misol: do'st tavsiyasi (link prediction) Graph-level — butun graf uchun Misol: molekula zaharlimi? 🎯 Yakuniy xulosa • Graf = tugunlar + qirralar (adjacency matrix bilan ifoda) • Message passing: message → aggregate → update • GCN formula: H' = sigma(A_hat · H · W) — qo'shnilar yig'indisi + linear + ReLU • Normalizatsiya: degree bilan bo'lish (mashhur tugunlar dominatsiya qilmasin) • 2-3 qatlam optimal, 5+ qatlam over-smoothing keltiradi • GNN istalgan o'lchamdagi grafda ishlaydi (permutation invariant) 💡 AlphaFold (protein), Google Maps (trafik), Pinterest (tavsiya), Facebook ("siz tanishingiz mumkin") — barchasi GNN asosida. Biz har kuni GNN dan foydalanamiz, lekin uni ko'rmaymiz. 🤝 YouTube dars: 🎥 Havola 🖥️ Colab notebook: 📂 Havola 📘 Barcha darslar: Havola 🚨 Videolar jonli yozilgan. Matematik izohlarda xatolar bo'lishi mumkin. Oldindan uzr so'rayman 🙏 @EldorML

Agar biror taklif yoki istaklaringiz bo’lsa, izohlarda yozib qoldiring. Darslarni shunga qarab moslashga harakat qilaman!
Anonymous voting

Assalom alaykum do’stlar. Video darslar sizlarga tushunarli va foydali bo’lyaptimi?
Anonymous voting

📌 7.3-dars: Diffusion Models — Noisedan(Shovqin) rasm yaratish 🎯 Deep Learning Mathematics@EldorML Oldingi darsda ViT va patch embedding haqida gaplashdik. Endi savol: ❓ Sof noisedan(shovqin) haqiqiy rasm yaratish mumkinmi? ❓ Stable Diffusion va DALL-E qanday ishlaydi? Javob: Ha — buning siri "diffuziya" jarayonida. 🔹 1. Asosiy g'oya GAN: rasmni "ixtiro qiladi" VAE: rasmni siqib qayta tiklaydi Diffusion: shovqinni olib tashlab rasm "quradi" "Agar biz rasmni buzishni o'rgansak, uni tiklashni ham o'rganishimiz mumkin." 🔹 2. Forward Process — Shovqin qo'shish Rasmga T = 1000 qadamda asta-sekin Gaussian shovqin qo'shamiz: x_0 → x_1 → x_2 → ... → x_T rasm ozgina ko'p sof shovqin shovqin shovqin Reparameterization formulasi: x_t = √ᾱ_t · x_0 + √(1-ᾱ_t) · ε Bu yerda: - ᾱ_t — α larning ko'paytmasi (t qadamgacha) - ε ~ N(0, I) — sof Gaussian shovqin 🟢 Forward process O'RGATILMAYDI (TRAIN) — bu matematik formula. 🔹 3. Reverse Process — Rasmni tiklash Sof shovqindan boshlab, har qadamda ozgina shovqin olib tashlaymiz: x_T → x_{T-1} → ... → x_1 → x_0 shovqin toza rasm Muammo: aniq formula yo'q (posterior hisoblash imkonsiz) Yechim: neyron tarmoq (U-Net) shovqinni bashorat qiladi 🔹 4. Score Matching — chuqur g'oya Score funksiyasi = log p(x) gradienti Bu — "haqiqiy rasmni ko’rsatadigan kompas" DDPMda (Diffusion Model) isbotlangan: score = -ε / √(1-ᾱ_t) Ya'ni shovqinni bashorat qilish == scoreni hisoblash Ikkisi MATEMATIK EKVIVALENT! 🔹 5. DDPM Loss — sodda MSE Murakkab variational lower bound (VLB) qisqartirildi: L = || ε - ε_θ(x_t, t) ||² Bu — oddiy MSE. Hammasi shu! Training algoritmi: 1. Datasetdan rasm olish: x_0 2. Tasodifiy qadam: t ~ Uniform(1, T) 3. Tasodifiy shovqin: ε ~ N(0, I) 4. x_t hisoblash (formula yuqorida) 5. Loss = ||ε - ε_θ(x_t, t)||² 6. Gradient descent 🔹 6. U-Net — Shovqin bashorat qiluvchi tarmoq Kirish: shovqinli rasm + qadam raqami (t) Chiqish: bashorat qilingan shovqin Encoder (siqish) x_t → [64] → [128] → [256] → [512] ↓ Bottleneck ↓ Decoder (kengaytirish) [512] → [256] → [128] → [64] → ε_pred Skip connections: har qatlamda — mayda detallar yo'qolmaydi. Time embedding sinusoidal — model qaysi qadamda ekanligini biladi. 🔹 7. Sampling — sekin lekin sifatli Trening: 1 ta forward pass Sampling: 1000 ta forward pass Diffusion GANdan 1000 marta sekinroq, lekin sifati ancha yuqori. Yangi metodlar (DDIM) bu sonni 20-50 ga tushiradi. 🎯 Yakuniy xulosa - Forward process → matematik formula, o'rgatilmaydi - Reverse process → U-Net o'rganadi - DDPM loss → oddiy MSE - Score matching = shovqin bashorati (matematik ekvivalent) - U-Net + skip connections → mayda detallar saqlanadi - Time embedding → bir model 1000 ta vazifani bajaradi 💡 Stable Diffusion, DALL-E 2, Midjourney, Imagen — barchasi DDPM asosida! 🤝 YouTube dars: 🎥 Havola 🖥️ Colab notebook: 📂 Havola 📘 Barcha darslar: Havola 🚨 Videolar jonli yozilgan. Matematik izohlarda xatolar bo'lishi mumkin. Oldindan uzr so’rayman🙏 @EldorML

📌 7.2-dars: Vision Transformers (ViT) — Rasmlarni tokenga aylantirish 🎯 Deep Learning Mathematics@EldorML Oldingi darsda ResNet va skip connectionlar haqida gaplashdik. Endi savol: ❓ Transformer faqat matn uchunmi? ❓ Rasmni ham Transformerga berish mumkinmi? Javob: Ha — lekin avval rasmni "so'zlarga" aylantirish kerak. 🔹 1. Muammo — Rasmni token qilish Har bir pikselni token deb olsak: 224×224 = 50176 token Attention hisoblash O(n²) → 50176² ≈ 2.5 milliard operatsiya. Bu amalda mumkin emas. 🔹 2. Yechim — Patch Embedding Rasmni P×P patchlarga bo'lamiz: Patch hajmi: 16×16 Patch soni: 224×224 / 16×16 = 196 ta 50176 piksel → faqat 196 token! ✅ Har patch: 1. Yassilanadi: 16×16×3 = 768 element 2. Linear proyeksiya: 768 → D o'lchamli embedding 3. Position embedding qo'shiladi 🔹 3. CLS Token Transformerga kirishda [CLS] token qo'shiladi. • Hech qaysi patchga tegishli emas • Barcha patchlar bilan attention orqali muloqot qiladi • Oxirida butun rasmning "xulosa" representatsiyasi • Klassifikatsiya uchun faqat [CLS] ishlatiladi 🔹 4. Position Embedding nima uchun kerak?Z Transformer tartibsiz (permutation invariant): [p1][p2][p3] va [p5][p1][p99] — bir xil ko'rinadi! Position embedding har tokenga "men i-chi o'rindaman" degan ma'lumot qo'shadi. ViTda o'rganiluvchi position embedding ishlatiladi. 🔹 5. Inductive Bias — CNN vs ViT Inductive bias — arxitekturaning ma'lumot haqidagi avvalgi taxminlari. CNNning taxminlari: • Locality → faqat qo'shni piksellar bilan ishlaydi • Translation equivariance → bir xil filter hamma joyda ishlaydi ViTning taxminlari: • Locality YO'Q → har patch barcha patchlarni ko'radi • Translation equivariance YO'Q → position embedding o'rganiladi • Global receptive field → darhol mavjud ✅ Taqqoslash: CNN: Locality ✅ (tayyor) Translation eq. ✅ (tayyor) Global context ❌ (sekin) Kam data ✅ yaxshi Ko'p data ✅ yaxshi ViT: Locality ❌ (o'rganiladi) Translation eq. ❌ (o'rganiladi) Global context ✅ (darhol) Kam data ❌ ko'p data kerak Ko'p data ✅✅ CNNdan yaxshi Amalda: • Kam data (< 1M) → CNN afzal • Ko'p data (> 10M) → ViT afzal 🔹 6. To'liq ViT Pipeline Kirish rasm (224×224×3) ↓ Patch bo'lish → 196 ta 16×16×3 ↓ Flatten + Linear → 196×768 ↓ CLS token → 197×768 ↓ Position embedding → 197×768 ↓ Transformer Encoder × 12 ↓ CLS token → 768 ↓ MLP Head → 1000 klass 🎯 Yakuniy xulosa • Patch embedding → rasm tokenlar ketma-ketligiga aylanadi • CLS token → butun rasmning xulosa representatsiyasi • Position embedding → har patchning joylashuvini bildiradi • CNN → inductive bias bor, kam data uchun yaxshi • ViT → global attention, ko'p data uchun yaxshi 💡 DINOv2, SAM, Stable Diffusion — barchasi ViT asosida! 🤝 YouTube dars: 🎥 Havola 🖥️ Colab notebook: 📂 Havola 📘 Barcha darslar: Havola 🚨 Videolar jonli yozilgan. Matematik izohlarda xatolar bo'lishi mumkin. Oldindan uzr 🙏 @EldorML

📌 7.1-dars: ResNet va Skip Connections — Chuqur tarmoqlar muammosiga yechim 🎯 Deep Learning Mathematics@EldorML Oldingi darsda Batch Normalization haqida gaplashdik. Endi savol: ❓ Nega 56 qatlamli tarmoq 20 qatlamlilikdan yomon ishlaydi? ❓ Nega chuqur tarmoq har doim yaxshiroq emas? Javob: Degradation muammosi — vanishing gradient. 🔹 1. Muammo — Vanishing Gradient Backpropagationda gradient zanjir qoidasi orqali hisoblanadi: ∂L/∂w₁ = ∂L/∂hₙ · ∂hₙ/∂hₙ₋₁ · ... · ∂h₁/∂w₁ Har qatlam gradientni oldingi gradientga ko'paytiradi. Agar har qatlamda gradient < 1 bo'lsa: 0.9¹⁰ = 0.35 0.9⁵⁰ = 0.005 0.9¹⁰⁰ ≈ 0.00003 ← deyarli nol! Natijada: • Birinchi qatlamlar deyarli o'qimaydi • Chuqur tarmoq sayoz tarmoqdan yomon ishlaydi 🔹 2. Residual Learning — F(x) + x Oddiy qatlam: h(x) = F(x) ← to'liq mapping o'rganadi ResNet qatlam: h(x) = F(x) + x ← faqat "qoldiq" (residual) o'rganadi Nima uchun bu oson? • Oddiy tarmoqda: h(x) = x ni o'rganish → qiyin • ResNetda: F(x) = 0 ni o'rganish → oson! Oddiy: x → [Conv→BN→ReLU] → F(x) ResNet: x ─────────┐ x → [F qatlam] → (+) → ReLU 🔹 3. Identity Mapping Matematikasi Bir blok: y = F(x, {Wᵢ}) + x Ko'p blok uchun: x_L = x_l + Σ F(xᵢ) (l dan L gacha) Ya'ni istalgan chuqur qatlam — istalgan sayoz qatlamning to'g'ridan-to'g'ri yig'indisi. Gradient formulasi: ∂L/∂x_l = ∂L/∂x_L · (1 + ∂/∂x_l · ΣF(xᵢ)) 💡 Formulada "1" bor! • Oddiy tarmoqda: gradient faqat qatlamlar orqali → yo'qolishi mumkin • ResNetda: 1 + ... → gradient hech qachon nolga tushmaydi ✅ 🔹 4. Skip Connection arxitekturasi Basic Block (ResNet-18, 34): x ┐ ↓ Conv(3×3) → BN → ReLU ↓ Conv(3×3) → BN ↓ (+) ← x ↓ ReLU Bottleneck Block (ResNet-50, 101, 152): x ┐ ↓ Conv(1×1) → BN → ReLU ← kanallar kamayadi ↓ Conv(3×3) → BN → ReLU ← asosiy hisoblash ↓ Conv(1×1) → BN ← kanallar oshadi ↓ (+) ← x ↓ ReLU 1×1 convolutionlar kanallar sonini kamaytiradi → hisoblash tejaladi. O'lchamlar farq qilganda — Projection ishlatiladi: y = F(x) + Wₛ·x ← bu yerda Wₛ = 1×1 conv 🔹 5. Natija Oddiy tarmoq: 20 qatlam → ✅ yaxshi 56 qatlam → ❌ yomonlashadi 152 qatlam → ❌❌ juda yomon ResNet: 20 qatlam → ✅ yaxshi 56 qatlam → ✅ hali yaxshi 152 qatlam → ✅ eng yaxshi (ImageNet 2015 🏆) ResNet-152 — ImageNetda 2015-yilda eng yaxshi natija. 🎯 Yakuniy xulosa • Degradation → chuqur tarmoq sayozdan yomon ishlaydi • Skip connection → F(x) + x gradientga to'g'ridan-to'g'ri yo'l ochadi • Identity mapping → F(x)=0 o'rganish oson → qo'shimcha qatlamlar zararlanmaydi • ResNet g'oyasi → bugungi barcha zamonaviy arxitekturalarda ishlatiladi 🤝 YouTube dars: 🎥 Havola 🖥️ Colab notebook: 📂 Havola 📘 Barcha darslar: Havola 🚨 Videolar jonli yozilgan. Matematik izohlarda xatolar bo'lishi mumkin. Oldindan uzr 🙏 @EldorML

📌 6.7-dars: Batch Normalization — Chuqur tarmoqlarda o'qitishni tezlashtirish 🎯 Deep Learning Mathematics@EldorML Oldingi darsda overfitting va generalization haqida gaplashdik. Endi savol: ❓ Nega chuqur tarmoqlar o'qitish davomida beqaror bo'ladi? ❓ Nega learning rateni katta qilsak training buziladi? Javob: Internal Covariate Shift. 🔹 1. Internal Covariate Shift nima? Masalan, fabrika misolini olsak: • Yaxshi holat → xomashyo har kuni bir xil keladi, ishchi bir maromda ishlaydi • Yomon holat → xomashyo har kuni boshqacha, ishchi doim moslashadi Tarmoqda ham xuddi shunday: • Har qatlam oldingi qatlamdan input oladi • O'qitish davomida oldingi qatlam o'zgargani sayin keyingi qatlam inputi ham o'zgaradi • Keyingi qatlam doim "yangi sharoitga" moslashadi → o'qitish sekinlashadi Bu — Internal Covariate Shift. 🔹 2. Batch Normalization — Yechim G'oya: har qatlamning inputini normalizatsiya qilamiz — ya'ni mean=0, std=1 ga keltiramiz. Batch = [1.0, 2.0, 3.0, 4.0] misol sifatida: 1-qadam — Mean: μ = (1.0 + 2.0 + 3.0 + 4.0) / 4 = 2.5 2-qadam — Variance: σ² = ((1−2.5)² + (2−2.5)² + (3−2.5)² + (4−2.5)²) / 4 = (2.25 + 0.25 + 0.25 + 2.25) / 4 = 1.25 3-qadam — Normalizatsiya: x̂ᵢ = (xᵢ − μ) / √(σ² + ε) x̂₁ = (1.0 − 2.5) / √1.25 = −1.34 x̂₂ = (2.0 − 2.5) / √1.25 = −0.45 x̂₃ = (3.0 − 2.5) / √1.25 = +0.45 x̂₄ = (4.0 − 2.5) / √1.25 = +1.34 Natija: mean ≈ 0, std ≈ 1 ✅ 4-qadam — Scale va Shift: yᵢ = γ · x̂ᵢ + β 💡 γ va β nima uchun kerak? Agar faqat normalizatsiya qilsak — model har doim mean=0, std=1 ga majbur. Lekin ba'zi qatlamlarda boshqa taqsimot kerak bo'lishi mumkin. γ va β — o'rganiluvchi parametrlar, model o'zi kerakli taqsimotni tanlaydi. 🔹 3. Training vs Inference Muammo: inferenceda batch bo'lmasa nima qilamiz? Yechim — Running Statistics: μ_run ← (1−α)·μ_run + α·μ_batch σ²_run ← (1−α)·σ²_run + α·σ²_batch Training: batch statistikasi + running yangilanadi Inference: running statistikasi — o'zgarmaydi PyTorchda: model.train() → batch stat, running yangilanadi model.eval() → running stat, o'zgarmaydi ⚠️ Keng tarqalgan xato: model.eval() qismini unutish: Inferenceda BatchNorm noto'g'ri ishlaydi → natijalar beqaror. 🔹 4. Batch Norm afzalliklari • Katta LR ishlatish mumkin → tezroq o'qitish • Initializationga kamroq bog'liqlik • Regularization effekti — ozgina overfitting kamayadi • Gradient vanishing kamayadi 🔹 5. Qayerga qo'yish kerak? Original: Linear → BN → Activation Zamonaviy: Linear → Activation → BN PyTorchda: nn.Linear(in, out) nn.BatchNorm1d(out) nn.ReLU() 🎯 Yakuniy xulosa: • Internal Covariate Shift → qatlam inputi o'qitishda o'zgarib turadi • Batch Norm → har batchda mean=0, std=1 ga keltiradi • γ, β → model kerakli taqsimotni o'zi o'rganadi • model.eval() → running statistikani ishlatadi 🤝 YouTube dars: 🎥 Havola 🖥️ Colab notebook: 📂 Havola 📘 Barcha darslar: Havola 🚨 Videolar jonli yozilgan. Matematik izohlarda xatolar bo'lishi mumkin. Oldindan uzr 🙏 @EldorML

📌 6.6-dars: Overfitting va Generalization — Model nima o'rganadi? 🎯 Deep Learning Mathematics@EldorML Oldingi darsda regularization haqida gaplashdik. Endi savol: ❓ Nega train datada yaxshi, yangi datada yomon ishlaydi? ❓ Model aslida nimani o'rganishi kerak? Javob: Model bog’liqlikni (pattern) o'rganishi kerak — chalg’ituvchi ma’lumotni (noise) emas. 🔹 1. Overfitting / Underfitting Imtihon analogiyasi: • Umuman o'qimagansiz → Underfitting • Mavzuni tushundingiz → Just right ✅ • Faqat javoblarni yod oldingiz → Overfitting Math: • Underfitting: Train loss↑ Val loss↑ • Just right: Train loss↓ Val loss↓ (yaqin) • Overfitting: Train loss↓↓ Val loss↑ Polinom misoli: • Daraja 1 → juda sodda → underfitting • Daraja 4 → optimal → just right ✅ • Daraja 20 → juda murakkab → overfitting 🔹 2. Generalization Geometriyasi Loss yuzada ikki xil minimum: Sharp minimum: Loss | \ / | \ / | \ / ← tik devorlar | \ / | \/ Flat minimum: Loss | \ / | \ / | \_____/ ← keng, tekis tub Nima uchun flat minimum yaxshi? Train va test distribution ozgina farq qilsa → parametrlar siljishi mumkin. • Sharp → kichik siljish → loss tez oshadi → testda yomon natija • Flat → kichik siljish → loss deyarli o'zgarmaydi → testda yaxshi natija Flat minimumga qanday erishish mumkin? • Kichik batch size → flat minimum • Weight Decay / L2 → katta parametrlarni jazolaydi • Dropout → robustness oshadi 💡 Kichik batch (32–256) ko'proq tavsiya etiladi — tezroq bo'lmasa ham. 🔹 3. Bias-Variance Tradeoff Xato = Bias² + Variance + Irreducible Noise Bias → modelning tizimli xatosi → underfitting belgisi Variance → modelning train dataga bog'liqligi → overfitting belgisi Murakkablik bilan o'zgarishi: Bias: ████████ → ░░░░░░░░ (murakkaklik oshsa kamayadi) Variance: ░░░░░░░░ → ████████ (murakkaklik oshsa oshadi) ⚠️ Deep learningda "Double Descent" hodisasi: Juda katta modellarda umumiy xato yana pasayadi — lekin bu hali to'liq tushuntirilmagan mavzu. 🔹 4. Train / Val / Test Split | Set | Maqsad | Hajm | | Train | Modelni o'qitish | 70–80% | | Val | Hyperparameter sozlash | 10–15% | | Test | Yakuniy baholash | 10–15% | ⚠️ Oltin qoida: Test setga faqat bir marta qarang. Agar test natijasiga qarab model o'zgartirsangiz — u endi haqiqiy test emas. Early Stopping: Val loss oshib ketganda o'qitishni to'xtatish. Epoch: 1 2 3 4 5 6 7 ... Train: 0.9 0.7 0.5 0.4 0.3 0.2 0.15 Val: 0.95 0.8 0.65 0.6 0.58 0.60 0.65 ↑ eng yaxshi model 🔹 5. Qaysi holda nima qilish kerak? • Train↓ Val↑ → overfitting → regularization, dropout, ko'proq data • Train↑ Val↑ → underfitting → kattaroq model, ko'proq epoch • Train≈Val, ikkalasi↑ → ko'proq data kerak • Train≈Val, ikkalasi↓ → ideal ✅ 🎯 Yakuniy xulosa • Overfitting → train datani yod olish • Flat minimum → yaxshi umumiylashtirish • Bias²+Variance → xatoning ikki komponenti • Train/Val/Test → har birining alohida vazifasi bor Yaxshi model — trainda eng past loss emas, ko'rmagan datada eng past loss. ✅ 🤝 YouTube dars: 🎥 Havola 🖥️ Colab notebook: 📂 Havola 📘 Barcha darslar: Havola 🚨 Videolar jonli yozilgan. Matematik izohlarda xatolar bo'lishi mumkin. Oldindan uzr 🙏 @EldorML

📌 6.5-dars: Regularization — Overfittingga qarshi kurash 🎯 Deep Learning Mathematics — @EldorML Oldingi darslarda optimizerlar va learning rate haqida gaplashdik. Endi savol: ❓ Nega model train datada yaxshi, test datada yomon ishlaydi? ❓ Parametrlar juda katta bo'lsa nima bo'ladi? Javob: Model "tushunmaydi" — faqat "yod oladi". Bunga overfitting deyiladi. 🔹 1. Overfitting nima? Imtihonga tayyorgarlik analogiyasi: • Yod olish → yangi savol chiqsa javob bera olmaysiz • Tushunish → har qanday savolga javob bera olasiz Xuddi shunday modellarda: • Overfitting → train datani "yod oladi" • Regularization → modelni "tushunishga" majbur qiladi 🔹 2. L2 Regularization (Ridge) Oddiy loss: Loss = (1/n)·Σ(y_i − ŷ_i)² L2 loss: Loss = (1/n)·Σ(y_i − ŷ_i)² + λ·Σ w_j² Endi model ikki narsani kamaytiradi: xato + katta parametrlar. Yangilanish: w ← w − η·(gradient + 2λ·w) Misol: η=0.1, λ=0.1, w=3.0, gradient=2.0 • L2 bilan: w = 3.0 − 0.1×(2.0+0.6) = 2.74 • L2 siz: w = 3.0 − 0.1×2.0 = 2.80 💡 L2 parametrlarni nolga yaqinlashtiradi, lekin aynan nolga tushirmaydi. 🔹 3. L1 Regularization (Lasso) Loss_L1 = (1/n)·Σ(y_i − ŷ_i)² + λ·Σ|w_j| L2 dan farqi: ba'zi parametrlarni aynan nolga tushiradi → model o'zi muhim feature larni tanlaydi. ⚠️ Deep learningda L1 kamdan-kam, L2 ko'proq ishlatiladi. 🔹 4. Dropout Muammo: neyronlar bir-biriga haddan ortiq tayanadi → overfitting. Yechim: har o'qitish qadamida neyronlarning bir qismini tasodifiy o'chiramiz. Oddiy: [N1][N2][N3][N4][N5] Dropout: [N1][ 0 ][N3][ 0 ][N5] ← har qadamda boshqalar o'chiriladi ⚠️ dropout=0.2 → neyronlarning 20% i o'chiriladi, 80% i saqlanadi. Inverted Dropout — train/inference miqyos farqini hal qiladi: h̃_j = (r_j · h_j) / p 💡 PyTorch buni avtomatik bajaradi. 🔹 5. AdamW Adam bilan L2 ishlatilsa — λ·w ham √v̂_t ga bo'linadi → regularization kuchi parametrga qarab o'zgaradi. Bu noto'g'ri! L2 (Adam): w ← w − η/√v̂_t · (gradient + 2λ·w) AdamW: w ← w − η/√v̂_t · gradient − η·λ·w Weight decay gradient yangilanishidan alohida qo'shiladi — bu to'g'ri yondashuv! 💡 Adam ishlatayotgan bo'lsangiz → har doim AdamW ishlatish tavsiya qilinadi. 🔹 6. Qaysi biri qachon? • Parametrlar juda katta → L2 • Feature selection kerak → L1 • Neyronlar bir-biriga tayanadi → Dropout • Adam ishlatilayotgan bo'lsa → AdamW Kombinatsiya: • Kichik model → L2 • O'rta model → L2 + Dropout • Katta model → AdamW + Dropout 🎯 Yakuniy xulosa • L2 → parametrlarni nolga yaqinlashtiradi • L1 → keraksiz parametrlarni butunlay o'chiradi • Dropout → neyronlarni mustaqil ishlashga majbur qiladi • AdamW → Adam da weight decay ni to'g'ri amalga oshiradi 🤝 YouTube dars: 🎥 Havola 🖥️ Colab notebook: 📂 Havola 📘 Barcha darslar: Havola 🚨 Videolar jonli yozilgan. Matematik izohlarda xatolar bo'lishi mumkin. Oldindan uzr 🙏 @EldorML

📌 6.4-dars: Learning Rate (LR) Schedules — LR qanday o'zgartiriladi? 🎯 Deep Learning Mathematics@EldorML Oldingi darslarda loss surface va curvature haqida gaplashdik. Endi savol: ❓ Nega model ba'zan minimumda sakrab yuradi? ❓ Nega fixed learning rate yetarli emas? Javob: LR dinamik bo'lishi kerak. 🔹 1. Muammo nima? Fixed LR bilan: • Katta LR → oxirida minimumda "o'tib ketadi" (overshoot) • Kichik LR → juda sekin, vaqt isrof Yechim — LRni vaqt bo'yicha o'zgartirish. Ikki xil yondashuv: • LR Schedule → LRni biz boshqaramiz • Adaptive Methods → har parametr o'z LRini o'zi moslashtiradi 🔹 2. Step Decay Formula: η_t = η₀ × γ^⌊t/S⌋ Misol: η₀=0.1, γ=0.5, S=20 • Epoch 0 → LR = 0.100 • Epoch 20 → LR = 0.050 • Epoch 40 → LR = 0.025 • Epoch 60 → LR = 0.013 Har 20 epochda LR ikki barobarga kamayadi. ⚠️ Kamchiligi: LR keskin sakraydi — to'satdan o'zgarish. 🔹 3. Cosine Annealing Formula: η_t = η_min + ½(η_max − η_min)(1 + cos(t/T · π)) Nima uchun cosine? • Boshida sekin kamayadi — model yo'nalish topmoqda • O'rtada tez kamayadi — model yaqinlashmoqda • Oxirida sekin — minimumga aniq o'tiradi • LR hech qachon nolga tushmaydi 💡 Zamonaviy o’qitish uchun eng ko'p ishlatiladigan schedule. 🔹 4. Warmup + Cosine Katta modellarda (GPT, BERT) yangi muammo: Boshida gradientlar ishonchsiz → katta LR bilan boshlasak model noto'g'ri tomonga ketadi. Bu — training collapse hisoblanadi. Formula (2 bosqich): Warmup (t ≤ W): η_t = η_max × t/W Cosine (t > W): η_t = cosine formula Misol: W=10, T=100 • Epoch 0 → LR = 0.000 • Epoch 5 → LR = 0.050 • Epoch 10 → LR = 0.100 ← peak • Epoch 100 → LR = 0.001 💡 Transformer modellarning barchasi warmup ishlatadi. 🔹 5. SGD + Momentum Oddiy SGD muammosi — oscillation: Tik yo'nalish: +2 → -1.8 → +1.6 → ... (tebranish) Tekis yo'nalish: 2 → 1.8 → 1.6 → ... (sekin) Formula: v_{t+1} = β·v_t + ∇L(w_t) w_{t+1} = w_t − η·v_{t+1} β=0.9 — oldingi tezlikni 90% saqlaydi. Analogiya: Qor to'pi tepalikdan tushganda tezlashadi — har qadam oldingi tezlikni "eslab" qoladi. 🔹 6. AdamSGD+Momentum • Har parametr uchun LR: ❌ • Momentum: ✅ • RMSProp • Har parametr uchun LR: ✅ • Momentum: ❌ • Adam • Har parametr uchun LR: ✅ • Momentum: ✅ Kingma & Ba (2015): ikkalasini birlashtirdi. 4 qadam: 1. m_t = β₁·m_{t-1} + (1−β₁)·∇L ← momentum 2. v_t = β₂·v_{t-1} + (1−β₂)·∇L² ← gradient kvadrati 3. m̂_t = m_t/(1−β₁ᵗ) ← bias correction 4. w_{t+1} = w_t − η/√v̂_t · m̂_t ← yangilanish Default qiymatlar: β₁=0.9, β₂=0.999, η=0.001 💡 Yangi loyihada boshqa optimizer o'ylamasdan Adamdan boshlang. 🔹 7. Qaysi biri qachon? • Oddiy model → Adam + Cosine Annealing • Transformer (GPT, BERT) → Adam + Warmup + Cosine • Image classification → SGD + Momentum + Step Decay 🎯 Yakuniy xulosa • LR Schedule — LR ni biz vaqt bo'yicha boshqaramiz • Adaptive — har parametr o'z LR sini o'zi moslashtiradi • Step Decay → keskin sakrash bor • Cosine Annealing → silliq, zamonaviy • Warmup → training collapse dan himoya • Adam → bugungi kunda default tanlov 🤝 YouTube dars: 🎥 Havola 🖥️ Colab notebook: 📂 Havola 📘 Barcha darslar: Havola 🚨 Videolar jonli yozilgan. Matematik izohlarda xatolar bo'lishi mumkin. Oldindan uzr 🙏 @EldorML

Assalomu alaykum, do’stlar! So‘nggi paytlarda postlar kamroq chiqayotganini sezgan bo‘lsangiz kerak. Sababi sog‘lig‘im bilan bog‘liq ayrim muammolar yuzaga keldi. Shu sababli avvalgidek faol bo‘la olmayapman 🙏 Shunga qaramay, kursni to‘xtatmayman va albatta oxirigacha yetkazamiz. Hozir imkon qadar haftasiga kamida 2 ta sifatli post berishga harakat qilyapman. Boshlagan ishimizni albatta yakunlaymiz. Noqulayliklar uchun uzr so‘rayman va tushunganingiz uchun rahmat 🤲

📌 6.3-dars: Loss Surface Curvature — sirt qanchalik tik? 🎯 Deep Learning Mathematics @EldorML Oldingi darslarda gradient haqida gaplashdik. Endi savol: ❓ Nega ba’zi joyda model tebranadi? ❓ Nega learning rate ba’zan ishlaydi, ba’zan esa training buziladi? Javob: curvature. 🔹 1. Gradient vs Curvature Gradient formulasi: w_{t+1} = w_t − η · ∇L(w_t) Bu yerda: w — parametr η — learning rate ∇L — gradient (yo‘nalish) Gradient faqat qaysi tomonga yurishni aytadi. Curvature esa sirt qanchalik tikligini bildiradi. 🔹 2. 1D misol Ikki funksiya: L1(w) = w² L2(w) = 10w² Ikkinchi hosila (curvature): L1''(w) = 2 L2''(w) = 20 Demak: • w² → silliq • 10w² → tik Curvature katta bo‘lsa, ehtiyotkor yurish kerak. 🔹 3. Learning rate bilan bog‘liqlik L(w) = 10w² Gradient: dL/dw = 20w Yangilanish: w_{t+1} = w_t − η · 20w_t = w_t(1 − 20η) Agar η = 0.1 bo‘lsa: w_{t+1} = w_t(1 − 2) = −w_t Natija: • Har qadamda ishora almashadi • Oscillation (tebranish) Qoida: Curvature katta → learning rate kichik 🔹 4. 2D misol L(w1, w2) = 5w1² + 0.5w2² Bu nimani bildiradi? • w1 yo‘nalishida tik • w2 yo‘nalishida tekis Model bir yo‘nalishda tez o‘zgaradi, Ikkinchisida esa sekin. 🔹 5. Hessian matritsasi Ko‘p o‘lchamda curvature matritsa ko‘rinishida bo‘ladi: H = [ d²L/dw1² d²L/dw1dw2 d²L/dw2dw1 d²L/dw2² ] Bizning misolda: H = [ 10 0 0 1 ] Diagonal sonlar — har yo‘nalishdagi tiklik. 🔹 6. Eigenvalue nima beradi? Hessian eigenvalue lari sirtning haqiqiy tikligini ko‘rsatadi. Bu misolda: λ1 = 10 λ2 = 1 • Katta λ → tik • Kichik λ → tekis Agar barcha λ > 0 bo‘lsa → bu minimum. 🔹 7. Sharp vs Flat minimum Sharp minimum: • Eigenvalue katta • Parametr ozgina siljisa loss tez o‘zgaradi • Umumiylashtirish yomonroq Flat minimum: • Eigenvalue kichik • Parametr siljishi lossni deyarli o‘zgartirmaydi • Umumiylashtirish yaxshiroq Shuning uchun katta modellarda flat minimum qidiriladi. 🔹 8. Newton usuli Gradient descent: w_{t+1} = w_t − η · ∇L Newton usuli: w_{t+1} = w_t − H^{-1} · ∇L Bu yerda H^{-1} tik joyda qadamni kichraytiradi, tekis joyda esa kattalashtiradi. Kamchiligi: Katta modellarda H ni hisoblash juda qimmat. 🔹 9. Eng muhim tushuncha Gradient → yo‘nalish Curvature → tezlikni nazorat qiladi Agar curvature hisobga olinmasa: • Oscillation bo‘ladi • Divergence yuz beradi • Training sekinlashadi Deep learning — bu geometriya. 🎯 Yakuniy xulosa • Curvature — sirt tikligi • Hessian — curvature matritsasi • Eigenvalue — har yo‘nalishdagi tiklik • Flat minimum — barqaror model 🤝 YouTube dars: 🎥 Havola 🖥️ Colab notebook: 📂 Havola 📘 Barcha darslar: Havola 🚨 Videolar jonli yozilgan. Matematik izohlarda xatolar bo‘lishi mumkin. Oldindan uzr 🙏 @EldorML

📌 6.2-dars: Weight initialization — modelni qanday to‘g‘ri boshlash kerak? 🎯 Deep Learning Mathematics@EldorML Oldingi darsda gradient muammosini ko‘rdik. Endi savol: ❓ Model nega ba’zan umuman o‘rganmaydi? ❓ Nega chuqur networkda signal yo‘qoladi yoki portlaydi? Javob: weight initialization. 🔹 1. Eng muhim formula Bir neyron: z = w₁x₁ + w₂x₂ + ... + wₙxₙ Agar kirishlar (input) tarqalishi ≈ 1 bo‘lsa: Var(z) ≈ n × Var(W) Misol: n = 100 • Var(W) = 1 → Var(z) = 100 ❌ signal portlaydi • Var(W) = 0.0001 → Var(z) = 0.01 ❌ signal yo‘qoladi Ideal holat: Var(W) ≈ 1 / n 👉 Initializationning asosiy g‘oyasi shu. 🔹 2. Xavier initialization (Tanh / Sigmoid) Formula: Var(W) = 2 / (fan_in + fan_out) Misol: Linear(100, 50) fan_in = 100 fan_out = 50 Var(W) = 2 / 150 = 0.0133 std ≈ 0.115 Teng ehtimollik taqsimoti limit: a = sqrt(6 / 150) = 0.2 Demak: W ∈ [-0.2 , 0.2] 👉 Tanh uchun signal barqaror saqlanadi 🔹 3. He initialization (ReLU) ReLU manfiy qiymatlarni 0 qiladi → signal yarmi yo‘qoladi. Shuning uchun: Var(W) = 2 / fan_in Misol: Linear(100, 50) Var(W) = 2 / 100 = 0.02 std ≈ 0.141 a = sqrt(6 / 100) ≈ 0.245 Demak: W ∈ [-0.245 , 0.245] 👉 Xavierdan kattaroq 👉 ReLU yo‘qotgan signalni kompensatsiya qiladi 🔹 4. Qachon qaysi biri? Sigmoid → Xavier Tanh → Xavier ReLU → He LeakyReLU → He GELU → He Oddiy qoida: "Nolni kesadigan aktivatsiya → He" "Simmetrik aktivatsiya → Xavier" 🔹 5. Eng muhim tushuncha Agar initialization noto‘g‘ri bo‘lsa: • Gradient barqaror bo‘lmaydi • Chuqur qatlamlar o‘rganmaydi • Model ishlamaydi Deep learning — bu matematik balans. 🎯 Yakuniy xulosa • Initialization — training (model o’qitish)ning poydevori • ReLU uchun He ishlating • Tanh/Sigmoid uchun Xavier ishlating 🤝 YouTube dars: 🎥 Havola 🖥️ Colab notebook: 📂 Havola 📘 Barcha darslar: Havola 🚨 Videolar jonli yozilgan. Matematik izohlarda xatolar bo‘lishi mumkin. Oldindan uzr 🙏 @EldorML

📌 6.1-dars: Gradient stability — model nega o‘rganmaydi? 🎯 Deep Learning Mathematics @EldorML Oldingi mavzularda biz backpropagation orqali model qanday o‘rganishini ko‘rdik. Ammo amalda ko‘pincha quyidagi holat yuz beradi: ❓ Loss kamaymaydi ❓ Model umuman o‘rganmaydi ❓ Ba’zida esa birdan NaN chiqadi Buning asosiy sababi — gradient stability muammosi. 🔹 1. Gradient qanday tarqaladi? Backpropagation zanjir qoidasiga asoslanadi: - Gradient = oxirgi xato × qatlam hosilalari ko‘paytmasi Ya’ni chuqur tarmoqda gradient — bu juda ko‘p sonlarning ko‘paytmasi. 🔹 2. Vanishing gradient (yo‘qoluvchi gradient) Agar har qatlamda (L) hosila < 1 bo‘lsa: Gradient_L = (0.5)^L Misol: L = 5 ta → 0.031 L = 10 ta → 0.00098 L = 20 ta → ~0 👉 Ko’p qatlamlar deyarli yangilanmaydi 👉 Model faqat oxirgi qatlamni o‘rganadi Ayniqsa sigmoid funksiyada bu kuchli kuzatiladi: maximum hosila ≈ 0.25 → 0.25^10 ≈ 0 ga teng 🔹 3. Exploding gradient (portlovchi gradient) Agar hosila > 1 bo‘lsa: Gradient_L = (1.5)^L Misol: L = 5 ta → 7.6 L = 10 ta → 57 L = 20 ta → 3325 👉 Ichki parameterlar (weightlar) sakrab ketadi 👉 Loss = NaN bo‘lishi mumkin 🔹 4. Gradient clipping — yechim Gradient juda katta bo‘lsa uni qisqartiramiz: agar g > c: g_new = g × (c / ||g||) Misol: g = [6, 8] g = 10 c = 5 Yangi gradient = [3, 4] 👉 Yo‘nalish saqlanadi 👉 Model o’qitish barqarorlashadi 🔹 5. Eng muhim tushuncha Gradient — bu ichki qatlamlar bo‘yicha eksponent o‘zgaradi: < 1 → yo‘qoladi > 1 → portlaydi > ≈ 1 → ideal training Shuning uchun deep learning ishlashi uchun alohida texnikalar kerak bo‘ladi. 🎯 Yakuniy xulosa • Deep networklar o‘z-o‘zidan o‘rganmaydi • Muammo optimizatsiyada emas — matematikada • Gradient stability — deep learningning yuragi 🤝 YouTube dars: 🎥 Havola 🖥️ Colab notebook: 📂 Havola 📘 Barcha darslar: Havola 🚨 Videolar tayyorgarliksiz yozilgan, matematik izohlarda xatolar bo‘lishi mumkin. Uzr so‘rayman 🙏 @EldorML

📌 5.9-dars: Attention visualization — model nimaga e’tibor beryapti? 🎯 Deep Learning Mathematics@EldorML Oldingi darslarda biz Transformerda Attention mexanizmi qanday ishlashini matematik jihatdan ko‘rdik. Endi esa juda muhim savolga kelamiz: ❓ Model qaysi tokenlarga qarayapti? ❓ U nimani muhim deb hisoblayapti? Javob: 👉 Buni attention visualization orqali ko‘rish mumkin. 🔹 1. Attention — bu ichki signal Ko‘pchilik xato tushunadi: ❌ Attention = tushuntirish Aslida: 👉 Attention — bu axborot oqimi kuchi 👉 Qaysi token qaysi tokendan ko‘proq foydalanayotganini ko‘rsatadi Bu sabab emas, lekin kuchli signal. 🔹 2. Attention matritsa nimani anglatadi? Transformer ichida attention natijasi — bu matritsa: • Qatorlar — Query tokenlar • Ustunlar — Key tokenlar • Qiymatlar — e’tibor kuchi (softmax natijasi) Muhim: 👉 Har bir qator yig‘indisi ≈ 1 👉 Diagonal har doim eng katta bo‘lishi shart emas Bu degani: 👉 Tokenlar faqat o‘ziga emas, kontekstga qaraydi 🔹 3. Attention patternlar Attention doim bir xil ko‘rinishda bo‘lmaydi. Model ichida turli tuzilmalar paydo bo‘ladi: • Lokal (yaqin tokenlar) • Uzoq masofali bog‘lanishlar • Maxsus tokenlar ([CLS], [SEP]) • Semantik munosabatlar 👉 Ayniqsa chuqur qatlamlarda ma’no bilan bog‘liq tuzilmalar ko‘rinadi 🔹 4. Multi-Head Attention — nega bitta emas? Bitta attention yetarli emas. Har bir head: • boshqa munosabatni o‘rganadi • boshqa nuqtai nazar bilan qaraydi Misol: • bitta head — grammatik bog‘lanish • boshqasi — semantik • yana biri — uzoq masofa 👉 Shu sababli head’lar soni ko‘p bo‘ladi 🔹 5. Eng muhim ogohlantirish ⚠️ Attention ≠ Explanation Ya’ni: • Attention katta → sabab degani emas • Vizualizatsiya ehtiyotkorlik bilan talqin qilinadi Ilmiy yondashuv: 👉 Attention + boshqa interpretability usullari 🎯 Yakuniy xulosa • Attention visualization — model ichiga qarash oynasi • Tuzilmalar orqali Transformer nimani o‘rganayotganini ko‘ramiz • Lekin sabab–natija bilan adashtirmaymiz 🤝 YouTube dars: 🎥 Havola 🖥️ Colab notebook (interactive demo): 📂 Havola 📘 Kursdagi barcha darslar: Havola 🚨 Videolar tayyorgarliksiz yozilgan, nutqda yoki matematik izohlarda xatolar bo‘lishi mumkin. Uzr so‘rayman 🙏 @EldorML