Data eXplore : Data Science, ML, Big Data, LLMs and AI Security

رفتن به کانال در Telegram

Exploring Data Science, Big Data Analytics & Visualization, ML/DL, Neural Networks, LLMs with GitHub, Kaggle, HuggingFace and some white papers by big institutions. Not just data, but science behind data Paid project? premodi@zohomail.in ★ @DataML

نمایش بیشتر

کشور مشخص نشده استفناوری و برنامه‌ها47 504

565

مشترکین

اطلاعاتی وجود ندارد24 ساعت

+47 روز

+930 روز

119

نمایش های پست

~ 5624 ساعت

~ 6848 ساعت

21.10%

نرخ مشارکت

اطلاعاتی وجود ندارد

پست های در روز

Ads index

beta

در حال بارگیری داده...

کانال‌های مشابه

هیچ داده‌ای

مشکلی وجود دارد؟ لطفاً صفحه را تازه کنید یا با مدیر پشتیبانی ما تماس بگیرید.

اشارات ورودی و خروجی

---

جذب مشترکین

اوت '26

در 0 کانال‌ها

ژوئیه '26

+23

در 0 کانال‌ها

Get PRO

ژوئن '26

+26

در 1 کانال‌ها

Get PRO

مه '26

+54

در 1 کانال‌ها

Get PRO

آوریل '26

+14

در 2 کانال‌ها

Get PRO

مارس '26

+53

در 10 کانال‌ها

Get PRO

فوریه '26

+22

در 2 کانال‌ها

Get PRO

ژانویه '26

+28

در 2 کانال‌ها

Get PRO

دسامبر '25

+46

در 4 کانال‌ها

Get PRO

نوامبر '25

+128

در 4 کانال‌ها

Get PRO

اکتبر '25

+149

در 3 کانال‌ها

Get PRO

سپتامبر '250

در 2 کانال‌ها

Get PRO

اوت '25

+99

در 0 کانال‌ها

Get PRO

ژوئیه '250

در 0 کانال‌ها

Get PRO

ژوئن '250

در 0 کانال‌ها

Get PRO

مه '250

در 0 کانال‌ها

Get PRO

آوریل '250

در 1 کانال‌ها

Get PRO

مارس '250

در 0 کانال‌ها

Get PRO

فوریه '250

در 0 کانال‌ها

Get PRO

ژانویه '25

+14

در 2 کانال‌ها

Get PRO

دسامبر '24

+11

در 0 کانال‌ها

تاریخ	رشد مشترکین	اشارات	کانال‌ها
01 اوت	+1

پست‌های کانال

Gradient Stability in Long Sequences How Spherical Gradient Protects Against Explosions and Vanishing Gradients in Online Time Series Learning? In online time series learning, gradients either explode or vanish when dealing with long sequences. LSTMs and GRUs handle this instability poorly: with streaming data, the model fails to adapt to new patterns due to the exponential growth or collapse of the gradient. The Problem with Standard Gradient Clipping Classic gradient clipping with a fixed threshold often fails in online mode. With short sequences, it aggressively truncates, losing information about rare events. And with long sequences, it doesn't protect against vanishing gradients because it only works with the upper bound of the norm. Spherical Gradient: Principle and Implementation This approach normalizes the gradient at each step, fixing its length while preserving its direction. This is L2 normalization, which solves both problems: - The gradient doesn't explode because the norm is limited (e.g., 1.0). - The gradient doesn't vanish because even when the norm is close to zero, it's restored to a fixed value. Here's an example in PyTorch for a production scenario:

def spherical_gradient_clip(grad, max_norm=1.0, eps=1e-8):
    norm = grad.norm()
    if norm > max_norm:
        return grad * (max_norm / norm)
    elif norm < eps:
        return torch.randn_like(grad) * eps
    return grad

Engineering Trade-offs in Production ML Combine Spherical Gradient with layer normalization and gradient checkpointing when the sequence length is greater than 500 steps (finance, IoT, logistics). Note: Gradient normalization increases latency by about 5-10%, but the stability of convergence pays off with real data. A common mistake is to apply Spherical Gradient to a Transformer without weight normalization, which breaks attention scores with high dimensionality. Practical Advice for Validation For online learning with streaming data, compare the variance of the gradients before and after applying Spherical Gradient on synthetic data with a length of 500. In production, for time series Transformers, Spherical Gradient shows a reduction in variance of 40-60% and accelerates loss convergence by 1.5 times compared to gradient clipping. Conclusion: Normalize the gradient in spherical space, rather than simply truncating it and this is the only way to maintain the stability of online learning on long sequences without losing sensitivity to rare events. •••••••••••••••••••••••••••••••••••••• 🤖 Data & ML | @DataXplore

2	Delayed Feedback In CVR Models How to avoid breaking conversion learning and evaluation due to delayed labels? In CVR, a user may convert minutes, days, or weeks after clicking, so labeling "as is" often turns future positive conversions into false negatives. This is critical for advertising, recommendations, marketplace funnels, and A/B tests, where the model is trained on incomplete logs. Problem: label may not yet be available. For a click at time click_time = t, we want to estimate: P(conversion \| click, x) But at the time the dataset is collected (T), we only know one of two things: ☞ Conversion has already occurred before T. ☞ Conversion is not yet visible. The second case does not mean converted = 0. It's a censored observation: the user was not observed for long enough. A typical mistake: converted = 1, if conversion_time - click_time <= 7d converted = 0, else Fresh clicks have artificially low CVRs, the model learns from false negatives, offline metrics depend on the "maturity" of the data, and production calibration drifts: the model predicts the full-window CVR, while monitoring sees the partial-window CVR. Baseline: Train only on mature data. If the target horizon is a 7-day conversion, and data is available up to 2025-01-31, then for training, use clicks no later than 2025-01-24. ➜ Pros: ☞ Honest labels ☞ Simple validation ☞ Easy to debug the pipeline and identify data leakage ➜ Cons: ☞ Loss of fresh data ☞ Poorer adaptation to seasonality and traffic changes ☞ With a long conversion lag, the training data becomes significantly outdated. Practical advice: Explicitly store the event_time, label_observed_until, horizon, and label_age fields in the feature store or training dataset. Without them, it's impossible to reproduce the labeling and understand why the CVR changed after retraining. A more robust approach: Model the delay. The problem can be broken down into the probability of conversion and the distribution of the delay: P(y = 1, delay <= H \| x) For example: ☞ CVR model estimates the probability of conversion itself. ☞ delay model estimates P(delay <= age \| y=1, x) This approach is closer to survival analysis: there's an event, the time until the event, and censored observations. This approach is particularly useful if the delay depends on the product category, channel, geography, price, device, or retargeting strategy. An alternative is a discrete hazard function: P(conversion at day k \| no conversion before day k, x) In this case, a click observed only 2 days ago is still useful for training the first two steps, rather than being discarded entirely. The trade-off is that the model and inference become more complex, but there's less data loss and a more accurate handling of the long tail of conversions. Evaluation: The test set should also be mature. If the horizon = 7d, the holdout set should only contain objects for which at least 7 days have passed since the click. Otherwise, you're measuring the immaturity of the labels, not the quality of the model. A good approach: train: clicks [D0, D1] validation: clicks [D2, D3] label cutoff: >= D3 + horizon In the production environment, also consider metrics related to delay buckets: * 0-1 hour * 1-24 hours * 1-3 days * 3-7 days * 7 days+ This helps identify where the system is failing: whether it's in fast conversions, the long tail, data freshness, attribution, or due to censored labels. This is especially important for A/B tests: an early readout can overestimate the effect of a model that performs well with fast conversions but underperforms over the full time window. Conclusion: Delayed feedback in CVR (Conversion Rate) is not just a matter of labeling; it's an engineering limitation of the ML system. Without mature labels, a clear label cutoff, and proper validation, the model optimizes for logging artifacts instead of actual conversions. •••••••••••••••••••••••••••••••••••••• 🤖 Data & ML \| @DataXplore	127
3	Delayed feedback in CVR models: How to avoid breaking conversion learning and evaluation due to delayed labels? In CVR, a user may convert minutes, days, or weeks after clicking, so labeling "as is" often turns future positive conversions into false negatives. This is critical for advertising, recommendations, marketplace funnels, and A/B tests, where the model is trained on incomplete logs. The problem: The label may not yet be available. For a click at time click_time = t, we want to estimate: P(conversion \| click, x) But at the time the dataset is collected (T), we only know one of two things: * The conversion has already occurred before T. * The conversion is not yet visible. The second case does not mean converted = 0. It's a censored observation: the user was not observed for long enough. A typical mistake: converted = 1, if conversion_time - click_time <= 7d converted = 0, else Fresh clicks have artificially low CVRs, the model learns from false negatives, offline metrics depend on the "maturity" of the data, and production calibration drifts: the model predicts the full-window CVR, while monitoring sees the partial-window CVR. Baseline: Train only on mature data. If the target horizon is a 7-day conversion, and data is available up to 2025-01-31, then for training, use clicks no later than 2025-01-24. Pros: * Honest labels * Simple validation * Easy to debug the pipeline and identify data leakage Cons: * Loss of fresh data * Poorer adaptation to seasonality and traffic changes * With a long conversion lag, the training data becomes significantly outdated. Practical advice: Explicitly store the event_time, label_observed_until, horizon, and label_age fields in the feature store or training dataset. Without them, it's impossible to reproduce the labeling and understand why the CVR changed after retraining. A more robust approach: Model the delay. The problem can be broken down into the probability of conversion and the distribution of the delay: P(y = 1, delay <= H \| x) For example: * The CVR model estimates the probability of conversion itself. * The delay model estimates P(delay <= age \| y=1, x) This approach is closer to survival analysis: there's an event, the time until the event, and censored observations. This approach is particularly useful if the delay depends on the product category, channel, geography, price, device, or retargeting strategy. An alternative is a discrete hazard function: P(conversion at day k \| no conversion before day k, x) In this case, a click observed only 2 days ago is still useful for training the first two steps, rather than being discarded entirely. The trade-off is that the model and inference become more complex, but there's less data loss and a more accurate handling of the long tail of conversions. Evaluation: The test set should also be mature. If the horizon = 7d, the holdout set should only contain objects for which at least 7 days have passed since the click. Otherwise, you're measuring the immaturity of the labels, not the quality of the model. A good approach: train: clicks [D0, D1] validation: clicks [D2, D3] label cutoff: >= D3 + horizon In the production environment, also consider metrics related to delay buckets: * 0-1 hour * 1-24 hours * 1-3 days * 3-7 days * 7 days+ This helps identify where the system is failing: whether it's in fast conversions, the long tail, data freshness, attribution, or due to censored labels. This is especially important for A/B tests: an early readout can overestimate the effect of a model that performs well with fast conversions but underperforms over the full time window. Conclusion: Delayed feedback in CVR (Conversion Rate) is not just a matter of labeling; it's an engineering limitation of the ML system. Without mature labels, a clear label cutoff, and proper validation, the model optimizes for logging artifacts instead of actual conversions. •••••••••••••••••••••••••••••••••••••• 🤖 Data & ML \| @DataXplore	2
4	Adding reactions to posts is disabled from now on. We are also reviewing the save and forward options.	205
5	Updated the encoder - broke the ANN? How to migrate embeddings without pain In embedding-based systems, the encoder is part of the data contract. It cannot be updated like a regular ML model: the ANN index already contains vectors from the old space, and a common mistake is to assume compatibility due to the same dimension and metric. Why compatibility breaks? Even if the dimension is the same and the cosine is the same, and the offline benchmark is better, the new encoder does not have to be compatible with the old index. After the update, the following change: - geometry of the space; - distribution of norms; - local neighborhoods; - ranking of nearest neighbors; - calibration of scores; - behavior of the ANN structure: HNSW/IVF/PQ were built for the old distribution. The main anti-pattern: writing new documents with the new encoder into the old index with old embeddings. Such an index becomes mixed: some vectors live in one space, others in another. The ANN works formally, but the nearest neighbors no longer have correct semantics. Versioning the embedding space as a production contract You need to version not just the model_name, but the full contract: embedding_version = encoder + tokenizer + pooling + normalization + dim + metric If any of these has changed, it's a new version of the space. Practical advice: keep the embedding_version next to the document, query, index, and retrieval logs. Otherwise, if recall or CTR degrades, you won't understand which encoder was actually involved in the delivery. Raising a new index and enabling dual-write The old path: docs_v1 -> embeddings_v1 -> ann_index_v1 The new path: docs_v2 -> embeddings_v2 -> ann_index_v2 Even if the documents are the same, the embeddings must be recalculated with the new encoder. For ANN, this is a new corpus. Importantly: the index parameters should also be tuned. For example, for HNSW, the old M, efConstruction, efSearch may not be optimal for the new distribution. During the migration, write new and updated documents to both versions: on_document_upsert(doc): emb_v1 = encoder_v1(doc) emb_v2 = encoder_v2(doc) index_v1.upsert(doc.id, emb_v1) index_v2.upsert(doc.id, emb_v2) This is more expensive in terms of compute and ingestion latency, but the old retrieval continues to work and the new index catches up with the current state. If v1 is soon shut down, dual-write can be kept only until the cutover plus a short rollback window. Backfill, shadow-read, and readiness criteria For v2, we need to recalculate the embeddings of the entire corpus and upload them to the new index. Here, it's not about notebook metrics, but about engineering reliability: - idempotency of tasks; - control of lag; - deduplication of upserts; - checkpoints; - separate limits on encoder and ANN ingestion; - document count comparison between indexes; - percentage of documents without v2 embeddings. The migration is not ready until the new index covers the production corpus with an acceptable lag. Before switching, enable shadow-read: query -> encoder_v1 -> index_v1 -> results_v1 -> encoder_v2 -> index_v2 -> results_v2 Show only v1 to the user, but compare: - recall@k on labeled data; - overlap@k between v1 and v2; - NDCG/MRR if there are clicks or raters; - p95/p99 latency; - tail failures; - score distribution; - downstream metrics in ranking, recommendations, or RAG. Warning: high overlap@k does not guarantee product improvement. The new retrieval may change diversity, freshness, coverage, and load on the next ranker. It's better to do the cutover via a feature flag, with monitoring of quality, latency, error rate, and a quick rollback to ann_index_v1. Conclusion: Updating the encoder is a migration of the embedding contract and ANN infrastructure, not a simple model replacement in the inference path. •••••••••••••••••••••••••••••••••••••• 🤖 Data & ML \| @DataXplore	290
6	How to Find Harmful Training Examples Before Fine-Tuning: Influence Functions, TracIn, and Data Pruning in Production ML In production ML, "bad" training examples can be costly: a cluster of mislabeled, outdated, or anomalous objects can consistently degrade fine-tuning on a fresh data set. A common mistake is to clean the dataset only based on heuristics and not checking which samples actually increase the loss on the production-like validation set. 1️⃣ Influence Functions Idea: estimate how the loss on the validation set z_val would change if we slightly increase the weight of the training example z_train. I(z_train, z_val) ≈ - ∇L_val^T H^-1 ∇L_train where H is the Hessian with respect to the model parameters. If the influence is large and positive, the training example is likely harming the validation loss and quality. Pros: - rigorous theoretical formulation; - can associate specific training examples with specific model errors. Cons: - expensive H^-1; - poorly scalable to large neural networks; - sensitive to non-convexity, batchnorm/dropout, checkpoints, and Hessian approximation. In production, we typically use approximations: LiSSA, conjugate gradients, low-rank approximation, or calculate the influence only for the last layer/head of the model. 2️⃣ TracIn A more engineering-oriented approach: a training example is useful for the validation set if their gradients evolve similarly during training. It's harmful if they evolve in the opposite direction. TracIn(z_train, z_val) = Σ_c η_c · ∇L_train(θ_c) · ∇L_val(θ_c) where θ_c are checkpoints and η_c is the learning rate. A strongly negative score means: the training example is pulling the model in the opposite direction of what's useful for validation. A mini sketch for the last layer: for ckpt in checkpoints: model.load_state_dict(load(ckpt)) g_val = mean_grad(model.head, val_loader) for i, batch in enumerate(train_subset): g_train = grad(model.head, batch) scores[i] += lr[ckpt] * dot(g_train, g_val) harmful = argsort(scores)[:K] Practical advice: calculate the score not on the entire validation set, but on important production slices: new users, rare classes, problematic regions, fresh drift, segments with high business value or SLA. 3️⃣ Data pruning before fine-tuning Workflow: 1. Freeze a production-like validation set without leakage. 2. Train a baseline / fine-tune and save several checkpoints. 3. Calculate the influence or TracIn for train→val. 4. Check the top harmful samples: - label noise; - outdated distribution; - conflicting duplicates; - corrupted inputs; - incorrect task/schema version. 5. Remove, downweight, or relabel them. 6. Repeat fine-tuning and check not only the overall metric but also the regression by segment. Production example: before retraining a recommendation model on fresh logs, you might find old interactions with a changed product taxonomy, conflicting labels after a schema migration, or bot traffic that degrades the ranking loss on a fresh holdout. 4️⃣ Caution Don't blindly remove all "harmful" examples. Sometimes they degrade the current validation, but are needed for long-tail robustness, fairness, or resilience to rare scenarios. It's safer to start with top-K, do a human-in-the-loop audit, compare remove / downweight / relabel options, and look at the trade-off between quality, latency of recalculation, cost of labeling, reproducibility, and monitoring reliability. Conclusion: Influence Functions and TracIn are useful not as a magic data cleaning tool, but as an engineering approach to make fine-tuning less toxic to noise, outdated data, and conflicting labeling. •••••••••••••••••••••••••••••••••••••• 🤖 Data & ML \| @DataXplore	295
7	Temporal leakage in the feature store: How point-in-time joins, backfills, and feature causality checks save a model from beautiful offline metrics and failure in production? Temporal leakage in the feature store is one of the most expensive ways to get great offline metrics and a useless model in production. The problem isn't that the feature is bad, but that on train it knows more than the model would have known at the moment of decision-making. We predict churn on date t, but in the features we use transactions_last_30d, calculated after a backfill from a table where transactions arrived with a delay or were recalculated with future fixes. Offline is all beautiful. Online - a slump. 1️⃣ Point-in-time join - basic protection For each training row, there is prediction_time. The features should be in the state they were in at that moment. It's important to distinguish: - event_time - when the event actually happened; - ingestion_time / created_at - when it entered the system; - available_at - when the feature became available to the model; - prediction_time - the moment of prediction. The correct join should take into account not only event_time <= prediction_time, but also available_at <= prediction_time: WITH ranked_features AS ( SELECT l.entity_id, l.prediction_time, f.feature_value, ROW_NUMBER() OVER ( PARTITION BY l.entity_id, l.prediction_time ORDER BY f.event_time DESC ) AS rn FROM labels l JOIN features f ON f.entity_id = l.entity_id AND f.event_time <= l.prediction_time AND f.available_at <= l.prediction_time ) SELECT * FROM ranked_features WHERE rn = 1; If there is no available_at, you often can't prove that there is no leakage. 2️⃣ Backfills - a hidden source of leakage Backfills are dangerous because they create the illusion of historical completeness. For example, today you recalculated a feature for the past year: - corrected old events; - added data from a new source; - changed the business logic; - caught up with late-arriving events; - used a reference that wasn't available at the time. As a result, train gets a history that didn't actually exist at the moment of prediction. A correct backfill should answer the question: What feature would the model have seen then if the pipeline had worked with the same delays, sources, and availability rules? If the answer is unknown, it's not historical truth, but reconstructed truth. For model training, these are different things. 3️⃣ Checking the causality of features Before training, every feature should be run through a causality review. ➡️ Minimum checklist: 1. Is the feature available before prediction_time? It's not that the event happened, but that the value of the feature was available. 2. Is there a label proxy in the feature? For example, days_since_last_payment_failed for a default task might be almost a direct consequence of a future target. 3. Is the aggregation window strictly in the past? last_7d should mean [t-7d, t), not a calendar week that includes the future relative to t. 4. Are there future-aware reference tables? Segments, statuses, limits, antifraud flags, and CRM attributes are often backfilled. 5. Is the source latency taken into account? If the data arrives in 6 hours, you can't use an event at 09:55 for a prediction at 10:00. In production ML, a feature is considered valid not when it's historically correct, but when it's demonstrably available to the model at the moment of decision-making. •••••••••••••••••••••••••••••••••••• 🤖 Data & ML \| @DataXplore	210
8	Conformal intervals in production ML with covariate shift: How to maintain coverage without unnecessarily wide predictions? Split conformal works well with exchangeability: train, calibration, and test come from same distribution. In production, this often breaks down due to geo, devices, channels, seasonality or a change in acquisition mix and a common mistake is to simply expand intervals "with a margin". What exactly breaks down? With covariate shift, we have: p_prod(x) != p_cal(x) but we assume that p(y\|x) approximately holds. If we calculate usual conformal quantile on old calibration set, coverage on current traffic might drop. Naive solution is to globally increase correction. Coverage will partially recover, but price prediction interval, ETA interval or forecast band will become so wide that downstream system will no longer trust them. Basic production recipe 1️⃣ Train a quantile model: q_low(x), q_high(x) 2️⃣ Calculate nonconformity scores on calibration set: s_i = max(q_low(x_i)-y_i, y_i-q_high(x_i), 0) 3️⃣ Estimate importance weights: w_i ~= p_prod(x_i) / p_cal(x_i) 4️⃣ Use weighted quantile scores instead of usual ones. 5️⃣ For a new object, construct: C(x) = [q_low(x)-tau, q_high(x)+tau] Minimal skeleton: import numpy as np def weighted_quantile(values, weights, q): order = np.argsort(values) v = np.asarray(values)[order] w = np.asarray(weights)[order] cw = np.cumsum(w) return v[np.searchsorted(cw, q * cw[-1])] alpha = 0.1 scores = np.maximum(q_low_cal - y_cal, y_cal - q_high_cal, 0) weights = ratio_model.predict_weight(X_cal) tau = weighted_quantile(scores, weights, 1 - alpha) low = q_low_prod - tau high = q_high_prod + tau This way, calibration distribution becomes closer to production distribution without unnecessarily widening all intervals. How not to get too wide intervals? One global tau often overestimates uncertainty if model error strongly depends on x. Practically helps: - CQR instead of point prediction: Conformalized Quantile Regression already models heteroscedastic uncertainty, so conformal correction is usually smaller. - Normalized score: for example s_i = \|y_i - y_hat_i\| / sigma_hat(x_i), and the interval is constructed as y_hat(x) +- tau * sigma_hat(x). - Local calibration: a separate tau per geo, device, channel, price bucket, or risk bucket. This is close to Mondrian conformal, but requires a sufficient number of calibration examples in each segment. - Rolling calibration buffer: for recommendations, scoring and forecasting, old calibration set quickly stops describing current traffic mix. Main risk - bad weights Density ratio model can be noisy. A few objects with huge weights effectively "replace" entire calibration set. Control: ESS = (sum w)^2 / sum(w^2) If ESS is low, the weighted quantile is unstable and intervals start jumping from release to release. Practical measures: - clip weights and monitor proportion of clipped weights; - smooth the density ratio; - merge rare segments; - not calibrate a segment where there are few fresh labels; - run recalibration when ESS drops or distribution drifts on X. Production checklist - a separate calibration set, not mixed with training; - drift detection on feature distribution; - density ratio model between prod traffic and calibration traffic; - weighted conformal calibration; - monitor coverage, average width, coverage by slices, ESS, and latency; - alerts on increasing interval width without increasing error; - A/B validation if intervals affect routing, fallback or human review. It's important not to confuse marginal and conditional coverage. Conformal can maintain 90% coverage on stream on average, but fail in individual microsegments. This needs to be explicitly checked in production. With covariate shift, goal is not to blindly widen the intervals, but to calibrate them to current mix of production objects and monitor the reliability of this calibration. ••••••••••••••••••••••••••••••••••••• 🤖 Data & ML \| @DataXplore	198
9	⁣Corpus drift in RAG systems: How to notice the degradation of retrieval without labels, annotations, and obvious errors? In RAG retrieval, things often break silently: same model, same embedding model, same prompt, normal latency, but the answers have gotten worse. A typical mistake is to immediately tweak the prompt or blame the LLM, even though the problem lies deeper: the corpus has changed. 1️⃣ Monitor the corpus drift itself We don't directly measure quality, but we look at how the space in which the retriever operates has changed: - distribution of embedding chunks; - average chunk length, overlap, number of chunks per document; - proportion of new, deleted, and modified chunks; - duplicates and near-duplicates; - distribution of domains, document types, languages, dates; - density of the embedding space: have many chunks "clumped" together. If the corpus has noticeably shifted, old retrieval thresholds and expectations of top-k might become garbage. Especially if the confidence logic is tied to score or the gap between top-1 and top-2. 2️⃣ Anchor queries instead of labels In production, there are almost never labels like "these chunks are relevant for this query". But we can take a stable set of production queries: for example, 500-5,000 frequent or business-critical queries. This isn't annotation. We don't know the correct chunk. But we know that the retrieval behavior shouldn't change chaotically after each corpus update. For each anchor query, save the baseline: - top-k doc/chunk ids; - retrieval scores; - rank positions; - gap between top-1 and top-2; - diversity of top-k; - source distribution. After the corpus update, compare the new retrieval with the baseline. Useful proxy metrics: - Jaccard@k between the old and new top-k; - p95_top1_score_drop; - score_wasserstein between the baseline and current scores. 3️⃣ How to interpret the signals - mean_jaccard@10 has dropped sharply: the retriever has started returning different context; - the top-1 score systematically drops: the queries are matching the corpus less well; - the score distribution has shifted significantly: old thresholds and confidence logic might have broken. Practical advice: don't just look globally, but also by segments - sources, languages, document types, product domains. A global average easily hides degradation in a critical segment. 4️⃣ Retrieval confidence without ground truth Even without annotations, you can look at the "confidence" of the retriever: - high top-1 score; - large gap between top-1 and top-2; - consistency of dense retrieval and BM25; - stability of top-k when query rewriting; - low proportion of duplicates in top-k; - coverage of needed sources. If dense and lexical retrieval suddenly start diverging, don't just chalk it up to noise. Often, this means that the corpus or queries have changed in a way that one of the strategies no longer works as before. Production minimum for RAG: - store a snapshot of retrieval results for anchor queries; - calculate overlap, score drift, and rank churn after each corpus update; - monitor duplicates, new chunks, and source distributions separately; - set alerts not on a single query, but on aggregates by segments. Corpus drift is annoying because it doesn't look like a crash. The system responds, there are no errors, and the latency is normal. It's just that the context has become slightly less relevant. Then a little more. And the RAG quality slowly declines. The conclusion is WITHOUT LABELS, you CAN'T honestly measure relevance, but you can monitor the stability of retrieval behavior, the retriever's confidence, and corpus changes to catch degradation before users do. ••••••••••••••••••••••••••••••••••••• 🤖 Data & ML \| @DataXplore	202
10	⁣Corpus drift in RAG systems In RAG retrieval, things often break silently: same model, same embedding model, same prompt, normal latency, but the answers have gotten worse. A typical mistake is to immediately tweak the prompt or blame the LLM, even though the problem lies deeper: the corpus has changed. ➡️ How to notice degradation of retrieval without labels, annotations and obvious errors? 1️⃣ Monitor the corpus drift itself We don't directly measure quality, but we look at how the space in which the retriever operates has changed: - distribution of embedding chunks; - average chunk length, overlap, number of chunks per document; - proportion of new, deleted, and modified chunks; - duplicates and near-duplicates; - distribution of domains, document types, languages, dates; - density of the embedding space: have many chunks "clumped" together. If the corpus has noticeably shifted, old retrieval thresholds and expectations of top-k might become garbage. Especially if the confidence logic is tied to score or the gap between top-1 and top-2. 2️⃣ Anchor queries instead of labels In production, there are almost never labels like "these chunks are relevant for this query". But we can take a stable set of production queries: for example, 500-5,000 frequent or business-critical queries. This isn't annotation. We don't know the correct chunk. But we know that the retrieval behavior shouldn't change chaotically after each corpus update. For each anchor query, save the baseline: - top-k doc/chunk ids; - retrieval scores; - rank positions; - gap between top-1 and top-2; - diversity of top-k; - source distribution. After the corpus update, compare the new retrieval with the baseline. Useful proxy metrics: - Jaccard@k between the old and new top-k; - p95_top1_score_drop; - score_wasserstein between the baseline and current scores. 3️⃣ How to interpret the signals - mean_jaccard@10 has dropped sharply: the retriever has started returning different context; - the top-1 score systematically drops: the queries are matching the corpus less well; - the score distribution has shifted significantly: old thresholds and confidence logic might have broken. Practical advice: don't just look globally, but also by segments - sources, languages, document types, product domains. A global average easily hides degradation in a critical segment. 4️⃣ Retrieval confidence without ground truth Even without annotations, you can look at the "confidence" of the retriever: - high top-1 score; - large gap between top-1 and top-2; - consistency of dense retrieval and BM25; - stability of top-k when query rewriting; - low proportion of duplicates in top-k; - coverage of needed sources. If dense and lexical retrieval suddenly start diverging, don't just chalk it up to noise. Often, this means that the corpus or queries have changed in a way that one of the strategies no longer works as before. Production minimum for RAG: - store a snapshot of retrieval results for anchor queries; - calculate overlap, score drift, and rank churn after each corpus update; - monitor duplicates, new chunks, and source distributions separately; - set alerts not on a single query, but on aggregates by segments. Corpus drift is annoying because it doesn't look like a crash. The system responds, there are no errors, and the latency is normal. It's just that the context has become slightly less relevant. Then a little more. And the RAG quality slowly declines. Conclusion is WITHOUT LABELS, you CAN'T honestly measure relevance, but you can monitor the stability of retrieval behavior, the retriever's confidence, and corpus changes to catch degradation before users do. ••••••••••••••••••••••••••••••••••••• 🤖 Data & ML \| @DataXplore	2
11	⁣Corpus drift in RAG systems: How to notice the degradation of retrieval without labels, annotations, and obvious errors? In RAG retrieval, things often break silently: same model, same embedding model, same prompt, normal latency, but the answers have gotten worse. A typical mistake is to immediately tweak the prompt or blame the LLM, even though the problem lies deeper: the corpus has changed. 1️⃣ Monitor the corpus drift itself We don't directly measure quality, but we look at how the space in which the retriever operates has changed: - distribution of embedding chunks; - average chunk length, overlap, number of chunks per document; - proportion of new, deleted, and modified chunks; - duplicates and near-duplicates; - distribution of domains, document types, languages, dates; - density of the embedding space: have many chunks "clumped" together. If the corpus has noticeably shifted, old retrieval thresholds and expectations of top-k might become garbage. Especially if the confidence logic is tied to score or the gap between top-1 and top-2. 2️⃣ Anchor queries instead of labels In production, there are almost never labels like "these chunks are relevant for this query". But we can take a stable set of production queries: for example, 500-5,000 frequent or business-critical queries. This isn't annotation. We don't know the correct chunk. But we know that the retrieval behavior shouldn't change chaotically after each corpus update. For each anchor query, save the baseline: - top-k doc/chunk ids; - retrieval scores; - rank positions; - gap between top-1 and top-2; - diversity of top-k; - source distribution. After the corpus update, compare the new retrieval with the baseline. Useful proxy metrics: - Jaccard@k between the old and new top-k; - p95_top1_score_drop; - score_wasserstein between the baseline and current scores. 3️⃣ How to interpret the signals - mean_jaccard@10 has dropped sharply: the retriever has started returning different context; - the top-1 score systematically drops: the queries are matching the corpus less well; - the score distribution has shifted significantly: old thresholds and confidence logic might have broken. Practical advice: don't just look globally, but also by segments - sources, languages, document types, product domains. A global average easily hides degradation in a critical segment. 4️⃣ Retrieval confidence without ground truth Even without annotations, you can look at the "confidence" of the retriever: - high top-1 score; - large gap between top-1 and top-2; - consistency of dense retrieval and BM25; - stability of top-k when query rewriting; - low proportion of duplicates in top-k; - coverage of needed sources. If dense and lexical retrieval suddenly start diverging, don't just chalk it up to noise. Often, this means that the corpus or queries have changed in a way that one of the strategies no longer works as before. Production minimum for RAG: - store a snapshot of retrieval results for anchor queries; - calculate overlap, score drift, and rank churn after each corpus update; - monitor duplicates, new chunks, and source distributions separately; - set alerts not on a single query, but on aggregates by segments. Corpus drift is annoying because it doesn't look like a crash. The system responds, there are no errors, and the latency is normal. It's just that the context has become slightly less relevant. Then a little more. And the RAG quality slowly declines. The conclusion is WITHOUT LABELS, you CAN'T honestly measure relevance, but you can monitor the stability of retrieval behavior, the retriever's confidence, and corpus changes to catch degradation before users do. ••••••••••••••••••••••••••••••••••••• 🤖 Data & ML \| @DataXplore	1
12	ML process template that we use in the ML Core team Maybe the template will make it easier for you to start putting together a doc: 1️⃣ Create a scheme of the main development stages adopted in your team, for example: • task setting; • data exploration; • formulating the task in ML terms; • MVP solution; • testing the solution; • rolling it out to production; • monitoring. 2️⃣ Describe each stage: • what needs to be done; • what the result should look like to proceed to the next step. Try to avoid long texts; use diagrams, tables, infographics. 3️⃣ Add to each stage: • templates that will allow you to complete this stage faster; • useful tips; • standards and requirements, if any; • links to resources, articles, documents that can help at this stage; • answers to popular questions; • documentation requirements: what and where needs to be described to consider the stage completed. After putting together the ML process, don't forget to request feedback from colleagues who didn't participate in its development. And also inform everyone interested about the appearance of a new useful tool. And remember, the ML process can't be written once. The practices adopted in the company change, new tools appear to replace or supplement the old ones, versions are updated. It's important to regularly keep your ML process up to date and adapt it to new needs. It's not you who adapt to the process, but the process that adapts to you! •••••••••••••••••••••••••••••••••••••• 🤖 Data & ML \| @DataXplore	214
13	In the previous post, i explained what conformal prediction is and why it's needed. But you probably have some questions: 1. How does the model understand how "strange" or "risky" a particular object is? 2. How is the conformal predictor trained? The answer to the first question: through the measure of discomfort. The measure of discomfort is a function that shows how poorly a particular pair (x, y) matches the model and the already known data. The measures of discomfort can be simple functions, such as MAE or hinge_loss: nonconformity_mae = \|y_true - y_pred\| nonconformity_hinge = 1 - P(true_class) or more complex ones, such as Brier's score. An intuitive example Suppose the model classifies images. For a normal picture, the model says: Barbie: 0.02 Ken: 0.97 Oppenheimer: 0.01 And for a blurry picture of an animal in the forest: Barbie: 0.2 Ken: 0.42 Oppenheimer: 0.38 A regular classifier in all cases will choose the Ken class. Conformal prediction in the second case may say: {Ken, Oppenheimer} Because the measure of discomfort for these classes will not be high enough to reject them outright. Next, let's talk about how to train it. TCP: Transductive Conformal Prediction TCP, strictly speaking, is not "trained" like a regular model. It's better to formulate it this way: In TCP, for each new object and each possible answer, we temporarily add this answer to the training set, retrain or re-evaluate the model, and check how "uncomfortable" this answer is relative to the rest of the data. Let's consider the TCP algorithm step by step. Suppose there is a sample: D = {(x1, y1), ..., (xn, yn)} and a new object x_new. • Step 1. Take one of the classes, for example, Barbie. Make an assumption: y_new = Barbie • Step 2. Add it to the existing dataset: D_Barbie = D ∪ {(x_new, Barbie)} • Step 3. Train the model on the new set. • Step 4. Calculate the measure of discomfort, for example, hinge_loss, for all objects, including the new one. • Step 5. Compare the new object with the rest. See how the discomfort of the new object is relative to the rest of the objects in D_Barbie. Simplified: p_value(Barbie) = the proportion of objects with a score ≥ score_x_new Steps 1 to 5 are repeated for each class. The final prediction set is formed from the classes for which p_value > α, where α is the desired significance level. ICP: Inductive Conformal Prediction Experienced ML engineers, having read the previous part, are probably horrified. For predicting on 1000 objects in 10 classes, we will need 10,000 retrains of the model! This problem is solved by the ICP method at the expense of allocating a separate calibration set: • Step 1. Divide the data into train and calibration: D_train D_calibration • Step 2. Train the model on D_train. After this, the model is no longer retrained for each new object. • Step 3. Calculate the measures of discomfort on D_calibration. For each object from D_calibration, calculate how poorly the model predicted the correct answer. We get a set of calibration scores: scores = [α1, α2, ..., αm] • Step 4. Set the significance level α and select the threshold q. Now we select such a threshold q of the calibration scores that the required proportion of calibration scores is not greater than it. Simplified: q = the 90th percentile of the calibration scores • Step 5. Apply to the new object. For the new object, calculate the score for each possible class: score(Barbie) = 1 - P(Barbie) score(Ken) = 1 - P(Ken) score(Oppenheimer) = 1 - P(Oppenheimer) The prediction set will include those classes for which score ≤ q. The main difference from TCP: ICP once trains the model and once calibrates the threshold. After that, for new objects, it uses the ready-made model and the ready-made calibration, so it works much faster. •••••••••••••••••••••••••••••••••••••• 🤖 Data & ML \| @DataXplore	269
14	EVOLUTION OF ATTENTION: Transition to Linear Models This is a continuation of the series of posts about the path of linear Attention. Last time, we found out that transformers have quadratic complexity, which makes them poorly scalable for long sequences and requires a lot of memory. ➡️ How it works? 📌 Linear Attention first changed the approach to computation. It allowed to decompose the kernel function and instead of exp(QKᵀ) use φ(Q) · φ(K). As a result: • rearrange the calculations; • first calculate K · V; • then apply Q. But the main thing: the complexity became linear in the length of the sequence (instead of quadratic). The model worked faster and more efficiently in terms of memory, but at the same time lost accuracy. The usual Attention stores tokens separately, while linear Attention aggregates the information into a general "summary" of the context. Thanks to this, the model understands the general meaning well, but poorly remembers the details. The task was: to maintain the efficiency of linear Attention, but to return local context. 📌 This problem was partially solved by the RWKV architecture. It did not abandon the idea of compact memory, but added mechanisms that make it more sensitive to the current context: • Token shift Instead of considering a token in isolation, the model mixes it with the previous state. Therefore, each new step contains information about the nearest context. • Memory management Memory in RWKV does not just accumulate. At each step, some of the old information is forgotten, and new information is added. If nothing is forgotten, the memory will quickly turn into noise. And if forgotten too aggressively — the context will be lost. The model learns to find a balance between extremes on its own. • Gate at retrieval When it's necessary to retrieve information from memory, a gate is used. It works like a filter: it looks at the current token and decides which parts of the memory are important now and which can be ignored. RWKV became a kind of hybrid of previous ideas. It did not make the model as accurate as classic Attention, but at the same time returned local context. In parallel with it, other architectures were developing: SSM and Mamba. Will tell more about them in the next posts. •••••••••••••••••••••••••••••••••••••• 🤖 Data & ML \| @DataXplore	198
15	EVOLUTION OF ATTENTION: From RNN to Transformer Starting a series of posts to explain path to linear Attention. Before transformers, tasks like translation and classification mainly used recurrent models (RNNs). In 2014, the Attention mechanism appeared. It allowed us not just to read a text sequentially, but to look at all input tokens and assess which of them were important for generation. ➡️ How it worked? A bidirectional RNN encoded the input sequence → for each decoder step, it calculated the relevance of input tokens → obtained weights via softmax → based on them, it formed a context for generating the next token. This led to a significant improvement in machine translation quality. However, the main problem with RNNs remained - they performed poorly on long sequences. To "understand" a word, the models had to process the entire text and reach it. Transformers became the next step in evolution To see the entire sequence at once and better model the dependencies between tokens, a number of changes were made: • We abandoned recurrence - the sequence is calculated in parallel. • Added self-attention - in addition to the encoder, attention now starts to be applied directly to the decoder. • Added Bahdanau Attention for expressiveness - instead of a single-layer perceptron, a dot product of the trainable Q,K,V matrices is used. However, a new problem arose: Attention has quadratic complexity in terms of sequence length. This means that as the context increases, memory and computations grow very quickly. This was attempted to be fixed in various ways: reducing the number of heads to save cache; calculating on a portion of the sequence; creating kernels for efficient Attention calculation (for example, Flash Attention). These methods accelerated the calculations, but didn't change the Attention formula itself. More about how this limitation was overcome in next post. •••••••••••••••••••••••••••••••••••••• 🤖 Data & ML \| @DataXplore	218
16	Data leakage is one of the main reasons Why ML demos look impressive... and then fail in production. The model didn't become smarter. It just happened to see the correct answers in advance. ➡️ Let's break it down in 4 minutes, you'll understand where data leaks hide. 1. Data Leakage Data leakage occurs when information that won't be available at the time of actual prediction is used during the model training process. Because of this, metrics on the validation stage can look much better than the actual quality of the model on new, previously unseen data. 2. Model Evaluation The test set isn't just "additional data". It's a simulation of the future. Only train the model on the information that would have been available to you at the time of prediction. Evaluate it on examples that the model couldn't have influenced during training. 3. Direct Leakage This is the most obvious type of leakage. Examples: - a field with information from the future; - an ID that encodes the target variable; - a variable that appears only after an event has occurred; - duplicate records in both the training and test sets. If a feature doesn't exist at the time of inference (prediction), then it's likely a source of data leakage. 4. Indirect Leakage This is the type of leakage that most often traps teams. You perform normalization, imputation, feature selection, outlier removal, or dimensionality reduction before splitting the data into a training and test set. The model didn't directly see the data from the test set. But your preprocessing pipeline already saw it. 5. Train/Test Split: Wrong: fit the scaler on all data → split the data → evaluate Right: split the data → fit the scaler only on the training set → apply it to both the training and test sets The same idea applies to imputers, encoders, feature selection, PCA, and any preprocessing step that is trained on the data. 6. Cross-Validation: Each fold is a mini-experiment with a training and test set. Therefore, preprocessing should be performed within each fold. If you prepared the entire dataset once and then ran cross-validation, each fold would already have had access to its held-out data. 7. Pipelines: A pipeline isn't just a way to make the code cleaner. It's also a defense against data leakage. Combine preprocessing, feature selection, and the model into a single pipeline, and then pass this pipeline to cross-validation or hyperparameter search (grid search). 8. AI Engineering Version: Data leaks also occur in RAG systems and when evaluating LLMs. Leakage occurs when you tune chunks, prompts, re-rankers, thresholds, or examples on the same evaluation dataset that you later present as "held-out". As a result, your benchmark turns into training data. 9. Leakage Checklist: Before trusting the obtained metric, ask yourself: Could this feature exist at the time of prediction? Was any transformation (transform) step trained (fit) on the test data? Did cross-validation include the entire pipeline? Were we tuning parameters on the final evaluation dataset? If the answer is "yes", then the metric likely doesn't reflect the actual quality of the model. •••••••••••••••••••••••••••••••••••••• 🤖 Data & ML \| @DataXplore	232
17	What to do if the models perform worse than expected? Almost everyone in ML experiences this problems. PROBLEM 1: You cleaned the data, trained the model, spent a lot of time, looked at the metrics and the quality turned out to be much worse than you expected. The first thought is usually "We need a more complex model." In my experience this is a mistake in 80% of cases. Problem is often not the model at all. First thing to check is the data. Very often it turns out that the target is noisy, the classes are poorly separated, half of the features are useless, there's not much signal in the data. Some tasks are just hard to predict and that's normal. There's a feeling that many people expect magic from ML "If the model is smart, it will find everything by itself." It won't, if there's no consistent pattern in the data, XGBoost won't create one. PROBLEM 2 The second problem is leakage or a bad split especially in tabular data. Sometimes offline everything is beautiful ROC-AUC = 0.95, almost perfect accuracy and then the model falls apart on new data and vice versa. The metrics are low because the split is too strict and realistic. Another common story is the wrong metric. For example optimizing accuracy with severe imbalance, looking at ROC-AUC where precision matters, rejoicing over a good loss that means nothing to the business The model can be “mathematically good” and useless at the same time. Baseline is almost always underestimated. Sometimes logistic regression, the group average, a simple rule by hand give a result close to a complex model and this is not a failure. On the contrary, this is a good sign that the task is either almost linear, or there's not enough data. There's another unpleasant thing, Some tasks just aren't worth ML. Seriously, it happens that there's not enough data, supporting the model is more expensive than the benefits, the business effect is minimal but many continue tuning the learning rate, changing architectures, running AutoML, going through 40 models because "we're doing AI". Even though they haven't even looked at the distributions, the model's errors, the quality of the target and that's usually where the answer lies. •••••••••••••••••••••••••••••••••••••• 🤖 Data & ML \| @DataXplore	243
18	LLM Hallucinations Large language models look like omniscient experts. The text is smooth, confident, logical. Until it turns out that all of this was a hallucination. Let's figure out where the hallucinations are a "normal" behavior of the model, and where they quietly turn into a serious problem. ➡️ Where the Model Helps and Where It Lies? 1️⃣ Where Hallucinations Are "Normal"? The Model Doesn't Know, It Keeps Going LLM is not a knowledge base, but a super-powerful autocomplete. Its goal is to generate a plausible continuation, not the truth. Insufficient or Ambiguous Data If the question is rare, fresh, or niche, the model simply fills in the gaps. It doesn't know how to say "I don't know" without additional training. Creative Tasks In storytelling and brainstorming, hallucinations aren't a bug, but a feature. The problems start when the same mode kicks in in facts and code. 2️⃣ Where the Problems Begin? Factual Questions The chatbot confidently reports incorrect dates, names, and events. And the user accepts this as truth. Code Generation • Functions that don't exist. • APIs that never existed. • The code looks correct — until you run it. Critical Domains Law, medicine, finance. Here, "sounding convincing" = potential disaster. A Confident Tone Without Knowledge The most dangerous thing is that the model doesn't hesitate. It doesn't blush, pause, or qualify itself. 3️⃣ What Really Reduces Hallucinations? RAG (Data Anchoring) The model responds not "out of thin air", but based on specific documents. There's a source — less fantasy. Re-training and Alignment RLHF, domain fine-tuning, teaching the model to say "I'm not sure". The model is taught to be cautious, not talkative. Clear Instructions: — answer only based on context — if you don't know — say so — justify every step Sometimes this is enough. • Post-checks and rules • Tests for code • Link verification • Filters for prohibited patterns Ask the Model: — check itself — assess confidence — review the answer 4️⃣ What Distinguishes a Reliable System from "Just an LLM"? — The model isn't the only source of truth — There are data, checks, and restrictions — The error is caught before the user — Confidence ≠ correctness Hallucinations aren't a "bad model". They're a consequence of the LLM always trying to respond. And if you don't surround it with context, checks, and rules, it will shoot itself in the foot just as confidently as it reasoned. ••••••••••••••••••••••••••••••••••• 🤖 Data & ML \| @DataXplore	333
19	This mathematics lies at the heart of every AI model currently being trained. Gradient… Jacobian… Hessian… Three words that initially seem intimidating, but in reality, they're just three ways of measuring change. ➡️ Which are the ways and how it works? 1. scaling function: f : ℝⁿ → ℝ Returns the vector of first partial derivatives. It answers the question: "In which direction does the function f grow the fastest?" That's why gradients are the foundation of optimization. Gradient descent goes in the opposite direction because the gradient points to the direction of maximum growth. Backpropagation efficiently calculates gradients during training. 2. vector-valued function: F : ℝⁿ → ℝᵐ Returns the m × n matrix of first partial derivatives. It answers: "How does each output depend on each input?" The Jacobian is a local linear mapping of a vector function. It appears in: → sensitivity analysis → variable substitution → automatic differentiation → forward-mode AD → reverse-mode AD / backpropagation In simple terms: forward-mode AD uses Jacobian–vector products. reverse-mode AD uses vector–Jacobian products. 3. scalar function: f : ℝⁿ → ℝ Returns the n × n matrix of second partial derivatives. It answers: "How does the gradient itself change?" That is, the Hessian measures curvature. When the second partial derivatives are continuous, the Hessian is symmetric. At a critical point: → positive-definite Hessian → strict local minimum → negative-definite Hessian → strict local maximum → indefinite Hessian → saddle point A pure mental model Gradient = first derivatives of a single output → shows direction Jacobian = first derivatives of many outputs → shows sensitivity Hessian = second derivatives of a single output → shows curvature And the connection between them is simple: The Hessian is the Jacobian of the gradient. For a scalar output, the Jacobian contains the same partial derivatives as the gradient, up to the convention on rows/columns. Same idea: measuring change. Different objects: direction, sensitivity, curvature. When this becomes clear, optimization stops looking like a set of formulas. It starts looking like a map of the task. ••••••••••••••••••••••••••••••••••• 🤖 Data & ML \| @DataXplore	265
20	Why are open-source models changing the AI market? A couple of years ago, it seemed that AI would be completely controlled by a few large companies. Whoever had more GPUs and money was the boss. Then came Llama, Mistral, DeepSeek, Qwen, and Phi, and it became clear that the market would take a completely different path. ➡️ How it is changing AI Market? It's not just about quality. The most interesting thing is that open-source models are changing the industry, not just because of quality. Although their quality is already pretty good. The problem is that closed models tie you too tightly to someone else's infrastructure. Today, the API works; tomorrow prices have changed, limits have been cut, policies have been changed, a region has been shut down, the model has gotten worse after an update, and you have no control over any of it. Why do open-source models change the rules of the game? With open-source, everything is different. You want to run locally, fine-tune, quantize, change the inference stack, optimize latency, and keep data within the company? Fine. For businesses, this makes a huge difference. Especially regarding private data, compliance, large volumes of requests, and expensive inference. There's another important effect: Open-source is rapidly moving the industry forward because thousands of engineers test models, find weaknesses, work on optimizations, create inference engines, and release fine-tuning tools. Progress doesn't come from the top down but from all sides at once. What's particularly interesting right now? Sometimes a small open-source model on a good inference pipeline feels more useful than a huge closed LLM, especially in production, because in reality, it's not just about benchmarks. What matters? Price, control, latency, stability, and the ability to integrate the model into the system. Main idea seems to be that the AI market is gradually moving away from the concept of "One gigantic model for everything" towards "Many specialized models for specific tasks." •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• 🤖 Data & ML \| @DataXplore	257

مشاهده همه پست‌ها