Data eXplore : Data Science, ML, Big Data, LLMs and AI Security
رفتن به کانال در Telegram
Exploring Data Science, Big Data Analytics & Visualization, ML/DL, Neural Networks, LLMs with GitHub, Kaggle, HuggingFace and some white papers by big institutions. Not just data, but science behind data Paid project? premodi@zohomail.in ★ @DataML
نمایش بیشترکشور مشخص نشده استفناوری و برنامهها48 407
554
مشترکین
-124 ساعت
-47 روز
+1730 روز
در حال بارگیری داده...
کانالهای مشابه
هیچ دادهای
مشکلی وجود دارد؟ لطفاً صفحه را تازه کنید یا با مدیر پشتیبانی ما تماس بگیرید.
ابر برچسبها
اشارات ورودی و خروجی
---
---
---
---
---
---
جذب مشترکین
ژوئیه '26
ژوئیه '26
+1
در 0 کانالها
ژوئن '26
+26
در 1 کانالها
Get PRO
مه '26
+54
در 1 کانالها
Get PRO
آوریل '26
+14
در 2 کانالها
Get PRO
مارس '26
+53
در 10 کانالها
Get PRO
فوریه '26
+22
در 2 کانالها
Get PRO
ژانویه '26
+28
در 2 کانالها
Get PRO
دسامبر '25
+46
در 4 کانالها
Get PRO
نوامبر '25
+128
در 4 کانالها
Get PRO
اکتبر '25
+149
در 3 کانالها
Get PRO
سپتامبر '250
در 2 کانالها
Get PRO
اوت '25
+99
در 0 کانالها
Get PRO
ژوئیه '250
در 0 کانالها
Get PRO
ژوئن '250
در 0 کانالها
Get PRO
مه '250
در 0 کانالها
Get PRO
آوریل '250
در 1 کانالها
Get PRO
مارس '250
در 0 کانالها
Get PRO
فوریه '250
در 0 کانالها
Get PRO
ژانویه '25
+14
در 2 کانالها
Get PRO
دسامبر '24
+11
در 0 کانالها
| تاریخ | رشد مشترکین | اشارات | کانالها | |
| 02 ژوئیه | 0 | |||
| 01 ژوئیه | +1 |
پستهای کانال
Updated the encoder - broke the ANN? How to migrate embeddings without pain
In embedding-based systems, the encoder is part of the data contract. It cannot be updated like a regular ML model: the ANN index already contains vectors from the old space, and a common mistake is to assume compatibility due to the same dimension and metric.
Why compatibility breaks?
Even if the dimension is the same and the cosine is the same, and the offline benchmark is better, the new encoder does not have to be compatible with the old index.
After the update, the following change:
- geometry of the space;
- distribution of norms;
- local neighborhoods;
- ranking of nearest neighbors;
- calibration of scores;
- behavior of the ANN structure: HNSW/IVF/PQ were built for the old distribution.
The main anti-pattern: writing new documents with the new encoder into the old index with old embeddings.
Such an index becomes mixed: some vectors live in one space, others in another. The ANN works formally, but the nearest neighbors no longer have correct semantics.
Versioning the embedding space as a production contract
You need to version not just the
model_name, but the full contract:
embedding_version = encoder + tokenizer + pooling + normalization + dim + metricIf any of these has changed, it's a new version of the space. Practical advice: keep the embedding_version next to the document, query, index, and retrieval logs. Otherwise, if recall or CTR degrades, you won't understand which encoder was actually involved in the delivery. Raising a new index and enabling dual-write The old path:
docs_v1 -> embeddings_v1 -> ann_index_v1The new path:
docs_v2 -> embeddings_v2 -> ann_index_v2Even if the documents are the same, the embeddings must be recalculated with the new encoder. For ANN, this is a new corpus. Importantly: the index parameters should also be tuned. For example, for HNSW, the old M, efConstruction, efSearch may not be optimal for the new distribution. During the migration, write new and updated documents to both versions:
on_document_upsert(doc):
emb_v1 = encoder_v1(doc)
emb_v2 = encoder_v2(doc)
index_v1.upsert(doc.id, emb_v1)
index_v2.upsert(doc.id, emb_v2)
This is more expensive in terms of compute and ingestion latency, but the old retrieval continues to work and the new index catches up with the current state. If v1 is soon shut down, dual-write can be kept only until the cutover plus a short rollback window.
Backfill, shadow-read, and readiness criteria
For v2, we need to recalculate the embeddings of the entire corpus and upload them to the new index. Here, it's not about notebook metrics, but about engineering reliability:
- idempotency of tasks;
- control of lag;
- deduplication of upserts;
- checkpoints;
- separate limits on encoder and ANN ingestion;
- document count comparison between indexes;
- percentage of documents without v2 embeddings.
The migration is not ready until the new index covers the production corpus with an acceptable lag.
Before switching, enable shadow-read:
query -> encoder_v1 -> index_v1 -> results_v1
-> encoder_v2 -> index_v2 -> results_v2
Show only v1 to the user, but compare:
- recall@k on labeled data;
- overlap@k between v1 and v2;
- NDCG/MRR if there are clicks or raters;
- p95/p99 latency;
- tail failures;
- score distribution;
- downstream metrics in ranking, recommendations, or RAG.
Warning: high overlap@k does not guarantee product improvement. The new retrieval may change diversity, freshness, coverage, and load on the next ranker. It's better to do the cutover via a feature flag, with monitoring of quality, latency, error rate, and a quick rollback to ann_index_v1.
Conclusion: Updating the encoder is a migration of the embedding contract and ANN infrastructure, not a simple model replacement in the inference path.
••••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore| 2 | How to Find Harmful Training Examples Before Fine-Tuning: Influence Functions, TracIn, and Data Pruning in Production ML
In production ML, "bad" training examples can be costly: a cluster of mislabeled, outdated, or anomalous objects can consistently degrade fine-tuning on a fresh data set. A common mistake is to clean the dataset only based on heuristics and not checking which samples actually increase the loss on the production-like validation set.
1️⃣ Influence Functions
Idea: estimate how the loss on the validation set z_val would change if we slightly increase the weight of the training example z_train.
I(z_train, z_val) ≈ - ∇L_val^T H^-1 ∇L_train
where H is the Hessian with respect to the model parameters.
If the influence is large and positive, the training example is likely harming the validation loss and quality.
Pros:
- rigorous theoretical formulation;
- can associate specific training examples with specific model errors.
Cons:
- expensive H^-1;
- poorly scalable to large neural networks;
- sensitive to non-convexity, batchnorm/dropout, checkpoints, and Hessian approximation.
In production, we typically use approximations: LiSSA, conjugate gradients, low-rank approximation, or calculate the influence only for the last layer/head of the model.
2️⃣ TracIn
A more engineering-oriented approach: a training example is useful for the validation set if their gradients evolve similarly during training. It's harmful if they evolve in the opposite direction.
TracIn(z_train, z_val) = Σ_c η_c · ∇L_train(θ_c) · ∇L_val(θ_c)
where θ_c are checkpoints and η_c is the learning rate.
A strongly negative score means: the training example is pulling the model in the opposite direction of what's useful for validation.
A mini sketch for the last layer:
for ckpt in checkpoints:
model.load_state_dict(load(ckpt))
g_val = mean_grad(model.head, val_loader)
for i, batch in enumerate(train_subset):
g_train = grad(model.head, batch)
scores[i] += lr[ckpt] * dot(g_train, g_val)
harmful = argsort(scores)[:K]
Practical advice: calculate the score not on the entire validation set, but on important production slices: new users, rare classes, problematic regions, fresh drift, segments with high business value or SLA.
3️⃣ Data pruning before fine-tuning
Workflow:
1. Freeze a production-like validation set without leakage.
2. Train a baseline / fine-tune and save several checkpoints.
3. Calculate the influence or TracIn for train→val.
4. Check the top harmful samples:
- label noise;
- outdated distribution;
- conflicting duplicates;
- corrupted inputs;
- incorrect task/schema version.
5. Remove, downweight, or relabel them.
6. Repeat fine-tuning and check not only the overall metric but also the regression by segment.
Production example: before retraining a recommendation model on fresh logs, you might find old interactions with a changed product taxonomy, conflicting labels after a schema migration, or bot traffic that degrades the ranking loss on a fresh holdout.
4️⃣ Caution
Don't blindly remove all "harmful" examples. Sometimes they degrade the current validation, but are needed for long-tail robustness, fairness, or resilience to rare scenarios.
It's safer to start with top-K, do a human-in-the-loop audit, compare remove / downweight / relabel options, and look at the trade-off between quality, latency of recalculation, cost of labeling, reproducibility, and monitoring reliability.
Conclusion: Influence Functions and TracIn are useful not as a magic data cleaning tool, but as an engineering approach to make fine-tuning less toxic to noise, outdated data, and conflicting labeling.
••••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 163 |
| 3 | Temporal leakage in the feature store: How point-in-time joins, backfills, and feature causality checks save a model from beautiful offline metrics and failure in production?
Temporal leakage in the feature store is one of the most expensive ways to get great offline metrics and a useless model in production. The problem isn't that the feature is bad, but that on train it knows more than the model would have known at the moment of decision-making.
We predict churn on date t, but in the features we use transactions_last_30d, calculated after a backfill from a table where transactions arrived with a delay or were recalculated with future fixes.
Offline is all beautiful. Online - a slump.
1️⃣ Point-in-time join - basic protection
For each training row, there is prediction_time. The features should be in the state they were in at that moment.
It's important to distinguish:
- event_time - when the event actually happened;
- ingestion_time / created_at - when it entered the system;
- available_at - when the feature became available to the model;
- prediction_time - the moment of prediction.
The correct join should take into account not only event_time <= prediction_time, but also available_at <= prediction_time:
WITH ranked_features AS (
SELECT
l.entity_id,
l.prediction_time,
f.feature_value,
ROW_NUMBER() OVER (
PARTITION BY l.entity_id, l.prediction_time
ORDER BY f.event_time DESC
) AS rn
FROM labels l
JOIN features f
ON f.entity_id = l.entity_id
AND f.event_time <= l.prediction_time
AND f.available_at <= l.prediction_time
)
SELECT *
FROM ranked_features
WHERE rn = 1;
If there is no available_at, you often can't prove that there is no leakage.
2️⃣ Backfills - a hidden source of leakage
Backfills are dangerous because they create the illusion of historical completeness.
For example, today you recalculated a feature for the past year:
- corrected old events;
- added data from a new source;
- changed the business logic;
- caught up with late-arriving events;
- used a reference that wasn't available at the time.
As a result, train gets a history that didn't actually exist at the moment of prediction.
A correct backfill should answer the question:
What feature would the model have seen then if the pipeline had worked with the same delays, sources, and availability rules?
If the answer is unknown, it's not historical truth, but reconstructed truth. For model training, these are different things.
3️⃣ Checking the causality of features
Before training, every feature should be run through a causality review.
➡️ Minimum checklist:
1. Is the feature available before prediction_time?
It's not that the event happened, but that the value of the feature was available.
2. Is there a label proxy in the feature?
For example, days_since_last_payment_failed for a default task might be almost a direct consequence of a future target.
3. Is the aggregation window strictly in the past?
last_7d should mean [t-7d, t), not a calendar week that includes the future relative to t.
4. Are there future-aware reference tables?
Segments, statuses, limits, antifraud flags, and CRM attributes are often backfilled.
5. Is the source latency taken into account?
If the data arrives in 6 hours, you can't use an event at 09:55 for a prediction at 10:00.
In production ML, a feature is considered valid not when it's historically correct, but when it's demonstrably available to the model at the moment of decision-making.
••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 154 |
| 4 | Conformal intervals in production ML with covariate shift: How to maintain coverage without unnecessarily wide predictions?
Split conformal works well with exchangeability: train, calibration, and test come from same distribution. In production, this often breaks down due to geo, devices, channels, seasonality or a change in acquisition mix and a common mistake is to simply expand intervals "with a margin".
What exactly breaks down?
With covariate shift, we have: p_prod(x) != p_cal(x)
but we assume that p(y|x) approximately holds. If we calculate usual conformal quantile on old calibration set, coverage on current traffic might drop.
Naive solution is to globally increase correction. Coverage will partially recover, but price prediction interval, ETA interval or forecast band will become so wide that downstream system will no longer trust them.
Basic production recipe
1️⃣ Train a quantile model:
q_low(x), q_high(x)
2️⃣ Calculate nonconformity scores on calibration set:
s_i = max(q_low(x_i)-y_i, y_i-q_high(x_i), 0)
3️⃣ Estimate importance weights:
w_i ~= p_prod(x_i) / p_cal(x_i)
4️⃣ Use weighted quantile scores instead of usual ones.
5️⃣ For a new object, construct:
C(x) = [q_low(x)-tau, q_high(x)+tau]
Minimal skeleton:
import numpy as np
def weighted_quantile(values, weights, q):
order = np.argsort(values)
v = np.asarray(values)[order]
w = np.asarray(weights)[order]
cw = np.cumsum(w)
return v[np.searchsorted(cw, q * cw[-1])]
alpha = 0.1
scores = np.maximum(q_low_cal - y_cal,
y_cal - q_high_cal,
0)
weights = ratio_model.predict_weight(X_cal)
tau = weighted_quantile(scores, weights, 1 - alpha)
low = q_low_prod - tau
high = q_high_prod + tau
This way, calibration distribution becomes closer to production distribution without unnecessarily widening all intervals.
How not to get too wide intervals?
One global tau often overestimates uncertainty if model error strongly depends on x.
Practically helps:
- CQR instead of point prediction: Conformalized Quantile Regression already models heteroscedastic uncertainty, so conformal correction is usually smaller.
- Normalized score: for example s_i = |y_i - y_hat_i| / sigma_hat(x_i), and the interval is constructed as y_hat(x) +- tau * sigma_hat(x).
- Local calibration: a separate tau per geo, device, channel, price bucket, or risk bucket. This is close to Mondrian conformal, but requires a sufficient number of calibration examples in each segment.
- Rolling calibration buffer: for recommendations, scoring and forecasting, old calibration set quickly stops describing current traffic mix.
Main risk - bad weights
Density ratio model can be noisy. A few objects with huge weights effectively "replace" entire calibration set.
Control:
ESS = (sum w)^2 / sum(w^2)
If ESS is low, the weighted quantile is unstable and intervals start jumping from release to release.
Practical measures:
- clip weights and monitor proportion of clipped weights;
- smooth the density ratio;
- merge rare segments;
- not calibrate a segment where there are few fresh labels;
- run recalibration when ESS drops or distribution drifts on X.
Production checklist
- a separate calibration set, not mixed with training;
- drift detection on feature distribution;
- density ratio model between prod traffic and calibration traffic;
- weighted conformal calibration;
- monitor coverage, average width, coverage by slices, ESS, and latency;
- alerts on increasing interval width without increasing error;
- A/B validation if intervals affect routing, fallback or human review.
It's important not to confuse marginal and conditional coverage. Conformal can maintain 90% coverage on stream on average, but fail in individual microsegments. This needs to be explicitly checked in production.
With covariate shift, goal is not to blindly widen the intervals, but to calibrate them to current mix of production objects and monitor the reliability of this calibration.
•••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 160 |
| 5 | Corpus drift in RAG systems: How to notice the degradation of retrieval without labels, annotations, and obvious errors?
In RAG retrieval, things often break silently: same model, same embedding model, same prompt, normal latency, but the answers have gotten worse.
A typical mistake is to immediately tweak the prompt or blame the LLM, even though the problem lies deeper: the corpus has changed.
1️⃣ Monitor the corpus drift itself
We don't directly measure quality, but we look at how the space in which the retriever operates has changed:
- distribution of embedding chunks;
- average chunk length, overlap, number of chunks per document;
- proportion of new, deleted, and modified chunks;
- duplicates and near-duplicates;
- distribution of domains, document types, languages, dates;
- density of the embedding space: have many chunks "clumped" together.
If the corpus has noticeably shifted, old retrieval thresholds and expectations of top-k might become garbage. Especially if the confidence logic is tied to score or the gap between top-1 and top-2.
2️⃣ Anchor queries instead of labels
In production, there are almost never labels like "these chunks are relevant for this query". But we can take a stable set of production queries: for example, 500-5,000 frequent or business-critical queries.
This isn't annotation. We don't know the correct chunk. But we know that the retrieval behavior shouldn't change chaotically after each corpus update.
For each anchor query, save the baseline:
- top-k doc/chunk ids;
- retrieval scores;
- rank positions;
- gap between top-1 and top-2;
- diversity of top-k;
- source distribution.
After the corpus update, compare the new retrieval with the baseline.
Useful proxy metrics:
- Jaccard@k between the old and new top-k;
- p95_top1_score_drop;
- score_wasserstein between the baseline and current scores.
3️⃣ How to interpret the signals
- mean_jaccard@10 has dropped sharply: the retriever has started returning different context;
- the top-1 score systematically drops: the queries are matching the corpus less well;
- the score distribution has shifted significantly: old thresholds and confidence logic might have broken.
Practical advice: don't just look globally, but also by segments - sources, languages, document types, product domains. A global average easily hides degradation in a critical segment.
4️⃣ Retrieval confidence without ground truth
Even without annotations, you can look at the "confidence" of the retriever:
- high top-1 score;
- large gap between top-1 and top-2;
- consistency of dense retrieval and BM25;
- stability of top-k when query rewriting;
- low proportion of duplicates in top-k;
- coverage of needed sources.
If dense and lexical retrieval suddenly start diverging, don't just chalk it up to noise. Often, this means that the corpus or queries have changed in a way that one of the strategies no longer works as before.
Production minimum for RAG:
- store a snapshot of retrieval results for anchor queries;
- calculate overlap, score drift, and rank churn after each corpus update;
- monitor duplicates, new chunks, and source distributions separately;
- set alerts not on a single query, but on aggregates by segments.
Corpus drift is annoying because it doesn't look like a crash. The system responds, there are no errors, and the latency is normal. It's just that the context has become slightly less relevant. Then a little more. And the RAG quality slowly declines.
The conclusion is WITHOUT LABELS, you CAN'T honestly measure relevance, but you can monitor the stability of retrieval behavior, the retriever's confidence, and corpus changes to catch degradation before users do.
•••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 187 |
| 6 | Corpus drift in RAG systems
In RAG retrieval, things often break silently: same model, same embedding model, same prompt, normal latency, but the answers have gotten worse. A typical mistake is to immediately tweak the prompt or blame the LLM, even though the problem lies deeper: the corpus has changed.
➡️ How to notice degradation of retrieval without labels, annotations and obvious errors?
1️⃣ Monitor the corpus drift itself
We don't directly measure quality, but we look at how the space in which the retriever operates has changed:
- distribution of embedding chunks;
- average chunk length, overlap, number of chunks per document;
- proportion of new, deleted, and modified chunks;
- duplicates and near-duplicates;
- distribution of domains, document types, languages, dates;
- density of the embedding space: have many chunks "clumped" together.
If the corpus has noticeably shifted, old retrieval thresholds and expectations of top-k might become garbage. Especially if the confidence logic is tied to score or the gap between top-1 and top-2.
2️⃣ Anchor queries instead of labels
In production, there are almost never labels like "these chunks are relevant for this query". But we can take a stable set of production queries: for example, 500-5,000 frequent or business-critical queries.
This isn't annotation. We don't know the correct chunk. But we know that the retrieval behavior shouldn't change chaotically after each corpus update.
For each anchor query, save the baseline:
- top-k doc/chunk ids;
- retrieval scores;
- rank positions;
- gap between top-1 and top-2;
- diversity of top-k;
- source distribution.
After the corpus update, compare the new retrieval with the baseline.
Useful proxy metrics:
- Jaccard@k between the old and new top-k;
- p95_top1_score_drop;
- score_wasserstein between the baseline and current scores.
3️⃣ How to interpret the signals
- mean_jaccard@10 has dropped sharply: the retriever has started returning different context;
- the top-1 score systematically drops: the queries are matching the corpus less well;
- the score distribution has shifted significantly: old thresholds and confidence logic might have broken.
Practical advice: don't just look globally, but also by segments - sources, languages, document types, product domains. A global average easily hides degradation in a critical segment.
4️⃣ Retrieval confidence without ground truth
Even without annotations, you can look at the "confidence" of the retriever:
- high top-1 score;
- large gap between top-1 and top-2;
- consistency of dense retrieval and BM25;
- stability of top-k when query rewriting;
- low proportion of duplicates in top-k;
- coverage of needed sources.
If dense and lexical retrieval suddenly start diverging, don't just chalk it up to noise. Often, this means that the corpus or queries have changed in a way that one of the strategies no longer works as before.
Production minimum for RAG:
- store a snapshot of retrieval results for anchor queries;
- calculate overlap, score drift, and rank churn after each corpus update;
- monitor duplicates, new chunks, and source distributions separately;
- set alerts not on a single query, but on aggregates by segments.
Corpus drift is annoying because it doesn't look like a crash. The system responds, there are no errors, and the latency is normal. It's just that the context has become slightly less relevant. Then a little more. And the RAG quality slowly declines.
Conclusion is WITHOUT LABELS, you CAN'T honestly measure relevance, but you can monitor the stability of retrieval behavior, the retriever's confidence, and corpus changes to catch degradation before users do.
•••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 2 |
| 7 | Corpus drift in RAG systems: How to notice the degradation of retrieval without labels, annotations, and obvious errors?
In RAG retrieval, things often break silently: same model, same embedding model, same prompt, normal latency, but the answers have gotten worse.
A typical mistake is to immediately tweak the prompt or blame the LLM, even though the problem lies deeper: the corpus has changed.
1️⃣ Monitor the corpus drift itself
We don't directly measure quality, but we look at how the space in which the retriever operates has changed:
- distribution of embedding chunks;
- average chunk length, overlap, number of chunks per document;
- proportion of new, deleted, and modified chunks;
- duplicates and near-duplicates;
- distribution of domains, document types, languages, dates;
- density of the embedding space: have many chunks "clumped" together.
If the corpus has noticeably shifted, old retrieval thresholds and expectations of top-k might become garbage. Especially if the confidence logic is tied to score or the gap between top-1 and top-2.
2️⃣ Anchor queries instead of labels
In production, there are almost never labels like "these chunks are relevant for this query". But we can take a stable set of production queries: for example, 500-5,000 frequent or business-critical queries.
This isn't annotation. We don't know the correct chunk. But we know that the retrieval behavior shouldn't change chaotically after each corpus update.
For each anchor query, save the baseline:
- top-k doc/chunk ids;
- retrieval scores;
- rank positions;
- gap between top-1 and top-2;
- diversity of top-k;
- source distribution.
After the corpus update, compare the new retrieval with the baseline.
Useful proxy metrics:
- Jaccard@k between the old and new top-k;
- p95_top1_score_drop;
- score_wasserstein between the baseline and current scores.
3️⃣ How to interpret the signals
- mean_jaccard@10 has dropped sharply: the retriever has started returning different context;
- the top-1 score systematically drops: the queries are matching the corpus less well;
- the score distribution has shifted significantly: old thresholds and confidence logic might have broken.
Practical advice: don't just look globally, but also by segments - sources, languages, document types, product domains. A global average easily hides degradation in a critical segment.
4️⃣ Retrieval confidence without ground truth
Even without annotations, you can look at the "confidence" of the retriever:
- high top-1 score;
- large gap between top-1 and top-2;
- consistency of dense retrieval and BM25;
- stability of top-k when query rewriting;
- low proportion of duplicates in top-k;
- coverage of needed sources.
If dense and lexical retrieval suddenly start diverging, don't just chalk it up to noise. Often, this means that the corpus or queries have changed in a way that one of the strategies no longer works as before.
Production minimum for RAG:
- store a snapshot of retrieval results for anchor queries;
- calculate overlap, score drift, and rank churn after each corpus update;
- monitor duplicates, new chunks, and source distributions separately;
- set alerts not on a single query, but on aggregates by segments.
Corpus drift is annoying because it doesn't look like a crash. The system responds, there are no errors, and the latency is normal. It's just that the context has become slightly less relevant. Then a little more. And the RAG quality slowly declines.
The conclusion is WITHOUT LABELS, you CAN'T honestly measure relevance, but you can monitor the stability of retrieval behavior, the retriever's confidence, and corpus changes to catch degradation before users do.
•••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 1 |
| 8 | ML process template that we use in the ML Core team
Maybe the template will make it easier for you to start putting together a doc:
1️⃣ Create a scheme of the main development stages adopted in your team, for example:
• task setting;
• data exploration;
• formulating the task in ML terms;
• MVP solution;
• testing the solution;
• rolling it out to production;
• monitoring.
2️⃣ Describe each stage:
• what needs to be done;
• what the result should look like to proceed to the next step.
Try to avoid long texts; use diagrams, tables, infographics.
3️⃣ Add to each stage:
• templates that will allow you to complete this stage faster;
• useful tips;
• standards and requirements, if any;
• links to resources, articles, documents that can help at this stage;
• answers to popular questions;
• documentation requirements: what and where needs to be described to consider the stage completed.
After putting together the ML process, don't forget to request feedback from colleagues who didn't participate in its development. And also inform everyone interested about the appearance of a new useful tool.
And remember, the ML process can't be written once. The practices adopted in the company change, new tools appear to replace or supplement the old ones, versions are updated. It's important to regularly keep your ML process up to date and adapt it to new needs.
It's not you who adapt to the process, but the process that adapts to you!
••••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 207 |
| 9 | In the previous post, i explained what conformal prediction is and why it's needed. But you probably have some questions:
1. How does the model understand how "strange" or "risky" a particular object is?
2. How is the conformal predictor trained?
The answer to the first question: through the measure of discomfort.
The measure of discomfort is a function that shows how poorly a particular pair (x, y) matches the model and the already known data.
The measures of discomfort can be simple functions, such as MAE or hinge_loss:
nonconformity_mae = |y_true - y_pred|
nonconformity_hinge = 1 - P(true_class)
or more complex ones, such as Brier's score.
An intuitive example
Suppose the model classifies images. For a normal picture, the model says:
Barbie: 0.02
Ken: 0.97
Oppenheimer: 0.01
And for a blurry picture of an animal in the forest:
Barbie: 0.2
Ken: 0.42
Oppenheimer: 0.38
A regular classifier in all cases will choose the Ken class. Conformal prediction in the second case may say:
{Ken, Oppenheimer}
Because the measure of discomfort for these classes will not be high enough to reject them outright.
Next, let's talk about how to train it.
TCP: Transductive Conformal Prediction
TCP, strictly speaking, is not "trained" like a regular model. It's better to formulate it this way:
In TCP, for each new object and each possible answer, we temporarily add this answer to the training set, retrain or re-evaluate the model, and check how "uncomfortable" this answer is relative to the rest of the data.
Let's consider the TCP algorithm step by step.
Suppose there is a sample:
D = {(x1, y1), ..., (xn, yn)}
and a new object
x_new.
• Step 1. Take one of the classes, for example, Barbie. Make an assumption:
y_new = Barbie
• Step 2. Add it to the existing dataset:
D_Barbie = D ∪ {(x_new, Barbie)}
• Step 3. Train the model on the new set.
• Step 4. Calculate the measure of discomfort, for example, hinge_loss, for all objects, including the new one.
• Step 5. Compare the new object with the rest. See how the discomfort of the new object is relative to the rest of the objects in D_Barbie. Simplified:
p_value(Barbie) = the proportion of objects with a score ≥ score_x_new
Steps 1 to 5 are repeated for each class. The final prediction set is formed from the classes for which p_value > α, where α is the desired significance level.
ICP: Inductive Conformal Prediction
Experienced ML engineers, having read the previous part, are probably horrified. For predicting on 1000 objects in 10 classes, we will need 10,000 retrains of the model!
This problem is solved by the ICP method at the expense of allocating a separate calibration set:
• Step 1. Divide the data into train and calibration:
D_train
D_calibration
• Step 2. Train the model on D_train. After this, the model is no longer retrained for each new object.
• Step 3. Calculate the measures of discomfort on D_calibration. For each object from D_calibration, calculate how poorly the model predicted the correct answer. We get a set of calibration scores:
scores = [α1, α2, ..., αm]
• Step 4. Set the significance level α and select the threshold q. Now we select such a threshold q of the calibration scores that the required proportion of calibration scores is not greater than it. Simplified:
q = the 90th percentile of the calibration scores
• Step 5. Apply to the new object. For the new object, calculate the score for each possible class:
score(Barbie) = 1 - P(Barbie)
score(Ken) = 1 - P(Ken)
score(Oppenheimer) = 1 - P(Oppenheimer)
The prediction set will include those classes for which score ≤ q.
The main difference from TCP:
ICP once trains the model and once calibrates the threshold. After that, for new objects, it uses the ready-made model and the ready-made calibration, so it works much faster.
••••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 262 |
| 10 | EVOLUTION OF ATTENTION: Transition to Linear Models
This is a continuation of the series of posts about the path of linear Attention. Last time, we found out that transformers have quadratic complexity, which makes them poorly scalable for long sequences and requires a lot of memory.
➡️ How it works?
📌 Linear Attention first changed the approach to computation. It allowed to decompose the kernel function and instead of exp(QKᵀ) use φ(Q) · φ(K). As a result:
• rearrange the calculations;
• first calculate K · V;
• then apply Q.
But the main thing: the complexity became linear in the length of the sequence (instead of quadratic). The model worked faster and more efficiently in terms of memory, but at the same time lost accuracy.
The usual Attention stores tokens separately, while linear Attention aggregates the information into a general "summary" of the context. Thanks to this, the model understands the general meaning well, but poorly remembers the details.
The task was: to maintain the efficiency of linear Attention, but to return local context.
📌 This problem was partially solved by the RWKV architecture. It did not abandon the idea of compact memory, but added mechanisms that make it more sensitive to the current context:
• Token shift
Instead of considering a token in isolation, the model mixes it with the previous state. Therefore, each new step contains information about the nearest context.
• Memory management
Memory in RWKV does not just accumulate. At each step, some of the old information is forgotten, and new information is added. If nothing is forgotten, the memory will quickly turn into noise. And if forgotten too aggressively — the context will be lost. The model learns to find a balance between extremes on its own.
• Gate at retrieval
When it's necessary to retrieve information from memory, a gate is used. It works like a filter: it looks at the current token and decides which parts of the memory are important now and which can be ignored.
RWKV became a kind of hybrid of previous ideas. It did not make the model as accurate as classic Attention, but at the same time returned local context. In parallel with it, other architectures were developing: SSM and Mamba.
Will tell more about them in the next posts.
••••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 192 |
| 11 | EVOLUTION OF ATTENTION: From RNN to Transformer
Starting a series of posts to explain path to linear Attention.
Before transformers, tasks like translation and classification mainly used recurrent models (RNNs). In 2014, the Attention mechanism appeared. It allowed us not just to read a text sequentially, but to look at all input tokens and assess which of them were important for generation.
➡️ How it worked?
A bidirectional RNN encoded the input sequence → for each decoder step, it calculated the relevance of input tokens → obtained weights via softmax → based on them, it formed a context for generating the next token.
This led to a significant improvement in machine translation quality. However, the main problem with RNNs remained - they performed poorly on long sequences. To "understand" a word, the models had to process the entire text and reach it.
Transformers became the next step in evolution
To see the entire sequence at once and better model the dependencies between tokens, a number of changes were made:
• We abandoned recurrence - the sequence is calculated in parallel.
• Added self-attention - in addition to the encoder, attention now starts to be applied directly to the decoder.
• Added Bahdanau Attention for expressiveness - instead of a single-layer perceptron, a dot product of the trainable Q,K,V matrices is used.
However, a new problem arose: Attention has quadratic complexity in terms of sequence length. This means that as the context increases, memory and computations grow very quickly.
This was attempted to be fixed in various ways: reducing the number of heads to save cache; calculating on a portion of the sequence; creating kernels for efficient Attention calculation (for example, Flash Attention).
These methods accelerated the calculations, but didn't change the Attention formula itself.
More about how this limitation was overcome in next post.
••••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 218 |
| 12 | Data leakage is one of the main reasons Why ML demos look impressive... and then fail in production.
The model didn't become smarter.
It just happened to see the correct answers in advance.
➡️ Let's break it down in 4 minutes, you'll understand where data leaks hide.
1. Data Leakage
Data leakage occurs when information that won't be available at the time of actual prediction is used during the model training process.
Because of this, metrics on the validation stage can look much better than the actual quality of the model on new, previously unseen data.
2. Model Evaluation
The test set isn't just "additional data".
It's a simulation of the future.
Only train the model on the information that would have been available to you at the time of prediction.
Evaluate it on examples that the model couldn't have influenced during training.
3. Direct Leakage
This is the most obvious type of leakage.
Examples:
- a field with information from the future;
- an ID that encodes the target variable;
- a variable that appears only after an event has occurred;
- duplicate records in both the training and test sets.
If a feature doesn't exist at the time of inference (prediction), then it's likely a source of data leakage.
4. Indirect Leakage
This is the type of leakage that most often traps teams.
You perform normalization, imputation, feature selection, outlier removal, or dimensionality reduction before splitting the data into a training and test set.
The model didn't directly see the data from the test set.
But your preprocessing pipeline already saw it.
5. Train/Test Split:
Wrong:
fit the scaler on all data → split the data → evaluate
Right:
split the data → fit the scaler only on the training set → apply it to both the training and test sets
The same idea applies to imputers, encoders, feature selection, PCA, and any preprocessing step that is trained on the data.
6. Cross-Validation:
Each fold is a mini-experiment with a training and test set.
Therefore, preprocessing should be performed within each fold.
If you prepared the entire dataset once and then ran cross-validation, each fold would already have had access to its held-out data.
7. Pipelines:
A pipeline isn't just a way to make the code cleaner.
It's also a defense against data leakage.
Combine preprocessing, feature selection, and the model into a single pipeline, and then pass this pipeline to cross-validation or hyperparameter search (grid search).
8. AI Engineering Version:
Data leaks also occur in RAG systems and when evaluating LLMs.
Leakage occurs when you tune chunks, prompts, re-rankers, thresholds, or examples on the same evaluation dataset that you later present as "held-out".
As a result, your benchmark turns into training data.
9. Leakage Checklist:
Before trusting the obtained metric, ask yourself:
Could this feature exist at the time of prediction?
Was any transformation (transform) step trained (fit) on the test data?
Did cross-validation include the entire pipeline?
Were we tuning parameters on the final evaluation dataset?
If the answer is "yes", then the metric likely doesn't reflect the actual quality of the model.
••••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 232 |
| 13 | What to do if the models perform worse than expected?
Almost everyone in ML experiences this problems.
PROBLEM 1:
You cleaned the data, trained the model, spent a lot of time, looked at the metrics and the quality turned out to be much worse than you expected.
The first thought is usually "We need a more complex model." In my experience this is a mistake in 80% of cases. Problem is often not the model at all.
First thing to check is the data. Very often it turns out that the target is noisy, the classes are poorly separated, half of the features are useless, there's not much signal in the data. Some tasks are just hard to predict and that's normal.
There's a feeling that many people expect magic from ML "If the model is smart, it will find everything by itself." It won't, if there's no consistent pattern in the data, XGBoost won't create one.
PROBLEM 2
The second problem is leakage or a bad split especially in tabular data. Sometimes offline everything is beautiful ROC-AUC = 0.95, almost perfect accuracy and then the model falls apart on new data and vice versa.
The metrics are low because the split is too strict and realistic. Another common story is the wrong metric. For example optimizing accuracy with severe imbalance, looking at ROC-AUC where precision matters, rejoicing over a good loss that means nothing to the business
The model can be “mathematically good” and useless at the same time. Baseline is almost always underestimated.
Sometimes logistic regression, the group average, a simple rule by hand give a result close to a complex model and this is not a failure. On the contrary, this is a good sign that the task is either almost linear, or there's not enough data.
There's another unpleasant thing, Some tasks just aren't worth ML. Seriously, it happens that there's not enough data, supporting the model is more expensive than the benefits, the business effect is minimal but many continue tuning the learning rate, changing architectures, running AutoML, going through 40 models because "we're doing AI".
Even though they haven't even looked at the distributions, the model's errors, the quality of the target and that's usually where the answer lies.
••••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 243 |
| 14 | LLM Hallucinations
Large language models look like omniscient experts. The text is smooth, confident, logical. Until it turns out that all of this was a hallucination. Let's figure out where the hallucinations are a "normal" behavior of the model, and where they quietly turn into a serious problem.
➡️ Where the Model Helps and Where It Lies?
1️⃣ Where Hallucinations Are "Normal"?
The Model Doesn't Know, It Keeps Going
LLM is not a knowledge base, but a super-powerful autocomplete. Its goal is to generate a plausible continuation, not the truth.
Insufficient or Ambiguous Data
If the question is rare, fresh, or niche, the model simply fills in the gaps. It doesn't know how to say "I don't know" without additional training.
Creative Tasks
In storytelling and brainstorming, hallucinations aren't a bug, but a feature. The problems start when the same mode kicks in in facts and code.
2️⃣ Where the Problems Begin?
Factual Questions
The chatbot confidently reports incorrect dates, names, and events. And the user accepts this as truth.
Code Generation
• Functions that don't exist.
• APIs that never existed.
• The code looks correct — until you run it.
Critical Domains
Law, medicine, finance. Here, "sounding convincing" = potential disaster.
A Confident Tone Without Knowledge
The most dangerous thing is that the model doesn't hesitate. It doesn't blush, pause, or qualify itself.
3️⃣ What Really Reduces Hallucinations?
RAG (Data Anchoring)
The model responds not "out of thin air", but based on specific documents. There's a source — less fantasy.
Re-training and Alignment
RLHF, domain fine-tuning, teaching the model to say "I'm not sure". The model is taught to be cautious, not talkative.
Clear Instructions:
— answer only based on context
— if you don't know — say so
— justify every step
Sometimes this is enough.
• Post-checks and rules
• Tests for code
• Link verification
• Filters for prohibited patterns
Ask the Model:
— check itself
— assess confidence
— review the answer
4️⃣ What Distinguishes a Reliable System from "Just an LLM"?
— The model isn't the only source of truth
— There are data, checks, and restrictions
— The error is caught before the user
— Confidence ≠ correctness
Hallucinations aren't a "bad model". They're a consequence of the LLM always trying to respond. And if you don't surround it with context, checks, and rules, it will shoot itself in the foot just as confidently as it reasoned.
•••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 333 |
| 15 | This mathematics lies at the heart of every AI model currently being trained.
Gradient… Jacobian… Hessian…
Three words that initially seem intimidating, but in reality, they're just three ways of measuring change.
➡️ Which are the ways and how it works?
1. scaling function:
f : ℝⁿ → ℝ
Returns the vector of first partial derivatives.
It answers the question:
"In which direction does the function f grow the fastest?"
That's why gradients are the foundation of optimization.
Gradient descent goes in the opposite direction because the gradient points to the direction of maximum growth.
Backpropagation efficiently calculates gradients during training.
2. vector-valued function:
F : ℝⁿ → ℝᵐ
Returns the m × n matrix of first partial derivatives.
It answers:
"How does each output depend on each input?"
The Jacobian is a local linear mapping of a vector function.
It appears in:
→ sensitivity analysis
→ variable substitution
→ automatic differentiation
→ forward-mode AD
→ reverse-mode AD / backpropagation
In simple terms:
forward-mode AD uses Jacobian–vector products.
reverse-mode AD uses vector–Jacobian products.
3. scalar function:
f : ℝⁿ → ℝ
Returns the n × n matrix of second partial derivatives.
It answers:
"How does the gradient itself change?"
That is, the Hessian measures curvature.
When the second partial derivatives are continuous, the Hessian is symmetric.
At a critical point:
→ positive-definite Hessian → strict local minimum
→ negative-definite Hessian → strict local maximum
→ indefinite Hessian → saddle point
A pure mental model
Gradient = first derivatives of a single output
→ shows direction
Jacobian = first derivatives of many outputs
→ shows sensitivity
Hessian = second derivatives of a single output
→ shows curvature
And the connection between them is simple:
The Hessian is the Jacobian of the gradient.
For a scalar output, the Jacobian contains the same partial derivatives as the gradient, up to the convention on rows/columns.
Same idea: measuring change.
Different objects: direction, sensitivity, curvature.
When this becomes clear, optimization stops looking like a set of formulas. It starts looking like a map of the task.
•••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 265 |
| 16 | Why are open-source models changing the AI market?
A couple of years ago, it seemed that AI would be completely controlled by a few large companies. Whoever had more GPUs and money was the boss.
Then came Llama, Mistral, DeepSeek, Qwen, and Phi, and it became clear that the market would take a completely different path.
➡️ How it is changing AI Market?
It's not just about quality. The most interesting thing is that open-source models are changing the industry, not just because of quality. Although their quality is already pretty good.
The problem is that closed models tie you too tightly to someone else's infrastructure. Today, the API works; tomorrow prices have changed, limits have been cut, policies have been changed, a region has been shut down, the model has gotten worse after an update, and you have no control over any of it.
Why do open-source models change the rules of the game?
With open-source, everything is different.
You want to run locally, fine-tune, quantize, change the inference stack, optimize latency, and keep data within the company? Fine.
For businesses, this makes a huge difference. Especially regarding private data, compliance, large volumes of requests, and expensive inference. There's another important effect: Open-source is rapidly moving the industry forward because thousands of engineers test models, find weaknesses, work on optimizations, create inference engines, and release fine-tuning tools.
Progress doesn't come from the top down but from all sides at once.
What's particularly interesting right now?
Sometimes a small open-source model on a good inference pipeline feels more useful than a huge closed LLM, especially in production, because in reality, it's not just about benchmarks.
What matters? Price, control, latency, stability, and the ability to integrate the model into the system.
Main idea seems to be that the AI market is gradually moving away from the concept of "One gigantic model for everything" towards "Many specialized models for specific tasks."
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 257 |
| 17 | Why data normalization sometimes worsens the model?
Beginners in ML often hear:
Always normalize the data.
And they start scaling everything, then the model quality... drops.
➡️ Why does this happen?
Because normalization isn't always necessary.
What does normalization actually do?
it brings the features to the same scale.
For example:
age → 18–60
salary → 1000–100000
After scaling:
the values become comparable and the training becomes more stable
When normalization is really needed?
It's especially important for models that are sensitive to scale:
Logistic Regression, Linear Regression, SVM, KNN, Neural Networks
Without scaling, such models may work worse
or train unstably.
And now the most important thing, Trees usually don't need scaling.
These are Random Forest, XGBoost, LightGBM, CatBoost
Why? Because trees make splits: feature < threshold
And it doesn't matter to them: whether it's 0.5 or 5000 and the scale hardly matters
How normalization can worsen the model?
1. It adds noise
Sometimes scaling blurs the distributions, amplifies outliers, worsens separability Especially on bad data.
2. It breaks interpretability
It used to be: income = 5000
Now it's: income = -0.73
It's harder to explain this to the business.
3. Incorrect scaling = leakage
A classic mistake: scaling on the entire dataset, then splitting
The test has already "leaked" into the train.
4. CatBoost can get worse
CatBoost works well with: categorical features, original distributions
Sometimes extra preprocessing just gets in the way.
The most important insight, Scaling isn't a "data improvement" tool. It's a tool for a specific model.
What to do in practice?
A simple rule: linear models / distance-based → scaling is needed, trees → usually not needed
Normalization isn't always useful, for some models it's useless, and sometimes even harmful.
••••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 239 |
| 18 | Feature Engineering is more important than model selection
Most unpopular fact in ML:
the model isn't the most important thing.
You can spend hours choosing between:
XGBoost, LightGBM, CatBoost ...and get a +1% increase in quality.
But you can change the features - and get a +20% increase.
➡️ Let's figure out why?
The model only learns from what you give it
Garbage in → garbage out
If the features:
- are noisy
- are irrelevant
- don't reflect the task well
👉 no model will save you
Even the biggest one.
Real-life example
Task: predict customer churn
Features:
- age
- city
- tariff
Model: ok, but weak result
Added:
- time since last action
- frequency of use
- change in activity
👉 sharp increase in quality
Why?
Because the features started to reflect real behavior
Feature Engineering = implementing knowledge about the task
The model doesn't know:
- the business
- the context
- the causal relationships
But you do.
And when you create features -
you "embed" this knowledge into the data.
Model vs Features
What we change → effect
Model → +1–5%
Hyperparameters → +1–3%
Feature Engineering → +10–50%
Where FE is especially crucial
- Tabular data
- Small datasets
- Business tasks
👉 where there aren't millions of examples, features are everything
When the model is more important
- CV (images)
- NLP (texts)
- Speech
👉 where features learn automatically
Why everyone ignores FE
Because:
- it's hard
- it takes a long time
- there's no "magic button"
- it requires understanding the data
It's much easier to:
"let's try another model"
Main insight
ML isn't a competition of models.
It's a competition of data representations.
In one sentence: best way to improve a model is to
stop tuning the model and start tuning the data
••••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 275 |
| 19 | MIT released a new RL method - Pedagogical RL.
The main lesson: even correct reasoning paths can be bad data for learning.
Idea is similar to teaching someone backprop.
Suppose you have a small computational graph:
z = w * x + b
a = ReLU(z)
L = (a - y)^2
If you already understand backprop, you can immediately write the gradient:
dL/dw = 2 (a - y) 1[z > 0] * x
The answer is correct, but it skips the reasoning process.
To reach it correctly, you need to break the calculation into local parts:
dL/da = 2 * (a - y)
da/dz = 1[z > 0]
dz/dw = x
Then backprop is just a composition of local derivatives in reverse order:
dL/dw = dL/da da/dz dz/dw = 2 (a - y) 1[z > 0] * x
➡️ What problem it solves and How?
Showing the student only the final gradient does not teach them to find gradients on new graphs.
Even the phrase "just use the chain rule" can be too big a leap if the student does not know how to break the calculation into intermediate nodes and local derivatives.
Reasoning RL faces the same problem.
A rollout may pass the test, but it may contain a step that the student-model almost never would have done.
The trajectory gives the correct answer, but the learning signal is unstable because the path is too far from the student's current policy.
Pedagogical RL:
Trains a "privileged" teacher who knows the answer.
Rewards him for creating trajectories that the student can learn from.
The trick: use spike-oriented rewards.
It penalizes individual sharp "surprises" in the trajectory, even if the average probability looks normal.
The student learns through surprisal-gated imitation:
The teacher's tokens that are still too surprising receive a reduced weight.
The teacher learns how to teach at the current level of the student.
The effect of Pedagogical RL:
RL becomes more effective by selecting trajectories that the student is ready to learn from.
There is less expectation of "successful" rollouts.
There is more learning signal from examples that correspond to the current level of the student.
Get here
••••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 302 |
| 20 | Why is it called the "kernel trick"?
Many machine learning algorithms use kernels: the support vector machine, the principal component kernel, and others. Their task is to calculate the dot product in some transformed feature space, usually of high dimensionality, without explicitly transitioning to this space.
The idea is this: instead of explicitly constructing the mapping φ(x) to the new space and then calculating ⟨φ(X), φ(Y)⟩, the kernel function k(X, Y) is used, which immediately returns the result of this dot product.
An example with a polynomial kernel:
k(X, Y) = (1 + XᵀY)²
Let:
X = (x1, x2)
Y = (y1, y2)
If we expand the expression, it turns into the dot product of two vectors in a higher-dimensional space (in this case — 6 dimensions). At the same time, the coordinates themselves in this space are not explicitly calculated.
Hence the meaning of the "trick": the result is calculated in a high-dimensional space without explicitly constructing the vectors themselves in this space.
The Gaussian kernel (RBF) enhances this effect: it corresponds to working in an infinite-dimensional feature space, while the calculations remain finite and compact due to the form of the kernel function.
The mathematics behind the RBF kernel → link
••••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 305 |
اکنون در دسترس! پژوهش تلگرام ۲۰۲۵ — مهمترین بینشهای سال 
