Data eXplore : Data Science, ML, Big Data, LLMs and AI Security
الذهاب إلى القناة على Telegram
Exploring Data Science, Big Data Analytics & Visualization, ML/DL, Neural Networks, LLMs with GitHub, Kaggle, HuggingFace and some white papers by big institutions. Not just data, but science behind data Paid project? premodi@zohomail.in ★ @DataML
إظهار المزيدلم يتم تحديد البلدالتكنولوجيات والتطبيقات50 049
545
المشتركون
لا توجد بيانات24 ساعات
+57 أيام
+5230 أيام
جاري تحميل البيانات...
القنوات المماثلة
لا توجد بيانات
هل تواجه مشاكل؟ يرجى تحديث الصفحة أو الاتصال بمدير الدعم الخاص بنا.
سحابة العلامات
الإشارات الواردة والصادرة
---
---
---
---
---
---
جذب المشتركين
يونيو '26
يونيو '26
+12
في 1 قنوات
مايو '26
+54
في 1 قنوات
Get PRO
أبريل '26
+14
في 2 قنوات
Get PRO
مارس '26
+53
في 10 قنوات
Get PRO
فبراير '26
+22
في 2 قنوات
Get PRO
يناير '26
+28
في 2 قنوات
Get PRO
ديسمبر '25
+46
في 4 قنوات
Get PRO
نوفمبر '25
+128
في 4 قنوات
Get PRO
أكتوبر '25
+149
في 3 قنوات
Get PRO
سبتمبر '250
في 2 قنوات
Get PRO
أغسطس '25
+99
في 0 قنوات
Get PRO
يوليو '250
في 0 قنوات
Get PRO
يونيو '250
في 0 قنوات
Get PRO
مايو '250
في 0 قنوات
Get PRO
أبريل '250
في 1 قنوات
Get PRO
مارس '250
في 0 قنوات
Get PRO
فبراير '250
في 0 قنوات
Get PRO
يناير '25
+14
في 2 قنوات
Get PRO
ديسمبر '24
+11
في 0 قنوات
| التاريخ | نمو المشتركين | الإشارات | القنوات | |
| 11 يونيو | 0 | |||
| 10 يونيو | 0 | |||
| 09 يونيو | +3 | |||
| 08 يونيو | +4 | |||
| 07 يونيو | 0 | |||
| 06 يونيو | 0 | |||
| 05 يونيو | +1 | |||
| 04 يونيو | 0 | |||
| 03 يونيو | +3 | |||
| 02 يونيو | 0 | |||
| 01 يونيو | +1 |
منشورات القناة
ML process template that we use in the ML Core team
Maybe the template will make it easier for you to start putting together a doc:
1️⃣ Create a scheme of the main development stages adopted in your team, for example: • task setting; • data exploration; • formulating the task in ML terms; • MVP solution; • testing the solution; • rolling it out to production; • monitoring. 2️⃣ Describe each stage: • what needs to be done; • what the result should look like to proceed to the next step. Try to avoid long texts; use diagrams, tables, infographics. 3️⃣ Add to each stage: • templates that will allow you to complete this stage faster; • useful tips; • standards and requirements, if any; • links to resources, articles, documents that can help at this stage; • answers to popular questions; • documentation requirements: what and where needs to be described to consider the stage completed. After putting together the ML process, don't forget to request feedback from colleagues who didn't participate in its development. And also inform everyone interested about the appearance of a new useful tool.And remember, the ML process can't be written once. The practices adopted in the company change, new tools appear to replace or supplement the old ones, versions are updated. It's important to regularly keep your ML process up to date and adapt it to new needs. It's not you who adapt to the process, but the process that adapts to you! •••••••••••••••••••••••••••••••••••••• 🤖 Data & ML | @DataXplore
| 2 | In the previous post, i explained what conformal prediction is and why it's needed. But you probably have some questions:
1. How does the model understand how "strange" or "risky" a particular object is?
2. How is the conformal predictor trained?
The answer to the first question: through the measure of discomfort.
The measure of discomfort is a function that shows how poorly a particular pair (x, y) matches the model and the already known data.
The measures of discomfort can be simple functions, such as MAE or hinge_loss:
nonconformity_mae = |y_true - y_pred|
nonconformity_hinge = 1 - P(true_class)
or more complex ones, such as Brier's score.
An intuitive example
Suppose the model classifies images. For a normal picture, the model says:
Barbie: 0.02
Ken: 0.97
Oppenheimer: 0.01
And for a blurry picture of an animal in the forest:
Barbie: 0.2
Ken: 0.42
Oppenheimer: 0.38
A regular classifier in all cases will choose the Ken class. Conformal prediction in the second case may say:
{Ken, Oppenheimer}
Because the measure of discomfort for these classes will not be high enough to reject them outright.
Next, let's talk about how to train it.
TCP: Transductive Conformal Prediction
TCP, strictly speaking, is not "trained" like a regular model. It's better to formulate it this way:
In TCP, for each new object and each possible answer, we temporarily add this answer to the training set, retrain or re-evaluate the model, and check how "uncomfortable" this answer is relative to the rest of the data.
Let's consider the TCP algorithm step by step.
Suppose there is a sample:
D = {(x1, y1), ..., (xn, yn)}
and a new object
x_new.
• Step 1. Take one of the classes, for example, Barbie. Make an assumption:
y_new = Barbie
• Step 2. Add it to the existing dataset:
D_Barbie = D ∪ {(x_new, Barbie)}
• Step 3. Train the model on the new set.
• Step 4. Calculate the measure of discomfort, for example, hinge_loss, for all objects, including the new one.
• Step 5. Compare the new object with the rest. See how the discomfort of the new object is relative to the rest of the objects in D_Barbie. Simplified:
p_value(Barbie) = the proportion of objects with a score ≥ score_x_new
Steps 1 to 5 are repeated for each class. The final prediction set is formed from the classes for which p_value > α, where α is the desired significance level.
ICP: Inductive Conformal Prediction
Experienced ML engineers, having read the previous part, are probably horrified. For predicting on 1000 objects in 10 classes, we will need 10,000 retrains of the model!
This problem is solved by the ICP method at the expense of allocating a separate calibration set:
• Step 1. Divide the data into train and calibration:
D_train
D_calibration
• Step 2. Train the model on D_train. After this, the model is no longer retrained for each new object.
• Step 3. Calculate the measures of discomfort on D_calibration. For each object from D_calibration, calculate how poorly the model predicted the correct answer. We get a set of calibration scores:
scores = [α1, α2, ..., αm]
• Step 4. Set the significance level α and select the threshold q. Now we select such a threshold q of the calibration scores that the required proportion of calibration scores is not greater than it. Simplified:
q = the 90th percentile of the calibration scores
• Step 5. Apply to the new object. For the new object, calculate the score for each possible class:
score(Barbie) = 1 - P(Barbie)
score(Ken) = 1 - P(Ken)
score(Oppenheimer) = 1 - P(Oppenheimer)
The prediction set will include those classes for which score ≤ q.
The main difference from TCP:
ICP once trains the model and once calibrates the threshold. After that, for new objects, it uses the ready-made model and the ready-made calibration, so it works much faster.
••••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 123 |
| 3 | EVOLUTION OF ATTENTION: Transition to Linear Models
This is a continuation of the series of posts about the path of linear Attention. Last time, we found out that transformers have quadratic complexity, which makes them poorly scalable for long sequences and requires a lot of memory.
➡️ How it works?
📌 Linear Attention first changed the approach to computation. It allowed to decompose the kernel function and instead of exp(QKᵀ) use φ(Q) · φ(K). As a result:
• rearrange the calculations;
• first calculate K · V;
• then apply Q.
But the main thing: the complexity became linear in the length of the sequence (instead of quadratic). The model worked faster and more efficiently in terms of memory, but at the same time lost accuracy.
The usual Attention stores tokens separately, while linear Attention aggregates the information into a general "summary" of the context. Thanks to this, the model understands the general meaning well, but poorly remembers the details.
The task was: to maintain the efficiency of linear Attention, but to return local context.
📌 This problem was partially solved by the RWKV architecture. It did not abandon the idea of compact memory, but added mechanisms that make it more sensitive to the current context:
• Token shift
Instead of considering a token in isolation, the model mixes it with the previous state. Therefore, each new step contains information about the nearest context.
• Memory management
Memory in RWKV does not just accumulate. At each step, some of the old information is forgotten, and new information is added. If nothing is forgotten, the memory will quickly turn into noise. And if forgotten too aggressively — the context will be lost. The model learns to find a balance between extremes on its own.
• Gate at retrieval
When it's necessary to retrieve information from memory, a gate is used. It works like a filter: it looks at the current token and decides which parts of the memory are important now and which can be ignored.
RWKV became a kind of hybrid of previous ideas. It did not make the model as accurate as classic Attention, but at the same time returned local context. In parallel with it, other architectures were developing: SSM and Mamba.
Will tell more about them in the next posts.
••••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 124 |
| 4 | EVOLUTION OF ATTENTION: From RNN to Transformer
Starting a series of posts to explain path to linear Attention.
Before transformers, tasks like translation and classification mainly used recurrent models (RNNs). In 2014, the Attention mechanism appeared. It allowed us not just to read a text sequentially, but to look at all input tokens and assess which of them were important for generation.
➡️ How it worked?
A bidirectional RNN encoded the input sequence → for each decoder step, it calculated the relevance of input tokens → obtained weights via softmax → based on them, it formed a context for generating the next token.
This led to a significant improvement in machine translation quality. However, the main problem with RNNs remained - they performed poorly on long sequences. To "understand" a word, the models had to process the entire text and reach it.
Transformers became the next step in evolution
To see the entire sequence at once and better model the dependencies between tokens, a number of changes were made:
• We abandoned recurrence - the sequence is calculated in parallel.
• Added self-attention - in addition to the encoder, attention now starts to be applied directly to the decoder.
• Added Bahdanau Attention for expressiveness - instead of a single-layer perceptron, a dot product of the trainable Q,K,V matrices is used.
However, a new problem arose: Attention has quadratic complexity in terms of sequence length. This means that as the context increases, memory and computations grow very quickly.
This was attempted to be fixed in various ways: reducing the number of heads to save cache; calculating on a portion of the sequence; creating kernels for efficient Attention calculation (for example, Flash Attention).
These methods accelerated the calculations, but didn't change the Attention formula itself.
More about how this limitation was overcome in next post.
••••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 152 |
| 5 | Data leakage is one of the main reasons Why ML demos look impressive... and then fail in production.
The model didn't become smarter.
It just happened to see the correct answers in advance.
➡️ Let's break it down in 4 minutes, you'll understand where data leaks hide.
1. Data Leakage
Data leakage occurs when information that won't be available at the time of actual prediction is used during the model training process.
Because of this, metrics on the validation stage can look much better than the actual quality of the model on new, previously unseen data.
2. Model Evaluation
The test set isn't just "additional data".
It's a simulation of the future.
Only train the model on the information that would have been available to you at the time of prediction.
Evaluate it on examples that the model couldn't have influenced during training.
3. Direct Leakage
This is the most obvious type of leakage.
Examples:
- a field with information from the future;
- an ID that encodes the target variable;
- a variable that appears only after an event has occurred;
- duplicate records in both the training and test sets.
If a feature doesn't exist at the time of inference (prediction), then it's likely a source of data leakage.
4. Indirect Leakage
This is the type of leakage that most often traps teams.
You perform normalization, imputation, feature selection, outlier removal, or dimensionality reduction before splitting the data into a training and test set.
The model didn't directly see the data from the test set.
But your preprocessing pipeline already saw it.
5. Train/Test Split:
Wrong:
fit the scaler on all data → split the data → evaluate
Right:
split the data → fit the scaler only on the training set → apply it to both the training and test sets
The same idea applies to imputers, encoders, feature selection, PCA, and any preprocessing step that is trained on the data.
6. Cross-Validation:
Each fold is a mini-experiment with a training and test set.
Therefore, preprocessing should be performed within each fold.
If you prepared the entire dataset once and then ran cross-validation, each fold would already have had access to its held-out data.
7. Pipelines:
A pipeline isn't just a way to make the code cleaner.
It's also a defense against data leakage.
Combine preprocessing, feature selection, and the model into a single pipeline, and then pass this pipeline to cross-validation or hyperparameter search (grid search).
8. AI Engineering Version:
Data leaks also occur in RAG systems and when evaluating LLMs.
Leakage occurs when you tune chunks, prompts, re-rankers, thresholds, or examples on the same evaluation dataset that you later present as "held-out".
As a result, your benchmark turns into training data.
9. Leakage Checklist:
Before trusting the obtained metric, ask yourself:
Could this feature exist at the time of prediction?
Was any transformation (transform) step trained (fit) on the test data?
Did cross-validation include the entire pipeline?
Were we tuning parameters on the final evaluation dataset?
If the answer is "yes", then the metric likely doesn't reflect the actual quality of the model.
••••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 165 |
| 6 | What to do if the models perform worse than expected?
Almost everyone in ML experiences this problems.
PROBLEM 1:
You cleaned the data, trained the model, spent a lot of time, looked at the metrics and the quality turned out to be much worse than you expected.
The first thought is usually "We need a more complex model." In my experience this is a mistake in 80% of cases. Problem is often not the model at all.
First thing to check is the data. Very often it turns out that the target is noisy, the classes are poorly separated, half of the features are useless, there's not much signal in the data. Some tasks are just hard to predict and that's normal.
There's a feeling that many people expect magic from ML "If the model is smart, it will find everything by itself." It won't, if there's no consistent pattern in the data, XGBoost won't create one.
PROBLEM 2
The second problem is leakage or a bad split especially in tabular data. Sometimes offline everything is beautiful ROC-AUC = 0.95, almost perfect accuracy and then the model falls apart on new data and vice versa.
The metrics are low because the split is too strict and realistic. Another common story is the wrong metric. For example optimizing accuracy with severe imbalance, looking at ROC-AUC where precision matters, rejoicing over a good loss that means nothing to the business
The model can be “mathematically good” and useless at the same time. Baseline is almost always underestimated.
Sometimes logistic regression, the group average, a simple rule by hand give a result close to a complex model and this is not a failure. On the contrary, this is a good sign that the task is either almost linear, or there's not enough data.
There's another unpleasant thing, Some tasks just aren't worth ML. Seriously, it happens that there's not enough data, supporting the model is more expensive than the benefits, the business effect is minimal but many continue tuning the learning rate, changing architectures, running AutoML, going through 40 models because "we're doing AI".
Even though they haven't even looked at the distributions, the model's errors, the quality of the target and that's usually where the answer lies.
••••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 186 |
| 7 | LLM Hallucinations
Large language models look like omniscient experts. The text is smooth, confident, logical. Until it turns out that all of this was a hallucination. Let's figure out where the hallucinations are a "normal" behavior of the model, and where they quietly turn into a serious problem.
➡️ Where the Model Helps and Where It Lies?
1️⃣ Where Hallucinations Are "Normal"?
The Model Doesn't Know, It Keeps Going
LLM is not a knowledge base, but a super-powerful autocomplete. Its goal is to generate a plausible continuation, not the truth.
Insufficient or Ambiguous Data
If the question is rare, fresh, or niche, the model simply fills in the gaps. It doesn't know how to say "I don't know" without additional training.
Creative Tasks
In storytelling and brainstorming, hallucinations aren't a bug, but a feature. The problems start when the same mode kicks in in facts and code.
2️⃣ Where the Problems Begin?
Factual Questions
The chatbot confidently reports incorrect dates, names, and events. And the user accepts this as truth.
Code Generation
• Functions that don't exist.
• APIs that never existed.
• The code looks correct — until you run it.
Critical Domains
Law, medicine, finance. Here, "sounding convincing" = potential disaster.
A Confident Tone Without Knowledge
The most dangerous thing is that the model doesn't hesitate. It doesn't blush, pause, or qualify itself.
3️⃣ What Really Reduces Hallucinations?
RAG (Data Anchoring)
The model responds not "out of thin air", but based on specific documents. There's a source — less fantasy.
Re-training and Alignment
RLHF, domain fine-tuning, teaching the model to say "I'm not sure". The model is taught to be cautious, not talkative.
Clear Instructions:
— answer only based on context
— if you don't know — say so
— justify every step
Sometimes this is enough.
• Post-checks and rules
• Tests for code
• Link verification
• Filters for prohibited patterns
Ask the Model:
— check itself
— assess confidence
— review the answer
4️⃣ What Distinguishes a Reliable System from "Just an LLM"?
— The model isn't the only source of truth
— There are data, checks, and restrictions
— The error is caught before the user
— Confidence ≠ correctness
Hallucinations aren't a "bad model". They're a consequence of the LLM always trying to respond. And if you don't surround it with context, checks, and rules, it will shoot itself in the foot just as confidently as it reasoned.
•••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 274 |
| 8 | This mathematics lies at the heart of every AI model currently being trained.
Gradient… Jacobian… Hessian…
Three words that initially seem intimidating, but in reality, they're just three ways of measuring change.
➡️ Which are the ways and how it works?
1. scaling function:
f : ℝⁿ → ℝ
Returns the vector of first partial derivatives.
It answers the question:
"In which direction does the function f grow the fastest?"
That's why gradients are the foundation of optimization.
Gradient descent goes in the opposite direction because the gradient points to the direction of maximum growth.
Backpropagation efficiently calculates gradients during training.
2. vector-valued function:
F : ℝⁿ → ℝᵐ
Returns the m × n matrix of first partial derivatives.
It answers:
"How does each output depend on each input?"
The Jacobian is a local linear mapping of a vector function.
It appears in:
→ sensitivity analysis
→ variable substitution
→ automatic differentiation
→ forward-mode AD
→ reverse-mode AD / backpropagation
In simple terms:
forward-mode AD uses Jacobian–vector products.
reverse-mode AD uses vector–Jacobian products.
3. scalar function:
f : ℝⁿ → ℝ
Returns the n × n matrix of second partial derivatives.
It answers:
"How does the gradient itself change?"
That is, the Hessian measures curvature.
When the second partial derivatives are continuous, the Hessian is symmetric.
At a critical point:
→ positive-definite Hessian → strict local minimum
→ negative-definite Hessian → strict local maximum
→ indefinite Hessian → saddle point
A pure mental model
Gradient = first derivatives of a single output
→ shows direction
Jacobian = first derivatives of many outputs
→ shows sensitivity
Hessian = second derivatives of a single output
→ shows curvature
And the connection between them is simple:
The Hessian is the Jacobian of the gradient.
For a scalar output, the Jacobian contains the same partial derivatives as the gradient, up to the convention on rows/columns.
Same idea: measuring change.
Different objects: direction, sensitivity, curvature.
When this becomes clear, optimization stops looking like a set of formulas. It starts looking like a map of the task.
•••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 231 |
| 9 | Why are open-source models changing the AI market?
A couple of years ago, it seemed that AI would be completely controlled by a few large companies. Whoever had more GPUs and money was the boss.
Then came Llama, Mistral, DeepSeek, Qwen, and Phi, and it became clear that the market would take a completely different path.
➡️ How it is changing AI Market?
It's not just about quality. The most interesting thing is that open-source models are changing the industry, not just because of quality. Although their quality is already pretty good.
The problem is that closed models tie you too tightly to someone else's infrastructure. Today, the API works; tomorrow prices have changed, limits have been cut, policies have been changed, a region has been shut down, the model has gotten worse after an update, and you have no control over any of it.
Why do open-source models change the rules of the game?
With open-source, everything is different.
You want to run locally, fine-tune, quantize, change the inference stack, optimize latency, and keep data within the company? Fine.
For businesses, this makes a huge difference. Especially regarding private data, compliance, large volumes of requests, and expensive inference. There's another important effect: Open-source is rapidly moving the industry forward because thousands of engineers test models, find weaknesses, work on optimizations, create inference engines, and release fine-tuning tools.
Progress doesn't come from the top down but from all sides at once.
What's particularly interesting right now?
Sometimes a small open-source model on a good inference pipeline feels more useful than a huge closed LLM, especially in production, because in reality, it's not just about benchmarks.
What matters? Price, control, latency, stability, and the ability to integrate the model into the system.
Main idea seems to be that the AI market is gradually moving away from the concept of "One gigantic model for everything" towards "Many specialized models for specific tasks."
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 216 |
| 10 | Why data normalization sometimes worsens the model?
Beginners in ML often hear:
Always normalize the data.
And they start scaling everything, then the model quality... drops.
➡️ Why does this happen?
Because normalization isn't always necessary.
What does normalization actually do?
it brings the features to the same scale.
For example:
age → 18–60
salary → 1000–100000
After scaling:
the values become comparable and the training becomes more stable
When normalization is really needed?
It's especially important for models that are sensitive to scale:
Logistic Regression, Linear Regression, SVM, KNN, Neural Networks
Without scaling, such models may work worse
or train unstably.
And now the most important thing, Trees usually don't need scaling.
These are Random Forest, XGBoost, LightGBM, CatBoost
Why? Because trees make splits: feature < threshold
And it doesn't matter to them: whether it's 0.5 or 5000 and the scale hardly matters
How normalization can worsen the model?
1. It adds noise
Sometimes scaling blurs the distributions, amplifies outliers, worsens separability Especially on bad data.
2. It breaks interpretability
It used to be: income = 5000
Now it's: income = -0.73
It's harder to explain this to the business.
3. Incorrect scaling = leakage
A classic mistake: scaling on the entire dataset, then splitting
The test has already "leaked" into the train.
4. CatBoost can get worse
CatBoost works well with: categorical features, original distributions
Sometimes extra preprocessing just gets in the way.
The most important insight, Scaling isn't a "data improvement" tool. It's a tool for a specific model.
What to do in practice?
A simple rule: linear models / distance-based → scaling is needed, trees → usually not needed
Normalization isn't always useful, for some models it's useless, and sometimes even harmful.
••••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 207 |
| 11 | Feature Engineering is more important than model selection
Most unpopular fact in ML:
the model isn't the most important thing.
You can spend hours choosing between:
XGBoost, LightGBM, CatBoost ...and get a +1% increase in quality.
But you can change the features - and get a +20% increase.
➡️ Let's figure out why?
The model only learns from what you give it
Garbage in → garbage out
If the features:
- are noisy
- are irrelevant
- don't reflect the task well
👉 no model will save you
Even the biggest one.
Real-life example
Task: predict customer churn
Features:
- age
- city
- tariff
Model: ok, but weak result
Added:
- time since last action
- frequency of use
- change in activity
👉 sharp increase in quality
Why?
Because the features started to reflect real behavior
Feature Engineering = implementing knowledge about the task
The model doesn't know:
- the business
- the context
- the causal relationships
But you do.
And when you create features -
you "embed" this knowledge into the data.
Model vs Features
What we change → effect
Model → +1–5%
Hyperparameters → +1–3%
Feature Engineering → +10–50%
Where FE is especially crucial
- Tabular data
- Small datasets
- Business tasks
👉 where there aren't millions of examples, features are everything
When the model is more important
- CV (images)
- NLP (texts)
- Speech
👉 where features learn automatically
Why everyone ignores FE
Because:
- it's hard
- it takes a long time
- there's no "magic button"
- it requires understanding the data
It's much easier to:
"let's try another model"
Main insight
ML isn't a competition of models.
It's a competition of data representations.
In one sentence: best way to improve a model is to
stop tuning the model and start tuning the data
••••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 233 |
| 12 | MIT released a new RL method - Pedagogical RL.
The main lesson: even correct reasoning paths can be bad data for learning.
Idea is similar to teaching someone backprop.
Suppose you have a small computational graph:
z = w * x + b
a = ReLU(z)
L = (a - y)^2
If you already understand backprop, you can immediately write the gradient:
dL/dw = 2 (a - y) 1[z > 0] * x
The answer is correct, but it skips the reasoning process.
To reach it correctly, you need to break the calculation into local parts:
dL/da = 2 * (a - y)
da/dz = 1[z > 0]
dz/dw = x
Then backprop is just a composition of local derivatives in reverse order:
dL/dw = dL/da da/dz dz/dw = 2 (a - y) 1[z > 0] * x
➡️ What problem it solves and How?
Showing the student only the final gradient does not teach them to find gradients on new graphs.
Even the phrase "just use the chain rule" can be too big a leap if the student does not know how to break the calculation into intermediate nodes and local derivatives.
Reasoning RL faces the same problem.
A rollout may pass the test, but it may contain a step that the student-model almost never would have done.
The trajectory gives the correct answer, but the learning signal is unstable because the path is too far from the student's current policy.
Pedagogical RL:
Trains a "privileged" teacher who knows the answer.
Rewards him for creating trajectories that the student can learn from.
The trick: use spike-oriented rewards.
It penalizes individual sharp "surprises" in the trajectory, even if the average probability looks normal.
The student learns through surprisal-gated imitation:
The teacher's tokens that are still too surprising receive a reduced weight.
The teacher learns how to teach at the current level of the student.
The effect of Pedagogical RL:
RL becomes more effective by selecting trajectories that the student is ready to learn from.
There is less expectation of "successful" rollouts.
There is more learning signal from examples that correspond to the current level of the student.
Get here
••••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 265 |
| 13 | Why is it called the "kernel trick"?
Many machine learning algorithms use kernels: the support vector machine, the principal component kernel, and others. Their task is to calculate the dot product in some transformed feature space, usually of high dimensionality, without explicitly transitioning to this space.
The idea is this: instead of explicitly constructing the mapping φ(x) to the new space and then calculating ⟨φ(X), φ(Y)⟩, the kernel function k(X, Y) is used, which immediately returns the result of this dot product.
An example with a polynomial kernel:
k(X, Y) = (1 + XᵀY)²
Let:
X = (x1, x2)
Y = (y1, y2)
If we expand the expression, it turns into the dot product of two vectors in a higher-dimensional space (in this case — 6 dimensions). At the same time, the coordinates themselves in this space are not explicitly calculated.
Hence the meaning of the "trick": the result is calculated in a high-dimensional space without explicitly constructing the vectors themselves in this space.
The Gaussian kernel (RBF) enhances this effect: it corresponds to working in an infinite-dimensional feature space, while the calculations remain finite and compact due to the form of the kernel function.
The mathematics behind the RBF kernel → link
••••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 290 |
| 14 | ML Roadmap: from the basics to mastering vibe coding
A learning map for machine learning (Machine Learning, Deep Learning, LLM, Generative AI, MLOps) - from the first import of numpy to the level of an engineer who understands how AI works internally and can write production systems, not just call APIs.
GitHub
•••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 3 593 |
| 15 | An unpleasant truth for beginners in Data Science: Why knowing math does not guarantee a job?
Good math ≠ readiness for real work.
You can understand
linear algebra, statistics, gradient descent, probabilities and still fail in practice.
➡️ Why does this happen?
Because the job of a Data Scientist is not just formulas.
It’s also: dirty data, unclear requirements, weak baselines, strange business constraints, communication with people. In reality, the task rarely looks like it does in a textbook.
Math helps, but does not replace practice
Math gives understanding of why the model works, where it can break, how to read metrics, how not to believe in magic
But it won’t teach you how to clean data, how to build a pipeline, how to write production code, how to do a proper train/test split, how to explain results to the business
Main mistake of beginners:
They think:
"First I’ll learn all the math, then I’ll start projects."
Problem is,
"all the math" never ends.
Jobs are given not for knowing formulas, but for the ability to solve problems.
What they really look for in interviews?
Usually they want to understand can you work with data, do you understand metrics, can you create a baseline, do you see leakage, can you explain your solution, do you have projects
Math is important but by itself it doesn’t sell you as a specialist.
What to do instead of endless theory?
The best way is learn math as needed, work on projects in parallel, analyze model errors, write code by hand, learn to explain conclusions in simple words
Theory should strengthen practice, not replace it.
Main insight: math is the foundation. But a house is not built by foundation alone.
In one sentence: to get a job in DS/ML, knowing formulas is not enough, you need to be able to turn data into working solutions.
••••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 272 |
| 16 | How to make AI not cheat if the model is smarter than the checker?
Models are getting smarter and smarter, and soon we won't be able to adequately check their output - either ourselves or with other models-judges. There's a real risk of sandbagging: when a powerful neural network deliberately pretends to be stupid (or hides bugs in the code) to deceive a weaker checker.
A group of researchers from Anthropic and Redwood Research conducted a decent experiment to cure this problem.
➡️ How to cure knowledge sabotage with standard training methods?
The experiment scheme involved a "Red vs. Blue" confrontation with a common metric - how well the Blues could extract the real performance of the model.
Blues aren't a competitor who gains something. Their role is purely instrumental: to assemble the most robust sabotage model so that the Blues have something to break.
The Blues trained GPT-OSS-120B to sandbag on a hidden flag in the system prompt: output plausible nonsense when the flag is on and work honestly when it's off.
The Blues had to extract the real capabilities of it, using models that were inherently weaker as judges - GPT-4o-mini and Llama 3.1 8B. The Blues had almost no reference answers (only 10 pieces for the entire dataset).
➜ How the Blues broke the sabotage?
☞ Only RL. A total failure.
If the model persistently sandbags, RL on a weak judge goes into reward hacking (learns to generate answers that look good to the judge but don't actually solve the task). On olympiad math and Super GPQA, almost all runs failed.
☞ SFT + RL (on the weak model's reference answers). Already better.
First, a short SFT-downtune of 80–100 steps, which is enough to break the sabotage pattern and make the model at least start trying to give correct answers. And then a top-up via RL on a weak reward signal.
➜ What the numbers say?
The SFT+RL combination extracts 86–99% of the model's real performance on hard benchmarks (Super GPQA, olympiad math, Code Contests), even when the checker was a frankly weak network.
By the way, it turned out that 10 reference examples for SFT are enough to restore the model's brain almost completely (83–99% on different tasks).
➜ Nuances
The magic only works as long as the model can't distinguish between test runs and production.
The authors tested a scenario where the Blues give the model full knowledge about which tasks belong to the training set and which to the deployment (the distributions remain the same, the difference is only that the model knows where it is).
In this scenario, the model changes its behavior: it behaves ok during SFT and RL, collects all the rewards, and in production it starts to mess up again.
On olympiads, elicitation in the production environment dropped from 88% to 36%, on Super GPQA - from 99% to 60%, on Code Contests - from 93% to 66%.
How a naturally born schematist, whose tendency to cheat broke through pretraining or extensive RL, will behave is an open question. The authors themselves mention this as the main limitation of the work.
➜ Output
Only RL on a weak judge can extract 86–99% of the model's real performance on hard benchmarks (Super GPQA, olympiad math, Code Contests), even when the checker was a frankly weak network.
But if the model can somehow understand that it's on a test stand - it will mimic.
For now, it's still possible to make AI stop cheating with basic pipelines without esoteric methods, and its good news, but if neural network somehow understands that it's on a test stand - it will mimic.
Article • #AI #ML #LLM #Research #Alignment
••••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 268 |
| 17 | How to Fight Overfitting in Neural Networks?
Overfitting is a moment when the model:
Perfectly knows train data and Performs poorly on new data
It memorizes, not generalizes.
➡️ How to fight this in practice?
1. More Data
The most reliable way.
If there's not enough data:
☞ collect new data
☞ do data augmentation
☞ use synthetic data
More diversity = less chance of memorizing noise.
2. Regularization
Add a penalty for model complexity.
Main options:
☞ L2 (weight decay)
☞ L1
Less weight → simpler model → less overfitting.
3. Dropout
During training, random neurons are "turned off".
What happens:
☞ the model can't rely on specific connections
☞ learns to be more robust
Usually used:
☞ 0.2 – 0.5
4. Early Stopping
Monitor validation:
☞ train loss drops
☞ val loss first drops, then rises
We stop training when val loss starts rising.
This is one of the most effective methods.
5. Simplify the Model
Sometimes the solution is obvious:
☞ fewer layers
☞ fewer parameters
☞ simpler architecture
A larger model is easier to overfit.
6. Data Augmentation
Especially important for:
CV:
☞ rotations
☞ noise
☞ crops
NLP:
☞ rephrasing
☞ substitutions
The model sees more variants of the same thing.
7. Batch Normalization
Helps:
☞ stabilize training
☞ slightly reduce overfitting
Not the main solution, but it reinforces the others.
8. Proper Validation
If the split is bad, you won't notice the problem.
Use:
☞ train / val / test
☞ k-fold with small data
Otherwise, you'll be optimizing an illusion.
Main Insight: overfitting is a signal:
☞ either not enough data
☞ or model is too complex
☞ or training is set up incorrectly
In One Sentence: To reduce overfitting - add data or reduce model complexity.
••••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 254 |
| 18 | You've probably encountered requirements for model calibration, interpretability and explainability. But what if at some point you need to answer the question: How reliable is this particular prediction?
Here's where conformal prediction comes to rescue. This approach adds an uncertainty estimate to model predictions.
Regular model says demand tomorrow will be 100 units and Conformal prediction says, demand tomorrow will likely be between 80 and 125 units.
In classification: a regular model says image shows object A and Conformal prediction says the most likely object in the image is A or C.
If we formalize problem, we want to construct such intervals or sets of answers that will contain correct answer in approximately p% of cases on new data. Size of this set shows the model's uncertainty: the wider it is, the less confident the model appears to be.
➡️ Where, How, and What tools to use?
Where can this be useful?
• Medicine. In medical tasks, one confident but incorrect diagnosis can be costly. Conformal prediction allows the model not to pretend to be omniscient. Instead of one diagnosis, it can give several likely options. For the doctor, this is an additional hint where to look more closely.
• Predictive maintenance. Instead of a point estimate, we get a risk window when important equipment fails.
• Retail. A forecast like "we'll sell 407 packages of ice cream tomorrow" looks too magical and doesn't give a full picture of demand. A range of forecasts is more convenient for inventory management: we can estimate the risk of a shortage or an overstocked warehouse.
How does it work?
1. We have a trained model. We take a separate part of the data that the model did not see during training. This is usually called a calibration sample.
2. On this sample, we look at how much the model is wrong.
3. We collect such errors and choose a threshold that covers the necessary percentage of cases. For example, we want 90% reliability - we take such a value of error that approximately 90% of past correct answers fall within the corresponding interval.
4. After this, for a new object, the model makes a prediction, and we add an interval around it.
The main idea: if the model on similar data usually made errors no more than a certain value, we use this value as a protective "gap" around future predictions.
It's important to remember that conformal prediction works well when the future data are similar to those on which we calibrated. But if the world has changed dramatically - for example, there was a pandemic, a crisis, or a sudden hype on social networks - past errors may not describe the future well enough.
How to USE it?
For a quick start, you can take a look at MAPIE. This is a library in the style of scikit-learn that allows you to build prediction intervals for regression and prediction sets for classification. Another good option is PUNCC. It's more flexible and allows you to apply conformal prediction on top of various models, including sklearn models, neural networks, and custom pipelines.
••••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 234 |
| 19 | Pre-processing and decoding in LLM inference
Have you ever wondered why first token always appears with a delay, while rest of stream proceeds almost instantly?
It's not network latency or model warmup, it's a structural property of how LLMs actually execute.
Inference consists of two phases that use the same model and execution path, but the workload in each phase is fundamentally different, and the bottlenecks are opposite.
𝗣𝗿𝗲𝗳𝗶𝗹𝗹 — this is the request processing phase. The model processes all input tokens in a single parallel pass, computing Q, K, and V for all tokens at once.
The attention mechanism is implemented as a large matrix operation, for which GPUs are optimized, so the computational units are heavily loaded and the chip operates at the limit of its arithmetic throughput.
Pre-processing is memory-bound, and the metric that reflects this is the time to the first token.
𝗗𝗲𝗰𝗼𝗱𝗲 starts after the first token appears. To generate the next one, the model calculates Q, K, and V only for the new token, because everything previous is already cached.
Then comes the "one token - one pass" cycle: the new query is multiplied by the already stored keys instead of the full matrix, and the computational volume becomes small.
However, the GPU still has to read all weights and the entire cache from memory to perform even this small operation, so the memory bandwidth becomes the bottleneck, and the computational units are idle.
Decoding is memory-bound, and the metric here is the delay between tokens.
This separation explains a number of effects that seem non-obvious from the outside.
The GPU load is high during pre-processing and drops sharply during decoding, because in the second phase, the memory becomes the limiting factor rather than the computations.
Adding computational power often doesn't help with slow generation, because for memory-bound workloads, the solution is faster memory or a smaller cache, not more FLOPs.
A long context slows down generation disproportionately, because the key and value caches grow with each token, and each step of decoding must read them all.
This cache is a key optimization, without which decoding would be impossible, because the attention would have to be recalculated for the entire growing sequence at each step.
With the cache, it's built once during pre-processing and then expanded by one element for each new token, reusing already computed values.
However, the cache is stored in GPU memory and grows linearly with the sequence length. For a 13B model, this is about 1 MB per token, so a 4K context occupies about 4 GB of video memory just for the cache.
Therefore, a long context feels slow not because of the "lack of model power," but because of the memory pressure.
Currently, the industry is optimizing this limitation through quantized caches, sliding windows, grouped attention, and PagedAttention, while the DeepSeek V4 series goes further and redesigns the attention mechanism itself to make the cache smaller from the start.
When attention starts being redesigned for memory constraints, it means that the limitation has shifted towards memory.
Practical takeaway: if the model seems slow, it's important to distinguish — is it slow starting or slow streaming? A slow start corresponds to pre-processing and computational bottlenecks, while slow streaming corresponds to decoding and memory limitations.
Read further material that breaks down LLM inference from scratch: tokenization, embeddings, attention, the separation of pre-processing and decoding, key/value caches and quantization.
••••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 195 |
| 20 | In an interview about ML engineering at Apple, question is asked:
There are two models with an accuracy of 88%.
- Model A has a confidence of 89%
- Model B has a confidence of 99%
Which would you choose?
The ANSWER "any of them, they have the same accuracy" ends the interview.
➡️ what's missing?
Modern neural networks often mislead.
They give an overconfidence in predictions.
For example, in one experiment on the CIFAR-100 dataset, LeNet and ResNet were compared.
LeNet:
- accuracy ≈ 0.55
- average confidence ≈ 0.54
ResNet:
- accuracy ≈ 0.7
- average confidence ≈ 0.9
Despite the higher accuracy, ResNet is overconfident in its predictions. The model believes it's right with a 90% probability, but the actual accuracy is about 70%.
Calibration solves this problem.
A model is considered calibrated if the probabilities of predictions correspond to the real outcomes.
For example: if the model gives a probability of 70%, then in about 70% of cases, the event should actually occur.
This is important because such models are used in decision-making.
A poorly calibrated but confident model can give critically misleading results.
Example: a state hospital plans expensive medical tests.
A realistic assessment of probabilities helps optimally allocate the budget and make decisions.
If the model is not calibrated, it will give overly confident predictions.
Reliability diagrams are used to visually check calibration.
They show the dependence of the actual accuracy on the predicted confidence (softmax values).
An ideally calibrated model gives a line y = x.
They also use a scalar metric - expected calibration error (ECE).
One of its approximations is to divide the predictions into intervals and average the difference between accuracy and confidence across these bins.
The main methods of model calibration:
For binary classification:
- histogram binning
- isotonic regression
- Platt scaling
For multi-class classification:
- binning
- matrix and vector scaling
••••••••••••••••••••••••••••••••••••••
🤖 Data & ML | @DataXplore | 207 |
متاح الآن! بحث تيليغرام 2025 — أهم رؤى العام 
