Data science/ML/AI
Data science and machine learning hub Python, SQL, stats, ML, deep learning, projects, PDFs, roadmaps and AI resources. For beginners, data scientists and ML engineers 👉 https://rebrand.ly/bigdatachannels DMCA: @disclosure_bds Contact: @mldatascientist
Больше📈 Аналитический обзор Telegram-канала Data science/ML/AI
Канал Data science/ML/AI (@datascience_bds) языкового сегмента Английский является активным участником. Сейчас сообщество объединяет 13 674 подписчиков, занимая 9 380 место в категории Технологии и приложения и 31 607 место в регионе Индия.
📊 Показатели аудитории и динамика
С момента создания невідомо проект демонстрирует стремительный рост, собрав аудиторию из 13 674 подписчиков.
Согласно последним данным от 10 июня, 2026, канал показывает стабильную активность. За последние 30 дней изменение числа участников составило 143, а за последние 24 часа — 2, при этом общий охват остаётся высоким.
- Статус верификации: Не верифицирован
- Уровень вовлечённости (ER): Средний показатель вовлечённости аудитории составляет 8.09%. В первые 24 часа после публикации контент обычно набирает 2.22% реакций от общего числа подписчиков.
- Охват публикаций: В среднем каждый пост получает 1 106 просмотров. В течение первых суток публикация набирает 304 просмотров.
- Реакции и взаимодействия: Аудитория активно поддерживает контент: среднее количество реакций на один пост — 5.
- Тематические интересы: Контент сосредоточен на ключевых темах, таких как panda, learning, row, api, ethic.
📝 Описание и контентная политика
Автор описывает ресурс как площадку для выражения субъективного мнения:
“Data science and machine learning hub
Python, SQL, stats, ML, deep learning, projects, PDFs, roadmaps and AI resources.
For beginners, data scientists and ML engineers
👉 https://rebrand.ly/bigdatachannels
DMCA: @disclosure_bds
Contact: @mldatasci...”
Благодаря высокой частоте обновлений (последние данные получены 11 июня, 2026) канал поддерживает актуальность и высокий уровень охвата публикаций. Аналитика показывает, что аудитория активно взаимодействует с контентом, что делает его важной точкой влияния в категории Технологии и приложения.
(x - mean) / standard_deviation
• When to use:
• When your data follows a Gaussian (Normal) distribution.
• When your algorithm assumes features are normally distributed.
• When you have outliers (Standardization is less affected by them than Normalization).
• Vibe: "Let's put everyone on a common baseline relative to the average."
2. Normalization (Min-Max Scaling) ↔️
• What it does: Scales data to a fixed range, usually 0 to 1. It squeezes all values into this specific interval.
• Formula: (x - min) / (max - min)
• When to use:
• When you know your data doesn't follow a Gaussian distribution.
• When your algorithm requires inputs to be within a specific range (e.g., some neural network activation functions).
• When you don't have outliers (Normalization is very sensitive to extreme values).
• Vibe: "Let's squeeze everyone into this exact box, no matter what."
🐍 Code Example: Seeing the Difference with Scikit-learn
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import numpy as np
# Sample Data: 'Age' (small range) vs. 'Income' (large range)
data = {
'Age': [25, 30, 45, 60, 20, 70],
'Income': [40000, 60000, 90000, 150000, 30000, 1000000] # An outlier in income!
}
df = pd.DataFrame(data)
print("Original Data:")
print(df)
# --- 1. Standardization ---
scaler_std = StandardScaler()
df_standardized = scaler_std.fit_transform(df)
print("\nStandardized Data (Mean=0, Std=1):")
print(pd.DataFrame(df_standardized, columns=df.columns))
# --- 2. Normalization ---
scaler_minmax = MinMaxScaler()
df_normalized = scaler_minmax.fit_transform(df)
print("\nNormalized Data (Range 0-1):")
print(pd.DataFrame(df_normalized, columns=df.columns))
Key Observation in Output:
Notice how the huge 1,000,000 income outlier in the original data dramatically pulls all other Income values towards 0 for Normalization, making them tiny. Standardization still scales it down but maintains its relative distance more robustly.
The Takeaway 🧠
There's no single "best" scaling method. Your choice depends on:
1. The distribution of your data.
2. The specific Machine Learning algorithm you're using.
3. The presence of outliers.
Always experiment and evaluate which scaling method performs best for your particular task!employees
id name department salary manager_id
1 Aditi HR 30000 5
2 Rahul IT 50000 6
3 Neha IT 60000 6
4 Aman Sales 40000 7
5 Kiran HR 70000 NULL
6 Mohit IT 80000 NULL
7 Suresh Sales 65000 NULL
8 Pooja HR 30000 5
1️⃣ Find average salary per department
SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department;
2️⃣ Find employees earning above their department average
SELECT name, department, salary
FROM employees e
WHERE salary > (
SELECT AVG(salary)
FROM employees
WHERE department = e.department
);
3️⃣ Find highest salary in each department
SELECT department, MAX(salary) AS max_salary
FROM employees
GROUP BY department;
4️⃣ Find employees who earn more than their manager
SELECT e.name
FROM employees e
JOIN employees m ON e.manager_id = m.id
WHERE e.salary > m.salary;
5️⃣ Count employees in each department
SELECT department, COUNT(*) AS total_employees
FROM employees
GROUP BY department;
6️⃣ Find departments with more than 2 employees
SELECT department, COUNT(*) AS total
FROM employees
GROUP BY department
HAVING COUNT(*) > 2;
7️⃣ Find the second highest salary
SELECT MAX(salary)
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);
8️⃣ Find employees without managers
SELECT name
FROM employees
WHERE manager_id IS NULL;
9️⃣ Rank employees by salary
SELECT name, salary,
RANK() OVER (ORDER BY salary DESC) AS rank
FROM employees;
🔟 Find duplicate salary values
SELECT salary, COUNT(*)
FROM employees
GROUP BY salary
HAVING COUNT(*) > 1;
1️⃣1️⃣ Top 2 highest unique salaries
SELECT DISTINCT salary
FROM employees
ORDER BY salary DESC
LIMIT 2;
1️⃣2️⃣ Find the total salary payout per department
SELECT department, SUM(salary) AS total_payout
FROM employees
GROUP BY department;
1️⃣3️⃣ Find employees whose names start with 'A'
SELECT name
FROM employees
WHERE name LIKE 'A%';
1️⃣4️⃣ Find the manager's name for each employee (Self-Join)
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;
1️⃣5️⃣ Find the department with the highest total salary expenditure
SELECT department
FROM employees
GROUP BY department
ORDER BY SUM(salary) DESC
LIMIT 1;
👉 Follow @datascience_bds for more..describe(), .info()
• Data Visualization:
• Histograms: For the distribution of a single numerical variable.
• Box Plots: To see distribution, outliers, and compare across categories.
• Scatter Plots: To check for relationships between two numerical variables.
• Bar Charts: For counts of categorical variables.
• Missing Value Analysis: Identify how much data is missing and where.
• Outlier Detection: Find extreme values that might skew your results.
• Correlation Matrices: Visualize how numerical variables relate to each other.
🐍 A Peek into EDA with Python (Pandas & Matplotlib/Seaborn)
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Assume 'df' is your loaded DataFrame
# --- Basic Overview ---
print("Dataset Info:")
df.info() # Data types, non-null counts
print("\nSummary Statistics:")
print(df.describe()) # For numerical columns
# --- Visualizing Distributions ---
# For a numerical column like 'Age'
plt.figure(figsize=(8, 4))
sns.histplot(data=df, x='Age', kde=True) # kde=True adds a smooth curve
plt.title('Distribution of Age')
plt.show()
# For a categorical column like 'Category'
plt.figure(figsize=(8, 4))
sns.countplot(data=df, y='Category') # Use y for horizontal bars if many categories
plt.title('Count of Categories')
plt.show()
# --- Checking for Relationships ---
# Between two numerical columns like 'Salary' and 'YearsExperience'
plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x='YearsExperience', y='Salary', hue='Department') # hue adds another dimension
plt.title('Salary vs. Years of Experience')
plt.show()
# --- Correlation Matrix (for numerical columns) ---
plt.figure(figsize=(10, 7))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm') # annot=True shows numbers, cmap changes color scheme
plt.title('Correlation Matrix of Numerical Features')
plt.show()
(Note: df.corr() only works on numerical columns. You might need to handle categorical data for visualization or use other methods like df.groupby()).
💡 The "Aha!" Moment
EDA is where you uncover the hidden stories in your data. You might find:
• "Wow, everyone in the 'South' region buys this product!"
• "This 'Age' column has weird negative values, something's wrong."
• "Sales dramatically drop after the 15th of the month."
🎯 What should you do?
✔️ Treat your data like a mystery to be solved, not a problem to be fixed.
✔️ Use basic plots and stats to understand your variables before building models.
✔️ Identify potential data quality issues early.I'll just give them numbers! Paris = 1, Tokyo = 2, New York = 3.⛔️The Danger: The model will think New York (3) is "greater than" Paris (1). It might assume New York is three times more "City-ish" than Paris. This creates fake relationships in your data. ✅ The Right Way: One-Hot Encoding (Dummy Variables) We create new columns for every category. If the row is "Paris," the Paris column gets a 1 and the others get a 0. It’s binary. It’s fair. It’s mathematical. 🐍 Let's see it in Python (Pandas)
import pandas as pd
# 1. Create a tiny dataset
data = {'User': ['Alex', 'Sam', 'Jo'],
'Plan': ['Free', 'Premium', 'Free']}
df = pd.DataFrame(data)
# 2. The Magic Function
df_encoded = pd.get_dummies(df, columns=['Plan'])
print(df_encoded)
⚠️ The "Dummy Variable Trap" (Pro Tip)
If you know someone is NOT on the Free plan, you automatically know they ARE on the Premium plan (in a 2-plan system). Having both columns creates "Multicollinearity"—basically, telling the model the same thing twice.
The Fix: Always use drop_first=True in your code to remove that redundant column and keep your model lean.
🎯 Today's Takeaway
Don't just feed raw text into a model. Transform your categories into a "One-Hot" format to ensure your machine understands the difference without inventing a hierarchy.
Уже доступно! Исследование Telegram 2025 — ключевые инсайты года 
