Data science/ML/AI
Data science and machine learning hub Python, SQL, stats, ML, deep learning, projects, PDFs, roadmaps and AI resources. For beginners, data scientists and ML engineers 👉 https://rebrand.ly/bigdatachannels DMCA: @disclosure_bds Contact: @mldatascientist
Show more📈 Analytical overview of Telegram channel Data science/ML/AI
Channel Data science/ML/AI (@datascience_bds) in the English language segment is an active participant. Currently, the community unites 13 685 subscribers, ranking 9 380 in the Technologies & Applications category and 31 607 in the India region.
📊 Audience metrics and dynamics
Since its creation on невідомо, the project has demonstrated rapid growth, gathering an audience of 13 685 subscribers.
According to the latest data from 10 June, 2026, the channel demonstrates stable activity. Although there has been a change in the number of participants by 143 over the last 30 days and by 2 over the last 24 hours, overall reach remains high.
- Verification status: Not verified
- Engagement rate (ER): The average audience engagement rate is 8.09%. Within the first 24 hours after publication, content typically collects 2.22% reactions from the total number of subscribers.
- Post reach: On average, each post receives 1 106 views. Within the first day, a publication typically gains 304 views.
- Reactions and interaction: The audience actively supports content: the average number of reactions per post is 5.
- Thematic interests: Content is focused on key topics such as panda, learning, row, api, ethic.
📝 Description and content policy
The author describes the resource as a platform for expressing subjective opinions:
“Data science and machine learning hub
Python, SQL, stats, ML, deep learning, projects, PDFs, roadmaps and AI resources.
For beginners, data scientists and ML engineers
👉 https://rebrand.ly/bigdatachannels
DMCA: @disclosure_bds
Contact: @mldatasci...”
Thanks to the high frequency of updates (latest data received on 11 June, 2026), the channel maintains relevance and a high level of publication reach. Analytics show that the audience actively interacts with content, making it an important point of influence in the Technologies & Applications category.
(x - mean) / standard_deviation
• When to use:
• When your data follows a Gaussian (Normal) distribution.
• When your algorithm assumes features are normally distributed.
• When you have outliers (Standardization is less affected by them than Normalization).
• Vibe: "Let's put everyone on a common baseline relative to the average."
2. Normalization (Min-Max Scaling) ↔️
• What it does: Scales data to a fixed range, usually 0 to 1. It squeezes all values into this specific interval.
• Formula: (x - min) / (max - min)
• When to use:
• When you know your data doesn't follow a Gaussian distribution.
• When your algorithm requires inputs to be within a specific range (e.g., some neural network activation functions).
• When you don't have outliers (Normalization is very sensitive to extreme values).
• Vibe: "Let's squeeze everyone into this exact box, no matter what."
🐍 Code Example: Seeing the Difference with Scikit-learn
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import numpy as np
# Sample Data: 'Age' (small range) vs. 'Income' (large range)
data = {
'Age': [25, 30, 45, 60, 20, 70],
'Income': [40000, 60000, 90000, 150000, 30000, 1000000] # An outlier in income!
}
df = pd.DataFrame(data)
print("Original Data:")
print(df)
# --- 1. Standardization ---
scaler_std = StandardScaler()
df_standardized = scaler_std.fit_transform(df)
print("\nStandardized Data (Mean=0, Std=1):")
print(pd.DataFrame(df_standardized, columns=df.columns))
# --- 2. Normalization ---
scaler_minmax = MinMaxScaler()
df_normalized = scaler_minmax.fit_transform(df)
print("\nNormalized Data (Range 0-1):")
print(pd.DataFrame(df_normalized, columns=df.columns))
Key Observation in Output:
Notice how the huge 1,000,000 income outlier in the original data dramatically pulls all other Income values towards 0 for Normalization, making them tiny. Standardization still scales it down but maintains its relative distance more robustly.
The Takeaway 🧠
There's no single "best" scaling method. Your choice depends on:
1. The distribution of your data.
2. The specific Machine Learning algorithm you're using.
3. The presence of outliers.
Always experiment and evaluate which scaling method performs best for your particular task!employees
id name department salary manager_id
1 Aditi HR 30000 5
2 Rahul IT 50000 6
3 Neha IT 60000 6
4 Aman Sales 40000 7
5 Kiran HR 70000 NULL
6 Mohit IT 80000 NULL
7 Suresh Sales 65000 NULL
8 Pooja HR 30000 5
1️⃣ Find average salary per department
SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department;
2️⃣ Find employees earning above their department average
SELECT name, department, salary
FROM employees e
WHERE salary > (
SELECT AVG(salary)
FROM employees
WHERE department = e.department
);
3️⃣ Find highest salary in each department
SELECT department, MAX(salary) AS max_salary
FROM employees
GROUP BY department;
4️⃣ Find employees who earn more than their manager
SELECT e.name
FROM employees e
JOIN employees m ON e.manager_id = m.id
WHERE e.salary > m.salary;
5️⃣ Count employees in each department
SELECT department, COUNT(*) AS total_employees
FROM employees
GROUP BY department;
6️⃣ Find departments with more than 2 employees
SELECT department, COUNT(*) AS total
FROM employees
GROUP BY department
HAVING COUNT(*) > 2;
7️⃣ Find the second highest salary
SELECT MAX(salary)
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);
8️⃣ Find employees without managers
SELECT name
FROM employees
WHERE manager_id IS NULL;
9️⃣ Rank employees by salary
SELECT name, salary,
RANK() OVER (ORDER BY salary DESC) AS rank
FROM employees;
🔟 Find duplicate salary values
SELECT salary, COUNT(*)
FROM employees
GROUP BY salary
HAVING COUNT(*) > 1;
1️⃣1️⃣ Top 2 highest unique salaries
SELECT DISTINCT salary
FROM employees
ORDER BY salary DESC
LIMIT 2;
1️⃣2️⃣ Find the total salary payout per department
SELECT department, SUM(salary) AS total_payout
FROM employees
GROUP BY department;
1️⃣3️⃣ Find employees whose names start with 'A'
SELECT name
FROM employees
WHERE name LIKE 'A%';
1️⃣4️⃣ Find the manager's name for each employee (Self-Join)
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;
1️⃣5️⃣ Find the department with the highest total salary expenditure
SELECT department
FROM employees
GROUP BY department
ORDER BY SUM(salary) DESC
LIMIT 1;
👉 Follow @datascience_bds for more..describe(), .info()
• Data Visualization:
• Histograms: For the distribution of a single numerical variable.
• Box Plots: To see distribution, outliers, and compare across categories.
• Scatter Plots: To check for relationships between two numerical variables.
• Bar Charts: For counts of categorical variables.
• Missing Value Analysis: Identify how much data is missing and where.
• Outlier Detection: Find extreme values that might skew your results.
• Correlation Matrices: Visualize how numerical variables relate to each other.
🐍 A Peek into EDA with Python (Pandas & Matplotlib/Seaborn)
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Assume 'df' is your loaded DataFrame
# --- Basic Overview ---
print("Dataset Info:")
df.info() # Data types, non-null counts
print("\nSummary Statistics:")
print(df.describe()) # For numerical columns
# --- Visualizing Distributions ---
# For a numerical column like 'Age'
plt.figure(figsize=(8, 4))
sns.histplot(data=df, x='Age', kde=True) # kde=True adds a smooth curve
plt.title('Distribution of Age')
plt.show()
# For a categorical column like 'Category'
plt.figure(figsize=(8, 4))
sns.countplot(data=df, y='Category') # Use y for horizontal bars if many categories
plt.title('Count of Categories')
plt.show()
# --- Checking for Relationships ---
# Between two numerical columns like 'Salary' and 'YearsExperience'
plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x='YearsExperience', y='Salary', hue='Department') # hue adds another dimension
plt.title('Salary vs. Years of Experience')
plt.show()
# --- Correlation Matrix (for numerical columns) ---
plt.figure(figsize=(10, 7))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm') # annot=True shows numbers, cmap changes color scheme
plt.title('Correlation Matrix of Numerical Features')
plt.show()
(Note: df.corr() only works on numerical columns. You might need to handle categorical data for visualization or use other methods like df.groupby()).
💡 The "Aha!" Moment
EDA is where you uncover the hidden stories in your data. You might find:
• "Wow, everyone in the 'South' region buys this product!"
• "This 'Age' column has weird negative values, something's wrong."
• "Sales dramatically drop after the 15th of the month."
🎯 What should you do?
✔️ Treat your data like a mystery to be solved, not a problem to be fixed.
✔️ Use basic plots and stats to understand your variables before building models.
✔️ Identify potential data quality issues early.I'll just give them numbers! Paris = 1, Tokyo = 2, New York = 3.⛔️The Danger: The model will think New York (3) is "greater than" Paris (1). It might assume New York is three times more "City-ish" than Paris. This creates fake relationships in your data. ✅ The Right Way: One-Hot Encoding (Dummy Variables) We create new columns for every category. If the row is "Paris," the Paris column gets a 1 and the others get a 0. It’s binary. It’s fair. It’s mathematical. 🐍 Let's see it in Python (Pandas)
import pandas as pd
# 1. Create a tiny dataset
data = {'User': ['Alex', 'Sam', 'Jo'],
'Plan': ['Free', 'Premium', 'Free']}
df = pd.DataFrame(data)
# 2. The Magic Function
df_encoded = pd.get_dummies(df, columns=['Plan'])
print(df_encoded)
⚠️ The "Dummy Variable Trap" (Pro Tip)
If you know someone is NOT on the Free plan, you automatically know they ARE on the Premium plan (in a 2-plan system). Having both columns creates "Multicollinearity"—basically, telling the model the same thing twice.
The Fix: Always use drop_first=True in your code to remove that redundant column and keep your model lean.
🎯 Today's Takeaway
Don't just feed raw text into a model. Transform your categories into a "One-Hot" format to ensure your machine understands the difference without inventing a hierarchy.
Available now! Telegram Research 2025 — the year's key insights 
