Data science/ML/AI
Data science and machine learning hub Python, SQL, stats, ML, deep learning, projects, PDFs, roadmaps and AI resources. For beginners, data scientists and ML engineers 👉 https://rebrand.ly/bigdatachannels DMCA: @disclosure_bds Contact: @mldatascientist
显示更多📈 Telegram 频道 Data science/ML/AI 的分析概览
频道 Data science/ML/AI (@datascience_bds) 英语 语言赛道中的 是活跃参与者。目前社区聚集了 13 674 名订阅者,在 技术与应用 类别中位列第 9 377,并在 印度 地区排名第 31 635 位。
📊 受众指标与增长动态
自 невідомо 创建以来,项目保持高速增长,吸引了 13 674 名订阅者。
根据 09 六月, 2026 的最新数据,频道保持稳定运转。过去 30 天订阅人数变化为 155,过去 24 小时变化为 5,整体触达仍然可观。
- 认证状态: 未认证
- 互动率 (ER): 平均受众互动率为 8.03%。内容发布后 24 小时内通常能获得 2.25% 的反应,占订阅者总量。
- 帖子覆盖: 每篇帖子平均可获得 1 098 次浏览,首日通常累积 308 次浏览。
- 互动与反馈: 受众积极参与,单帖平均反应数为 5。
- 主题关注点: 内容集中在 panda, learning, row, api, ethic 等核心主题上。
📝 描述与内容策略
作者将该频道定位为表达主观观点的平台:
“Data science and machine learning hub
Python, SQL, stats, ML, deep learning, projects, PDFs, roadmaps and AI resources.
For beginners, data scientists and ML engineers
👉 https://rebrand.ly/bigdatachannels
DMCA: @disclosure_bds
Contact: @mldatasci...”
凭借高频更新(最新数据采集于 10 六月, 2026),频道始终保持新鲜度与高覆盖。分析显示受众积极互动,使其成为 技术与应用 类别中的关键影响点。
(x - mean) / standard_deviation
• When to use:
• When your data follows a Gaussian (Normal) distribution.
• When your algorithm assumes features are normally distributed.
• When you have outliers (Standardization is less affected by them than Normalization).
• Vibe: "Let's put everyone on a common baseline relative to the average."
2. Normalization (Min-Max Scaling) ↔️
• What it does: Scales data to a fixed range, usually 0 to 1. It squeezes all values into this specific interval.
• Formula: (x - min) / (max - min)
• When to use:
• When you know your data doesn't follow a Gaussian distribution.
• When your algorithm requires inputs to be within a specific range (e.g., some neural network activation functions).
• When you don't have outliers (Normalization is very sensitive to extreme values).
• Vibe: "Let's squeeze everyone into this exact box, no matter what."
🐍 Code Example: Seeing the Difference with Scikit-learn
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import numpy as np
# Sample Data: 'Age' (small range) vs. 'Income' (large range)
data = {
'Age': [25, 30, 45, 60, 20, 70],
'Income': [40000, 60000, 90000, 150000, 30000, 1000000] # An outlier in income!
}
df = pd.DataFrame(data)
print("Original Data:")
print(df)
# --- 1. Standardization ---
scaler_std = StandardScaler()
df_standardized = scaler_std.fit_transform(df)
print("\nStandardized Data (Mean=0, Std=1):")
print(pd.DataFrame(df_standardized, columns=df.columns))
# --- 2. Normalization ---
scaler_minmax = MinMaxScaler()
df_normalized = scaler_minmax.fit_transform(df)
print("\nNormalized Data (Range 0-1):")
print(pd.DataFrame(df_normalized, columns=df.columns))
Key Observation in Output:
Notice how the huge 1,000,000 income outlier in the original data dramatically pulls all other Income values towards 0 for Normalization, making them tiny. Standardization still scales it down but maintains its relative distance more robustly.
The Takeaway 🧠
There's no single "best" scaling method. Your choice depends on:
1. The distribution of your data.
2. The specific Machine Learning algorithm you're using.
3. The presence of outliers.
Always experiment and evaluate which scaling method performs best for your particular task!employees
id name department salary manager_id
1 Aditi HR 30000 5
2 Rahul IT 50000 6
3 Neha IT 60000 6
4 Aman Sales 40000 7
5 Kiran HR 70000 NULL
6 Mohit IT 80000 NULL
7 Suresh Sales 65000 NULL
8 Pooja HR 30000 5
1️⃣ Find average salary per department
SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department;
2️⃣ Find employees earning above their department average
SELECT name, department, salary
FROM employees e
WHERE salary > (
SELECT AVG(salary)
FROM employees
WHERE department = e.department
);
3️⃣ Find highest salary in each department
SELECT department, MAX(salary) AS max_salary
FROM employees
GROUP BY department;
4️⃣ Find employees who earn more than their manager
SELECT e.name
FROM employees e
JOIN employees m ON e.manager_id = m.id
WHERE e.salary > m.salary;
5️⃣ Count employees in each department
SELECT department, COUNT(*) AS total_employees
FROM employees
GROUP BY department;
6️⃣ Find departments with more than 2 employees
SELECT department, COUNT(*) AS total
FROM employees
GROUP BY department
HAVING COUNT(*) > 2;
7️⃣ Find the second highest salary
SELECT MAX(salary)
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);
8️⃣ Find employees without managers
SELECT name
FROM employees
WHERE manager_id IS NULL;
9️⃣ Rank employees by salary
SELECT name, salary,
RANK() OVER (ORDER BY salary DESC) AS rank
FROM employees;
🔟 Find duplicate salary values
SELECT salary, COUNT(*)
FROM employees
GROUP BY salary
HAVING COUNT(*) > 1;
1️⃣1️⃣ Top 2 highest unique salaries
SELECT DISTINCT salary
FROM employees
ORDER BY salary DESC
LIMIT 2;
1️⃣2️⃣ Find the total salary payout per department
SELECT department, SUM(salary) AS total_payout
FROM employees
GROUP BY department;
1️⃣3️⃣ Find employees whose names start with 'A'
SELECT name
FROM employees
WHERE name LIKE 'A%';
1️⃣4️⃣ Find the manager's name for each employee (Self-Join)
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;
1️⃣5️⃣ Find the department with the highest total salary expenditure
SELECT department
FROM employees
GROUP BY department
ORDER BY SUM(salary) DESC
LIMIT 1;
👉 Follow @datascience_bds for more..describe(), .info()
• Data Visualization:
• Histograms: For the distribution of a single numerical variable.
• Box Plots: To see distribution, outliers, and compare across categories.
• Scatter Plots: To check for relationships between two numerical variables.
• Bar Charts: For counts of categorical variables.
• Missing Value Analysis: Identify how much data is missing and where.
• Outlier Detection: Find extreme values that might skew your results.
• Correlation Matrices: Visualize how numerical variables relate to each other.
🐍 A Peek into EDA with Python (Pandas & Matplotlib/Seaborn)
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Assume 'df' is your loaded DataFrame
# --- Basic Overview ---
print("Dataset Info:")
df.info() # Data types, non-null counts
print("\nSummary Statistics:")
print(df.describe()) # For numerical columns
# --- Visualizing Distributions ---
# For a numerical column like 'Age'
plt.figure(figsize=(8, 4))
sns.histplot(data=df, x='Age', kde=True) # kde=True adds a smooth curve
plt.title('Distribution of Age')
plt.show()
# For a categorical column like 'Category'
plt.figure(figsize=(8, 4))
sns.countplot(data=df, y='Category') # Use y for horizontal bars if many categories
plt.title('Count of Categories')
plt.show()
# --- Checking for Relationships ---
# Between two numerical columns like 'Salary' and 'YearsExperience'
plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x='YearsExperience', y='Salary', hue='Department') # hue adds another dimension
plt.title('Salary vs. Years of Experience')
plt.show()
# --- Correlation Matrix (for numerical columns) ---
plt.figure(figsize=(10, 7))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm') # annot=True shows numbers, cmap changes color scheme
plt.title('Correlation Matrix of Numerical Features')
plt.show()
(Note: df.corr() only works on numerical columns. You might need to handle categorical data for visualization or use other methods like df.groupby()).
💡 The "Aha!" Moment
EDA is where you uncover the hidden stories in your data. You might find:
• "Wow, everyone in the 'South' region buys this product!"
• "This 'Age' column has weird negative values, something's wrong."
• "Sales dramatically drop after the 15th of the month."
🎯 What should you do?
✔️ Treat your data like a mystery to be solved, not a problem to be fixed.
✔️ Use basic plots and stats to understand your variables before building models.
✔️ Identify potential data quality issues early.I'll just give them numbers! Paris = 1, Tokyo = 2, New York = 3.⛔️The Danger: The model will think New York (3) is "greater than" Paris (1). It might assume New York is three times more "City-ish" than Paris. This creates fake relationships in your data. ✅ The Right Way: One-Hot Encoding (Dummy Variables) We create new columns for every category. If the row is "Paris," the Paris column gets a 1 and the others get a 0. It’s binary. It’s fair. It’s mathematical. 🐍 Let's see it in Python (Pandas)
import pandas as pd
# 1. Create a tiny dataset
data = {'User': ['Alex', 'Sam', 'Jo'],
'Plan': ['Free', 'Premium', 'Free']}
df = pd.DataFrame(data)
# 2. The Magic Function
df_encoded = pd.get_dummies(df, columns=['Plan'])
print(df_encoded)
⚠️ The "Dummy Variable Trap" (Pro Tip)
If you know someone is NOT on the Free plan, you automatically know they ARE on the Premium plan (in a 2-plan system). Having both columns creates "Multicollinearity"—basically, telling the model the same thing twice.
The Fix: Always use drop_first=True in your code to remove that redundant column and keep your model lean.
🎯 Today's Takeaway
Don't just feed raw text into a model. Transform your categories into a "One-Hot" format to ensure your machine understands the difference without inventing a hierarchy.
现已上线!2025 年 Telegram 研究 — 年度关键洞察 
