Data science/ML/AI
Data science and machine learning hub Python, SQL, stats, ML, deep learning, projects, PDFs, roadmaps and AI resources. For beginners, data scientists and ML engineers 👉 https://rebrand.ly/bigdatachannels DMCA: @disclosure_bds Contact: @mldatascientist
نمایش بیشتر📈 تحلیل کانال تلگرام Data science/ML/AI
کانال Data science/ML/AI (@datascience_bds) در بخش زبانی انگلیسی بازیگری فعال است. در حال حاضر جامعه شامل 13 663 مشترک است و جایگاه 9 387 را در دسته فناوری و برنامهها و رتبه 31 771 را در منطقه الهند دارد.
📊 شاخصهای مخاطب و پویایی
از زمان ایجاد در невідомо، پروژه رشد سریعی داشته و 13 663 مشترک جذب کرده است.
بر اساس آخرین دادهها در تاریخ 05 ژوئن, 2026، کانال فعالیت پایداری دارد. در ۳۰ روز گذشته تغییر اعضا برابر 171 و در ۲۴ ساعت گذشته برابر 1 بوده و همچنان دسترسی گستردهای حفظ شده است.
- وضعیت تأیید: تأیید نشده
- نرخ تعامل (ER): میانگین تعامل مخاطب 7.95% است و در ۲۴ ساعت نخست پس از انتشار، محتوا معمولاً 2.46% واکنش نسبت به کل مشترکان کسب میکند.
- دسترسی پستها: هر پست به طور میانگین 1 086 بازدید دریافت میکند. در اولین روز معمولاً 336 بازدید جمعآوری میشود.
- واکنشها و تعامل: مخاطبان بهطور فعال حمایت میکنند؛ میانگین واکنش به هر پست 5 است.
- علایق موضوعی: محتوا بر موضوعات کلیدی مانند panda, learning, row, api, ethic تمرکز دارد.
📝 توضیح و سیاست محتوایی
نویسنده این فضا را محل بیان دیدگاههای شخصی توصیف میکند:
“Data science and machine learning hub
Python, SQL, stats, ML, deep learning, projects, PDFs, roadmaps and AI resources.
For beginners, data scientists and ML engineers
👉 https://rebrand.ly/bigdatachannels
DMCA: @disclosure_bds
Contact: @mldatasci...”
به لطف بهروزرسانیهای پرتکرار (آخرین داده در تاریخ 07 ژوئن, 2026)، کانال همواره بهروز و دارای دسترسی بالاست. تحلیلها نشان میدهد مخاطبان بهطور فعال با محتوا تعامل دارند و آن را به نقطه اثرگذاری مهم در دسته فناوری و برنامهها تبدیل کردهاند.
(x - mean) / standard_deviation
• When to use:
• When your data follows a Gaussian (Normal) distribution.
• When your algorithm assumes features are normally distributed.
• When you have outliers (Standardization is less affected by them than Normalization).
• Vibe: "Let's put everyone on a common baseline relative to the average."
2. Normalization (Min-Max Scaling) ↔️
• What it does: Scales data to a fixed range, usually 0 to 1. It squeezes all values into this specific interval.
• Formula: (x - min) / (max - min)
• When to use:
• When you know your data doesn't follow a Gaussian distribution.
• When your algorithm requires inputs to be within a specific range (e.g., some neural network activation functions).
• When you don't have outliers (Normalization is very sensitive to extreme values).
• Vibe: "Let's squeeze everyone into this exact box, no matter what."
🐍 Code Example: Seeing the Difference with Scikit-learn
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import numpy as np
# Sample Data: 'Age' (small range) vs. 'Income' (large range)
data = {
'Age': [25, 30, 45, 60, 20, 70],
'Income': [40000, 60000, 90000, 150000, 30000, 1000000] # An outlier in income!
}
df = pd.DataFrame(data)
print("Original Data:")
print(df)
# --- 1. Standardization ---
scaler_std = StandardScaler()
df_standardized = scaler_std.fit_transform(df)
print("\nStandardized Data (Mean=0, Std=1):")
print(pd.DataFrame(df_standardized, columns=df.columns))
# --- 2. Normalization ---
scaler_minmax = MinMaxScaler()
df_normalized = scaler_minmax.fit_transform(df)
print("\nNormalized Data (Range 0-1):")
print(pd.DataFrame(df_normalized, columns=df.columns))
Key Observation in Output:
Notice how the huge 1,000,000 income outlier in the original data dramatically pulls all other Income values towards 0 for Normalization, making them tiny. Standardization still scales it down but maintains its relative distance more robustly.
The Takeaway 🧠
There's no single "best" scaling method. Your choice depends on:
1. The distribution of your data.
2. The specific Machine Learning algorithm you're using.
3. The presence of outliers.
Always experiment and evaluate which scaling method performs best for your particular task!employees
id name department salary manager_id
1 Aditi HR 30000 5
2 Rahul IT 50000 6
3 Neha IT 60000 6
4 Aman Sales 40000 7
5 Kiran HR 70000 NULL
6 Mohit IT 80000 NULL
7 Suresh Sales 65000 NULL
8 Pooja HR 30000 5
1️⃣ Find average salary per department
SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department;
2️⃣ Find employees earning above their department average
SELECT name, department, salary
FROM employees e
WHERE salary > (
SELECT AVG(salary)
FROM employees
WHERE department = e.department
);
3️⃣ Find highest salary in each department
SELECT department, MAX(salary) AS max_salary
FROM employees
GROUP BY department;
4️⃣ Find employees who earn more than their manager
SELECT e.name
FROM employees e
JOIN employees m ON e.manager_id = m.id
WHERE e.salary > m.salary;
5️⃣ Count employees in each department
SELECT department, COUNT(*) AS total_employees
FROM employees
GROUP BY department;
6️⃣ Find departments with more than 2 employees
SELECT department, COUNT(*) AS total
FROM employees
GROUP BY department
HAVING COUNT(*) > 2;
7️⃣ Find the second highest salary
SELECT MAX(salary)
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);
8️⃣ Find employees without managers
SELECT name
FROM employees
WHERE manager_id IS NULL;
9️⃣ Rank employees by salary
SELECT name, salary,
RANK() OVER (ORDER BY salary DESC) AS rank
FROM employees;
🔟 Find duplicate salary values
SELECT salary, COUNT(*)
FROM employees
GROUP BY salary
HAVING COUNT(*) > 1;
1️⃣1️⃣ Top 2 highest unique salaries
SELECT DISTINCT salary
FROM employees
ORDER BY salary DESC
LIMIT 2;
1️⃣2️⃣ Find the total salary payout per department
SELECT department, SUM(salary) AS total_payout
FROM employees
GROUP BY department;
1️⃣3️⃣ Find employees whose names start with 'A'
SELECT name
FROM employees
WHERE name LIKE 'A%';
1️⃣4️⃣ Find the manager's name for each employee (Self-Join)
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;
1️⃣5️⃣ Find the department with the highest total salary expenditure
SELECT department
FROM employees
GROUP BY department
ORDER BY SUM(salary) DESC
LIMIT 1;
👉 Follow @datascience_bds for more..describe(), .info()
• Data Visualization:
• Histograms: For the distribution of a single numerical variable.
• Box Plots: To see distribution, outliers, and compare across categories.
• Scatter Plots: To check for relationships between two numerical variables.
• Bar Charts: For counts of categorical variables.
• Missing Value Analysis: Identify how much data is missing and where.
• Outlier Detection: Find extreme values that might skew your results.
• Correlation Matrices: Visualize how numerical variables relate to each other.
🐍 A Peek into EDA with Python (Pandas & Matplotlib/Seaborn)
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Assume 'df' is your loaded DataFrame
# --- Basic Overview ---
print("Dataset Info:")
df.info() # Data types, non-null counts
print("\nSummary Statistics:")
print(df.describe()) # For numerical columns
# --- Visualizing Distributions ---
# For a numerical column like 'Age'
plt.figure(figsize=(8, 4))
sns.histplot(data=df, x='Age', kde=True) # kde=True adds a smooth curve
plt.title('Distribution of Age')
plt.show()
# For a categorical column like 'Category'
plt.figure(figsize=(8, 4))
sns.countplot(data=df, y='Category') # Use y for horizontal bars if many categories
plt.title('Count of Categories')
plt.show()
# --- Checking for Relationships ---
# Between two numerical columns like 'Salary' and 'YearsExperience'
plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x='YearsExperience', y='Salary', hue='Department') # hue adds another dimension
plt.title('Salary vs. Years of Experience')
plt.show()
# --- Correlation Matrix (for numerical columns) ---
plt.figure(figsize=(10, 7))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm') # annot=True shows numbers, cmap changes color scheme
plt.title('Correlation Matrix of Numerical Features')
plt.show()
(Note: df.corr() only works on numerical columns. You might need to handle categorical data for visualization or use other methods like df.groupby()).
💡 The "Aha!" Moment
EDA is where you uncover the hidden stories in your data. You might find:
• "Wow, everyone in the 'South' region buys this product!"
• "This 'Age' column has weird negative values, something's wrong."
• "Sales dramatically drop after the 15th of the month."
🎯 What should you do?
✔️ Treat your data like a mystery to be solved, not a problem to be fixed.
✔️ Use basic plots and stats to understand your variables before building models.
✔️ Identify potential data quality issues early.I'll just give them numbers! Paris = 1, Tokyo = 2, New York = 3.⛔️The Danger: The model will think New York (3) is "greater than" Paris (1). It might assume New York is three times more "City-ish" than Paris. This creates fake relationships in your data. ✅ The Right Way: One-Hot Encoding (Dummy Variables) We create new columns for every category. If the row is "Paris," the Paris column gets a 1 and the others get a 0. It’s binary. It’s fair. It’s mathematical. 🐍 Let's see it in Python (Pandas)
import pandas as pd
# 1. Create a tiny dataset
data = {'User': ['Alex', 'Sam', 'Jo'],
'Plan': ['Free', 'Premium', 'Free']}
df = pd.DataFrame(data)
# 2. The Magic Function
df_encoded = pd.get_dummies(df, columns=['Plan'])
print(df_encoded)
⚠️ The "Dummy Variable Trap" (Pro Tip)
If you know someone is NOT on the Free plan, you automatically know they ARE on the Premium plan (in a 2-plan system). Having both columns creates "Multicollinearity"—basically, telling the model the same thing twice.
The Fix: Always use drop_first=True in your code to remove that redundant column and keep your model lean.
🎯 Today's Takeaway
Don't just feed raw text into a model. Transform your categories into a "One-Hot" format to ensure your machine understands the difference without inventing a hierarchy.from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # ❌ WRONG
Why is this wrong?
Because the scaler learns:
📊 mean
📉 standard deviation
…from the entire dataset, including test data.
🎯 Why this breaks everything
Your model now has indirect knowledge of the test set distribution.
Even though you didn’t touch labels, you leaked statistical information.
This leads to:
📈 overly optimistic accuracy
❌ poor real-world performance
✅ The Correct Way
Always split first:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X_train, X_test, y_train, y_test = train_test_split(X, y)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test) # ✅ only transform
🔑 Any operation that “learns from data” must only see training data.
اکنون در دسترس! پژوهش تلگرام ۲۰۲۵ — مهمترین بینشهای سال 
