Data science/ML/AI
Data science and machine learning hub Python, SQL, stats, ML, deep learning, projects, PDFs, roadmaps and AI resources. For beginners, data scientists and ML engineers 👉 https://rebrand.ly/bigdatachannels DMCA: @disclosure_bds Contact: @mldatascientist
Mostrar más📈 Análisis del canal de Telegram Data science/ML/AI
El canal Data science/ML/AI (@datascience_bds) en el segmento lingüístico de Inglés es un actor destacado. Actualmente la comunidad reúne a 13 674 suscriptores, ocupando la posición 9 380 en la categoría Tecnologías y Aplicaciones y el puesto 31 607 en la región India.
📊 Métricas de audiencia y dinámica
Desde su creación el невідомо, el proyecto ha mostrado un crecimiento acelerado, reuniendo a 13 674 suscriptores.
Según los últimos datos del 10 junio, 2026, el canal mantiene una actividad estable. En los últimos 30 días la variación de miembros fue de 143, y en las últimas 24 horas de 2, conservando un alto alcance.
- Estado de verificación: No verificado
- Tasa de interacción (ER): El promedio de interacción de la audiencia es 8.09%. Durante las primeras 24 horas tras publicar, el contenido suele obtener 2.22% de reacciones respecto al total de suscriptores.
- Alcance de las publicaciones: Cada publicación recibe en promedio 1 106 visualizaciones. En el primer día suele acumular 304 visualizaciones.
- Reacciones e interacción: La audiencia responde de forma activa: el promedio de reacciones por publicación es 5.
- Intereses temáticos: El contenido se centra en temas clave como panda, learning, row, api, ethic.
📝 Descripción y política de contenido
El autor describe el recurso como un espacio para expresar opiniones subjetivas:
“Data science and machine learning hub
Python, SQL, stats, ML, deep learning, projects, PDFs, roadmaps and AI resources.
For beginners, data scientists and ML engineers
👉 https://rebrand.ly/bigdatachannels
DMCA: @disclosure_bds
Contact: @mldatasci...”
Gracias a la alta frecuencia de actualizaciones (últimos datos recibidos el 11 junio, 2026), el canal mantiene la vigencia y un amplio alcance. La analítica demuestra que la audiencia interactúa activamente con el contenido, lo que lo convierte en un punto de referencia dentro de la categoría Tecnologías y Aplicaciones.
(x - mean) / standard_deviation
• When to use:
• When your data follows a Gaussian (Normal) distribution.
• When your algorithm assumes features are normally distributed.
• When you have outliers (Standardization is less affected by them than Normalization).
• Vibe: "Let's put everyone on a common baseline relative to the average."
2. Normalization (Min-Max Scaling) ↔️
• What it does: Scales data to a fixed range, usually 0 to 1. It squeezes all values into this specific interval.
• Formula: (x - min) / (max - min)
• When to use:
• When you know your data doesn't follow a Gaussian distribution.
• When your algorithm requires inputs to be within a specific range (e.g., some neural network activation functions).
• When you don't have outliers (Normalization is very sensitive to extreme values).
• Vibe: "Let's squeeze everyone into this exact box, no matter what."
🐍 Code Example: Seeing the Difference with Scikit-learn
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import numpy as np
# Sample Data: 'Age' (small range) vs. 'Income' (large range)
data = {
'Age': [25, 30, 45, 60, 20, 70],
'Income': [40000, 60000, 90000, 150000, 30000, 1000000] # An outlier in income!
}
df = pd.DataFrame(data)
print("Original Data:")
print(df)
# --- 1. Standardization ---
scaler_std = StandardScaler()
df_standardized = scaler_std.fit_transform(df)
print("\nStandardized Data (Mean=0, Std=1):")
print(pd.DataFrame(df_standardized, columns=df.columns))
# --- 2. Normalization ---
scaler_minmax = MinMaxScaler()
df_normalized = scaler_minmax.fit_transform(df)
print("\nNormalized Data (Range 0-1):")
print(pd.DataFrame(df_normalized, columns=df.columns))
Key Observation in Output:
Notice how the huge 1,000,000 income outlier in the original data dramatically pulls all other Income values towards 0 for Normalization, making them tiny. Standardization still scales it down but maintains its relative distance more robustly.
The Takeaway 🧠
There's no single "best" scaling method. Your choice depends on:
1. The distribution of your data.
2. The specific Machine Learning algorithm you're using.
3. The presence of outliers.
Always experiment and evaluate which scaling method performs best for your particular task!employees
id name department salary manager_id
1 Aditi HR 30000 5
2 Rahul IT 50000 6
3 Neha IT 60000 6
4 Aman Sales 40000 7
5 Kiran HR 70000 NULL
6 Mohit IT 80000 NULL
7 Suresh Sales 65000 NULL
8 Pooja HR 30000 5
1️⃣ Find average salary per department
SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department;
2️⃣ Find employees earning above their department average
SELECT name, department, salary
FROM employees e
WHERE salary > (
SELECT AVG(salary)
FROM employees
WHERE department = e.department
);
3️⃣ Find highest salary in each department
SELECT department, MAX(salary) AS max_salary
FROM employees
GROUP BY department;
4️⃣ Find employees who earn more than their manager
SELECT e.name
FROM employees e
JOIN employees m ON e.manager_id = m.id
WHERE e.salary > m.salary;
5️⃣ Count employees in each department
SELECT department, COUNT(*) AS total_employees
FROM employees
GROUP BY department;
6️⃣ Find departments with more than 2 employees
SELECT department, COUNT(*) AS total
FROM employees
GROUP BY department
HAVING COUNT(*) > 2;
7️⃣ Find the second highest salary
SELECT MAX(salary)
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);
8️⃣ Find employees without managers
SELECT name
FROM employees
WHERE manager_id IS NULL;
9️⃣ Rank employees by salary
SELECT name, salary,
RANK() OVER (ORDER BY salary DESC) AS rank
FROM employees;
🔟 Find duplicate salary values
SELECT salary, COUNT(*)
FROM employees
GROUP BY salary
HAVING COUNT(*) > 1;
1️⃣1️⃣ Top 2 highest unique salaries
SELECT DISTINCT salary
FROM employees
ORDER BY salary DESC
LIMIT 2;
1️⃣2️⃣ Find the total salary payout per department
SELECT department, SUM(salary) AS total_payout
FROM employees
GROUP BY department;
1️⃣3️⃣ Find employees whose names start with 'A'
SELECT name
FROM employees
WHERE name LIKE 'A%';
1️⃣4️⃣ Find the manager's name for each employee (Self-Join)
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;
1️⃣5️⃣ Find the department with the highest total salary expenditure
SELECT department
FROM employees
GROUP BY department
ORDER BY SUM(salary) DESC
LIMIT 1;
👉 Follow @datascience_bds for more..describe(), .info()
• Data Visualization:
• Histograms: For the distribution of a single numerical variable.
• Box Plots: To see distribution, outliers, and compare across categories.
• Scatter Plots: To check for relationships between two numerical variables.
• Bar Charts: For counts of categorical variables.
• Missing Value Analysis: Identify how much data is missing and where.
• Outlier Detection: Find extreme values that might skew your results.
• Correlation Matrices: Visualize how numerical variables relate to each other.
🐍 A Peek into EDA with Python (Pandas & Matplotlib/Seaborn)
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Assume 'df' is your loaded DataFrame
# --- Basic Overview ---
print("Dataset Info:")
df.info() # Data types, non-null counts
print("\nSummary Statistics:")
print(df.describe()) # For numerical columns
# --- Visualizing Distributions ---
# For a numerical column like 'Age'
plt.figure(figsize=(8, 4))
sns.histplot(data=df, x='Age', kde=True) # kde=True adds a smooth curve
plt.title('Distribution of Age')
plt.show()
# For a categorical column like 'Category'
plt.figure(figsize=(8, 4))
sns.countplot(data=df, y='Category') # Use y for horizontal bars if many categories
plt.title('Count of Categories')
plt.show()
# --- Checking for Relationships ---
# Between two numerical columns like 'Salary' and 'YearsExperience'
plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x='YearsExperience', y='Salary', hue='Department') # hue adds another dimension
plt.title('Salary vs. Years of Experience')
plt.show()
# --- Correlation Matrix (for numerical columns) ---
plt.figure(figsize=(10, 7))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm') # annot=True shows numbers, cmap changes color scheme
plt.title('Correlation Matrix of Numerical Features')
plt.show()
(Note: df.corr() only works on numerical columns. You might need to handle categorical data for visualization or use other methods like df.groupby()).
💡 The "Aha!" Moment
EDA is where you uncover the hidden stories in your data. You might find:
• "Wow, everyone in the 'South' region buys this product!"
• "This 'Age' column has weird negative values, something's wrong."
• "Sales dramatically drop after the 15th of the month."
🎯 What should you do?
✔️ Treat your data like a mystery to be solved, not a problem to be fixed.
✔️ Use basic plots and stats to understand your variables before building models.
✔️ Identify potential data quality issues early.I'll just give them numbers! Paris = 1, Tokyo = 2, New York = 3.⛔️The Danger: The model will think New York (3) is "greater than" Paris (1). It might assume New York is three times more "City-ish" than Paris. This creates fake relationships in your data. ✅ The Right Way: One-Hot Encoding (Dummy Variables) We create new columns for every category. If the row is "Paris," the Paris column gets a 1 and the others get a 0. It’s binary. It’s fair. It’s mathematical. 🐍 Let's see it in Python (Pandas)
import pandas as pd
# 1. Create a tiny dataset
data = {'User': ['Alex', 'Sam', 'Jo'],
'Plan': ['Free', 'Premium', 'Free']}
df = pd.DataFrame(data)
# 2. The Magic Function
df_encoded = pd.get_dummies(df, columns=['Plan'])
print(df_encoded)
⚠️ The "Dummy Variable Trap" (Pro Tip)
If you know someone is NOT on the Free plan, you automatically know they ARE on the Premium plan (in a 2-plan system). Having both columns creates "Multicollinearity"—basically, telling the model the same thing twice.
The Fix: Always use drop_first=True in your code to remove that redundant column and keep your model lean.
🎯 Today's Takeaway
Don't just feed raw text into a model. Transform your categories into a "One-Hot" format to ensure your machine understands the difference without inventing a hierarchy.
¡Ya disponible! Investigación de Telegram 2025 — los principales insights del año 
