Data science/ML/AI

前往频道在 Telegram

Data science and machine learning hub Python, SQL, stats, ML, deep learning, projects, PDFs, roadmaps and AI resources. For beginners, data scientists and ML engineers 👉 https://rebrand.ly/bigdatachannels DMCA: @disclosure_bds Contact: @mldatascientist

显示更多

网络:Programming, data science, ML - free courses by Big Data Specialist 印度31 635 技术与应用9 377...

📈 Telegram 频道 Data science/ML/AI 的分析概览

频道 Data science/ML/AI (@datascience_bds) 英语语言赛道中的是活跃参与者。目前社区聚集了 13 674 名订阅者，在 技术与应用 类别中位列第 9 377，并在印度地区排名第 31 635 位。

📊 受众指标与增长动态

自 невідомо 创建以来，项目保持高速增长，吸引了 13 674 名订阅者。

根据 09 六月, 2026 的最新数据，频道保持稳定运转。过去 30 天订阅人数变化为 155，过去 24 小时变化为 5，整体触达仍然可观。

认证状态： 未认证
互动率 (ER)： 平均受众互动率为 8.03%。内容发布后 24 小时内通常能获得 2.25% 的反应，占订阅者总量。
帖子覆盖： 每篇帖子平均可获得 1 098 次浏览，首日通常累积 308 次浏览。
互动与反馈： 受众积极参与，单帖平均反应数为 5。
主题关注点： 内容集中在 panda, learning, row, api, ethic 等核心主题上。

📝 描述与内容策略

作者将该频道定位为表达主观观点的平台：
“Data science and machine learning hub Python, SQL, stats, ML, deep learning, projects, PDFs, roadmaps and AI resources. For beginners, data scientists and ML engineers 👉 https://rebrand.ly/bigdatachannels DMCA: @disclosure_bds Contact: @mldatasci...”

凭借高频更新（最新数据采集于 10 六月, 2026），频道始终保持新鲜度与高覆盖。分析显示受众积极互动，使其成为 技术与应用 类别中的关键影响点。

13 674

订阅者

+524 小时

+197 天

+15530 天

1 098

帖子浏览量

~ 30824 小时

~ 45248 小时

8.03%

参与率

~ 1

每日帖子数

Ads index

beta

帖子存档

13 674

Repost from Programming Quiz Channel

Which activation function outputs values between 0 and 1?

Anonymous voting

13 674

Repost from Programming, data science, ML - free courses by Big Data Specialist

Azure Data Engineering: A comprehensive Roadmap Including Professional Notes + Interview Guide

13 674

AI Agents vs LLM vs RAG vs Agentic AI

13 674

📏 Feature Scaling (Standardization vs. Normalization) ⚖️ Imagine you're trying to compare apples and oranges... or rather, "Age" measured in years (0-100) and "Salary" measured in thousands of dollars (0-1,000,000). Many Machine Learning algorithms get utterly confused if one feature has a massive range and another is tiny. The larger-ranged feature will dominate the distance calculations or gradient descent, making the model unfairly biased towards it. 👉 This is where Feature Scaling comes in: making all your features play nicely together on the same playground. Why Do We Need It? 🤔 • Distance-based algorithms: (K-Nearest Neighbors, K-Means Clustering, Support Vector Machines) are very sensitive to the magnitude of features. A small difference in a large-ranged feature can seem more important than a big difference in a small-ranged feature. • Gradient Descent based algorithms: (Linear Regression, Logistic Regression, Neural Networks) converge much faster when features are on a similar scale. Two Main Flavors: Standardization & Normalization 1. Standardization (Z-score Normalization) ⚡️ • What it does: Transforms data to have a mean of 0 and a standard deviation of 1. It centers the data around the mean and scales it based on its variance. • Formula: (x - mean) / standard_deviation • When to use: • When your data follows a Gaussian (Normal) distribution. • When your algorithm assumes features are normally distributed. • When you have outliers (Standardization is less affected by them than Normalization). • Vibe: "Let's put everyone on a common baseline relative to the average." 2. Normalization (Min-Max Scaling) ↔️ • What it does: Scales data to a fixed range, usually 0 to 1. It squeezes all values into this specific interval. • Formula: (x - min) / (max - min) • When to use: • When you know your data doesn't follow a Gaussian distribution. • When your algorithm requires inputs to be within a specific range (e.g., some neural network activation functions). • When you don't have outliers (Normalization is very sensitive to extreme values). • Vibe: "Let's squeeze everyone into this exact box, no matter what." 🐍 Code Example: Seeing the Difference with Scikit-learn

import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import numpy as np

# Sample Data: 'Age' (small range) vs. 'Income' (large range)
data = {
    'Age': [25, 30, 45, 60, 20, 70],
    'Income': [40000, 60000, 90000, 150000, 30000, 1000000] # An outlier in income!
}
df = pd.DataFrame(data)

print("Original Data:")
print(df)

# --- 1. Standardization ---
scaler_std = StandardScaler()
df_standardized = scaler_std.fit_transform(df)
print("\nStandardized Data (Mean=0, Std=1):")
print(pd.DataFrame(df_standardized, columns=df.columns))

# --- 2. Normalization ---
scaler_minmax = MinMaxScaler()
df_normalized = scaler_minmax.fit_transform(df)
print("\nNormalized Data (Range 0-1):")
print(pd.DataFrame(df_normalized, columns=df.columns))

Key Observation in Output: Notice how the huge 1,000,000 income outlier in the original data dramatically pulls all other Income values towards 0 for Normalization, making them tiny. Standardization still scales it down but maintains its relative distance more robustly. The Takeaway 🧠 There's no single "best" scaling method. Your choice depends on: 1. The distribution of your data. 2. The specific Machine Learning algorithm you're using. 3. The presence of outliers. Always experiment and evaluate which scaling method performs best for your particular task!

13 674

Machine Learning for Newbies.pdf2.33 KB

13 674

Repost from Programming Quiz Channel

Which loss function is commonly used for classification?

Anonymous voting

13 674

Real-world SQL Questions with Answers 🔥 Let's dive into some real-world SQL questions using a practical mini dataset. 📊 Dataset: employees

id  name    department  salary  manager_id
1   Aditi   HR          30000   5
2   Rahul   IT          50000   6
3   Neha    IT          60000   6
4   Aman    Sales       40000   7
5   Kiran   HR          70000   NULL
6   Mohit   IT          80000   NULL
7   Suresh  Sales       65000   NULL
8   Pooja   HR          30000   5

1️⃣ Find average salary per department

SELECT department, AVG(salary) AS avg_salary 
FROM employees 
GROUP BY department;

2️⃣ Find employees earning above their department average

SELECT name, department, salary 
FROM employees e 
WHERE salary > ( 
    SELECT AVG(salary) 
    FROM employees 
    WHERE department = e.department 
);

3️⃣ Find highest salary in each department

SELECT department, MAX(salary) AS max_salary 
FROM employees 
GROUP BY department;

4️⃣ Find employees who earn more than their manager

SELECT e.name 
FROM employees e 
JOIN employees m ON e.manager_id = m.id 
WHERE e.salary > m.salary;

5️⃣ Count employees in each department

SELECT department, COUNT(*) AS total_employees 
FROM employees 
GROUP BY department;

6️⃣ Find departments with more than 2 employees

SELECT department, COUNT(*) AS total 
FROM employees 
GROUP BY department 
HAVING COUNT(*) > 2;

7️⃣ Find the second highest salary

SELECT MAX(salary) 
FROM employees 
WHERE salary < (SELECT MAX(salary) FROM employees);

8️⃣ Find employees without managers

SELECT name 
FROM employees 
WHERE manager_id IS NULL;

9️⃣ Rank employees by salary

SELECT name, salary, 
RANK() OVER (ORDER BY salary DESC) AS rank 
FROM employees;

🔟 Find duplicate salary values

SELECT salary, COUNT(*) 
FROM employees 
GROUP BY salary 
HAVING COUNT(*) > 1;

1️⃣1️⃣ Top 2 highest unique salaries

SELECT DISTINCT salary 
FROM employees 
ORDER BY salary DESC 
LIMIT 2;

1️⃣2️⃣ Find the total salary payout per department

SELECT department, SUM(salary) AS total_payout 
FROM employees 
GROUP BY department;

1️⃣3️⃣ Find employees whose names start with 'A'

SELECT name 
FROM employees 
WHERE name LIKE 'A%';

1️⃣4️⃣ Find the manager's name for each employee (Self-Join)

SELECT e.name AS employee, m.name AS manager 
FROM employees e 
LEFT JOIN employees m ON e.manager_id = m.id;

1️⃣5️⃣ Find the department with the highest total salary expenditure

SELECT department 
FROM employees 
GROUP BY department 
ORDER BY SUM(salary) DESC 
LIMIT 1;

👉 Follow @datascience_bds for more.

13 674

Neural Networks Explained

13 674

Intro to ML in 50 Terms.pdf1.47 MB

13 674

Generative AI vs Agentic AI vs AI Agents

13 674

🔎 Exploratory Data Analysis (EDA) 📊 You've got your dataset. Now what? Before jumping into complex modeling or even creating charts, the absolute most critical phase is Exploratory Data Analysis (EDA). Think of it as a detective's initial investigation. EDA is your chance to get intimately familiar with your data before you ask it questions. 🕵️ 1. What is EDA? EDA is the process of analyzing datasets to summarize their main characteristics, often with visual methods. It’s about understanding: • What kind of data do I have? • Are there any obvious errors or strange patterns? • What are the potential relationships between variables? • What questions can I even hope to answer with this data? 💡 2. Key EDA Activities • Summary Statistics: Get a quick overview of your numerical columns (mean, median, min, max, standard deviation). • Pandas: .describe(), .info() • Data Visualization: • Histograms: For the distribution of a single numerical variable. • Box Plots: To see distribution, outliers, and compare across categories. • Scatter Plots: To check for relationships between two numerical variables. • Bar Charts: For counts of categorical variables. • Missing Value Analysis: Identify how much data is missing and where. • Outlier Detection: Find extreme values that might skew your results. • Correlation Matrices: Visualize how numerical variables relate to each other. 🐍 A Peek into EDA with Python (Pandas & Matplotlib/Seaborn)

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assume 'df' is your loaded DataFrame

# --- Basic Overview ---
print("Dataset Info:")
df.info() # Data types, non-null counts
print("\nSummary Statistics:")
print(df.describe()) # For numerical columns

# --- Visualizing Distributions ---
# For a numerical column like 'Age'
plt.figure(figsize=(8, 4))
sns.histplot(data=df, x='Age', kde=True) # kde=True adds a smooth curve
plt.title('Distribution of Age')
plt.show()

# For a categorical column like 'Category'
plt.figure(figsize=(8, 4))
sns.countplot(data=df, y='Category') # Use y for horizontal bars if many categories
plt.title('Count of Categories')
plt.show()

# --- Checking for Relationships ---
# Between two numerical columns like 'Salary' and 'YearsExperience'
plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x='YearsExperience', y='Salary', hue='Department') # hue adds another dimension
plt.title('Salary vs. Years of Experience')
plt.show()

# --- Correlation Matrix (for numerical columns) ---
plt.figure(figsize=(10, 7))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm') # annot=True shows numbers, cmap changes color scheme
plt.title('Correlation Matrix of Numerical Features')
plt.show()

(Note: df.corr() only works on numerical columns. You might need to handle categorical data for visualization or use other methods like df.groupby()). 💡 The "Aha!" Moment EDA is where you uncover the hidden stories in your data. You might find: • "Wow, everyone in the 'South' region buys this product!" • "This 'Age' column has weird negative values, something's wrong." • "Sales dramatically drop after the 15th of the month." 🎯 What should you do? ✔️ Treat your data like a mystery to be solved, not a problem to be fixed. ✔️ Use basic plots and stats to understand your variables before building models. ✔️ Identify potential data quality issues early.

13 674

DAX Guide .pdf10.51 MB

13 674

🧹 The "80/20" Rule of Data Science (Data Cleaning) 📊 Most beginners think Data Science is 80% building fancy AI models and 20% looking at charts. In reality, it’s the exact opposite: 80% of your time is spent cleaning and preparing data. 📌 Understanding this is the difference between a "theoretician" and a real-world Data Analyst. 🔹 1. What is Data Cleaning (Wrangling)? Real-world data is "dirty." It’s full of mistakes, missing pieces, and weird formatting. Data cleaning is the process of fixing these issues so your analysis is actually accurate. 🔹 2. The Golden Rule: "Garbage In, Garbage Out" (GIGO) If you feed "garbage" (bad data) into the most expensive AI model in the world, you will get "garbage" results. No model can save you from bad data. 🔹 3. Common "Dirty Data" Problems • Missing Values (NaN): Someone forgot to fill out a form field. • Duplicates: The same customer is listed three times with different IDs. • Inconsistent Formatting: One row says "New York," another says "NY," and another says "ny." • Outliers: A "human age" column contains the number 250. 🔹 4. How to Fix It? (The Toolkit) • Imputation: Filling in missing values using the Mean or Median. • Standardization: Converting all text to lowercase and trimming extra spaces. • Deduplication: Removing the "noise" of repeated records. • Filtering: Removing impossible outliers that would skew your average. 🔹 5. Why it matters for Visualization If you don't clean your data before making a chart, your visualization will lie to you. A single "outlier" can make a bar chart look completely flat, hiding the real trends. 👉 A clean dataset and a simple model will always beat a messy dataset and a complex model.

13 674

Roadmap to Master Agentic AI

13 674

🤖 Teaching Machines to Read: One-Hot Encoding 🔢 Machine Learning models are geniuses at math, but they are illiterate when it comes to words. If your dataset has a column for "City" (Paris, Tokyo, New York), a linear regression model will literally crash because it can't multiply "Paris" by a coefficient. 👉 Feature Engineering is how we translate human categories into machine-readable math. 💡 The Wrong Way: Label Encoding You might think:

I'll just give them numbers! Paris = 1, Tokyo = 2, New York = 3.

⛔️The Danger: The model will think New York (3) is "greater than" Paris (1). It might assume New York is three times more "City-ish" than Paris. This creates fake relationships in your data. ✅ The Right Way: One-Hot Encoding (Dummy Variables) We create new columns for every category. If the row is "Paris," the Paris column gets a 1 and the others get a 0. It’s binary. It’s fair. It’s mathematical. 🐍 Let's see it in Python (Pandas)

import pandas as pd

# 1. Create a tiny dataset
data = {'User': ['Alex', 'Sam', 'Jo'], 
        'Plan': ['Free', 'Premium', 'Free']}
df = pd.DataFrame(data)

# 2. The Magic Function
df_encoded = pd.get_dummies(df, columns=['Plan'])

print(df_encoded)

⚠️ The "Dummy Variable Trap" (Pro Tip) If you know someone is NOT on the Free plan, you automatically know they ARE on the Premium plan (in a 2-plan system). Having both columns creates "Multicollinearity"—basically, telling the model the same thing twice. The Fix: Always use drop_first=True in your code to remove that redundant column and keep your model lean. 🎯 Today's Takeaway Don't just feed raw text into a model. Transform your categories into a "One-Hot" format to ensure your machine understands the difference without inventing a hierarchy.

13 674

LLM Overview Module.pdf0.71 KB

13 674

🎯 Overfitting vs. Underfitting: Easy Explanation⚖️ Imagine you’re studying for a math exam. 📌 Underfitting is like only learning that "1+1=2." You’re too simple. When the exam asks "5+5," you fail because you didn't learn enough patterns. • What Happens: Your model is too lazy. It misses the point entirely. • The Result: High error on both your training data and your new data. 📌 Overfitting is like memorizing every single practice question word-for-word. You know that on page 4, the answer is "C." But when the exam changes even one number, you fail because you didn't learn the logic, you just memorized the noise. • What Happens: Your model is a try-hard. It’s "hallucinating" patterns that don't exist. • The Result: 100% accuracy on training data, but a total disaster on real-world data. ✅ The Sweet Spot: Generalization 🎯 In Data Science, we want a model that learns the trend, not the hiccups. How to spot it? 👉 Check your accuracy scores. If your training score is 99% but your testing score is 60%, you’ve overfitted. You built a model that is a genius at yesterday’s news but useless for tomorrow’s predictions. Quick Fixes: • Overfitting? Give the model more data to look at, or simplify the model so it stops overthinking. • Underfitting? Give the model more "features" (details) to look at, or use a more powerful algorithm. 👉 In data, "perfect" is usually a red flag. You don't want a model that memorizes the past; you want one that understands the future.

13 674

Data Modelling for Data Engineering.pdf0.64 KB

13 674

7 Layers of AI Automation

13 674

Analysis vs Analytics.pdf1.75 KB