Data science/ML/AI

Open in Telegram

Data science and machine learning hub Python, SQL, stats, ML, deep learning, projects, PDFs, roadmaps and AI resources. For beginners, data scientists and ML engineers 👉 https://rebrand.ly/bigdatachannels DMCA: @disclosure_bds Contact: @mldatascientist

Network:Programming, data science, ML - free courses by Big Data Specialist India31 607 Technologies & Applications9 380...

📈 Analytical overview of Telegram channel Data science/ML/AI

Channel Data science/ML/AI (@datascience_bds) in the English language segment is an active participant. Currently, the community unites 13 685 subscribers, ranking 9 380 in the Technologies & Applications category and 31 607 in the India region.

📊 Audience metrics and dynamics

Since its creation on невідомо, the project has demonstrated rapid growth, gathering an audience of 13 685 subscribers.

According to the latest data from 10 June, 2026, the channel demonstrates stable activity. Although there has been a change in the number of participants by 143 over the last 30 days and by 2 over the last 24 hours, overall reach remains high.

Verification status: Not verified
Engagement rate (ER): The average audience engagement rate is 8.09%. Within the first 24 hours after publication, content typically collects 2.22% reactions from the total number of subscribers.
Post reach: On average, each post receives 1 106 views. Within the first day, a publication typically gains 304 views.
Reactions and interaction: The audience actively supports content: the average number of reactions per post is 5.
Thematic interests: Content is focused on key topics such as panda, learning, row, api, ethic.

📝 Description and content policy

The author describes the resource as a platform for expressing subjective opinions:
“Data science and machine learning hub Python, SQL, stats, ML, deep learning, projects, PDFs, roadmaps and AI resources. For beginners, data scientists and ML engineers 👉 https://rebrand.ly/bigdatachannels DMCA: @disclosure_bds Contact: @mldatasci...”

Thanks to the high frequency of updates (latest data received on 11 June, 2026), the channel maintains relevance and a high level of publication reach. Analytics show that the audience actively interacts with content, making it an important point of influence in the Technologies & Applications category.

13 685

Subscribers

+224 hours

+217 days

+14330 days

1 106

Post views

~ 30424 hours

~ 44448 hours

8.09%

Engagement rate

~ 1

Posts per day

Ads index

beta

Posts Archive

13 685

Repost from Programming Quiz Channel

Which activation function outputs values between 0 and 1?

Anonymous voting

13 685

Repost from Programming, data science, ML - free courses by Big Data Specialist

Azure Data Engineering: A comprehensive Roadmap Including Professional Notes + Interview Guide

13 685

AI Agents vs LLM vs RAG vs Agentic AI

13 685

📏 Feature Scaling (Standardization vs. Normalization) ⚖️ Imagine you're trying to compare apples and oranges... or rather, "Age" measured in years (0-100) and "Salary" measured in thousands of dollars (0-1,000,000). Many Machine Learning algorithms get utterly confused if one feature has a massive range and another is tiny. The larger-ranged feature will dominate the distance calculations or gradient descent, making the model unfairly biased towards it. 👉 This is where Feature Scaling comes in: making all your features play nicely together on the same playground. Why Do We Need It? 🤔 • Distance-based algorithms: (K-Nearest Neighbors, K-Means Clustering, Support Vector Machines) are very sensitive to the magnitude of features. A small difference in a large-ranged feature can seem more important than a big difference in a small-ranged feature. • Gradient Descent based algorithms: (Linear Regression, Logistic Regression, Neural Networks) converge much faster when features are on a similar scale. Two Main Flavors: Standardization & Normalization 1. Standardization (Z-score Normalization) ⚡️ • What it does: Transforms data to have a mean of 0 and a standard deviation of 1. It centers the data around the mean and scales it based on its variance. • Formula: (x - mean) / standard_deviation • When to use: • When your data follows a Gaussian (Normal) distribution. • When your algorithm assumes features are normally distributed. • When you have outliers (Standardization is less affected by them than Normalization). • Vibe: "Let's put everyone on a common baseline relative to the average." 2. Normalization (Min-Max Scaling) ↔️ • What it does: Scales data to a fixed range, usually 0 to 1. It squeezes all values into this specific interval. • Formula: (x - min) / (max - min) • When to use: • When you know your data doesn't follow a Gaussian distribution. • When your algorithm requires inputs to be within a specific range (e.g., some neural network activation functions). • When you don't have outliers (Normalization is very sensitive to extreme values). • Vibe: "Let's squeeze everyone into this exact box, no matter what." 🐍 Code Example: Seeing the Difference with Scikit-learn

import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import numpy as np

# Sample Data: 'Age' (small range) vs. 'Income' (large range)
data = {
    'Age': [25, 30, 45, 60, 20, 70],
    'Income': [40000, 60000, 90000, 150000, 30000, 1000000] # An outlier in income!
}
df = pd.DataFrame(data)

print("Original Data:")
print(df)

# --- 1. Standardization ---
scaler_std = StandardScaler()
df_standardized = scaler_std.fit_transform(df)
print("\nStandardized Data (Mean=0, Std=1):")
print(pd.DataFrame(df_standardized, columns=df.columns))

# --- 2. Normalization ---
scaler_minmax = MinMaxScaler()
df_normalized = scaler_minmax.fit_transform(df)
print("\nNormalized Data (Range 0-1):")
print(pd.DataFrame(df_normalized, columns=df.columns))

Key Observation in Output: Notice how the huge 1,000,000 income outlier in the original data dramatically pulls all other Income values towards 0 for Normalization, making them tiny. Standardization still scales it down but maintains its relative distance more robustly. The Takeaway 🧠 There's no single "best" scaling method. Your choice depends on: 1. The distribution of your data. 2. The specific Machine Learning algorithm you're using. 3. The presence of outliers. Always experiment and evaluate which scaling method performs best for your particular task!

13 685

Machine Learning for Newbies.pdf2.33 KB

13 685

Repost from Programming Quiz Channel

Which loss function is commonly used for classification?

Anonymous voting

13 685

Real-world SQL Questions with Answers 🔥 Let's dive into some real-world SQL questions using a practical mini dataset. 📊 Dataset: employees

id  name    department  salary  manager_id
1   Aditi   HR          30000   5
2   Rahul   IT          50000   6
3   Neha    IT          60000   6
4   Aman    Sales       40000   7
5   Kiran   HR          70000   NULL
6   Mohit   IT          80000   NULL
7   Suresh  Sales       65000   NULL
8   Pooja   HR          30000   5

1️⃣ Find average salary per department

SELECT department, AVG(salary) AS avg_salary 
FROM employees 
GROUP BY department;

2️⃣ Find employees earning above their department average

SELECT name, department, salary 
FROM employees e 
WHERE salary > ( 
    SELECT AVG(salary) 
    FROM employees 
    WHERE department = e.department 
);

3️⃣ Find highest salary in each department

SELECT department, MAX(salary) AS max_salary 
FROM employees 
GROUP BY department;

4️⃣ Find employees who earn more than their manager

SELECT e.name 
FROM employees e 
JOIN employees m ON e.manager_id = m.id 
WHERE e.salary > m.salary;

5️⃣ Count employees in each department

SELECT department, COUNT(*) AS total_employees 
FROM employees 
GROUP BY department;

6️⃣ Find departments with more than 2 employees

SELECT department, COUNT(*) AS total 
FROM employees 
GROUP BY department 
HAVING COUNT(*) > 2;

7️⃣ Find the second highest salary

SELECT MAX(salary) 
FROM employees 
WHERE salary < (SELECT MAX(salary) FROM employees);

8️⃣ Find employees without managers

SELECT name 
FROM employees 
WHERE manager_id IS NULL;

9️⃣ Rank employees by salary

SELECT name, salary, 
RANK() OVER (ORDER BY salary DESC) AS rank 
FROM employees;

🔟 Find duplicate salary values

SELECT salary, COUNT(*) 
FROM employees 
GROUP BY salary 
HAVING COUNT(*) > 1;

1️⃣1️⃣ Top 2 highest unique salaries

SELECT DISTINCT salary 
FROM employees 
ORDER BY salary DESC 
LIMIT 2;

1️⃣2️⃣ Find the total salary payout per department

SELECT department, SUM(salary) AS total_payout 
FROM employees 
GROUP BY department;

1️⃣3️⃣ Find employees whose names start with 'A'

SELECT name 
FROM employees 
WHERE name LIKE 'A%';

1️⃣4️⃣ Find the manager's name for each employee (Self-Join)

SELECT e.name AS employee, m.name AS manager 
FROM employees e 
LEFT JOIN employees m ON e.manager_id = m.id;

1️⃣5️⃣ Find the department with the highest total salary expenditure

SELECT department 
FROM employees 
GROUP BY department 
ORDER BY SUM(salary) DESC 
LIMIT 1;

👉 Follow @datascience_bds for more.

13 685

Neural Networks Explained

13 685

Intro to ML in 50 Terms.pdf1.47 MB

13 685

Generative AI vs Agentic AI vs AI Agents

13 685

🔎 Exploratory Data Analysis (EDA) 📊 You've got your dataset. Now what? Before jumping into complex modeling or even creating charts, the absolute most critical phase is Exploratory Data Analysis (EDA). Think of it as a detective's initial investigation. EDA is your chance to get intimately familiar with your data before you ask it questions. 🕵️ 1. What is EDA? EDA is the process of analyzing datasets to summarize their main characteristics, often with visual methods. It’s about understanding: • What kind of data do I have? • Are there any obvious errors or strange patterns? • What are the potential relationships between variables? • What questions can I even hope to answer with this data? 💡 2. Key EDA Activities • Summary Statistics: Get a quick overview of your numerical columns (mean, median, min, max, standard deviation). • Pandas: .describe(), .info() • Data Visualization: • Histograms: For the distribution of a single numerical variable. • Box Plots: To see distribution, outliers, and compare across categories. • Scatter Plots: To check for relationships between two numerical variables. • Bar Charts: For counts of categorical variables. • Missing Value Analysis: Identify how much data is missing and where. • Outlier Detection: Find extreme values that might skew your results. • Correlation Matrices: Visualize how numerical variables relate to each other. 🐍 A Peek into EDA with Python (Pandas & Matplotlib/Seaborn)

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assume 'df' is your loaded DataFrame

# --- Basic Overview ---
print("Dataset Info:")
df.info() # Data types, non-null counts
print("\nSummary Statistics:")
print(df.describe()) # For numerical columns

# --- Visualizing Distributions ---
# For a numerical column like 'Age'
plt.figure(figsize=(8, 4))
sns.histplot(data=df, x='Age', kde=True) # kde=True adds a smooth curve
plt.title('Distribution of Age')
plt.show()

# For a categorical column like 'Category'
plt.figure(figsize=(8, 4))
sns.countplot(data=df, y='Category') # Use y for horizontal bars if many categories
plt.title('Count of Categories')
plt.show()

# --- Checking for Relationships ---
# Between two numerical columns like 'Salary' and 'YearsExperience'
plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x='YearsExperience', y='Salary', hue='Department') # hue adds another dimension
plt.title('Salary vs. Years of Experience')
plt.show()

# --- Correlation Matrix (for numerical columns) ---
plt.figure(figsize=(10, 7))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm') # annot=True shows numbers, cmap changes color scheme
plt.title('Correlation Matrix of Numerical Features')
plt.show()

(Note: df.corr() only works on numerical columns. You might need to handle categorical data for visualization or use other methods like df.groupby()). 💡 The "Aha!" Moment EDA is where you uncover the hidden stories in your data. You might find: • "Wow, everyone in the 'South' region buys this product!" • "This 'Age' column has weird negative values, something's wrong." • "Sales dramatically drop after the 15th of the month." 🎯 What should you do? ✔️ Treat your data like a mystery to be solved, not a problem to be fixed. ✔️ Use basic plots and stats to understand your variables before building models. ✔️ Identify potential data quality issues early.

13 685

DAX Guide .pdf10.51 MB

13 685

🧹 The "80/20" Rule of Data Science (Data Cleaning) 📊 Most beginners think Data Science is 80% building fancy AI models and 20% looking at charts. In reality, it’s the exact opposite: 80% of your time is spent cleaning and preparing data. 📌 Understanding this is the difference between a "theoretician" and a real-world Data Analyst. 🔹 1. What is Data Cleaning (Wrangling)? Real-world data is "dirty." It’s full of mistakes, missing pieces, and weird formatting. Data cleaning is the process of fixing these issues so your analysis is actually accurate. 🔹 2. The Golden Rule: "Garbage In, Garbage Out" (GIGO) If you feed "garbage" (bad data) into the most expensive AI model in the world, you will get "garbage" results. No model can save you from bad data. 🔹 3. Common "Dirty Data" Problems • Missing Values (NaN): Someone forgot to fill out a form field. • Duplicates: The same customer is listed three times with different IDs. • Inconsistent Formatting: One row says "New York," another says "NY," and another says "ny." • Outliers: A "human age" column contains the number 250. 🔹 4. How to Fix It? (The Toolkit) • Imputation: Filling in missing values using the Mean or Median. • Standardization: Converting all text to lowercase and trimming extra spaces. • Deduplication: Removing the "noise" of repeated records. • Filtering: Removing impossible outliers that would skew your average. 🔹 5. Why it matters for Visualization If you don't clean your data before making a chart, your visualization will lie to you. A single "outlier" can make a bar chart look completely flat, hiding the real trends. 👉 A clean dataset and a simple model will always beat a messy dataset and a complex model.

13 685

Roadmap to Master Agentic AI

13 685

🤖 Teaching Machines to Read: One-Hot Encoding 🔢 Machine Learning models are geniuses at math, but they are illiterate when it comes to words. If your dataset has a column for "City" (Paris, Tokyo, New York), a linear regression model will literally crash because it can't multiply "Paris" by a coefficient. 👉 Feature Engineering is how we translate human categories into machine-readable math. 💡 The Wrong Way: Label Encoding You might think:

I'll just give them numbers! Paris = 1, Tokyo = 2, New York = 3.

⛔️The Danger: The model will think New York (3) is "greater than" Paris (1). It might assume New York is three times more "City-ish" than Paris. This creates fake relationships in your data. ✅ The Right Way: One-Hot Encoding (Dummy Variables) We create new columns for every category. If the row is "Paris," the Paris column gets a 1 and the others get a 0. It’s binary. It’s fair. It’s mathematical. 🐍 Let's see it in Python (Pandas)

import pandas as pd

# 1. Create a tiny dataset
data = {'User': ['Alex', 'Sam', 'Jo'], 
        'Plan': ['Free', 'Premium', 'Free']}
df = pd.DataFrame(data)

# 2. The Magic Function
df_encoded = pd.get_dummies(df, columns=['Plan'])

print(df_encoded)

⚠️ The "Dummy Variable Trap" (Pro Tip) If you know someone is NOT on the Free plan, you automatically know they ARE on the Premium plan (in a 2-plan system). Having both columns creates "Multicollinearity"—basically, telling the model the same thing twice. The Fix: Always use drop_first=True in your code to remove that redundant column and keep your model lean. 🎯 Today's Takeaway Don't just feed raw text into a model. Transform your categories into a "One-Hot" format to ensure your machine understands the difference without inventing a hierarchy.

13 685

LLM Overview Module.pdf0.71 KB

13 685

🎯 Overfitting vs. Underfitting: Easy Explanation⚖️ Imagine you’re studying for a math exam. 📌 Underfitting is like only learning that "1+1=2." You’re too simple. When the exam asks "5+5," you fail because you didn't learn enough patterns. • What Happens: Your model is too lazy. It misses the point entirely. • The Result: High error on both your training data and your new data. 📌 Overfitting is like memorizing every single practice question word-for-word. You know that on page 4, the answer is "C." But when the exam changes even one number, you fail because you didn't learn the logic, you just memorized the noise. • What Happens: Your model is a try-hard. It’s "hallucinating" patterns that don't exist. • The Result: 100% accuracy on training data, but a total disaster on real-world data. ✅ The Sweet Spot: Generalization 🎯 In Data Science, we want a model that learns the trend, not the hiccups. How to spot it? 👉 Check your accuracy scores. If your training score is 99% but your testing score is 60%, you’ve overfitted. You built a model that is a genius at yesterday’s news but useless for tomorrow’s predictions. Quick Fixes: • Overfitting? Give the model more data to look at, or simplify the model so it stops overthinking. • Underfitting? Give the model more "features" (details) to look at, or use a more powerful algorithm. 👉 In data, "perfect" is usually a red flag. You don't want a model that memorizes the past; you want one that understands the future.

13 685

Data Modelling for Data Engineering.pdf0.64 KB

13 685

7 Layers of AI Automation

13 685

Analysis vs Analytics.pdf1.75 KB